When it comes to generative artificial intelligence, should your organization opt for public or proprietary AI? First, you need to consider the main differences between these options.
Public AI can have a wide knowledge base and fulfill a lot of tasks. However, public AI may feed that data back into a model’s training data, which can cause security vulnerabilities to emerge. The alternative, which is AI trained and hosted in-house with proprietary data, can be more secure but requires a lot more infrastructure.
Some companies, including Samsung, have forbidden the use of public generative AI for corporate use because of security risks. In response to these concerns, OpenAI, the company behind ChatGPT, added an option for users to restrict the use of their data in April 2023.
Aaron Kalb, co-founder and chief strategy officer at data analytics firm Alation, spoke with us about how generative AI is being used in data analytics and what other organizations can learn about the state of this fast-moving field. Working as an engineer on Siri has given him insight into what organizations should consider when choosing emerging technologies, including the choice between public or proprietary AI datasets.
The following is a transcript of my interview with Kalb. It has been edited for length and clarity.
Jump to:
Train your own AI or use a public service?
Megan Crouse: Do you think companies having their own, private pools of data fed to an AI will be the way of the future, or will it be a mix of public and proprietary AI?
Aaron Kalb: Internal large language models are interesting. Training on the whole internet has benefits and risks — not everyone can afford to do that or even wants to do it. I’ve been struck by how far you can get on a big pre-trained model with fine tuning or prompt engineering.
For smaller players, there will be a lot of uses of the stuff [AI] that’s out there and reusable. I think larger players who can afford to make their own [AI] will be tempted to. If you look at, for example, AWS and Google Cloud Platform, some of this stuff feels like core infrastructure — I don’t mean what they do with AI, just what they do with hosting and server farms. It’s easy to think ‘we’re a huge company, we should make our own server farm.’ Well, our core business is agriculture or manufacturing. Maybe we should let the A-teams at Amazon and Google make it, and we pay them a few cents per terabyte of storage or compute.
My guess is only the biggest tech companies over time will actually find it beneficial to maintain their own versions of these [AI]; most people will end up using a third-party service. Those services are going to get more secure, more accurate [and] more fine-tuned by industry and lower in price.
SEE: GPT-4 cheat sheet: What is GPT-4, and what is it capable of?
How to decide if AI is right for your enterprise
Megan Crouse: What other questions do you think enterprise decision-makers should ask themselves before deciding whether to implement generative AI? In what cases might it be better not to use it?
Aaron Kalb: I have a design background, and the goal there is the design diamond. You ideate out and then you select in. The other key thing I take from design is: You always start with not your product but the user and the user’s problem. What are the biggest problems we have?
If the sales development team says ‘we find that we get a better response and open rate if the subject and the body of our outreach emails are really tailored to that person based on their LinkedIn and based on their company or website,’ and ‘we’re spending hours a day manually doing all this work and get a good open rate but not many emails sent in a day,’ turns out generative AI is great at that. You can make a widget that goes through your list of people to email and draft one based on the LinkedIn page of the recipient and the corporate website. The person just edits it instead of writing it in half an hour. I think you have to start with what your problem is.
SEE: Generative AI can create text or video on demand, but it opens up concerns about plagiarism, misuse, bias and more.
Aaron Kalb: Even though it’s not exciting anymore, a lot of AI are predictive models. That’s a generation old, but that might be much more lucrative than giving people a thing where they can type into a bot. People don’t like to type. You might be better off just having a great user interface that’s predictive based on buyer clicks or something, even though that’s a different approach.
The most important things to think about [when it comes to generative AI] are security, performance [and] cost. The disadvantage is generative AI can be like using a bulldozer to move a backpack. And you’re introducing randomness, perhaps unnecessarily. There are many times you’d rather have something deterministic.
Determining ownership of the data AI uses
Megan Crouse: In terms of IT responsibility, if you are making your own datasets, who has ownership of the data the AI has access to? How does that integrate into the process?
Aaron Kalb: I look at AWS, and I trust that over time both the privacy concerns and the process are going to get better and better. Right now, certainly, that can be a hard thing. Over time, it’ll be possible to get an off-the-shelf thing with all the approvals and certifications you need to trust that, even if you’re in the federal government or a really regulated industry. It will not happen overnight, but I think that’s going to happen.
However, an LLM is a very heavy algorithm. The whole point is it will learn from everything but doesn’t know where anything came from. Any time you’re worried about bias, [AI may not be suitable]. And there’s not a lightweight version of this. The very thing that makes it impressive makes it expensive. Those expenses come down to not just money: it also comes down to power. There aren’t enough electrons floating around.
Proprietary AI lets you look into the ‘black box’
Megan Crouse: Alation prides itself in delivering visibility in data governance. Have you discussed internally how and whether to get around the AI ‘black box’ problem, where it’s impossible to see why the AI makes the decisions it does?
Aaron Kalb: I think in places where you really want to know where all the ‘knowledge’ the AI is being trained on is coming from, that’s a place where you might want to build your own model and the scope of what data it’s trained on. The only problem there is the first ‘L’ of ‘LLM.’ If the model isn’t large enough, you don’t get the impressive performance. There’s a trade-off [with] smaller training data: more accuracy, less weirdness, but also less fluency and less impressive skills.
Finding a balance between usefulness and privacy
Megan Crouse: What have you learned from your time working on Siri that you apply to the way you approach AI?
Aaron Kalb: Siri was the first [chatbot-like AI]. It faced very steep competition from players such as Google who had projects like Google Voice and these huge corpora of user-generated conversational data. Siri didn’t have any of that; it was all based on corpora of texts from newspapers and things like that and had a lot of old-school, template-based, inferential AI stuff.
For a long time, even as Siri updated the algorithms it was using, the performance couldn’t upgrade as much. One [factor] is the privacy policy. Every conversation you have with Siri stands alone; there’s no way for it to learn over time. That helps users have trust that it isn’t being used in all of the hundreds of ways Google uses and potentially misuses that information, but Apple couldn’t learn from it.
In the same way, Apple kept adding new functionality. The journey of Siri shows the bigger your world, the more empowering. But it’s also a risk. The more data you pull in brings empowerment but also privacy concerns. This [generative AI] is a hugely forward-looking tech, but you’re always moving these sliders that trade off different things people care about.