March 25, 2024

The tech industry can’t agree on what open source AI means. That’s a problem.

Suddenly, “open source” is the latest buzzword in AI circles. Meta has pledged to create open-source artificial general intelligence. And Elon Musk is suing OpenAI over its lack of open-source AI models.

Meanwhile, a growing number of tech leaders and companies are setting themselves up as open-source champions. 

But there’s a fundamental problem—no one can agree on what “open-source AI” means. 

On the face of it, open-source AI promises a future where anyone can take part in the technology’s development. That could accelerate innovation, boost transparency, and give users greater control over systems that could soon reshape many aspects of our lives. But what even is it? What makes an AI model open source, and what disqualifies it?

The answers could have significant ramifications for the future of the technology. Until the tech industry has settled on a definition, powerful companies can easily bend the concept to suit their own needs, and it could become a tool to entrench the dominance of today’s leading players.

Entering this fray is the Open Source Initiative (OSI), the self-appointed arbiters of what it means to be open source. Founded in 1998, the nonprofit is the custodian of the Open Source Definition, a widely accepted set of rules that determine whether a piece of software can be considered open source. 

Now, the organization has assembled a 70-strong group of researchers, lawyers, policymakers, activists, and representatives from big tech companies like Meta, Google, and Amazon to come up with a working definition of open-source AI. 

The open-source community is a big tent though, running the gamut from hacktivists to Fortune 500 companies. While there’s broad agreement on the overarching principles, says Stefano Maffulli, OSI’s executive director, it’s becoming increasingly obvious that the devil is in the details. With so many competing interests to consider, finding a solution that satisfies everyone while ensuring the biggest companies play along is no easy task.

Fuzzy criteria

The lack of a settled definition has done little to prevent tech companies from adopting the term.

Last July, Meta made its Llama 2 model, which it referred to as open source, freely available, and it has a track record of publicly releasing AI technologies. “We support the OSI’s effort to define open-source AI and look forward to continuing to participate in their process for the benefit of the open source community across the world,” Jonathan Torres, Meta’s associate general counsel for AI, open source, and licensing told us. 

That stands in marked contrast to rival OpenAI, which has shared progressively fewer details about its leading models over the years, citing safety concerns. “We only open-source powerful AI models once we have carefully weighed the benefits and risks, including misuse and acceleration,” a spokesperson said. 

Other leading AI companies, like Stability AI and Aleph Alpha, have also released models described as open source, and Hugging Face hosts a large library of freely available AI models.

While Google has taken a more locked-down approach with its most powerful models, like Gemini and PaLM 2, the Gemma models released last month are freely accessible and designed to go toe-to-toe with Llama 2, though the company described them as “open” rather than “open source.”  

But there’s considerable disagreement about whether any of these models can really be described as open source. For a start, both Llama 2 and Gemma come with licenses that restrict what users can do with the models. That’s anathema to open-source principles: one of the key clauses of the Open Source Definition outlaws the imposition of any restrictions based on use cases.

The criteria are fuzzy even for models that don’t come with these kinds of conditions. The concept of open source was devised to ensure developers could use, study, modify, and share software without restrictions. But AI works in fundamentally different ways, and key concepts don’t translate from software to AI neatly, says Maffulli.

One of the biggest hurdles is the sheer number of ingredients that go into today’s AI models. All you need to tinker with a piece of software is the underlying source code, says Maffulli. But depending on your goal, dabbling with an AI model could require access to the trained model, its training data, the code used to preprocess this data, the code governing the training process, the underlying architecture of the model, or a host of other, more subtle details.

Which ingredients you need to meaningfully study and modify models remains open to interpretation. “We have identified what basic freedoms or basic rights we want to be able to exercise,” says Maffulli. “The mechanics of how to exercise those rights are not clear.”

Settling this debate will be essential if the AI community wants to reap the same benefits software developers gained from open source, says Maffulli, which was built on broad consensus about what the term meant. “Having [a definition] that is respected and adopted by a large chunk of the industry provides clarity,” he says. “And with clarity comes lower costs for compliance, less friction, shared understanding.”

By far the biggest sticking point is data. All the major AI companies have simply released pretrained models, without the datasets on which they were trained. For people pushing for a stricter definition of open-source AI, Maffulli says, this seriously constrains efforts to modify and study models, automatically disqualifying them as open source.

Others have argued that a simple description of the data is often enough to probe a model, says Maffulli, and you don’t necessarily need to retrain from scratch to make modifications. Pretrained models are routinely adapted through a process known as fine-tuning, in which they are partially retrained on a smaller, often application-specific, dataset.

Meta’s Llama 2 is a case in point, says Roman Shaposhnik, CEO of open-source AI company Ainekko and vice president of legal affairs for the Apache Software Foundation, who is involved in the OSI process. While Meta only released a pretrained model, a flourishing community of developers has been downloading and adapting it, and sharing their modifications.

“People are using it in all sorts of projects. There’s a whole ecosystem around it,” he says. “We therefore must call it something. Is it half-open? Is it ajar?”

While it may be technically possible to modify a model without its original training data, restricting access to a key ingredient is not really in the spirit of open source, says Zuzanna Warso, director of research at nonprofit Open Future, who is taking part in the OSI’s discussions. It’s also debatable whether it’s possible to truly exercise the freedom to study a model without knowing what information it was trained on.

“It’s a crucial component of this whole process,” she says. “If we care about openness, we should also care about the openness of the data.”

Have your cake and eat it

It’s important to understand why companies setting themselves up as open-source champions are reluctant to hand over training data. Access to high-quality training data is a major bottleneck for AI research and a competitive advantage for bigger firms that they’re eager to maintain, says Warso.

At the same time, open source carries a host of benefits that these companies would like to see translated to AI. At a superficial level, the term “open source” carries positive connotations for a lot of people, so engaging in so-called “open washing” can be an easy PR win, says Warso.

It can also have a significant impact on their bottom line. Economists at Harvard Business School recently found that open-source software has saved companies almost $9 trillion in development costs by allowing them to build their products on top of high-quality free software rather than writing it themselves.

For larger companies, open-sourcing their software so that it can be reused and modified by other developers can help build a powerful ecosystem around their products, says Warso. The classic example is Google’s open-sourcing of its Android mobile operating system, which cemented its dominant position at the heart of the smartphone revolution. Meta’s Mark Zuckerberg has been explicit about this motivation in earnings calls, saying “open source software often becomes an industry standard, and when companies standardize on building with our stack, that then becomes easier to integrate new innovations into our products.”

Crucially, it also appears that open-source AI may receive favorable regulatory treatment in some places, Warso says, pointing to the EU’s newly passed AI Act, which exempts certain open-source projects from some of its more stringent requirements.

Taken together, it’s clear why sharing pretrained models but restricting access to the data required to build them makes good business sense, says Warso. But it does smack of companies trying to have their cake and eat it too, she adds. And if the strategy helps entrench the already dominant positions of large tech companies, it’s hard to see how that fits with the underlying ethos of open source.

“We see openness as one of the tools to challenge the concentration of power,” says Warso. “If the definition is supposed to help in challenging these concentrations of power, then the question of data becomes even more important.”

Shaposhnik thinks a compromise is possible. A significant amount of data used to train the largest models already comes from open repositories like Wikipedia or Common Crawl, which scrapes data from the web and shares it freely. Companies could simply share the open resources used to train their models, he says, making it possible to recreate a reasonable approximation that should allow people to study and understand models.

The lack of clarity regarding whether training on art or writing scraped from the internet infringes on the creator’s property rights can cause legal complications though, says Aviya Skowron, head of policy and ethics at nonprofit AI research group EleutherAI, also involved in the OSI process. That makes developers wary of being open about their data.

Stefano Zacchiroli, a professor of computer science at the Polytechnic Institute of Paris who is also contributing to the OSI definition, appreciates the need for pragmatism. His personal view is that a full description of a model’s training data is the bare minimum for it to be described as open source, but he recognizes that stricter definitions of open-source AI might not have broad appeal.

Ultimately, the community needs to decide what it’s trying to achieve, says Zacchiroli: “Are you just following where the market is going so that they don’t essentially co-opt the term ‘open-source AI,’ or are you trying to pull the market toward being more open and providing more freedoms to the users?”

What’s the point of open source?

It’s debatable how much any definition of open-source AI will level the playing field anyway, says Sarah Myers West, co-executive director of the AI Now Institute. She co-authored a paper published in August 2023 exposing the lack of openness in many open-source AI projects. But it also highlighted that the vast amounts of data and computing power needed to train cutting-edge AI creates deeper structural barriers for smaller players, no matter how open models are.

Myers West thinks there’s also a lack of clarity regarding what people hope to achieve by making AI open source. “Is it safety, is it the ability to conduct academic research, is it trying to foster greater competition?” she asks. “We need to be way more precise about what the goal is, and then how opening up a system changes the pursuit of that goal.”

The OSI seems keen to avoid those conversations. The draft definition mentions autonomy and transparency as key benefits, but Maffulli demurred when pressed to explain why the OSI values those concepts. The document also contains a section labeled “out of scope issues” that makes clear the definition won’t wade into questions around “ethical, trustworthy, or responsible” AI.

Maffulli says historically the open-source community has focused on enabling the frictionless sharing of software and avoided getting bogged down in debates about what that software should be used for. “It’s not our job,” he says.

But those questions can’t be dismissed, says Warso, no matter how hard people have tried over the decades. The idea that technology is neutral and that topics like ethics are “out of scope” is a myth, she adds. She suspects it’s a myth that needs to be upheld to prevent the open-source community’s loose coalition from fracturing. “I think people realize it’s not real [the myth], but we need this to move forward,” says Warso.

Beyond the OSI, others have taken a different approach. In 2022, a group of researchers introduced Responsible AI Licenses (RAIL), which are similar to open-source licenses but include clauses that can restrict specific use cases. The goal, says Danish Contractor, an AI researcher who co-created the license, is to let developers prevent their work from being used for things they consider inappropriate or unethical.

“As a researcher, I would hate for my stuff to be used in ways that would be detrimental,” he says. And he’s not alone: a recent analysis he and colleagues conducted on AI startup Hugging Face’s popular model-hosting platform found that 28% of models use RAIL. 

The license Google attached to its Gemma follows a similar approach. Its terms of use list various prohibited use cases considered “harmful,” which reflects its “commitment to developing AI responsibly,” the company said in a recent blog post.The Allen Institute for AI has also developed its own take on open licensing. Its ImpACT Licenses restrict redistribution of models and data based on their potential risks.

Given how different AI is from conventional software, some level of experimentation with different degrees of openness is inevitable and probably good for the field, says Luis Villa, co-founder and legal lead at open-source software management company Tidelift. But he worries that a proliferation of “open-ish” licenses that are mutually incompatible could negate the frictionless collaboration that made open source so successful, slowing down innovation in AI, reducing transparency, and making it harder for smaller players to build on each other’s work.

Ultimately, Villa thinks the community needs to coalesce around a single standard, otherwise industry will simply ignore it and decide for itself what “open” means. He doesn’t envy the OSI’s job, though. When it came up with the open-source software definition it had the luxury of time and little outside scrutiny. Today, AI is firmly in the crosshairs of both big business and regulators.

But if the open-source community can’t settle on a definition, and quickly, someone else will come up with one that suits their own needs. “They’re going to fill that vacuum,” says Villa. “Mark Zuckerberg is going to tell us all what he thinks ‘open’ means, and he has a very big megaphone.”

Share This Post
en_USEnglish