What Is Open Source AI Anyway?
AMSTERDAM — Last October, the Open Source Initiative (OSI) published its definition of what it would take for an AI model to be open source. At the time, OSI executive director Stefano Maffulli said that the definition was meant to be a conversation starter.
That’s definitely held true. Even as developers have generally taken a rather pragmatic approach to open weight models and their licenses, the OSI definition left something to be desired for many who wanted a more radical definition. That’s especially true when it comes to the data used to train the model, which the OSI definition says must be described in detail, but not made available.
Defining Open Source AI
At the Open Source Summit in Amsterdam, I sat down with Maffulli to talk about the current state of the discussion. He noted that not only has the conversation started but that the definition has become a tool for the OSI to engage with politicians, including the European Commission, where the AI Act, for example, will go into full effect in August 2026.
“It has been a very useful tool for us in discussions at the European Commission — and to some extent also in the United States and in Washington — to drive the interpretation of the AI Act and the [EU’s] guidelines for general-purpose AI,” Maffulli said. “The intention of the AI Act is to remove friction and give privileged access to open source developers, researchers in academia.”

Image credit: The New Stack
He noted that the guidelines, which are the European Commission’s interpretation of the AI Act and define the obligations that providers of “general-purpose AI models” (which include virtually all large language models) have under the AI Act. The act and guidelines specifically include exemptions for open-source AI models. Maffulli notes that it follows all of the principles that are also encoded in the OSI’s open source AI definition.
“They say basically that in order to have those obstacles removed, you need to be transparent. Therefore, you need to be very clean and clear about what went into the training set,” he noted and stressed that the politicians understand why making full training sets is generally not possible.
“They understand exactly what the issue is. You don’t have the copyright, the ownership of the data that they are distributing. So they know what went into the revision of the Copyright Act that gave the exceptions for text and data mining. And the text and data mining exceptions explicitly say you’re free to accumulate all the data, to scour and crawl the web and do whatever you want with it. Once you have done the analysis, throw away the data. It’s not yours to keep. And that’s exactly what resonates. It works.”
When working with the open source community at large, Maffulli said a lot of the work has been about clarifying the definition of open source AI. A popular model like Qwen may be open weight and licensed under a permissive, OSI-approved license, but a developer wouldn’t have the tools, code and data to replicate the work that the Qwen team did to build the model.
Maffulli acknowledged that the OSI definition sets a high bar and that very few models actually pass it right now.
The Open Source Initiative has never been prescriptive. We’re not a standards body imposing fines. There are finger waggers, for sure. There are people happy to point the finger at you and yell that you’re wrong. But in open source [in general], the definition came out of practice and from practitioners, and I think the evolution of the open source AI definition is going to follow the same path as the technology evolves, as the practice evolves and as the law evolves, which is something that we didn’t have to consieer 20 years ago and now we have.”
Open Data
One area he is especially interested in right now is the datasets that comprise the training data for new models. Many companies, he said, are not looking to build data sets that are more resilient to lawsuits (“I don’t call them safe from the copyright perspective, because there is no safety. It’s one of the other things that we are learning,” he said).
A lot of companies now have a very hard time creating large data sets from the public web, which he described as “shrinking.” Common Crawl, the largest repository of web crawl data, Maffulli said, is having a hard time expanding its data set, in part because the web is increasingly being polluted by AI slop, but also because many large sites and publishers are asking for their data to be removed.
This goes back to an increasingly urgent discussion about the relationship between those building AI models and online publishers. The models depend on high-quality data, which typically comes from news organizations or large sites like Reddit and Stack Overflow, but those sites always depend on Google and other search engines to send them readers — which they could then monetize to keep producing that content. The rise of large language models as alternatives to search engines is quickly upending that relationship because few users ever click on LLM citations.
Maffulli’s position here may not sit well with publishers. “If we want to have a public AI, we need to safeguard the public web,” he said. “We need to safeguard taxes and take it away from the publishers. I think that there is no other way than play the Google Books card. The publisher should not have a say. In the same way that we have the concept that worked with Google Books needs to work also for AI, for AI training — in exchange of a public AI. You want to enter a secret deal with OpenAI, be my guest. But if I’m another AI firm — if I’m the Allen Institute for AI and I want to do public AI, then, sorry, it seems fair to some degree.”
He argues that the relationship between AI firms and publishers and projects like Common Crawl is out of balance. But as of now, we have neither the legal framework (because, he said, copyright doesn’t work at that scale) nor the technical framework in place to restore this balance.
I would also argue that for publishers, access to public AI trained on their data may not be enough to make their data openly available.
“We’ve got a lot of work to do if you want to have really public data sets — data sets that we can share, that we can build on, and we can build large — and we’re talking large — language models, GPT-style technologies. We need to work on these. We need to talk governance. We don’t have good ways to prove ownership.”