I’ve resisted writing much about this week’s frenzy over DeepSeek, the relatively small Chinese company that has put the cat amongst the frantically multiplying large language model pigeons. The main reason I’ve avoided the discussion is that I’m pretty sure that this is only one step further along the path of LLM evolution – next week or next month, something else will be the new shiny object and we’ll all have forgotten DeepSeek’s big splash. Instead, I’d like to discuss how DeepSeek’s labelling as open source AI helped them make such a big impact – and whether this labelling is accurate.
Inspiration and perspiration
It’s frequently said that in software, we stand on the shoulders of giants. Everything depends on what came before: C# builds on C++ builds on C builds on B (you’ve been around for a while if you remember that last one!). Every new database, platform, content management system or search engine builds on the lessons learned from and is inspired by previous generations.
The fuzzy part of this is what we mean by ‘inspired’ and practically how this translates into actual written code. Back in the early days, programmers would happily give away their code to others who needed it, to adapt and improve, in the expectation and hope that they in return would get access to the new version. Once people realised they could charge for software, this model changed, with some software becoming available only in compiled form under a written legal license. Those that still cared about sharing invented free software and slightly later, open source, using licenses to permit pretty much any use of the software as long as you also maintained the freedom to re-use, modify and distribute the derivatives.
Of course, not everybody played by the rules. There are plenty of well-documented instances of proprietary, closed source software being built on the backs of open source contributors and equally, open source code that looks suspiciously similar to a commercial alternative. In some cases people have genuinely re-created functionality from scratch, in many others just switching some variable names and functions around was enough. It’s hard to replicate functionality – you can’t create code in a vacuum and you can’t un-see something you might base your program on, even unconsciously. If someone objects and brings in the lawyers we get into the nitty gritty detail and if people believe that program X looks way too similar to program Y, people get sued and settlements are made. If commercial strategies change, sometimes code that was previously open becomes more closed, which annoys previous contributors and may lead to a fork, creating a new open alternative.
However, open source has become an established, and often disruptive force for good that underpins our technological world. Since you’re giving away your IP (under certain conditions) it’s challenging to make anywhere near as much money in open source than in commercial software, but it’s also a great and quick way to build a large user base and active community. Repeatedly we see software sectors that were previously dominated by commercial vendors disrupted by open source alternatives – for example, the rise of Lucene-powered search engines in the early 2000s, or MySQL, now one of the most popular databases in the world.
Not really open AI
The evolution of LLMs has considerably muddied this picture, as they depend on training data, model code, weights and associated software. One principle of open source is that for something to be truly open, one must have unrestricted access to everything needed to build the system, but in the world of LLMs people seem perfectly happy to release a model and call it open source without also releasing the training data, weights etc. Even worse, some very well known companies have invented their own licenses (complete with restrictions) and decided to call these open source when they’re clearly not. What we see here is open source used for marketing and disruption, without truly signing up to the principles behind it. Sadly, journalists and commentators often fall for it.
No matter what the license, it’s clear those building LLMs are still standing on each others’ shoulders. Engineers and data scientists are trying to advance the state of the art, racing towards AGI, and some are using whatever method they can to achieve that – be it ingesting millions of copyrighted works or using the output of one model to train another. Playing fast and loose with naming, licenses and laws isn’t an accident but a deliberate strategy – when you’re creating the golden goose of AI and potentially attracting billions of dollars of investment, none of that seems to matter.
Unfortunately the open source world is still struggling to define what we mean by open AI, wish makes it even easier for AI companies to open-wash their offerings. The OSI have created a definition which isn’t universally popular. We’ll no doubt hear a lot more on this at the State of Open conference in London next week, hosted by OpenUK for which I’m proud to be an ambassador.
DeepSeek’s models have been widely reported as open source – but actually not everything you need to replicate them, such as the training code, is provided (although to their credit they’ve discussed how did this in an associated paper). Interestingly there is already an effort underway to provide a truly open model based on DeepSeek.
An open bridge
Time will tell whether whatever flavour of ‘open’ we decide on will be the best bridge over the moat being frantically dug by those large, well-funded (and mainly American) companies trying to land-grab & monetize AI. If we look back, it’s clear that commercial software can eventually lose out to open source as this commoditizes the market. Enterprises in particular are nervous about sending their data over the wire to an API and may decide it’s safer to run models in-house, which open source would permit. However, those with deep vendor relationships may choose to trust the offerings from Microsoft or Google rather than spending the time and money to build in-house AI expertise and capability. Most of the AI companies are now targeting enterprise adoption, although it won’t be as easy as they may think. Open models, software and training data also help address the significant and justified concerns many people have about bias.
However if there’s one thing that the DeepSeek release can show us is that saying your cutting-edge LLM is open source (no matter how inaccurately) can have a significant effect on the AI market. No matter what huge AI funding governments announce or grand commercial strategy you have, there’s always the chance someone inspired by your work will try to replicate and improve it – and if they then release this to the world with fewer restrictions, your moat is no longer an effective defence. This applies to DeepSeek as much as anyone else.
Perhaps the larger players should bear this in mind, and we can only hope that some of them realise that releasing truly open source AI may be a winning strategy.
Interested in how open source AI can work for your business? Contact me.