Open-Source in AI: The Big Misunderstanding
Concepts

Open-Source in AI: The Big Misunderstanding

Open-source vs open-weight: why LLMs can't be truly open-source like Linux, and what you actually get when a model releases its weights.

Jean-Christophe Budin
3 min read

Introduction

Hardly a week goes by without seeing headlines in the press or on LinkedIn announcing the release of "open-source" AI models… when in reality, they aren't.

Am I nitpicking words? The distinction is actually fundamental.

"Open-source" and "open-weight" are often used interchangeably in AI discussions. Yet they refer to fundamentally different things.

This article explains the real differences between these two concepts and why generative AI struggles to follow the traditional open-source model.

Open-Source: total transparency as a foundation

What is open-source?

Open-source means total transparency: access to the full source code, the ability to modify it, audit it, and understand exactly how it works.

Take Linux. Every line of code is visible. You can compile your own version, adapt it, contribute improvements, or have the code audited to detect vulnerabilities.

This transparency enables:

  • Trust: any expert can verify there are no hidden backdoors
  • Collective security: thousands of eyes find and fix bugs quickly
  • Independence: no risk of being locked into a single company

This philosophy gave us Git, Apache, Firefox, PostgreSQL, Python… Some of the most reliable projects in the world, widely used in companies, including commercially. Google and Microsoft are even among the biggest open-source contributors.

LLMs: a fundamentally different architecture

Weights are not source code

An LLM (large language model) like GPT-5.2, Mistral 3, or Gemini 3 is not a traditional program. It’s a mathematical architecture with billions of “weights” - the neural network connections - produced by training that can cost millions of euros.

The crucial difference: having an LLM’s weights is not like having a software project’s source code.

It’s more like receiving a compiled binary: you can run it, but you don’t see:

  • The training data used
  • The methodological choices (algorithms, hyperparameters)
  • The optimization strategies
  • The mistakes corrected along the way

Result: a powerful machine… but opaque.

What is an "open-weight" model?

While major players keep their top models closed, some actors like Mistral AI or DeepSeek publish model weights on Hugging Face. Even OpenAI and Google have joined in.

Illustration generated by GeminiIllustration generated by Gemini

This kind of openness brings real benefits:

  • Run the model on your own servers without relying on an external API
  • Fine-tune it for your specific needs (legal, medical vocabulary…)
  • Keep your data on your own infrastructure
  • Control costs
  • Enable the community to innovate

Why "true" open-source is so hard for LLMs

1. Training data: the legal nightmare

The data used to train LLMs largely comes from copyright-protected works: books, articles, web content, scientific publications. This data has often been collected at scale without explicit author consent.

Publishing this data creates legal risk. Lawsuits are multiplying in the US and Europe. Revealing precisely which content was used would open massive legal fronts. This is a structural problem that requires changes in the legal framework.

Even though you can find hundreds of thousands of datasets on Hugging Face, the most valuable data is not copyright-free.

2. Reproducibility is extremely difficult

With Linux, if you have the source code, you can recompile the exact same version.

For LLMs, it’s far more complicated, even if you had everything (code, data, weights):

  • Training processes include randomness by design
  • Results differ across separate runs
  • Reproducing training can cost millions of euros

Perfect reproducibility is therefore practically out of reach, even if it’s theoretically possible.

3. Biases and hallucinations can’t be detected from weights alone

An AI model’s hallucinations and biases are the equivalent of software vulnerabilities and bugs.

Even with access to weights, you can’t “read” biases or hallucinations directly from the model. They are invisibly distributed across billions of parameters.

The only way to detect them is to test the model in real settings with benchmarks. Biases show up in use, not through inspection.

Truly transparent initiatives are the exception

A few projects try to get closer to real transparency. LLM360 is a notable example: they publish not only final weights, but also intermediate checkpoints, full training code, and share data responsibly.

These initiatives remain rare because they first have to solve legal problems - which requires political action.

What approach should you adopt?

For a company investing in AI, reality is neither "all proprietary models" nor "all open-weight." The right approach combines both.

Open-weight models have real strategic value for diversification: using a mix of open-weight models (Mistral, Qwen…) and commercial models (GPT, Gemini…) can reduce dependency on a single provider and improve continuity and service quality.

At Ask This Guy, for example, we use both open-weight models (that we can host on one or more European infrastructures) and commercial models depending on needs. We monitor each provider's performance in real time.

It’s precisely this diversity of providers that helps ensure a better level of service for our customers.

(Cover image credit: Peter Adams, faces of Open-Source)

Tags:AIOpen-SourceOpen-WeightLLMTransparencyMistralGenerative AI
Partager :

Interested in our solutions?

Discover how Ask This Guy can help you accelerate with AI

Book a Demo