Agents, RAG, internal assistants: how to choose between Make and Buy for your AI project. Figures, use cases, and cost analysis to help you decide.

Running open AI models on your own servers gives you control: your data stays in-house, and you're independent from providers. For enterprise AI projects - agents, RAG, internal assistants - this is attractive. Sometimes essential.

But AI infrastructure costs are massive. The usual "Make or Buy" thinking doesn't quite work the same way here. This article gives you the benchmarks to decide with full knowledge of the facts.

A few definitions to read the rest calmly

LLM (Large Language Model): the “brain” of the system. This is what you will deploy on your server.
Number of parameters: a measure of the model’s “size”. The higher this number, the more capable the model is at fine reasoning (and the more compute and memory it requires).
Inference: the moment when the model answers your request. We are not talking about training (which has already happened), but simply asking it a question and getting an answer.
Token: a small unit of text (a word or part of a word). When we say “100 tokens per second”, it means the model produces about 100 of these units each second: the higher this number, the faster the response.
RAG (Retrieval-Augmented Generation): a technique that gives the model access to your documents (PDFs, internal wikis, etc.) so it can answer using your data rather than its memory alone. Often the trigger for an enterprise AI project.
Agentic AI: an AI that does not just answer once; it chains steps (calling a tool, searching, reasoning, retrying) to accomplish a complex task, like an assistant that takes initiative.

Agentic AI: chaining steps and tool calls

Not one but several models to deploy

First important point: an enterprise AI project, as soon as it gets a bit advanced, does not rely on one model but on several models running in parallel:

Main LLM: text generation, complex reasoning, dialogue.
Lightweight or specialised models: fast, simple tasks (routing, extraction, naming).
Embeddings model: essential for semantic search (RAG).
Vision model: if you analyse images or screenshots.
Image generation model: if your use case requires it.

In short, in all likelihood you will start with 1 or 2 models, and over time you will end up with 5 or 6.

Users are demanding

They have become impatient: used to "consumer" services like ChatGPT that deliver near-instant responses. Speed matters. It's often what makes users accept or reject the tool. It is hard to give a single reference threshold, but below 100 tokens generated per second, it is likely to feel slow. If the AI is agentic and calls multiple tools and "reasons", you should aim for 500 tokens/second or more for a fast response.

Reference thresholds:

~500 tokens/s: multi-step projects, tool calls, reasoning, complex agents.

~100 tokens/s: simple tasks without reflection (basic chat, completion).

Users are also used to many additional services. For example, attaching documents to messages (e.g. PDFs) and having them processed immediately. Cloud LLM services excel at this (Gemini, ChatGPT, etc.). On-premises, you can offer it by converting the document to images and using a vision model, but that will be much slower and often limited (e.g. in terms of number of pages per document).

Key questions to ask before deciding

1) What are your real use cases?

Simple tasks (extraction, reformulation, naming) or complex ones (reasoning, multi-document synthesis, tool chains)?
Expected request volume (peak, average, growth).
Level of reasoning required: is a factual answer enough, or do you need chained reasoning steps?
Need for multi-step and tool-calling (agents)? This strongly pushes toward larger models (for the orchestrator) and higher inference speed.

2) What response quality is acceptable?

On complex queries, the gap between a medium-sized and a large model is clear: depth of reasoning, nuance, consistency. For an internal assistant or business agent, this difference can mean users accepting or rejecting the solution.

Conversely, for very focused use cases (simple RAG, well-defined repetitive tasks), a smaller model may be enough with much lighter infrastructure.

3) What is your long-term vision?

Hardware obsolescence: will the GPUs you buy today be suited to tomorrow’s models (size, formats, frameworks)?
Evolving needs: will your project grow in complexity (more tools, more reasoning, more users)?
Flexibility vs commitment: on-premises commits you for 3–5 years; a provider lets you evolve without changing your hardware estate.

When “small” models are enough

When small models are enough

Several use cases are fine with lighter models:

Simple, precise tasks: e.g. generating a conversation name, extracting a field, reformulating a sentence.
Simple retrieval: basic document search, RAG with little reasoning.
Strongly RAG-oriented use cases with factual queries and little logical chaining.
Tasks with fine-tuning on a narrow domain.

Models around 20 billion parameters (e.g. mistral-small or gpt-oss-20b) can be enough for this, if you accept some hallucinations and the fact that the model "won't reason". The upside: very light infrastructure, making the Make scenario feasible.

If your use case is focused and doesn't need complex reasoning, a 20B model can work. The investment stays reasonable.

But honestly? This scenario is rare. Most projects quickly grow beyond this.

When medium or large models are essential

As soon as complexity increases, a model around 100 billion parameters becomes the minimum. gpt-oss-120b from OpenAI is often deployed, both because of its vendor’s reputation and its solid performance for its size.

Models above 300 billion parameters (Mistral Large, GLM, DeepSeek, etc.) are recommended when use cases include for example:

Complex queries with deep reasoning.
Multi-step agents with tool calls and chaining.
Multi-document analysis with nuanced synthesis and decision-making.
Tasks requiring fine understanding of context and implicits.

The quality difference shows as soon as the query demands real reasoning or complexity.

In short: as soon as you target advanced agents, advanced RAG, or demanding business reasoning, large models become hard to replace without significantly degrading the experience.

Let’s talk money

"Make" scenario: deploying gpt-oss-120b on-premises

Let's look at a 120 billion parameter model - medium-sized, good for simple scenarios.

A realistic setup: 2× NVIDIA H100 GPUs. That's about €60k up front. According to the Clarifai benchmark, inference speed is:

~200 tokens/s for 1 user.
~40 tokens/s for 100 concurrent users.

For multi-step and agent usage, the 500 tokens/s target is far from reached: user experience can suffer, especially under load. Moreover, a model of this size remains insufficient for advanced use cases and complex reasoning: for those, you need to aim at larger models, which further increases on-premises infrastructure.

“Buy” scenario: using an inference provider

According to Cerebras documentation, advertised throughput on the same model is on the order of ~3,000 tokens/s. At ATG we observe 1,000 to 2,000 tokens/s, which is still 5 to 50 times faster than the on-premises setup above.

With €60k investment, you can therefore deploy a first building block of your AI project. Not very fast, not very powerful, but it still covers a few low-complexity use cases.

Break-even calculation

To choose between Make and Buy:

Estimate how many tokens you'll process per year (tricky to predict).
Compare total hardware costs (purchase, power, maintenance, aging) to API costs over the same period.
Remember: hardware ages fast. What works today might not fit tomorrow's models. APIs evolve without you buying new hardware.

In theory, Buy can be cheaper. In practice, it rarely is - but it buys you flexibility.

Conclusion

On-premises makes sense for simple, well-defined use cases with strict sovereignty needs - or when you're processing huge volumes. For ambitious projects with agents and complex reasoning, Buy usually wins on speed and simplicity.

On-premises is often a conviction-driven choice. Nothing wrong with that. But it's easier to implement once you've already learned what your project needs.

Our recommendation at Ask This Guy: start with a provider to gain maturity, validate your use cases and volumes, then consider on-premises only if volume and specific requirements (sovereignty, long-term cost) justify it.

Want to deploy a high-performance AI assistant without managing infrastructure? Book a 20-minute demo, we’ll show you how.

LLM On-Premises for Your Enterprise: What Can You Do with €60k?