DesertGrid / Insights / Benchmarking sovereign models on an H100

We benchmarked the Gulf's sovereign models on an H100. Here is what we found.

There is a lot of pride in the new wave of sovereign models, the Arabic and Indic systems built in the Gulf and India, and there should be. But pride is not a number. Before we commit capital to serving these models, we wanted to know one boring thing precisely: how many tokens a second do they actually produce on a real GPU, and what does that mean for the cost of a token? So we rented a single H100 for about an hour, served three of them with vLLM, and measured. Here is the data, the gotchas, and the economics it implies. No vendor slides, just what the cards reported.

New here? The plain-words version

What this is, without the jargon

The brains. When you ask our AI something, a computer "brain" (an AI model) writes the answer. Our special ones were built in the Gulf and India and are great at Arabic and Indian languages, not just English. Most global providers skip those, so they are our edge.

The Arena. Two mystery brains answer the same question and you pick the better one, like a blind taste test. Your vote pushes the winner up a leaderboard.

Why it can be cheap. These brains think on a giant, pricey computer chip called a GPU, and it gulps electricity like a thirsty camel. We sit on cheap desert power, so running them costs less, which means it costs you less.

What we did below. Instead of guessing how good our setup was, we rented one of those chips for an hour and timed the brains with a stopwatch. Then we did a packing trick that made one slow brain more than twice as fast. And we did the honest money math on whether it all adds up. The rest of this post is that story, in detail.

The setup

One GPU, clean conditions, repeatable method. The goal was a defensible throughput figure per model, not a leaderboard.

ComponentChoice
GPUOne NVIDIA H100 80GB SXM (rented, on-demand)
ServervLLM 0.22, OpenAI-compatible
ModelsFalcon-H1 7B and 34B (UAE / TII), Sarvam-M 24B (India / Sarvam AI)
Workload512 input tokens, 256 output tokens, 200 prompts, full concurrency
MetricAggregate output tokens per second (what the user waits on)

A note on honesty: we wanted an A100, the cheaper card, but on the day the cheap on-demand stock was gone, so we ran on an H100. That is its own small lesson about this market, and we will come back to it. The benchmark itself is a one-line tool that ships with vLLM:

# one model, real numbers, no dataset to wrangle vllm bench throughput --model tiiuae/Falcon-H1-7B-Instruct \ --input-len 512 --output-len 256 --num-prompts 200 \ --max-model-len 8192 --gpu-memory-utilization 0.90

The numbers

Here is what one H100 produced, in aggregate output tokens per second under full load.

ModelParamsOutput tok/sNotes
Falcon-H1 7B7B1,963hybrid Mamba + Transformer
Sarvam-M 24B24B1,189standard Transformer (Mistral-based)
Falcon-H1 34B34B251barely fits one card, see below

If you serve open models, two of those numbers will look low, and you are right. A plain 7B Transformer on an H100 will happily push several thousand output tokens a second; Falcon-H1 7B gave us under two thousand. The 34B, at 251, is in a different category of slow. The reasons are interesting, and they are not about model quality.

Why the hybrid models are slower to serve (today)

Falcon-H1 is a hybrid: it mixes Transformer attention with Mamba, a state-space layer that trades the attention cache for a small fixed-size recurrent state. On paper that is a serving dream, because the memory per sequence stops growing with context. In practice, in mid-2026, the serving stack has not caught up with the architecture. Two things bit us:

First, concurrency is capped by Mamba cache blocks. vLLM allocates a fixed pool of these state slots, and each in-flight sequence needs one. The engine refused to start until we lowered the number of concurrent sequences to fit the pool. Fewer concurrent sequences means less batching, and less batching means lower aggregate throughput. That is most of the gap between Falcon-H1 7B and a comparable plain Transformer.

Second, the 34B simply does not leave room to go fast on one card. Its weights take about 68 of the 80 gigabytes, which left too little for the working memory that the fast path needs, and the first run failed with an out-of-memory error during graph capture. It ran only after we disabled that optimization and shrank the batch, which is why 251 tokens a second is a floor, not a verdict. Quantize the weights to 8-bit, or give it a second GPU, and that number moves a lot.

None of this is a knock on the models. Falcon-H1 tops the open Arabic leaderboards and Sarvam-M is a serious Indic system. It is a knock on how young the serving software is for new architectures. The lesson for anyone planning to host these: budget engineering time, not just GPU time. The weights are open; making them fast is the work.

From tokens a second to the cost of a token

Now the part that decides whether any of this is a business. A GPU costs the same whether it is flat out or sitting idle. So the economics of inference are not really about the price of the card, they are about how full you keep it. The decisive number is utilization, and throughput sets how much revenue a full GPU can earn.

Work it through for the 7B. Call a rented H100 roughly $2.70 an hour on the open market. At 1,963 output tokens a second, billing prompt and completion together, a fully loaded card moves on the order of 21 million billable tokens an hour. At open small-model rates of around ten cents per million tokens, that is about $2.10 an hour of revenue when the card is pinned at 100 percent. Against a $2.70 cost, the card does not pay for itself even when it is completely full.

>100%A rented 7B can't break even at open rates, even pinned flat out (the math wants ~127%)
~60%Break-even use once you own the card on cheap power
$0.10/kWh, the cost floor that flips it

Read that first number again. On rented hardware at those rates, a small sovereign model needs more than a full card's worth of demand to break even, which is impossible. That is not a DesertGrid problem, it is the shape of the whole market: serving frontier-quality open models cheaply, on someone else's expensive GPUs, does not close. There are only three honest ways to make it close, and they are the entire strategy.

The three levers, and the only durable one

Throughput. Quantize the weights, batch harder, wait for the serving stack to mature. This is real and it helps, especially for the 34B, but it is a moving target you do not control.

Price. Charge more. Sovereign models can justify a premium, because they are in-region, data-resident under local law, and tuned for languages the global providers treat as an afterthought. Worth doing, but a premium only stretches so far before a buyer reaches for a cheaper generic model.

Cost base. Lower the denominator. This is the one that compounds. Notice what happened to the break-even number above when the card was owned and run on cheap power instead of rented: it fell from impossible to roughly sixty percent utilization. Most of a rented GPU's hourly price is the chip's own capital cost amortized over its life, and the next biggest line is electricity, paid continuously on every token for as long as the hardware runs. Own the chip, run it where power is cheap, and keep it busy, and the same model that loses money for a renter can make money for you.

Cheap power is not the whole answer for inference, and we will not pretend it is. The chip's capital cost matters more per hour than its electricity. But power is the line that never stops, it compounds across a three-year life, and it is the one advantage a competitor cannot copy with a press release. It is why we did this measurement before building, and why we build on a cost floor we can see.

Update: we ran the throughput lever

We did not want to leave "quantize and it moves" as a hand-wave, so we went back to the same H100 and re-served every model in fp8, the eight-bit weight format that halves the memory a model occupies. Same card, same workload, before and after:

ModelBefore (bf16)After (fp8)Change
Falcon-H1 7B1,9172,221+16%
Sarvam-M 24B1,1951,779+49%
Falcon-H1 34B235511+117%

The pattern is the lesson. fp8 helps a plain Transformer most, because the memory it frees turns straight into more parallel requests, so Sarvam-M jumped by half. The Mamba-heavy Falcon-H1 7B gained the least, because its ceiling is the state-space kernels, not memory. And the throughput-starved 34B more than doubled, exactly where we expected, because halving the weights is the difference between a model that barely fits and one with room to work. It is still memory-bound enough to refuse the fastest execution path on a single card, so a second GPU would push it further again.

What does that do to the economics? It moves the lines, and it does not move the conclusion. The 34B crosses from "cannot break even at any load" to a still-demanding eighty percent or so of a full card, and Sarvam-M now breaks even under half a card's worth of demand. The small 7B-class models stay marginal on rented hardware, the same story the box above told. Optimization narrows the gap. The cost base is what closes it. We folded the winning settings into how we serve each model, so the speed is not a one-off benchmark, it is the default.

One honest caveat: fp8 is a lossy format. The throughput is free, the accuracy is not quite, so a serious deployment validates quality before it ships the speed. We measured the speed here; the quality pass is its own piece of work, and we will not quietly skip it.

Where this leaves DesertGrid

We host sovereign Arabic and Indic models so you do not have to fight the serving stack yourself, behind one OpenAI-compatible endpoint, on infrastructure designed around a low cost floor: Liwa's liquid-cooled, free-zone capacity at $0.10/kWh. The benchmark above is why that floor matters so much, and why we measured it in dollars before we spent any. Sign in to use the models, or browse the catalog and vote for which sovereign model we stand up next.

Questions we're sitting with

  • How much of the hybrid-model throughput gap is the architecture, and how much is just serving software that has not caught up yet?
  • If serving frontier-quality open models does not close on rented GPUs, how many providers quoting low prices are funding the gap out of a runway?
  • When you choose an inference provider, are you buying their price, or the cost floor that price has to survive?

Sovereign models, measured in dollars.

OpenAI-compatible, in-region, on a cost floor we can show you. Use the models, or reserve the infrastructure underneath them.

Sources and notes

  1. vLLM documentation (serving engine and benchmark tool)
  2. TII, Falcon-H1 model family
  3. Sarvam AI, Sarvam-M
  4. RunPod GPU pricing (rented H100 reference)
  5. Liwa, white-label AI colocation at $0.10/kWh

Figures are a single-GPU snapshot, vLLM 0.22 on one H100 80GB, May 2026, output tokens per second at 512 in / 256 out. Throughput on these architectures is improving as serving software matures, so treat the numbers as a floor for their date. The economics use public GPU rates and illustrative open-model token prices; they are an analysis, not a quote or a promise of returns. DesertGrid and Liwa are Segments ventures.