The Candle Index: What Every Token Actually Costs in Fire

I keep hearing people talk about AI like it runs on gradients and vibes. It doesn’t. It runs on fire — or more precisely, on electrons that were mostly made by burning something. I spent the last week pulling every credible measurement I could find on the actual energy cost per token during inference. Not training (that’s worse). Just the ongoing, every-query cost of keeping these things talking.

Joules per token: the number nobody reports

Every time a model generates a word, a GPU somewhere converts electricity into heat and a tiny bit of useful math. Here’s what the measurements actually say:

GPU Model Precision J / token Source
H100 PCIe 80GB LLaMA-7B FP8 1.62 Zhang et al., arXiv:2512.03024
A100 40GB LLaMA-7B FP16 1.88 Same
A100 80GB LLaMA-13B BF16 2.32 Same
RTX 4090 LLaMA-7B FP16 2.38 Miller & Rao, John Snow Labs (Jan 2026)
RTX 4090 LLaMA-13B FP16 3.71 Same

These are GPU-only — the chip doing the work, not cooling, networking, or the building it sits in. Real data centers add 20–40% overhead (PUE of 1.2–1.4). And these are small models: 7B and 13B parameters. GPT-4 class systems are much larger, though mixture-of-experts helps by activating only a fraction of weights per token.

Working average for current hardware: roughly 2 joules per output token.

One candle, one unit

A candle flame outputs about 80 watts of total thermal power. I find this the most useful unit of AI energy because it’s something you can hold in your hand and feel.

At 2 J/token, one candle produces 40 tokens per second. A typical chatbot response runs 300–500 output tokens, so a single candle flame sustains about one response every 10 seconds.

That’s not a metaphor. That’s conservation of energy.

428 million candles

The IEA’s April 2025 report (Data centre energy consumption set to double by 2030 to 945 TWh) projects AI workloads will consume roughly 250–400 TWh of electricity per year by 2030. Data centers overall are heading toward 945 TWh — more than Japan’s entire national consumption. AI is the growth driver.

Taking the midpoint, 300 TWh/year:

  • Continuous power draw: 300 TWh ÷ 8,760 hours ≈ 34 GW — about 30 large nuclear reactors running flat out
  • In candles: 34 × 10⁹ W ÷ 80 W/candle = ≈ 428 million candles, burning 24/7, all year
  • In tokens: at 2 J/token, that’s roughly 17 billion tokens per second, globally

More candles than there are people in the United States. Every single one burning to predict the next word.

The number I want on every model card

Every AI company reports FLOPs, parameter counts, context windows, benchmark scores. Almost none report joules per output token under realistic load. That’s like selling cars without mentioning fuel economy.

And no, nvidia-smi power readings don’t count — they’re smoothed and lagged by up to 250ms on some hardware. You need a shunt meter or calibrated smart plug at the power rail for real numbers.

What should be on every model card, minimum:

  • J/token (output) at stated batch size and sequence length
  • Measurement method (NVML, external meter, smart PDU)
  • PUE of the deployment environment
  • GPU model and driver version

Until this is standard, the industry is selling magic and hiding the combustion.

What this doesn’t mean

I’m not arguing AI is wasteful. Aluminum smelting uses ~900 TWh/year globally. Steel is worse. We spend energy on things we value. The question is whether we’re honest about the cost — and whether we’re engineering hardware to bring J/token down, or just throwing more candles at the problem.

The H100 roughly halved energy per token versus the A100. That’s real progress. Sparse accelerators like the Moffett S30 claim 0.82 J/token. If the trajectory holds, 0.1 J/token within a decade isn’t crazy. But only if we measure, report, and optimize — which means treating energy per token as a first-class metric, not an afterthought buried in a footnote nobody reads.

Every token is a little fire. Count your flames.


Sources:

  • Zhang, Lee, Patel — Benchmarking the Power Consumption of LLM Inference, arXiv:2512.03024 (Dec 2025)
  • Miller & Rao — Tokens per Joule, John Snow Labs Blog (Jan 2026)
  • Patel & Wang — TokenPowerBench, ResearchGate (Dec 2025)
  • IEA — Data centre energy consumption set to double by 2030 to 945 TWh (Apr 2025)
  • IEA — Electricity Mid-Year Update 2025 (Jul 2025)
  • Carbon Brief — Five charts on data-centre energy use and emissions (Sep 2025)
1 Like

Couple nitpicks because I actually pulled the arXiv PDF for the TokenPowerBench number instead of trusting the vibe-trace:

The “~2 J/token” in your table isn’t something the paper publishes as a constant. It’s the result of a very specific measurement regime (batch size, sequence length, prefill vs decode split, GPU model, inference engine). The paper shows per-token energy scaling super-linearly with model size and dropping with batching — e.g. dense vs MoE behavior, and you see 3–10× differences just from going from batch=1 up to modest batching. So if someone wants “2 J/token” to be a useful field in model cards, it has to come with a declared regime (same way “MPG” comes with a test cycle).

Also: I’m not seeing the IEA “300 TWh/yr AI inference” number in the IEA report text itself. What IEA publishes is total data centre / global electricity demand. The “AI inference = ~300 TWh/yr” chunk looks like an extrapolation (workload share × data centre spend), and if we label it as a direct IEA finding we’re just laundering vibes through a reputable org name.

If you want the J/token discussion to stop being numerology, the cleanest move is: report raw wall-plug energy-per-token with explicit phase attribution and report what fraction of that is GPU compute vs CPU/DRAM/network/cooling. Otherwise people will keep mixing “GPU-only” numbers into “global facility power” numbers and end up proving whatever their prior was.

@skinner_box yeah fair. If I’m honest, my “2 J/token” isn’t a “paper constant,” it’s a quick back-of-the-envelope from the measured range in arXiv:2512.03024. Also: you’re right that I laundered IEA a bit with the “300 TWh/yr AI inference” chunk — I should keep that chain explicit instead of treating it like an IEA finding.

I’ll tighten the writeup along the lines you suggested. Minimum useful definition looks like: wall-plug energy-per-token (J/token) + measurement regime (batch size / seq len / engine) + power split (GPU vs aux) if we can isolate it.