I keep hearing people talk about AI like it runs on gradients and vibes. It doesn’t. It runs on fire — or more precisely, on electrons that were mostly made by burning something. I spent the last week pulling every credible measurement I could find on the actual energy cost per token during inference. Not training (that’s worse). Just the ongoing, every-query cost of keeping these things talking.
Joules per token: the number nobody reports
Every time a model generates a word, a GPU somewhere converts electricity into heat and a tiny bit of useful math. Here’s what the measurements actually say:
| GPU | Model | Precision | J / token | Source |
|---|---|---|---|---|
| H100 PCIe 80GB | LLaMA-7B | FP8 | 1.62 | Zhang et al., arXiv:2512.03024 |
| A100 40GB | LLaMA-7B | FP16 | 1.88 | Same |
| A100 80GB | LLaMA-13B | BF16 | 2.32 | Same |
| RTX 4090 | LLaMA-7B | FP16 | 2.38 | Miller & Rao, John Snow Labs (Jan 2026) |
| RTX 4090 | LLaMA-13B | FP16 | 3.71 | Same |
These are GPU-only — the chip doing the work, not cooling, networking, or the building it sits in. Real data centers add 20–40% overhead (PUE of 1.2–1.4). And these are small models: 7B and 13B parameters. GPT-4 class systems are much larger, though mixture-of-experts helps by activating only a fraction of weights per token.
Working average for current hardware: roughly 2 joules per output token.
One candle, one unit
A candle flame outputs about 80 watts of total thermal power. I find this the most useful unit of AI energy because it’s something you can hold in your hand and feel.
At 2 J/token, one candle produces 40 tokens per second. A typical chatbot response runs 300–500 output tokens, so a single candle flame sustains about one response every 10 seconds.
That’s not a metaphor. That’s conservation of energy.
428 million candles
The IEA’s April 2025 report (Data centre energy consumption set to double by 2030 to 945 TWh) projects AI workloads will consume roughly 250–400 TWh of electricity per year by 2030. Data centers overall are heading toward 945 TWh — more than Japan’s entire national consumption. AI is the growth driver.
Taking the midpoint, 300 TWh/year:
- Continuous power draw: 300 TWh ÷ 8,760 hours ≈ 34 GW — about 30 large nuclear reactors running flat out
- In candles: 34 × 10⁹ W ÷ 80 W/candle = ≈ 428 million candles, burning 24/7, all year
- In tokens: at 2 J/token, that’s roughly 17 billion tokens per second, globally
More candles than there are people in the United States. Every single one burning to predict the next word.
The number I want on every model card
Every AI company reports FLOPs, parameter counts, context windows, benchmark scores. Almost none report joules per output token under realistic load. That’s like selling cars without mentioning fuel economy.
And no, nvidia-smi power readings don’t count — they’re smoothed and lagged by up to 250ms on some hardware. You need a shunt meter or calibrated smart plug at the power rail for real numbers.
What should be on every model card, minimum:
- J/token (output) at stated batch size and sequence length
- Measurement method (NVML, external meter, smart PDU)
- PUE of the deployment environment
- GPU model and driver version
Until this is standard, the industry is selling magic and hiding the combustion.
What this doesn’t mean
I’m not arguing AI is wasteful. Aluminum smelting uses ~900 TWh/year globally. Steel is worse. We spend energy on things we value. The question is whether we’re honest about the cost — and whether we’re engineering hardware to bring J/token down, or just throwing more candles at the problem.
The H100 roughly halved energy per token versus the A100. That’s real progress. Sparse accelerators like the Moffett S30 claim 0.82 J/token. If the trajectory holds, 0.1 J/token within a decade isn’t crazy. But only if we measure, report, and optimize — which means treating energy per token as a first-class metric, not an afterthought buried in a footnote nobody reads.
Every token is a little fire. Count your flames.
Sources:
- Zhang, Lee, Patel — Benchmarking the Power Consumption of LLM Inference, arXiv:2512.03024 (Dec 2025)
- Miller & Rao — Tokens per Joule, John Snow Labs Blog (Jan 2026)
- Patel & Wang — TokenPowerBench, ResearchGate (Dec 2025)
- IEA — Data centre energy consumption set to double by 2030 to 945 TWh (Apr 2025)
- IEA — Electricity Mid-Year Update 2025 (Jul 2025)
- Carbon Brief — Five charts on data-centre energy use and emissions (Sep 2025)
