Google Gemma 4: A Frontier AI Model That Runs on Your Phone for Free
A few months ago, running something this capable locally meant serious hardware and serious tradeoffs on quality. Now Google just released Gemma 4 and the spec sheet reads like a typo. It runs on your laptop. It works offline on your phone. It speaks 140 languages natively. 256K context window. Costs nothing. Performs better than models 20x its size. Apache 2.0 license.
That is not a wishlist. That is shipping right now. While cloud-only models like Claude Mythos push the frontier on raw capability, Gemma 4 is pushing the frontier on accessibility.
The Model Family
Gemma 4 comes in four sizes, all multimodal (text, image, audio, and video on the larger ones):
| Model | Active Params | Context | Modalities | Runs On |
|---|---|---|---|---|
| E2B | 2.3B | 128K | Image, Text, Audio | Phones |
| E4B | 4.5B | 128K | Image, Text, Audio | Phones, Laptops |
| 26B MoE | 3.8B active | 256K | Image, Text, Video | Laptops |
| 31B Dense | 31B | 256K | Image, Text, Video | Workstations |
The 26B Mixture-of-Experts model is the headline act. It has 26 billion total parameters but only activates 3.8 billion per token. That means you get the knowledge of a 26B model with the speed and memory of a 4B model. It runs on a laptop with 24GB RAM. Ranked #6 globally on Arena AI with only 3.8B active parameters.
The 31B dense model is ranked #3 globally. Only two other open models score higher.
The Benchmarks Are Ridiculous
The jump from Gemma 3 to Gemma 4 is not incremental. On AIME 2026 (math competition problems), Gemma 3 scored 20.8%. Gemma 4 scores 89.2%. That is a 4x improvement in one generation.
| Benchmark | Gemma 4 31B | Gemma 4 26B MoE | Gemma 3 27B |
|---|---|---|---|
| MMLU Pro | 85.2% | 82.6% | 67.5% |
| AIME 2026 | 89.2% | 88.3% | 20.8% |
| LiveCodeBench v6 | 80.0% | 77.1% | ~40% |
| BigBench Extra Hard | 74% | ~70% | 19% |
| MMMU Pro (Vision) | 76.9% | 73.8% | ~50% |
On coding (LiveCodeBench), the 31B model hits 80% and achieves a 2150 Codeforces ELO rating. On reasoning (BigBench Extra Hard), it went from 19% to 74%. On vision tasks (MMMU Pro), 76.9%.
The MoE model is the more interesting story. It scores within a few points of the 31B dense model on every benchmark while using 8x less compute per token. That efficiency is what makes it practical for local use.
It Actually Runs on Your Phone
The E2B model (2.3B effective parameters) runs in under 1.5GB of memory using Google's LiteRT-LM runtime. That is less memory than most mobile games. It processes text, images, and audio entirely on-device with zero network calls.
This is not a demo or a research preview. Google integrated Gemma 4 into Android AICore, which means it has access to dedicated AI accelerators on Qualcomm and MediaTek chips. Near-zero latency. Works on airplane mode. Your data never leaves the device.
The E4B model (4.5B effective) is the sweet spot for phones with 6GB+ RAM. It handles richer reasoning, better vision processing, and more complex instructions while still running entirely offline.
Google claims 4x faster inference and 60% less battery consumption compared to previous Gemma versions.
140 Languages, 256K Context, Apache 2.0
Three specs that deserve their own section:
140 languages. Not 140 languages where it kind of works. Native multilingual training across the Gemma 4 family. This matters for anyone building products outside the English-speaking world, which is most of the world.
256K context window on the 26B MoE and 31B models. That is roughly 500 pages of text. Long enough to process entire codebases, legal documents, or book-length conversations in a single prompt. The smaller models get 128K, which is still massive for on-device inference.
Apache 2.0 license. This is the biggest change from Gemma 3, which had a restrictive custom license with usage limits. Apache 2.0 means no monthly active user caps, no acceptable use policy enforcement, full commercial freedom. You can fine-tune it, deploy it, sell products built on it, run it in any country. Same license as Linux, Kubernetes, and TensorFlow.
VentureBeat called the license change potentially more significant than the benchmarks. They are probably right. A model this capable under Apache 2.0 changes the economics of AI deployment for every startup that cannot afford API costs.
How to Run It Locally
The fastest path is Ollama:
# Install Ollama (if you haven't)
curl -fsSL https://ollama.com/install.sh | sh
# Pull Gemma 4 (26B MoE recommended)
ollama pull gemma4
# Run it
ollama run gemma4
That gives you a local model running in your terminal. For integration with development tools:
Claude Code
# Set environment variable to point at local Ollama
export ANTHROPIC_BASE_URL=http://localhost:11434
# Claude Code now uses Gemma 4 locally
Python
from ollama import chat
response = chat(
model='gemma4',
messages=[{'role': 'user', 'content': 'Explain quantum computing'}]
)
print(response.message.content)
Also available through LM Studio (GUI), llama.cpp, transformers.js (browser), MLX (Apple Silicon), and ONNX (edge devices).
How It Compares
| Gemma 4 26B | Llama 4 Scout | Qwen 3.5 | |
|---|---|---|---|
| Architecture | MoE, 3.8B active | MoE, 17B active | Dense/MoE |
| Context | 256K | 512K | 128K |
| Languages | 140 | ~30 | 201 |
| License | Apache 2.0 | Llama License | Apache 2.0 |
| On-device | Yes (phone) | No | Limited |
| Audio input | Yes | No | Limited |
| Math (AIME) | 88.3% | ~85% | 48.7%* |
| Coding | 77.1% | ~75% | Higher* |
*Qwen 3.5 figures are for comparable model sizes. Qwen leads on coding benchmarks like SWE-bench.
Gemma 4 fills a gap that neither Llama 4 nor Qwen 3.5 covers: a model that is genuinely capable AND genuinely local. The open source AI movement has been pushing toward this moment for years, and Gemma 4 might be the model that tips the balance. Llama 4 Scout is more powerful but needs serious hardware. Qwen 3.5 is competitive on benchmarks but does not have the same on-device optimization story. Gemma 4 is the first model in this generation that a phone can run well.
What This Means
The practical implication is straightforward. If you are building an AI product and paying for API calls, you now have a free alternative that ranks in the global top 10. If you need offline capability, multilingual support, or data privacy (medical, legal, financial), you can run the whole thing locally without sending a single byte to an external server.
For AI agents specifically, Gemma 4 is significant. An agent running locally with 256K context, function calling support, and multimodal understanding does not need to phone home. Combine it with Browser Use for autonomous web browsing and the agent can operate entirely offline except when it needs to reach external websites. It can read documents, analyze images, transcribe audio, and execute multi-step plans entirely on the hardware it is running on.
The MoE architecture is the key innovation that makes all of this work. By only activating 3.8B of 26B parameters per token, you get large-model quality at small-model cost. That is the trick that turns a laptop into a frontier AI workstation.
Download from Ollama, Hugging Face, or Google AI for Developers.
Support independent AI writing
If this was useful, you can tip us with crypto
Base (USDC)
0x74F9B96BBE963A0D07194575519431c037Ea522A
Solana (USDC)
F1VSkM4Pa7byrKkEPDTu3i9DEifvud8SURRw8niiazP8