Google Gemma 4: Byte for Byte, the Most Capable Open AI Models Yet

Google Gemma 4

Google DeepMind has released Gemma 4 — four open-weight models spanning from a phone-sized 2B to a workstation-class 31B, licensed under Apache 2.0 for the first time. The flagship 31B ranks third among all open models in the world. The smallest runs fully offline on a Raspberry Pi. And every single model supports multimodal input, native function calling, and a context window of up to 256K tokens.

400M+Total Gemma model downloads since launch100K+Community model variants built on Gemma140+Languages supported natively by Gemma 4

Every few months, the frontier of open-source AI moves. A new model family arrives that makes the previous benchmark numbers look ordinary, that demonstrates capabilities previously assumed to require proprietary systems, and that hands developers the kind of intelligence that once required a cloud API contract. On April 2, 2026, that moment belonged to Google DeepMind.

Google DeepMind released Gemma 4 — its most capable family of open-weight models to date, built on the same research foundations as the commercial Gemini 3 family, and available to download, run, and fine-tune under a fully permissive Apache 2.0 licence. Four models. Four deployment scenarios. Benchmarks that put proprietary models from 12 months ago to shame. A licence that removes every legal barrier enterprise teams had been waiting for Google to drop.

The release lands in a competitive open-model landscape that includes Meta’s Llama 4, Alibaba’s Qwen 3.5, and Mistral’s latest offerings — but Gemma 4 carves out a position no competitor currently occupies: frontier-level intelligence at every size tier, from a device you carry in your pocket to a server you run in a data centre, with day-one support across every major inference engine and training framework in existence.

The Four Models: From Phone to Workstation

Gemma 4 ships in four sizes, each designed for a distinct deployment scenario. All four are available in both pre-trained base and instruction-tuned variants, and all are released under the Apache 2.0 licence — a first for the Gemma family, replacing a custom Gemma licence that had restricted certain commercial uses and created legal friction for enterprise adoption.

ModelParamsActive ParamsContextBest ForAudio
E2B2.3B eff.2.3B128KPhones, Raspberry Pi✅ Native
E4B4.5B eff.4.5B128KEdge devices, laptops✅ Native
26B MoE26B total3.8B256KWorkstations, APIs
31B Dense31B total31B256KServers, fine-tuning

E2B and E4B: Intelligence at the Edge

The “E” in E2B and E4B stands for “effective” — a reference to Per-Layer Embeddings (PLE), a technique introduced in the previous Gemma 3n family that allows a model to operate with a much smaller active parameter footprint than its total parameter count might suggest. PLE adds a parallel, lower-dimensional conditioning pathway alongside the main residual stream, giving each decoder layer its own channel to receive token-specific information rather than requiring the embedding to front-load everything the model might ever need.

The practical result: the E2B runs entirely offline on a Raspberry Pi, an NVIDIA Jetson Orin Nano, or a mid-range Android smartphone. The E4B is 3x faster than the E2B while offering meaningfully better reasoning at a still-minimal compute footprint. Google developed both models in close collaboration with its Pixel hardware team and silicon partners Qualcomm Technologies and MediaTek, meaning they are specifically optimized for the AI accelerators already present in hundreds of millions of Android devices.

Both edge models support native audio input — including automatic speech recognition and speech-to-translated-text — entirely on-device. The audio encoder has been compressed to 305 million parameters (down from 681 million in Gemma 3n) while reducing frame duration from 160ms to 40ms for more responsive, real-time transcription. They also support video input, processed as frame sequences, and multi-image prompting.

On Android specifically, Gemma 4 serves as the foundation for the next generation of Gemini Nano — Google’s on-device model embedded across Android features. Code written today against the Gemma 4 API in the AICore Developer Preview will automatically work on Gemini Nano 4-enabled devices arriving later in 2026. Google reports the new model is up to 4x faster than previous versions and uses up to 60% less battery during inference.

26B MoE: The Efficiency Champion

The 26B Mixture of Experts (MoE) model is arguably Gemma 4’s most interesting architectural achievement. Its total parameter count is 26 billion, but the “A4B” designation reveals what makes it remarkable: only 3.8 billion parameters are active during any single forward pass. The MoE routing mechanism selects which expert subnetworks to activate for each token, making the model nearly as fast as a native 4B model during inference while accessing the accumulated knowledge of 25 billion additional parameters when needed.

The 26B MoE currently ranks sixth among all open models on the Arena AI text leaderboard — with only 3.8 billion active parameters per inference. That is, as the Hugging Face team described it in their launch post, mind-blowing: a model that competes with the world’s top open models while running at the speed and cost of a much smaller one.

31B Dense: The Flagship

The 31B Dense model is Gemma 4’s flagship — a full-density transformer designed for workstation GPUs, server deployments, and fine-tuning. It currently ranks third among all open models in the world on the Arena AI text leaderboard, with an estimated LMArena ELO score of 1,452. A single 31B Dense model can run on a laptop with sufficient VRAM, though users wanting to utilize its full 256K context window will need substantial system RAM — early testing on a machine with 128GB found the model could claim over 80GB for itself at maximum context.

The Benchmarks: Numbers That Would Have Been Frontier Last Year

The benchmark performance of Gemma 4’s flagship models represents a generational leap for the open-weight category — posting numbers on reasoning, coding, and mathematics that would have been considered proprietary-model territory twelve months ago:

BenchmarkGemma 4 31BGemma 4 26B MoEWhat It TestsGemma 4 Rank
MMLU Pro85.2%Knowledge & reasoningTop tier
AIME 202689.2%Advanced mathematicsTop tier
LiveCodeBench v680.0%Real-world codingFrontier
Codeforces ELO2,150Competitive codingFrontier
MMMU Pro (vision)76.9%Visual reasoningSOTA
MATH-Vision85.6%Visual math problemsSOTA
Arena AI (text)#3 open#6 openHuman-rated chat#3 & #6
Arena LMArena ELO1,4521,441Human preferenceTop open

The AIME 2026 score of 89.2% on the 31B Dense model deserves particular emphasis. AIME (American Invitational Mathematics Examination) is one of the most rigorous publicly available mathematical reasoning benchmarks — designed to challenge the top 5% of high school mathematics students in the United States. An 89.2% score from a 31-billion-parameter open model is an extraordinary result, reflecting Google DeepMind’s focus on what it describes as “intelligence per parameter” as a core design objective for the Gemma 4 family.

“Purpose-built for advanced reasoning and agentic workflows, Gemma 4 delivers an unprecedented level of intelligence-per-parameter. Gemma 4 outcompetes models 20x its size.”

— Google DeepMind, Gemma 4 Launch Blog, April 2, 2026

Multimodal by Default: Text, Images, Video, and Audio

All four Gemma 4 models are natively multimodal — meaning they process text and images together in a single prompt, without separate modules or pipeline overhead. The vision architecture supports variable aspect-ratio image input with configurable visual token budgets ranging from 70 to 1,120 tokens per image. This flexibility lets developers trade off detail against compute: lower budgets work for classification and simple captioning; higher budgets are needed for OCR, document parsing, and fine-grained visual analysis.

The vision capabilities in Gemma 4 represent a meaningful improvement over the previous generation. The earlier Gemma 3n architecture struggled with OCR and document understanding — a known limitation that developers frequently cited as a gap between Gemma’s text performance and its visual performance. Gemma 4’s new vision encoder closes much of that gap, with MMMU Pro reaching 76.9% and MATH-Vision hitting 85.6%.

The E2B and E4B edge models go further, adding native audio processing. This includes automatic speech recognition and speech-to-translated-text — fully on-device, with no cloud round-trip. The capability enables real-time multilingual speech understanding directly on a smartphone or edge device. Gemma 4 natively supports 140-plus languages across the entire model family, a foundation for localized, multilingual applications that was a frequent developer request following Gemma 3.

Native Function Calling: Built for the Agentic Era

One of the most significant technical improvements in Gemma 4 is native function calling support across all four models. Unlike previous approaches that relied on instruction-following prompts to coax models into structured tool use — a fragile technique that required careful prompt engineering and often produced unreliable results — Gemma 4’s function calling was trained into the model from the ground up, drawing on research from Google’s dedicated FunctionGemma release in late 2025.

All four Gemma 4 models output native structured JSON for function calls, support multi-turn agentic flows with multiple tools active simultaneously, and work with every major agent framework — including OpenClaw, Hermes, and Pi via llama.cpp’s OpenAI-compatible server. For enterprise teams that have been building tool-using agents on top of larger, more expensive proprietary models, Gemma 4’s native function calling removes a significant barrier to moving those workloads to locally-deployed open models.

Google DeepMind researchers Clement Farabet and Olivier Lacombe noted that Gemma 4’s agentic capabilities were a central design priority — not a feature added after the fact. The models have been trained specifically to handle the multi-step planning, tool selection, and structured output generation that agentic workflows require. Combined with the 256K token context windows available on the larger models, this means a Gemma 4 deployment can hold an entire codebase or a large document set in context while executing multi-step agent plans.

The Licence Change That May Matter More Than the Benchmarks

VentureBeat’s analysis of the Gemma 4 launch made a point worth repeating: the switch to Apache 2.0 licensing may ultimately matter more to the open-source AI ecosystem than any of the benchmark numbers. The previous Gemma licence was a custom Google agreement that, while reasonably permissive, included clauses around acceptable use that required legal interpretation, restricted certain redistribution scenarios, and created friction for enterprise legal teams unfamiliar with non-standard software licences.

Apache 2.0 is the de facto standard for permissive open-source software licensing. It places no restrictions on commercial use, redistribution, or the creation of derivative works. Fine-tuned versions of Gemma 4 can be deployed commercially without licensing ambiguity. Plugins, wrappers, and integrations can be released under any licence. Enterprise procurement teams can evaluate it against a well-understood legal framework rather than a proprietary Google agreement.

The timing is notable as a competitive signal. As some Chinese AI labs — most notably Alibaba’s Qwen family — have begun pulling back from fully open releases for their latest models, Google is moving in the opposite direction. Gemma 4 is the most capable model Google has released under open weights, and it ships under the most permissive licence any Gemma model has ever used. Google is explicitly betting that openness is a sustainable competitive strategy, not a temporary concession.

“Building the future of AI requires a collaborative approach. We believe in empowering the developer ecosystem without restrictive barriers. That’s why Gemma 4 is released under a commercially permissive Apache 2.0 license.”

— Google DeepMind, Gemma 4 Launch, April 2, 2026

Day-One Ecosystem: Run It Everywhere From Day One

One of the consistent failures of previous open model launches — including several Google ones — has been the lag between model release and tooling support. Developers eager to run a new model locally would find their preferred inference engine unsupported, or fine-tuning libraries unavailable, or deployment frameworks incompatible. Gemma 4 has eliminated this gap almost entirely.

On launch day, Gemma 4 had full support across: Hugging Face Transformers, TRL, Transformers.js, and Candle; llama.cpp (with OpenAI-compatible server); Ollama; MLX (for Apple Silicon); LM Studio; vLLM; NVIDIA NIM and NeMo; SGLang; Unsloth; Keras; Google Colab; Vertex AI; and Docker. The model weights are available on Hugging Face, Kaggle, and Ollama.

Performance figures from early testing reflect the breadth of that support. Using LM Studio on a machine with 128GB RAM, the 31B Dense model achieved over 10 tokens per second — faster than a human can read. The 26B MoE and E4B models exceed 40 tokens per second on the same hardware; the E2B tops 60 tokens per second. For context: real-time conversational AI typically needs around 10–20 tokens per second to feel natural to users. Gemma 4’s smaller models are operating well above that threshold even on consumer-grade hardware.

NVIDIA’s involvement warrants separate mention. The company has specifically optimized Gemma 4 for RTX PCs and DGX Spark — NVIDIA’s latest personal AI supercomputer, which targets researchers and developers who want datacenter-level inference in a desktop form factor. NVIDIA published dedicated guidance for running Gemma 4 on local RTX hardware on launch day, signalling that the collaboration between Google DeepMind and NVIDIA on Gemma 4 goes beyond casual support.

The Gemmaverse: 400 Million Downloads and 100,000 Variants

The scale of developer engagement with the Gemma model family since its original 2024 launch is itself a significant story. Developers have downloaded Gemma models over 400 million times across all generations. More than 100,000 community variants — fine-tuned versions, domain-adapted models, quantized versions, and experimental derivatives — have been created and shared in what Google calls the “Gemmaverse.”

Hugging Face’s team, which collaborated with Google on the Gemma 4 release, noted in their launch post that pre-release testing left them struggling to find compelling fine-tuning examples “because they are so good out of the box.” That is a meaningful signal about the instruction-tuned quality of the Gemma 4 release: models that require minimal post-training to perform well on specialized tasks dramatically lower the barrier for developers who want to build domain-specific AI products.

Gemma 4 in Context: The Open Model Race in 2026

The open-model landscape of early 2026 is the most competitive it has ever been. Meta’s Llama 4 Scout offers a 10-million-token context window — a technical feat that no competitor has matched. Alibaba’s Qwen 3.5 fields a 397-billion-parameter flagship model at the top of the performance range. DeepSeek V3 continues to perform strongly across reasoning and coding benchmarks.

Gemma 4’s position in this field is distinct rather than dominant across every dimension. It is not the largest model available. It does not have the longest context window. But it occupies a position that no single competitor currently owns: it delivers the best intelligence-per-parameter at the small-to-medium size tier, runs on more device categories than any comparable model, ships under the most permissive licence of any major open model, and has the deepest day-one ecosystem support of any open release in recent memory.

Holger Mueller, an analyst with Constellation Research, summarized the strategic picture: “Google is building its lead in AI, not only by pushing Gemini, but also open models with the Gemma 4 family. These are important for building an ecosystem of AI developers, and will help the company to tap into functional and vertical use cases on different device form factors. Google set a high bar with its previous Gemma 3 release, and so there’s a lot of expectation with this release.”

What Gemma 4 Means for Developers, Businesses, and the Open AI Era

For developers, Gemma 4 means that frontier-level AI capabilities are now available to run locally, fine-tune freely, and deploy commercially — with no API costs, no rate limits, no data privacy concerns from cloud round-trips, and no custom licence to have lawyers review. The combination of Apache 2.0, day-one tooling support, and genuinely competitive benchmark scores makes Gemma 4 the strongest case Google has yet made for its open model programme.

For businesses, Gemma 4 changes the economics of AI deployment. The 26B MoE model, running at 3.8 billion active parameters, delivers top-ten-open-model performance at a fraction of the inference cost of larger dense models. For organizations with consistent, high-volume inference workloads — content moderation, document processing, customer service automation — the ability to run a top-ten model on existing hardware rather than paying per-token API costs could represent meaningful savings at scale.

For the open AI ecosystem broadly, Gemma 4 signals that Google DeepMind has decided openness is not a marketing strategy but a genuine competitive commitment. With 400 million downloads behind it and 100,000 community variants in the Gemmaverse, the Gemma family has already demonstrated that an open model can generate an ecosystem. Gemma 4, the best and most permissively licensed version yet, is Google’s bet that this ecosystem will only grow — and that the developers who build on it today will be a durable source of competitive advantage in the years ahead.

Share your love

Leave a Reply