What are the four Gemma 4 model sizes?

Gemma 4 comes in four variants: E2B (2B effective parameters for edge devices), E4B (4B effective parameters for laptops), 26B-A4B (Mixture of Experts with 26B total but only 4B active per token), and 31B Dense (all parameters active, highest quality). All are released under Apache 2.0.

What does E2B and E4B mean in Gemma 4?

The 'E' stands for 'effective' parameters. E2B activates 2 billion parameters during inference, and E4B activates 4 billion. These are compact edge models designed for phones, tablets, and laptops where RAM and battery life matter more than peak quality.

How much RAM do I need for each Gemma 4 model?

Approximate RAM needs with Q4 quantization: E2B requires about 2GB, E4B needs 4-6GB, the 26B MoE runs well with 8-12GB since only 4B params are active, and the 31B Dense model needs 20GB or more. The 26B MoE is the most RAM-efficient for its quality level.

Is Gemma 4 26B MoE better than 31B Dense?

It depends on your priorities. The 26B MoE is significantly faster and uses less RAM because only 4B parameters activate per token. The 31B Dense produces better output quality since all parameters contribute to every response. For most local use cases, the 26B MoE offers the better tradeoff.

How does Gemma 4 compare to Llama 4?

Gemma 4 excels at parameter efficiency and edge deployment with models as small as 2B effective params. Llama 4 Scout offers a massive 10M token context window compared to Gemma's 256K. Both use open licenses. Pick Gemma for local or mobile, Llama for long-context server workloads.

Can I run Gemma 4 on a MacBook?

Yes. The E4B model runs smoothly on any MacBook with 8GB RAM using Ollama. The 26B MoE works well on 16GB MacBooks since it only activates 4B parameters per token. The 31B Dense model needs a MacBook Pro with 32GB or more unified memory for acceptable speeds.

What is Gemma 4's context window size?

All Gemma 4 models support up to 256K tokens of context, doubled from Gemma 3's 128K limit. This is enough for most coding, document analysis, and conversation tasks. Llama 4 Scout offers 10M tokens if you need significantly more context.

Does Gemma 4 support tool use and function calling?

Yes, all Gemma 4 models support structured tool use out of the box. You define a function schema and the model returns valid JSON matching that schema. This makes Gemma 4 suitable for building local AI agents with tool-calling capabilities, no prompt engineering tricks required.

What are the exact Ollama model names for Gemma 4?

The Ollama model tags are: gemma4 or gemma4:e4b for the E4B edge model (default), gemma4:e2b for the smallest edge model, gemma4:26b for the MoE sweet-spot model, and gemma4:31b for the full Dense model. Running 'ollama run gemma4' pulls E4B by default.

Can I run Gemma 4 with MLX on a Mac Mini?

Yes. MLX runs Gemma 4 natively on Apple Silicon using the Metal GPU. A Mac Mini M2 with 16GB handles E4B and the 26B MoE comfortably. With 24GB unified memory, you can run the 31B Dense model with Q4 quantization. MLX often delivers faster inference than Ollama on the same Mac hardware because it uses Apple's unified memory architecture directly.

How does Gemma 4 compare to Qwen 3.5?

Gemma 4 wins on parameter efficiency with its MoE architecture (26B total but only 4B active per token) and edge deployment (E2B, E4B). Qwen 3.5 offers toggleable thinking mode for step-by-step reasoning, which Gemma 4 lacks. Both use Apache 2.0 licenses. Pick Gemma 4 for local/mobile with low RAM, Qwen 3.5 for reasoning-heavy tasks.

What is the gemma 4 e4b ollama model name?

The gemma 4 e4b ollama model name is `gemma4:e4b` (explicit tag) or `gemma4` (the default tag, which also pulls E4B). Pull it with `ollama pull gemma4:e4b` or run `ollama run gemma4` for an interactive session. The E4B model needs 4-6GB RAM and runs on any laptop with 8GB or more.

AI DEVELOPMENT

Gemma 4 Models: Which One Should You Actually Use?

Google released Gemma 4 on April 2, 2026 with four model sizes - E2B, E4B, 26B MoE, and 31B Dense. After running all four locally, the 26B MoE variant is the best pick for most developers. It activates only 4B parameters per token, runs at near-4B speeds, but delivers quality close to a 13B model.

April 6, 2026•

12 min read•

AI, Open Source, Local LLM

Gemma 4Open SourceOllamaMoELocal AI

TL;DR

Gemma 4 ships in 4 sizes: E2B (edge), E4B (edge), 26B MoE, and 31B Dense - all under Apache 2.0
The 26B MoE is the sweet spot for most developers - only 4B params active per token means fast inference with strong quality
E4B runs well on laptops with 4-6GB RAM, making it the easiest starting point for local experiments
All models support native tool use, 256K context, vision input, and 140+ languages

Table of Contents

What Changed from Gemma 3 to Gemma 4
Gemma 4 Model Sizes Explained
How to Run Gemma 4 Locally with Ollama
Gemma 4 Ollama Model Names Quick Reference
Running Gemma 4 with MLX on Mac
Gemma 4 vs Llama 4 vs Qwen 3.5 vs Mistral Small 4
Which Gemma 4 Model Should You Pick?
Gemma 4 Tool Use and Agentic Workflows
Frequently Asked Questions

What Changed from Gemma 3 to Gemma 4

Gemma 3 gave us 4B, 12B, and 27B parameter models with 128K context and basic multimodality. Gemma 4 restructures the entire lineup. Google split the family into two tiers: compact edge models (E2B, E4B) designed for phones and IoT, and larger models (26B MoE, 31B Dense) for GPUs and workstations. The naming scheme changed too - "E" means effective parameters, "A" means active parameters.

The 31B Dense variant currently ranks 3rd among open models on the Arena AI Text leaderboard, according to Google's official announcement. That puts it ahead of most open-weight competitors at similar sizes. Google claims up to 4x faster inference and 60% less battery consumption compared to Gemma 3, which I can confirm feels noticeably snappier when running the MoE variant locally.

The context window doubled from 128K to 256K tokens. Native audio input is new - Gemma 3 could only handle text and images. Vision capabilities carried over and improved. The license stays Apache 2.0, which means you can use these models commercially without restrictions. That's a real advantage over some competitors with more restrictive community licenses.

Feature	Gemma 3	Gemma 4
Model sizes	4B, 12B, 27B	E2B, E4B, 26B MoE, 31B Dense
Context window	128K tokens	256K tokens
Modalities	Text + Vision	Text + Vision + Audio
Languages	140+	140+
Native tool use	Limited	Full structured tool use
License	Apache 2.0	Apache 2.0

Gemma 4 Model Sizes Explained

The four Gemma 4 variants target different hardware and use cases. Understanding the naming convention saves a lot of confusion. "E" stands for effective parameters (what actually runs during inference on edge devices), while the 26B model uses "A" for active parameters in its MoE architecture. Here's what each one does.

E2B - The Tiny One

2B effective parameters. Built for phones, IoT devices, and environments where every megabyte counts. You won't get impressive output quality here, but it runs on hardware that can't handle anything else. Think embedded assistants, on-device text classification, or quick summarization on a Raspberry Pi.

RAM: ~2GB | Best for: Mobile, IoT, embedded

E4B - The Laptop Pick

4B effective parameters. This is the model I recommend for getting started. It runs on any MacBook or Windows laptop with 8GB RAM and produces surprisingly good output for its size. I use it for quick code explanations and draft generation when I don't want to wait for a larger model.

RAM: 4-6GB | Best for: Laptops, quick prototyping

26B-A4B MoE - The Sweet Spot

26B total parameters but only 4B active per token. This is the model that surprised me most. It uses 8 experts plus 1 shared expert, routing each token through a small subset of the network. The result is inference speed comparable to a 4B model with output quality approaching a 13B model. On my M2 MacBook Pro with 16GB, it felt almost as fast as E4B but noticeably smarter.

RAM: 8-12GB | Best for: Local coding assistant, agents

31B Dense - The Heavyweight

Every inference uses all 31B parameters. This produces the best output quality in the Gemma 4 family, but you need serious hardware. A quantized version fits on a 32GB MacBook Pro, though you'll feel the latency compared to the MoE variant. If you have a dedicated GPU with 24GB+ VRAM, this is where you get the strongest results.

RAM: 20GB+ | Best for: Server deployment, max quality

Model	Total Params	Active Params	RAM (Q4)	Context	Architecture
E2B	~2B	2B	~2GB	256K	Dense (edge)
E4B	~4B	4B	4-6GB	256K	Dense (edge)
26B-A4B	26B	~4B	8-12GB	256K	MoE (8+1 experts)
31B	31B	31B	20GB+	256K	Dense

According to HuggingFace's Gemma 4 analysis, the MoE architecture in the 26B model differs from DeepSeek and Qwen. Instead of replacing MLP blocks with sparse experts, Gemma adds MoE blocks as separate layers alongside the standard MLP blocks and sums their outputs. This design choice keeps the base model intact while adding specialized capacity through experts.

How to Run Gemma 4 Locally with Ollama

The fastest way to try Gemma 4 is through Ollama. One command downloads and runs the model. I tested all four variants on a 16GB M2 MacBook Pro, and the setup takes under 5 minutes for the smaller models.

First, install Ollama if you haven't already. Then pull whichever variant fits your hardware.

bashterminal

# Install Ollama (macOS)
brew install ollama

# Start the Ollama server
ollama serve

# Run the default E4B model
ollama run gemma4

# Or pick a specific variant
ollama run gemma4:e2b    # Smallest, ~2GB
ollama run gemma4:e4b    # Good laptop model, ~4GB
ollama run gemma4:26b    # MoE sweet spot, ~8GB
ollama run gemma4:31b    # Best quality, ~20GB

Once the model is running, you can interact with it through the terminal or call it from your code using Ollama's REST API. Here's a Python example that sends a coding question to the 26B MoE model.

pythongemma4_test.py

import requests
import json

response = requests.post(
    "http://localhost:11434/api/generate",
    json={
        "model": "gemma4:26b",
        "prompt": "Write a Python function that checks if a number is prime. Keep it simple.",
        "stream": False
    }
)

result = response.json()
print(result["response"])

The E4B model downloaded in about 2 minutes on my connection and started generating responses immediately. The 26B MoE took longer to download (it's a bigger file even though inference is fast), but once loaded, response times felt comparable to the E4B. That's the MoE advantage - the download is bigger, but the runtime cost is small.

Tip: Use the 26B MoE for Local Development

If you have 16GB of RAM, skip the E4B and go straight to the 26B MoE. The quality jump is significant, and it won't feel slower in practice since only 4B parameters activate per token. You get roughly 3x better reasoning for nearly the same latency and memory footprint.

Gemma 4 Ollama Model Names Quick Reference

Ollama uses specific model tags for each Gemma 4 variant. This table lists every available tag so you can copy the exact command you need. The default gemma4 tag pulls the E4B model.

Ollama Model Name	Variant	Download Size	RAM Needed	Best For
gemma4	E4B (default)	~3GB	4-6GB	Quick start, laptops
gemma4:e2b	E2B	~1.5GB	~2GB	Mobile, IoT, edge
gemma4:e4b	E4B	~3GB	4-6GB	Same as default
gemma4:26b	26B MoE	~16GB	8-12GB	Local coding, agents
gemma4:31b	31B Dense	~20GB	20GB+	Max quality, servers

bashterminal

# Pull a specific Gemma 4 model by exact tag
ollama pull gemma4:e4b     # E4B edge model
ollama pull gemma4:26b     # 26B MoE (recommended)
ollama pull gemma4:31b     # 31B Dense (best quality)

# Run interactively
ollama run gemma4:26b

# List downloaded models to verify
ollama list | grep gemma4

If you see gemma4:latest in your model list, that is the E4B variant. Ollama defaults to E4B when you run ollama run gemma4 without a size tag.

Running Gemma 4 with MLX on Mac

If you have an Apple Silicon Mac (M1/M2/M3/M4), MLX is an alternative to Ollama that runs models natively on the Metal GPU. MLX often delivers faster token generation than Ollama on the same hardware because it skips the GGUF conversion layer and uses Apple's unified memory architecture directly.

This matters especially for the Gemma 4 26B MoE and 31B Dense models. On a Mac Mini with 24GB or 32GB unified memory, MLX can run these larger models with better throughput than Ollama since the Metal backend handles the MoE expert routing efficiently.

bashterminal

# Install mlx-lm
pip install mlx-lm

# Run Gemma 4 E4B with MLX
mlx_lm.generate --model mlx-community/gemma-4-e4b-it-4bit \
  --prompt "Explain Python decorators in one paragraph"

# Run Gemma 4 26B MoE (needs 16GB+ unified memory)
mlx_lm.generate --model mlx-community/gemma-4-26b-it-4bit \
  --prompt "Write a function to parse CSV files"

# Start an OpenAI-compatible server
mlx_lm.server --model mlx-community/gemma-4-26b-it-4bit --port 8080

Mac Hardware Recommendations

Mac Mini M2 (16GB): Runs E4B and 26B MoE comfortably with MLX. The MoE model is the sweet spot for this config.
Mac Mini M2/M4 (24GB): Runs 26B MoE at full speed and can handle 31B Dense with Q4 quantization.
MacBook Air M3 (16GB): E4B runs great, 26B MoE works but watch thermal throttling on sustained generation.
Mac Studio / Mac Pro (32GB+): Runs 31B Dense at full quality. Overkill for smaller variants.

Ollama vs MLX: Ollama is easier to set up (one command) and has a broader ecosystem (REST API, tool use support). MLX gives you better raw performance on Apple Silicon and more control over quantization. For most users, start with Ollama. Switch to MLX if you need faster inference or are building a Mac-native application.

Gemma 4 vs Llama 4 vs Qwen 3.5 vs Mistral Small 4

Four major open model families are competing in early 2026. Each targets different strengths. I've been running all four locally and here's how they compare in practice.

Feature	Gemma 4 (26B MoE)	Llama 4 Scout	Qwen 3.5	Mistral Small 4
Active params	~4B	17B	7B / 32B / 72B	24B
Context window	256K	10M	128K	256K
License	Apache 2.0	Llama Community	Apache 2.0	Apache 2.0
Architecture	MoE	MoE	Dense	Dense
Edge models	Yes (E2B, E4B)	No	Yes (0.6B, 1.7B)	No
Vision	Yes	Yes	No (Qwen-VL separate)	Yes
Thinking mode	No	No	Yes (toggle on/off)	No
Native tool use	Yes	Yes	Yes	Yes

Gemma 4 wins on parameter efficiency. No other family offers models as small as E2B that still produce usable output. If you're building for mobile or embedded devices, Gemma is your only serious option among the big three. The MoE architecture also means the 26B model runs like a 4B model in terms of speed and memory.

Llama 4 wins on context length. The 10 million token context window on Llama 4 Scout is in a different league. If you're processing entire codebases, long documents, or need to maintain very long conversations, Llama is the clear choice. But you need server hardware to run it - there's no laptop-friendly Llama 4 variant.

Qwen 3.5 wins on reasoning flexibility. Alibaba's Qwen 3.5 introduced a toggleable thinking mode - you can switch between fast direct responses and slower chain-of-thought reasoning within the same model. The 32B variant competes directly with Gemma 4's 31B Dense on quality benchmarks, while the 7B version is a strong alternative to Gemma 4's E4B for laptop use. If you need a model that can reason through complex problems step by step, Qwen 3.5 has an edge. Gemma 4 wins on parameter efficiency and edge deployment, especially with its MoE architecture.

Mistral Small 4 wins on coding quality per parameter. In my testing, Mistral produces the cleanest code output when you compare similar-sized models. The 24B dense architecture means every parameter contributes to every token, and it shows in structured output tasks. According to BenchLM's 2026 rankings, Mistral consistently scores well on code generation benchmarks relative to its size.

Which Gemma 4 Model Should You Pick?

This is the question I get most. The answer depends on two things: your available hardware and what you're building. Here's my decision framework after running all four variants for different tasks.

Building a mobile or embedded app?

Pick E2B if RAM is extremely tight (under 4GB), or E4B if you can spare 4-6GB. Google specifically optimized these for Android via the AICore Developer Preview and AI Edge Gallery. The E4B handles basic coding tasks, text classification, and short summarization well enough for on-device features.

Want a local coding assistant on your laptop?

The 26B MoE is your best bet. On a 16GB MacBook, it runs at interactive speeds while producing output quality that actually helps with real coding problems. I've been using it alongside Claude Code for tasks that don't need cloud connectivity, and it handles code review, refactoring suggestions, and test generation reasonably well.

Deploying to a server for production use?

Go with 31B Dense if you have the hardware (24GB+ GPU VRAM). It consistently produces the best output in the Gemma 4 family. For high-throughput serving where latency matters more, the 26B MoE is more cost-effective since it processes tokens faster with fewer active parameters.

Just experimenting and want to try it quickly?

Start with E4B. It downloads fast, runs on almost anything, and gives you a feel for the Gemma 4 experience. You can always upgrade to the 26B MoE once you're ready to commit more resources. A single ollama run gemma4 command gets you going.

Gemma 4 Tool Use and Agentic Workflows

The feature that excites me most about Gemma 4 is native structured tool use. All four variants can accept a function schema and return valid JSON matching that schema. No prompt engineering hacks, no output parsing tricks. You define a tool, and the model calls it correctly.

This matters because it makes local AI agents practical. I've been building MCP servers for various integrations, and Gemma 4's tool use works well enough to drive simple agentic workflows without sending data to a cloud API. Here's what a basic tool definition looks like when calling Gemma 4 through Ollama.

pythongemma4_tool_use.py

import requests
import json

tools = [
    {
        "type": "function",
        "function": {
            "name": "get_weather",
            "description": "Get current weather for a location",
            "parameters": {
                "type": "object",
                "properties": {
                    "location": {
                        "type": "string",
                        "description": "City name, e.g. 'Pune' or 'San Francisco'"
                    },
                    "unit": {
                        "type": "string",
                        "enum": ["celsius", "fahrenheit"],
                        "description": "Temperature unit"
                    }
                },
                "required": ["location"]
            }
        }
    }
]

response = requests.post(
    "http://localhost:11434/api/chat",
    json={
        "model": "gemma4:26b",
        "messages": [
            {"role": "user", "content": "What's the weather in Pune?"}
        ],
        "tools": tools,
        "stream": False
    }
)

result = response.json()
print(json.dumps(result["message"]["tool_calls"], indent=2))

The model returns a properly structured tool call with the location extracted from the user's message. This is the same pattern that powers MCP server integrations - you define tools, the model decides when to call them, and your code handles execution. Running this locally with Gemma 4 means sensitive data never leaves your machine.

I've tested tool use across all four variants. The 26B MoE and 31B Dense handle multi-tool scenarios reliably - they pick the right tool and format parameters correctly. The E4B works for single-tool cases but sometimes struggles with complex schemas that have optional nested fields. The E2B is too small for reliable tool use in my experience.

Frequently Asked Questions

Get Started with Gemma 4

Install Ollama, run ollama run gemma4:26b, and you'll have a capable local AI model running in minutes. Check out my MCP projects to see how tool use and agentic patterns work in practice.

Gemma 4 on Ollama

View MCP Projects