IBM just dropped the full technical details on their Granite 4.1 family of LLMs, and honestly, it’s refreshing to see a team that actually puts data quality ahead of just throwing more compute at the problem. The models—3B, 8B, and 30B dense decoder-only transformers—are trained from scratch on about 15 trillion tokens, but the real story is how they curated and sequenced that data.
The Architecture: Nothing Fancy, Just Solid Choices
Granite 4.1 uses a straightforward dense transformer with Grouped Query Attention (GQA), Rotary Position Embeddings (RoPE), SwiGLU activations, RMSNorm, and shared input/output embeddings. No Mixture-of-Experts, no exotic routing mechanisms. The 8B model has 40 layers, 32 attention heads, 8 KV heads, and an embedding size of 4096. The 30B doubles the layers to 64 and bumps the MLP hidden size to 32768. It’s a clean, well-understood design that lets the data do the heavy lifting.
Pre-Training: Five Phases of Progressive Data Refinement
This is where Granite 4.1 separates itself from the pack. Instead of one monolithic training run, IBM split pre-training into five distinct phases, each with a different data mixture and learning rate schedule. The goal is to start broad, then progressively shift toward higher-quality, more specialized data.
Phase 1 (10T tokens) is your standard web-scale training: ~59% CommonCrawl, 20% code, 7% math, 10.5% technical content, 2% multilingual, and 1.5% domain-specific. It’s a solid foundation, but nothing groundbreaking.
Phase 2 (2T tokens) is where things get interesting. They crank up math to 35% (a 5x increase) and code to 30% (1.5x increase). CommonCrawl drops to just 12%, replaced by high-quality subsets. They also start introducing 9% synthetic data. This is a deliberate pivot toward reasoning-heavy capabilities.
Phase 3 (2T tokens) marks the transition to “mid-training” with an exponential decay learning rate. The data mix becomes more balanced: CommonCrawl-HQ, math, and code each get ~16.67%, but now we see long chain-of-thought data (12.5%) and instruction tuning data (12% combined). This is where the model starts learning how to reason step-by-step.
Phase 4 (0.5T tokens) is a refinement phase with linear decay to zero. The mix shifts again: CommonCrawl-HQ jumps to 40%, code and math each get 20%, and instruction data makes up 14%. It’s a final polish on the highest-quality data available.
Phase 5 is long-context extension, taking the context window from 4K all the way to 512K tokens through staged steps: 32K, 128K, then 512K. For the 512K step, they use 80% books and 20% code repository data (only for 8B and 30B). After each stage, they do a model merge to preserve short-context performance. The RULER benchmark results are solid: the 8B base model scores 83.6 at 32K, 79.1 at 64K, and 73.0 at 128K.
Post-Training: SFT + GRPO with DAPO Loss
After pre-training, the models go through supervised fine-tuning on ~4.1 million curated samples. IBM used an LLM-as-Judge framework to filter and rank this data, which is a sensible approach—letting a strong model evaluate the quality of training examples.
Then comes reinforcement learning via on-policy GRPO (Group Relative Policy Optimization) with DAPO loss (Yu et al., 2025). This is a multi-stage pipeline designed to systematically strengthen math, coding, instruction following, and general chat. The 8B instruct model reportedly matches or surpasses the previous Granite 4.0-H-Small (a 32B-A9B MoE model), which is impressive for a dense 8B model.
What I Like and What I’d Question
I appreciate the transparency here. IBM published the exact data mixtures per phase, the learning rate schedules, and the architectural choices. That’s rare and valuable for the community. The five-phase approach makes intuitive sense: start broad, then focus on reasoning, then polish with high-quality data, then extend context.
That said, 15T tokens for a 30B model feels like a lot. For comparison, Llama 3 70B was trained on 15T tokens. I’d be curious to see ablation studies showing whether the five-phase pipeline actually outperforms a simpler two-phase approach (pre-training + high-quality annealing) on downstream tasks. The model merge after each LCE stage is a clever trick, but it adds complexity.
Also, the long-context data mix for 512K—80% books and 20% code—seems heavily biased toward narrative and structured content. I wonder how it handles long-form technical documents or multi-turn conversations at that context length.
Bottom Line
Granite 4.1 is a well-engineered family of models that prioritizes data quality and progressive training over architectural novelty. The 8B model punching above its weight class is a good sign that the data strategy works. All models are released under Apache 2.0, which is a nice bonus for enterprise users. If you’re building applications that need strong reasoning and long-context handling without the overhead of a 70B+ model, these are worth a serious look.
Comments (0)
Login Log in to comment.
Be the first to comment!