LLM Architecture Innovation Timeline

Tracing the evolution of LLM architectures through key innovations

68
Total Innovations
3
Years of Evolution
17
Model Families
High Impact (MoE, GQA, SSM)
Medium Impact (RoPE, SwiGLU)
Standard Features
2022
Multi-lingual
BLOOM
46 Languages
bigscience high
ODP
BLOOM
Open Scientific Preprint License
bigscience low
176B Params
BLOOM
Largest Open Multilingual Model
bigscience high
2023
RoPE
LLaMA-2
Rotary Position Embedding
LLaMA medium
SwiGLU
LLaMA-2
SwiGLU Activation
LLaMA medium
RoPE
Qwen-7B
Rotary Position Embedding
Qwen medium
Sliding Window
Mistral-7B
Sliding Window Attention
LLaMA medium
RoPE
Mistral-7B
Rotary Position Embedding
LLaMA medium
Rope Scaling
Mistral-Instruct
Extended Context via RoPE Scaling
LLaMA medium
GLM Embedding
ChatGLM
General Language Model Pretraining
GLM medium
Multi-Query Attention
ChatGLM2
Multi-Query Attention
GLM medium
Long Context
ChatGLM2
32K Context
GLM high
GQA
ChatGLM3
Grouped Query Attention
GLM high
Self-Extension
ChatGLM3
Extended Context 128K
GLM low
Long Context
Yi
200K Context Window
01-ai high
RoPE
Yi
Rotary Position Embedding
01-ai medium
Long Context
Kimi
128K Context Window
Kimi high
Textbooks
Phi-1
High-Quality Textbook Data
microsoft medium
Code Data
Phi-1
Synthetic Code Generation
microsoft low
Small Scale
Phi-2
2.7B Parameter Efficiency
microsoft low
FIM
Starcoder
Fill-in-the-Middle
bigcode medium
Long Context
Starcoder
8K Context
bigcode high
Billion-scale
Falcon
Web Data Filtering
tiiuae low
LLM
Falcon
FlashAttention
tiiuae low
GQA
Falcon-40B
Grouped Query Attention (40B)
tiiuae high
WKV
RWKV
Weighted Key-Value
RWKV high
RNN-Transformer
RWKV
RNN-Transformer Hybrid
RWKV high
Linear Complexity
RWKV
O(n) for Long Context
RWKV high
Long Context
InternLM
8K-32K Context
internlm high
Open Weights
InternLM
Fully Open Source
internlm medium
BaiChuan
Baichuan
Bilingual (ZH/EN)
baichuan-inc low
Dynamic NTK
Baichuan
Dynamic NTK Scaling
baichuan-inc medium
2.0
Baichuan2
Improved Training Data
baichuan-inc low
GQA
Baichuan2
Grouped Query Attention
baichuan-inc high
Open Source
Skywork
Fully Open Weights
Skywork low
Long Context
Skywork
4K-16K Context
Skywork high
2024
GQA
LLaMA-3
Grouped Query Attention
LLaMA high
128K Context
LLaMA-3
Extended Context Length
LLaMA high
Long Context
LLaMA-3.1
128K Extended Context
LLaMA high
GQA
Qwen1.5
Grouped Query Attention
Qwen high
GQA
Qwen2
Grouped Query Attention
Qwen high
BF16
Qwen2
BFloat16 Support
Qwen low
MoE
Qwen2.5
Mixture of Experts (optional)
Qwen high
Long Context
Qwen2.5
128K Context
Qwen high
MoE
DeepSeek-MoE
Mixture of Experts
DeepSeek high
Fine-grained Expert
DeepSeek-MoE
Fine-grained Expert Partitioning
DeepSeek medium
MLA
DeepSeek-V2
Multi-head Latent Attention
DeepSeek high
DeepSeek MoE
DeepSeek-V2
Custom MoE Architecture
DeepSeek medium
VL
DeepSeek-V2.5
Vision-Language Integration
DeepSeek medium
GLM-4
GLM-4
Full GLA Architecture
GLM low
Tool Use
GLM-4
Function Calling
GLM medium
GQA
Yi-1.5
Grouped Query Attention
01-ai high
Stronger Base
Yi-1.5
Improved Pretraining
01-ai low
3.8B > 7B
Phi-3
Outperform Larger Models
microsoft low
Long Context
Phi-3
128K Context
microsoft high
GQA
Phi-3
Grouped Query Attention
microsoft high
Gemini Tech
Gemma
Based on Gemini Research
google low
Open Weights
Gemma
Open Model Weights
google medium
Gemma 2
Gemma-2
Improved Architecture
google low
GQA
Gemma-2
Grouped Query Attention
google high
SSM
Mamba
State Space Model
state-spaces high
Hardware-Aware
Mamba
Hardware-Aware Selection Scan
state-spaces low
Linear Complexity
Mamba
O(n) vs O(n²) Attention
state-spaces high
Slimp
Mamba-Slimp
Compression for Deployment
state-spaces low
MoE
InternLM2
Mixture of Experts
internlm high
100K Context
InternLM2
Extended to 100K
internlm low
MoE
Skywork-MoE
Mixture of Experts
Skywork high
SFT
Skywork-MoE
Supervised Fine-Tuning
Skywork low