LLM Architecture Innovation Timeline

68

Total Innovations

3

Years of Evolution

17

Model Families

High Impact (MoE, GQA, SSM)

Medium Impact (RoPE, SwiGLU)

Standard Features

2022

Multi-lingual

BLOOM

46 Languages

bigscience high

ODP

BLOOM

Open Scientific Preprint License

bigscience low

176B Params

BLOOM

Largest Open Multilingual Model

bigscience high

2023

RoPE

LLaMA-2

Rotary Position Embedding

LLaMA medium

SwiGLU

LLaMA-2

SwiGLU Activation

LLaMA medium

RoPE

Qwen-7B

Rotary Position Embedding

Qwen medium

Sliding Window

Mistral-7B

Sliding Window Attention

LLaMA medium

RoPE

Mistral-7B

Rotary Position Embedding

LLaMA medium

Rope Scaling

Mistral-Instruct

Extended Context via RoPE Scaling

LLaMA medium

GLM Embedding

ChatGLM

General Language Model Pretraining

GLM medium

Multi-Query Attention

ChatGLM2

Multi-Query Attention

GLM medium

Long Context

ChatGLM2

32K Context

GLM high

GQA

ChatGLM3

Grouped Query Attention

GLM high

Self-Extension

ChatGLM3

Extended Context 128K

GLM low

Long Context

Yi

200K Context Window

01-ai high

RoPE

Yi

Rotary Position Embedding

01-ai medium

Long Context

Kimi

128K Context Window

Kimi high

Textbooks

Phi-1

High-Quality Textbook Data

microsoft medium

Code Data

Phi-1

Synthetic Code Generation

microsoft low

Small Scale

Phi-2

2.7B Parameter Efficiency

microsoft low

FIM

Starcoder

Fill-in-the-Middle

bigcode medium

Long Context

Starcoder

8K Context

bigcode high

Billion-scale

Falcon

Web Data Filtering

tiiuae low

LLM

Falcon

FlashAttention

tiiuae low

GQA

Falcon-40B

Grouped Query Attention (40B)

tiiuae high

WKV

RWKV

Weighted Key-Value

RWKV high

RNN-Transformer

RWKV

RNN-Transformer Hybrid

RWKV high

Linear Complexity

RWKV

O(n) for Long Context

RWKV high

Long Context

InternLM

8K-32K Context

internlm high

Open Weights

InternLM

Fully Open Source

internlm medium

BaiChuan

Baichuan

Bilingual (ZH/EN)

baichuan-inc low

Dynamic NTK

Baichuan

Dynamic NTK Scaling

baichuan-inc medium

2.0

Baichuan2

Improved Training Data

baichuan-inc low

GQA

Baichuan2

Grouped Query Attention

baichuan-inc high

Open Source

Skywork

Fully Open Weights

Skywork low

Long Context

Skywork

4K-16K Context

Skywork high

2024

GQA

LLaMA-3

Grouped Query Attention

LLaMA high

128K Context

LLaMA-3

Extended Context Length

LLaMA high

Long Context

LLaMA-3.1

128K Extended Context

LLaMA high

GQA

Qwen1.5

Grouped Query Attention

Qwen high

GQA

Qwen2

Grouped Query Attention

Qwen high

BF16

Qwen2

BFloat16 Support

Qwen low

MoE

Qwen2.5

Mixture of Experts (optional)

Qwen high

Long Context

Qwen2.5

128K Context

Qwen high

MoE

DeepSeek-MoE

Mixture of Experts

DeepSeek high

Fine-grained Expert

DeepSeek-MoE

Fine-grained Expert Partitioning

DeepSeek medium

MLA

DeepSeek-V2

Multi-head Latent Attention

DeepSeek high

DeepSeek MoE

DeepSeek-V2

Custom MoE Architecture

DeepSeek medium

VL

DeepSeek-V2.5

Vision-Language Integration

DeepSeek medium

GLM-4

GLM-4

Full GLA Architecture

GLM low

Tool Use

GLM-4

Function Calling

GLM medium

GQA

Yi-1.5

Grouped Query Attention

01-ai high

Stronger Base

Yi-1.5

Improved Pretraining

01-ai low

3.8B > 7B

Phi-3

Outperform Larger Models

microsoft low

Long Context

Phi-3

128K Context

microsoft high

GQA

Phi-3

Grouped Query Attention

microsoft high

Gemini Tech

Gemma

Based on Gemini Research

google low

Open Weights

Gemma

Open Model Weights

google medium

Gemma 2

Gemma-2

Improved Architecture

google low

GQA

Gemma-2

Grouped Query Attention

google high

SSM

Mamba

State Space Model

state-spaces high

Hardware-Aware

Mamba

Hardware-Aware Selection Scan

state-spaces low

Linear Complexity

Mamba

O(n) vs O(n²) Attention

state-spaces high

Slimp

Mamba-Slimp

Compression for Deployment

state-spaces low

MoE

InternLM2

Mixture of Experts

internlm high

100K Context

InternLM2

Extended to 100K

internlm low

MoE

Skywork-MoE

Mixture of Experts

Skywork high

SFT

Skywork-MoE

Supervised Fine-Tuning

Skywork low