Model Backbone
End-to-End Pipeline
Decoder-Only
Text Input
Tokenizer
Vocab: 248320
qwen3_5_moe Core
MoE
Embedding
d=4096
Layers (×364)
Layer 1
⋮
Layer 364
RMSNorm
Linear
Logits
Loss
Output
Decoder Layer
Single Transformer Block
Transformer Block
Hidden_states
[B, L, D]
LayerNorm
Pre-Attn
Self-Attention
MHA · 32 heads
E1
E2
E3
E4
E5
E6
E7
E8
top-2 of 8 active
copy
+
LayerNorm
Pre-MLP
MLP
ReLU
copy
+
Hidden_states
[B, L, D]
Micro-Architecture
Component Details
Attn
Hidden_states
Query
Key
Value
Apply_rotary_pos_emb
Query
Key
Compute_module
Query
Key
Dot_attn
Attention_weight
+
Softmax
Matmul
O_Linear
Output
MLP
HS
Linear
Act
Linear
×
Linear
HS
Architecture DNA
Model Fingerprint
📐 Dimensions
Hidden dim4096
Heads32H
KV heads32 (MHA)
Head dim128
Layers364
Vocab248k
⚡ Compute Budget
FP16 mem734.2 GB
Params394.2B
🔬 Mechanism
AttentionMHA
PositionLearned
NormLayerNorm
ActivationReLU
MaskingBidirectional
ExtrasMoE
⚖ Architecture Comparison
| Feature | Dec-Only | Enc-Only | Enc–Dec |
|---|---|---|---|
| Causal Mask | ✔ | – | ✔ |
| Bidirectional | – | – | ✔ |
| Cross-Attention | – | – | – |
| Autoregressive | ✔ | – | – |
| Text Generation | ✔ | – | – |
| Seq2Seq Tasks | – | – | – |
| Classification | ✔ | – | ✔ |