qwen3_5_moe
Params 394.17B
Layers 364
Head Dim 128
Mem FP16 734.2 GB
Theme
Export
Model Backbone
End-to-End Pipeline
Decoder-Only
Text Input
Tokenizer Vocab: 248320
qwen3_5_moe Core
MoE
Embedding d=4096
Hidden States [B, L, 4096]
Layers (×364)
Layer 1
Layer 364
RMSNorm
Linear Logits
Loss
Output
Decoder Layer
Single Transformer Block
Transformer Block
Hidden_states [B, L, D]
LayerNorm Pre-Attn
Self-Attention MHA · 32 heads
E1
E2
E3
E4
E5
E6
E7
E8
top-2 of 8 active
Hidden_states
copy
+
Hidden_states
LayerNorm Pre-MLP
MLP ReLU
Hidden_states
copy
+
Hidden_states [B, L, D]
Micro-Architecture
Component Details
Attn
Hidden_states
Query
Key
Value
Apply_rotary_pos_emb
Query
Key
Compute_module
Query
Key
Dot_attn
Attention_weight
+
Softmax
Matmul
O_Linear
Output
MLP
HS
Linear
Act
Linear
×
Linear
HS
Architecture DNA
Model Fingerprint
📐 Dimensions
Hidden dim4096
Heads32H
KV heads32 (MHA)
Head dim128
Layers364
Vocab248k
⚡ Compute Budget
FP16 mem734.2 GB
Params394.2B
🔬 Mechanism
AttentionMHA
PositionLearned
NormLayerNorm
ActivationReLU
MaskingBidirectional
ExtrasMoE
⚖ Architecture Comparison
Feature Dec-Only Enc-Only Enc–Dec
Causal Mask
Bidirectional
Cross-Attention
Autoregressive
Text Generation
Seq2Seq Tasks
Classification