The Llama 3 Herd of Model

resources: Grattafiori et al. (2024), 2407.21783v3.pdf

step-by-step reproduction from training ⇒ scaling ⇒ inference

pre-train 405B on 15.6T tokens with 8K context windows.

The data mix: 50% of tokens corresponding to general knowledge, 25% mathematical and reasoning tokens, 17% code tokens, and 8% multilingual tokens.

Blakeney et al. (2024) also implements annealing data to improve quality

They also run their own scaling law calculations, instead of using Chinchilla constant

Architecture-wise, nothing special, but pure Transformers with Group-Query Attention and FFN

	8B	70B	405B
Layers	32	80	126
Model Dimension	4096	8192	16384
FFN Dimension	14336	28672	53248
Attention Heads	32	64	128
Key/Value Heads	8	8	8
Peak Learning Rate	3*10^-4	1.5*10^-4	8*10^-5
Activation Function	SwiGLU
Vocabulary Size	128000
Positional Embeddings	RoPE θ = 500000

Training config:

GPUs	TP	CP	PP	DP	Seq. Len.	Batch size/DP	Tokens/Batch	TFLOPs/GPU	BF16 MFU
$8{,}192$	$8$	$1$	$16$	$64$	$8{,}192$	$32$	$16\mathrm{M}$	$430$	$43\%$
$16{,}384$	$8$	$1$	$16$	$128$	$8{,}192$	$16$	$16\mathrm{M}$	$400$	$41\%$
$16{,}384$	$8$	$16$	$16$	$8$	$131{,}072$	$16$	$16\mathrm{M}$	$380$	$38\%$

16K H100 clusters (given that this is a production clusters instead of research clusters)
- 8 pods with 3072 GPUs per pods but around 1:7 oversubscription ratios (or 7x lower bandwidth)
took around 54 days for pre-training
Theretical FLOPs for H100 is 1,978 TFLOPs BF16
training days can be calculated as: $\text{Training time days} = \frac{\text{total tokens}}{\text{throughput tokens per sec} * 86400}$
Model FLOPs utilisation is usually $\frac{\text{global batch size} * \text{model FLOPs}}{\text{training step time} * \text{nGPUs} * \text{peak GPU FLOPs}}$ $\frac{global batch size * model FLOPs}{training step time * nGPUs * peak GPU FLOPs}$
- 38-43% utilization
Schedule:
- linear warmup of 8000 steps
- peak LR at $8 \times 10^{-5}$ $8 \times 1 0^{- 5}$ with Cosine LR scheduler to $8 \times 10^{-7}$ $8 \times 1 0^{- 7}$ at 1.2M steps
  - initial batch size of $4M$ tokens with seq_length=4096
  - double to batch size of $8M$ sequences of 8192 tokens after pretraining $252M$ tokens
  - double to batch size of $16M$ sequences of 8192 tokens after pretraining $2.87T$ tokens
Network configuration:
- a variants of NCCL (NCCLX)
- RDMA over Converged Ethernet (RoCE) fabric based on the Arista 7800 and Minipack2 Open Compute Project4 OCP rack.
- RoCE and Infiniband clusters
- Topology:
  - Three layers of Clos network
Training recipe: 4D parallelism with FSDP
- tensor parallelism: split individual weights tensors to multiple chunks on different devices
- pipeline parallelism: partition models vertically into stages by layers so different devices can process in parallel different stages of the full model pipeline
- context parallelism: divides input context into segments; reducing memory bottleneck for long sequence inputs
- FSDP: shards the model, optimizer, and gradients while implementing data parallelism (process data on multiple GPUs and synchronize per training steps)
  - They also do some network-aware parallelism configuration, but essentially they do all_gather
  - FSDP in Zero-2 mode, not Zero-3 mode. I.e., they keep the weight tensors materialized after the forward pass instead of re-gathering them in backward.

Bibliographie

Blakeney, C., Paul, M., Larsen, B. W., Owen, S., & Frankle, J. (2024). Does your data spark joy? Performance gains from domain upsampling at the end of training. arXiv preprint arXiv:2406.03476 [arXiv]
Grattafiori, A., Dubey, A., Jauhri, A., Pandey, A., Kadian, A., Al-Dahle, A., Letman, A., Mathur, A., Schelten, A., Vaughan, A., Yang, A., Fan, A., Goyal, A., Hartshorn, A., Yang, A., Mitra, A., Sravankumar, A., Korenev, A., Hinsvark, A., … Ma, Z. (2024). The Llama 3 Herd of Models. arXiv preprint arXiv:2407.21783 [arXiv]

resources: Grattafiori et al. (2024), 2407.21783v3.pdf

step-by-step reproduction from training ⇒ scaling ⇒ inference

pre-train 405B on 15.6T tokens with 8K context windows.

The data mix: 50% of tokens corresponding to general knowledge, 25% mathematical and reasoning tokens, 17% code tokens, and 8% multilingual tokens.

Blakeney et al. (2024) also implements annealing data to improve quality

They also run their own scaling law calculations, instead of using Chinchilla constant

Architecture-wise, nothing special, but pure Transformers with Group-Query Attention and FFN

	8B	70B	405B
Layers	32	80	126
Model Dimension	4096	8192	16384
FFN Dimension	14336	28672	53248
Attention Heads	32	64	128
Key/Value Heads	8	8	8
Peak Learning Rate	3*10^-4	1.5*10^-4	8*10^-5
Activation Function	SwiGLU
Vocabulary Size	128000
Positional Embeddings	RoPE θ = 500000

Training config:

GPUs	TP	CP	PP	DP	Seq. Len.	Batch size/DP	Tokens/Batch	TFLOPs/GPU	BF16 MFU
$8{,}192$	$8$	$1$	$16$	$64$	$8{,}192$	$32$	$16\mathrm{M}$	$430$	$43\%$
$16{,}384$	$8$	$1$	$16$	$128$	$8{,}192$	$16$	$16\mathrm{M}$	$400$	$41\%$
$16{,}384$	$8$	$16$	$16$	$8$	$131{,}072$	$16$	$16\mathrm{M}$	$380$	$38\%$

16K H100 clusters (given that this is a production clusters instead of research clusters)
- 8 pods with 3072 GPUs per pods but around 1:7 oversubscription ratios (or 7x lower bandwidth)
took around 54 days for pre-training
Theretical FLOPs for H100 is 1,978 TFLOPs BF16
training days can be calculated as: $\text{Training time days} = \frac{\text{total tokens}}{\text{throughput tokens per sec} * 86400}$
Model FLOPs utilisation is usually $\frac{\text{global batch size} * \text{model FLOPs}}{\text{training step time} * \text{nGPUs} * \text{peak GPU FLOPs}}$ $\frac{global batch size * model FLOPs}{training step time * nGPUs * peak GPU FLOPs}$
- 38-43% utilization
Schedule:
- linear warmup of 8000 steps
- peak LR at $8 \times 10^{-5}$ $8 \times 1 0^{- 5}$ with Cosine LR scheduler to $8 \times 10^{-7}$ $8 \times 1 0^{- 7}$ at 1.2M steps
  - initial batch size of $4M$ tokens with seq_length=4096
  - double to batch size of $8M$ sequences of 8192 tokens after pretraining $252M$ tokens
  - double to batch size of $16M$ sequences of 8192 tokens after pretraining $2.87T$ tokens
Network configuration:
- a variants of NCCL (NCCLX)
- RDMA over Converged Ethernet (RoCE) fabric based on the Arista 7800 and Minipack2 Open Compute Project4 OCP rack.
- RoCE and Infiniband clusters
- Topology:
  - Three layers of Clos network
Training recipe: 4D parallelism with FSDP
- tensor parallelism: split individual weights tensors to multiple chunks on different devices
- pipeline parallelism: partition models vertically into stages by layers so different devices can process in parallel different stages of the full model pipeline
- context parallelism: divides input context into segments; reducing memory bottleneck for long sequence inputs
- FSDP: shards the model, optimizer, and gradients while implementing data parallelism (process data on multiple GPUs and synchronize per training steps)
  - They also do some network-aware parallelism configuration, but essentially they do all_gather
  - FSDP in Zero-2 mode, not Zero-3 mode. I.e., they keep the weight tensors materialized after the forward pass instead of re-gathering them in backward.

The Llama 3 Herd of Model

Étiquette

publié à

modifié à

durée

source

Bibliographie

Vous pourriez aimer ce qui suit