Most people jumping into LLMs know that they are trained using massive data centers, often with expensive A100s which rock 40GiB of VRAM each. Getting into training is much more expensive than running, take a look at the lists below, which show most models require multiple A100s.
What factors into hardware requirements?
LLMs consume memory as a consequence of many factors, the primary of which are the number of parameters, the precision of the data type for storage and the framework used to train.
There are many different data types available today, if you are familiar with data types in Computer Science then FP32 and FP16 will be familiar to you. However due to the every increasing number of parameters, new types of Floating Point numbers have been developed. These focus on enabling a larger range of numbers compared to FP16, which was limited to 65,504, new types like TF32 can represent 3.4028235 × 10^38, while using slightly more memory.
The main ones used today are BFLOAT16 and Tensor Float 32, which reduce memory consumption dramatically.
So how much memory do I need to train?
Below you will find how much memory both VRAM and System Memory (RAM) you will need. For smaller models we will retain FP32 as it can be used, but it will not make sense for larger models. We will divide these into two categories, single-system models, and multi-system models. Single system models can theoretically be trained on a single computer.
Single System
LLAMA-6B
FP16
60GiB VRAM
32GiB RAM
FP32
80GiB VRAM
32GiB RAM
BFLOAT16
60GiB VRAM
32GiB RAM
TF32
70GiB VRAM
32GiB RAM
LLAMA-13B
FP16
121GiB VRAM
64GiB RAM
FP32
145GiB VRAM
64GiB RAM
BFLOAT16
121GiB VRAM
64GiB RAM
TF32
133GiB VRAM
64GiB RAM
LLAMA-33B
BFLOAT16
310GiB VRAM
256GiB RAM
TF32
OOM on single system
GPT-NEO X 20B
BFLOAT16
205GIB VRAM
128GiB RAM
TF32
60GiB VRAM
128GiB RAM
Multi-System (VRAM > 320GiB)
LLAMA-65B
BFLOAT16
603GiB VRAM
512GiB RAM
TF32
666GiB VRAM
512GiB RAM
GPT3 (~175B)
BFLOAT16
1630GiB VRAM
TF32
1790GiB VRAM
Why are these numbers so much larger than what I saw on HN?
A lot of the hype about LLMs recently has been about getting them to run on consumer hardware. This is a completely different type of problem from training, you can use methods to shrink the amount of bits required to run the model such as GPTQ by compressing the model. You first need to train the model at B16 before you can run GPTQ to shrink it to 3-4bits instead of 16.