The Complete Guide to GPU Requirements for LLM Fine-tuning

The Complete Guide to GPU Requirements for LLM Fine-tuning

When deciding on a GPU spec to train or fine-tune a model, you're likely going to need to hold onto the pod for hours or even days for your training run. Even a difference of a few cents per hour easily adds up, especially if you have a limited budget. On the other hand, you'll need to be sure that you have enough VRAM in your pod to even get the job off the ground in the first place. Here's the info you'll need to make an informed purchase of your GPU time.

Training speed - does it really matter?

When deciding which GPU spec to select at two competing price points (say, the A5000 vs A6000) you can get a rough estimate by evaluating how many cores they have:

Core Architecture Impact:

  • A6000: 10752 CUDA cores
  • A5000: 8192 CUDA cores
  • Results in ~31% more cores for parallel processing

Tensor Core Impact:

  • A6000: 336 Tensor cores
  • A5000: 256 Tensor cores
  • Results in ~31% more ML-specialized cores

Be advised, of course, that you won't get that full 31% due to additional overhead and other factors, but it's safe to say you should get at least three-quarters of it. However, at present the A5000 is only half the price of the A6000 in Secure Cloud, so the lower-spec card is actually the better choice of the two for you in this situation, economically, assuming you don't need the extra VRAM in the A6000. Hopefully, this does demonstrate that it's generally not advisable to use a higher-end card to try to push your training along faster from a purely economic standpoint.

So it's not that GPU speed doesn't matter - it's just that the speed of your hardware doesn't matter nearly as much as the techniques you apply. VRAM should be the focus of your decision, which leads us to..

Understanding the VRAM Equation

During inference, you primarily need memory for the model parameters and a small amount for activations. However, fine-tuning introduces several additional memory-hungry components:

The Core Components

  • Model Parameters: The foundation of your LLM. FP32 means each parameter requires 4 bytes, FP16 requires 2.
  • Optimizer States: Often the largest memory consumer. Traditional optimizers like AdamW maintain multiple copies of the parameters.
  • Gradients: Required for backpropagation. These usually match the precision of your model weights. Also, add another full copy of the parameter count to your VRAM requirements.
  • Activations: The wild card in the equation. These vary based on batch size, sequence length, and model architecture. They can be managed through techniques like gradient checkpointing.

All told, these components add up to the ~16GB per 1B parameter rule of thumb. This seems like a lot – and it is – but you can reduce that footprint through implementations like flash attention (which might save you 10-20%.) You can also use something like gradient checkpointing which even the most pessimistic estimate could save you half of your VRAM usage while increasing your training run by 20-30%. However, this training run time will easily be compensated for by the fact that you can train in a lower GPU spec, or in a pod with fewer GPUs renting, along with some other clawed back gains such as lower overhead due to needing fewer physical cards for the job.

Summing it all up

So, the 16GB per 1B rule of thumb is good for determining the maximum requirement for the job, which is a good starting point. If you're training a small, lightweight model like a 1.5B model, then using something like gradient checkpointing is actually going to work against you, because you aren't going to save enough to drop from an A5000 to an A4000 to justify the increased training costs. However, for larger models, it's almost always going to be worth it, so long as you are not under any pressing time constraints.

LoRA and QLoRA

If you're on a budget, and are willing to manage the technical overhead of additional files, then you can use LoRA and QLoRA to train at a deep discount. These methods freeze the original model weights and only train small rank decomposition matrices. Instead of updating all parameters, we're only updating these small matrices. Let's look at what this means in practice:

For a 7B parameter model with LoRA:

  • Original model stays frozen, requiring only about 14GB (in FP16)
  • LoRA matrices typically add less than 1% of the original parameter count
  • Optimizer states and gradients only needed for these small matrices
  • Total VRAM requirement drops to around 16-20GB

QLoRA takes this efficiency even further by combining LoRA with 4-bit quantization:

  • Original model is quantized to 4-bit precision, requiring only about 3.5GB
  • Still uses LoRA's efficient parameter updating
  • Keeps a small portion in higher precision for stability
  • Total VRAM requirement drops to around 8-10GB

This dramatic reduction in memory requirements has democratized LLM fine-tuning, making it possible to work with these models on even consumer-grade GPUs. For example, QLoRA enables fine-tuning of a 13B model on a single RTX 4090, which would be impossible with traditional methods.

When all is said and done, here's how those values might shake out for a number of different model weights. You can see how stark the differences can get - for a 70B model, you could need either a 5xH200 pod or a humble A40, depending on the techniques utilized.

Method Precision 1B 7B 14B 70B
Full 16 10GB 67GB 134GB 672GB
LoRA 16 2GB 15GB 30GB 146GB
QLoRA 8 1.3GB 9GB 18GB 88GB
QLoRA 4 0.7GB 5GB 10GB 46GB

Conclusion

So to sum it up - when training a model, the spec's speed doesn't matter nearly as much as the VRAM requirements and whatever memory-saving techniques you use. It's almost always going to be worth it to take the speed and technical tradeoffs for lower VRAM requirements - and then it becomes more a game of lowering that as much as feasible, which means your run will cost you less in the end.