RunPod Blog
  • RunPod
  • Docs
Sign in Subscribe
Moritz Wallawitsch

Moritz Wallawitsch

Introduction to vLLM and PagedAttention

Introduction to vLLM and PagedAttention

What is vLLM? vLLM is an open-source LLM inference and serving engine that utilizes a novel memory allocation algorithm called PagedAttention. It can run your models with up to 24x higher throughput than HuggingFace Transformers (HF) and up to 3.5x higher throughput than HuggingFace Text Generation Inference (TGI). How
31 May 2024 11 min read
Page 1 of 1
RunPod Blog © 2025
  • Sign up
Powered by Ghost