The RTX 5090 Is Here: Serve 65,000+ Tokens per Second on RunPod

The RTX 5090 Is Here: Serve 65,000+ Tokens per Second on RunPod

RunPod customers can now access the NVIDIA RTX 5090—the latest powerful GPU for real-time LLM inference. With impressive throughput and large memory capacity, the 5090 enables serving for small and mid-sized AI models at scale. Whether you’re deploying high-concurrency chatbots, inference APIs, or multi-model backends, this next-gen GPU delivers substantial performance with plenty of headroom to grow.

Small Models, Huge Gains

In our internal benchmarks, the 5090 proved to be a game-changer for small language models. Qwen2-0.5B achieved over 65,000 tokens per second and 250+ requests per second at 1024 concurrent prompts—a 2–3x leap compared to previous-gen cards. Similarly, Phi-3-mini-4k-instruct delivered 6,400 tokens/sec and ~25 requests/sec under the same conditions.

Even at lower batch sizes, both models maintained high throughput, with Qwen2-0.5B consistently outperforming Phi-3 in raw speed. It’s worth noting that these numbers represent high-concurrency performance, which is ideal for inference APIs and production deployments. As expected across the industry, single-prompt throughput will be lower—but for real-world workloads at scale, the 5090 sets a new bar.

Efficient, Scalable, and Ready for Production

All tests were run using vLLM, a high-performance inference engine optimized for large-batch and low-latency workloads. During our benchmarks, CPU usage averaged just 3.2%, confirming that the system is well-balanced and GPU-bound. VRAM usage sat at 95%, showing the 5090 was fully loaded with memory-bound tasks.

What really stands out is the linear scaling we observed as batch sizes increased—particularly with Qwen2-0.5B. Even at 1024 concurrent prompts, there was no significant drop-off in efficiency. For customers running production endpoints that need to scale gracefully with demand, this kind of performance is a major win.

How We Tested

The benchmark setup was simple but representative of typical chat and instruction-following use cases. We used:

  • vLLM as the backend (tensor parallelism = 1, single GPU)
  • 128 input + 128 output tokens per prompt
  • Default precision (dtype: auto, no quantization)
  • RTX 5090 Config: gpu_memory_utilization=0.95
  • And the GPU utilized around 18 GB of its available 32 GB of VRAM

This means there’s still room to experiment—with quantized models, larger architectures, or even multi-model deployments on the same GPU.

What This Means for You

For anyone running high-volume inference, the RTX 5090 on RunPod offers a sweet spot of speed, efficiency, and cost-effectiveness. With throughput surpassing 250 RPS on small models, it crushes latency bottlenecks while reducing the number of pods needed for a given workload. That translates directly to lower costs per request—a win for startups and scaled deployments alike.

Perhaps more importantly, the 5090 gives you flexibility. With only 18 GB of memory used during these benchmarks, there’s ample room to scale up to larger models or serve multiple models from a single GPU. Whether you're deploying lightweight agents, chatbots, copilots, or custom inference APIs, this is the hardware you want powering your production stack.


Try the 5090 Today on RunPod

The RTX 5090 is now available for on-demand and containerized workloads on RunPod. You can deploy your model in minutes using our prebuilt templates—or spin up a custom container with vLLM or your backend of choice.


Want to test your own models on the 5090?

Ready to deploy? Get started with inference containers or explore our vLLM template to try it yourself.

Let us know what you build—we’re excited to see what 65,000 tokens per second makes possible.