Master the Art of Serverless Scaling: Optimize Performance and Costs on RunPod

Brendan McKeag

25 Jul 2024 • 7 min read

In many sports – golf, baseball, tennis, among others – there is a "sweet spot" to aim for which results in the maximum amount of lift or distance for the ball given an equivalent amount of kinetic energy in the swing. While you'll still get somewhere with an imperfect swing, it might not be as far as you might have hoped.

It turns out that scaling your workload on RunPod Serverless isn't much different. Missing on the mark on one side leads to unnecessary spend, while missing the other leads to a degraded user experience. There are several variables that go into maximizing and efficiently using your serverless spend – so let's go into what it takes to develop your strategy.

Imagine that you are a developer tasked with creating a chatbot to field technical questions before they're forwarded to your support team. You're already saving quite a bit by having a RunPod-hosted LLM handle those as opposed to your support associates, but the real question is: could you save even more by tuning your worker scaling strategy?

Why scaling matters

While you could simply rent a GPU pod for the job, and in some cases that is actually preferable – mostly for tasks where you know you will be utilizing the GPU at full blast for hours on end, such as training a model – scaling inference is simply a far more efficient use of resources. It's very rare for user-based workload demand to fit into nice, contiguous blocks to be efficiently fed into a GPU pod server. In reality, you'll often have blocks of time where the GPU is sitting idle, which leads to wasted spend.

Enter serverless AI functions, that simply uses a worker to ingest a job, spends a few seconds (or perhaps minutes, if needed) on the job, gives you the results, and shuts down when complete. This ensures that you only pay for the time you actually use, compared to pods where you pay for the hardware time whether it is being utilized or not.

One of the most compelling examples is for chatbot applications like the use case mentioned above. Because the vast majority of the elapsed time is the user processing the AI response, considering, and typing the response, the end result is the GPU sits idle for the vast majority of the time. For AI chatbot use, serverless could represent an 80% cost savings or more compared to using a pod server. dannysemi has provided a serverless proxy worker running on FastAPI and vLLM that will enable chatbot use on Sillytavern and similar applications. Wing Lian also has an article on Medium demonstrating how you could use quantized models on RunPod to optimize response times, so that your workers can get the job done even faster.

Tolerance of delays in your use case

Before you alter your strategy at all, you need to consider your serverless use case. Are you serving job results to a general public that expects immediate results? According to Google, 53% of mobile visits are abandoned if a site takes more than three seconds to load. If your use case is such that you are trying to attract new users into your funnel, you might not want to even risk any cold starts. On the other hand, if you are perhaps running inference on an input and the results will be collected on a server for later human review, in that case a slight delay in running a task is probably not a big deal at all. You will want to define a service level agreement with your users so that everyone is on the same page – without this, you are essentially scrambling in the dark without a plan. Once you have a specific target to aim for, then you'll be able to proceed. In the case of a support chatbot, you'll need fairly immediate, if not immediate results. There may be some tolerance there (maybe a couple of seconds) but excessive cold starting may cause delays that leave a bad impression.

In this entry, we'll go through the variables that go into your serverless strategy, along with their primary advantage, disadvantage, and example use case.

Active Workers

Active workers in serverless will bill for every second the worker is active, but at a 40% discount compared to a flex worker.

The primary advantage of active workers is that they are spun up and ready to accept work immediately - no cold starts, no delays, they are just ready to go.

The primary disadvantage is that they bill whether or not they have anything to work on, so any funds spent on an idle active worker are not devoted to actual inference time, only providing availability.

The use case for active workers should be to handle your baseline workload. Note that you can adjust your amount of active workers in your endpoint at any time through manually editing in the endpoint or in GraphQL. You can increase or decrease your number of active workers through API calls at any time.

You can send an API call using GraphQL to update your number of active workers - these can be automated to fire programmatically at certain points during the day when you know your demand levels may change. Further examples of this can be found in our docs.

You might find cron or postman useful for having these worker updates kick off at certain points in the day - for example, if support requests for your chatbot drop off after end of business locally, there's no need to have all of those workers running overnight.

curl --request POST \
  --header 'content-type: application/json' \
  --url 'https://api.runpod.io/graphql?api_key=${YOUR_API_KEY}' \
  --data '{"query": "mutation { saveEndpoint(input: { id: \"i02xupws21hp6i\", gpuIds: \"AMPERE_16\", name: \"Generated Endpoint -fb\", templateId: \"xkhgg72fuo\", workersMax: 0 }) { id gpuIds name templateId workersMax } }"}'

Flex workers

Flex workers spin up on-demand to handle incoming workloads when all of your active workers are currently occupied with tasks.

The primary advantage of flex workers is that they are convenient and able to handle an essentially infinite amount of additional work if needed. They do their job and shut down after a few seconds if no further work is received.

The primary disadvantage is that they are more expensive for inference time than active workers.

The use case of flex workers is to reduce friction for users and prevent churn (such as the user feeling that the service is slow/unresponsive and them moving on for a competitor.) Think of flex workers as your "goalkeeper" to prevent unexpected spikes in demand from impacting your users – especially users that are contacting your support for help and may already be carrying a certain level of frustration.

Try RunPod Serverless

Idle Timeout

Idle timeout keeps a flex worker running and ready for new tasks after it has finished its current job.

The primary advantage of idle timeout is that it keeps you poised to handle workload spikes with your flex workers, instead of needing to spin up additional workers.

The primary disadvantage is that an idle flex worker has the greatest potential for wastage; not only do you miss the discount from maintaining an active worker while it runs, but it's not accomplishing any useful work in the process.

The use case of an idle timeout is similar to flex workers, helping to boost availability for your end users by keeping them available for further work.

Scale Type

When increasing your number of workers dynamically, you have two options for increasing your number of workers: queue delay or request delay. The purpose of these delays is to smoothly increase your worker count to ensure that workers are utilized efficiently, rather than simply adding a ton of extra workers that may not be used for more than one job.

Queue Delay adds additional workers after jobs have been waiting for a certain number of seconds.

Request Count adds additional workers based on how many jobs are currently being handled.

Request Count is something of a moving target based on how many jobs you currently have active, which makes it more challenging to use. However, it generally leads to more efficient use of resources. Queue Delay will likely be perceived as a smoother user experience.

What about Flashboot?

We go into Flashboot here and you can learn more about how it works - but long story short, there's no real reason to not keep it enabled. It doesn't cost anything, and can easily reduce cold starts down to a second or less. The only real caveat is how much it can improve your situation tends to be variable on a number of factors – such as how popular the image you're using is. Images that get used more get cached more, for example. Also, it's not a guarantee that a worker will be booted by a certain time, so while it is a boon for reducing overall delays, do not solely use it for a process that absolutely must be up and running by a certain time because edge cases can still slip through. In the case of a chatbot, though, this is the use case that Flashboot was designed for and it will perform quite handily.

So what is right for your use case?

As mentioned above, the first thing is to find out what your service level agreement is with your users. Even if you don't have one defined, you still have an unspoken SLA which is governed by your user behavior; the average time where your users feel dissatisfaction with your service, even if it is not outwardly expressed, could be considered to be a breaking of this de facto service level agreement.

While you could allocate enough workers to have zero delays for any of your users at any point, this could also be prohibitively expensive and not even be something that your users are actually expecting.

Optimally, you would want to listen and communicate with your users. Set up a poll or survey seeking this information to find out how long of a delay they would be willing to tolerate, and then tweak your worker counts and strategy to meet those expectations.

If you need an overview of serverless functions, check out Getting Started with RunPod Serverless - community member Ashley Kleynhans has provided an 8 minute rundown for a bird's eye view over what you'll need to begin testing.

Getting Started

So now that you've learned about how to optimize RunPod Serverless for chatbot use, how might you demonstrate the value to your organization, and how much might you save using serverless compared to your previous implementation?

If you haven't yet begun exploring serverless yet, then you may be interested in our vLLM Quick Deploy feature which guides you through setting up an endpoint in just a few clicks. We also have a vibrant community full of developers on our Discord where you can share your own story, seek feedback, and ask questions – feel free to join us here!

RunPod has many additional tools to help you get started with running serverless functions, and we're adding more all the time. It's never been easier to run serverless functions on the platform. Check out the following docs and community contributions:

Quick Deploy - set up popular serverless functions for vLLM, Stable Diffusions, or others in a few clicks
runpod-python repo: Python library for RunPod API and serverless worker SDK.
Serverless pricing calculator: Estimate costs based on GPU spec, request length, and request count over time.