Text Generation

16k Context LLM Models Now Available On RunPod

Brendan McKeag

19 Jul 2023 • 3 min read

Hot off the heels of the 8192-token context SuperHOT model line, Panchovix has now released another set of models with an even higher context window, matching the 16384 token context possible in the latest version of text-generation-webui (Oobabooga). Such a large context window is going to vastly improve performacne in long, involved question-answer sessions or roleplay experiences. Here's what these models are going to need to run successfully on the platform, since the widened context window comes with a few additional technical considerations.

VRAM Requirements

Depending on how much of the additional context window you need, you'll need to account for a higher amount of VRAM than you're used to. For example, in my testing of the Panchovix/guanaco-33b-lxctx-PI-16384-LoRA-4bit-32g model, with an empty context window, I used 55% of an a100's 80 GB of memory, which is about on par with a standard 2k context 33b model. With a fully loaded 16k context window, though, it spiked all the way up to 63%, meaning it's using around an extra 6gb of VRAM. If you've already been cutting it close with VRAM usage with your preferred model, it's something to keep in mind.

Higher perplexity

Perplexity is an objective measurement of how well an LLM is going to predict the next word based on the context it has been provided. A completely loaded context window means the model has to do many more comparisons to provide acceptable results. This is fine if the model was originally built for it and can be adjusted accordingly, but these are all merges of models, rather than brand-new models. That's not to say these merged models can't produce robust, impressive results, but it's a tradeoff to keep in mind when deciding whether the increased context will outweigh the drawbacks.

In my experience, for roleplay scenarios, the boosted context will always have enough value to be worth the tradeoff. However, for short sentiment analysis or question-answering scenarios that don't require a lot of back and forth and thus won't use that increased context window, you may be better off with the base model to give a less "diluted" result. In this case, it all depends on what your particular needs are. It may be worth keeping both models handy in your toolbox and switching back and forth as needed, depending on whether the extra context need applies to your particular scenario.

Why the increased context window is important

Up until now, the vast majority of accessible LLMs that can run on local PC or RunPod hardware have been limited to a 2k context window. To give you a point of reference for how little this is, at this point in the article we would have already used more than a quarter of a 2k context window if it were being output by an LLM. Tack on additional context needs for other use cases, such as character sheets and speech examples for roleplay scenarios or other instructions given to a question-answering scenario, and you can see how quickly that window fills up. If you get into an involved question-answering scenario with an LLM and need to ask it follow-up questions or have it refer to earlier text, once that context window fills up, it will begin forgetting the earliest things it said and any further answers it may give will be suspect based on it lacking that context that has fallen out of the window.