September 6th was a momentous day in large language model history, as Falcon-180 was released by the Technology Innovation Institute. To date, this is the single largest open-source LLM released to the public (edging out BLOOM-176b from 2022.) For quite some time, whether it was technical concerns or simply market forces, open source LLMs had a hard time breaking out of the 70b barrier, so it was surprising – but certainly not at all unwelcome – to see such a massive model unleashed to the public.
It's been reported that Falcon 180B has surpassed Llama-2 70B (no slouch in its own right) on the Hugging Face leaderboard. Having tested it myself in creative writing and roleplay exercises, I can give the more qualitative notion that Falcon 180B is excellent at avoiding confusion and boredom traps more endemic to smaller models. It is excellent at staying on task and remaining coherent while providing unique output, even in situations where I tried to force boredom traps with inordinately harsh temperature and repetition penalty settings. In my experience, Falcon held up much more so than 70b models designed for general use, or even 13b models like Mythomax that were specifically designed for roleplay.
You can get Falcon 180B from their Huggingface repo. Be advised that it is a gated repo (though getting access is as simple as accepting their license agreement.) Downloading gated models is a bit tricky - if you're using text-generation-webui, you'll need to insert your token from your Huggingface account before downloading the model, e.g:
HF_TOKEN="token_goes_here" python3 download-model.py tiiuae/falcon-180B-chat
Falcon 180B is quoted as requiring 400gb of VRAM to infer on, which means you'll need at least 5 A100's before it gets off the ground in its original, unadulterated state. In my experience, I think 5 was actually not quite enough to do so (the model would load, but the GPUs were sitting at 90% VRAM usage while idle, so I couldn't actually do anything with it.) So for workloads where there's any reasonable amount of context, you actually may need 6 or more. If you're going to load the model as is, A100s or H100s are going to be your only reasonable choices. Theoretically, if you were able to get a pod with nine 48GB cards like L40s, that would also work, but I think the load spread across that many cards would lead to an unusably slow token per second speed, as inference speed drops quite heavily as the model is spread over more physical cards, all other things being equal.
That would run you about $10/hr just to load the model, which might be prohibitively expensive for casual use cases. Fortunately, there are some other options.
Setting the model to load-in-4-bit in the Model page under text-generation-webui will get it to load in just two A100s, or four L40s or other 48gb cards. On 2 a100s, I got a text generation speed of about 4 t/s in text-generation-webui with Transformers, which is still pretty slow, but within the realm of usability for creative writing with patience.
Alternate quantizations: TheBloke has GGUF and GPTQ quantizations available for use. I wasn't able to get the GPTQ quant working (as it appears AutoGPTQ does not support Falcon yet, or so I was told by the error message I got when I tried) but the GGUF quantization should very easily be able to fit into a single A100 by utilizing the CPU and physical memory resources of the pod.
Falcon-180B is an absolute beast of a model, and it's not often that the community has been able to get their hand on something so large. As such, it is in something of an experimental phase with how to get their hands around such huge model. It is, however, very exciting to see options that are climbing closer and closer to the gargantuan sizes of many closed-source models like OpenAI's.