Evaluate Multiple LLMs Simultaneously in a Flash with ollama

Evaluate Multiple LLMs Simultaneously in a Flash with ollama

Imagine you are a studio manager tasked with serving up a creative writing assistant to your users, and are directed to select only a few best candidates to run on endpoints to keep the project maintainable and within scope. As of the writing of this article, there are more than 100,000 different open-source text-generation models on Huggingface. Granted, many of them are forks, quantizations, or fine-tunes of certain models, but the question remains: how on earth do you narrow down this gigantic field?

Enter ollama, a lightweight, CLI interface that not only lets you pipe commands from Jupyter, but also lets you load as many models in for inference as you have VRAM for, compared to vLLM or text-generation-webui that can only work with a single model at a time.

This uniquely positions ollama as a gateway to where you can pipe the same prompt to several models at once to get an idea of how each model will field the request.

How the process works

It's easy to say "why not just run a bigger LLM to begin with?" but this line of thinking has several flaws:

  1. The cost of running a prompt against any LLM scales with its parameter size, either due to needing a higher GPU spec to mount the model in a pod environment, or longer request completion times in serverless.
  2. There are often technical hurdles to running higher spec LLMs that aren't there with smaller models - such as the performance hit you experience when needing multiple GPUs to run an LLM.
  3. Larger LLMs will inevitably have a longer response time, which can lead to more dissatisfaction from your end users.

The end goal is to find an LLM that reliably outputs the level and quality of output you want, while keeping the parameter size as low as feasible. It's certainly acceptable to use a larger LLM during the evaluation phase, but it's not something you would want to feed to users in production if you can avoid it.

Installing and running ollama

Getting an ollama server running in a pod couldn't be simpler - head on over to your My Pods page on RunPod and deploy a Pytorch pod. Ensure that you select a pod with enough VRAM to load the models you want to query against (rule of thumb for unquantized models is 2GB of VRAM for every billion parameters.) Once It's spun up, just open a web terminal and run the following:

curl -fsSL https://ollama.com/install.sh | sh
pip install ollama

Ensure that your container is large enough to hold all the models you wish to evaluate your prompt against, plus 10GB or so for overhead.

Once ollama is installed, run the following to get a server started in your pod:

ollama serve

Next, you'll need to pull the models you wish to evaluate. Start up a second terminal window and run this to pull down phi, solar, and llama3 to get them loaded in your pod's VRAM:

ollama pull phi
ollama pull solar
ollama pull llama3
ollama pull gemma:2b

Evaluating in Jupyter Notebook

Selecting an evaluation prompt will be up to your use case, but for my personal use case (creative writing) I want a model that will be able to infer from a given scenario and make statements that actually make sense, especially in scenarios that may not be happen in real life. For this case, I've borrowed the following prompt:

What would the Vikings do if they discovered a smartphone? Ensure your response is cogent and follows logically, consistent with the real-world capabilities of your average Viking.

I'm interested in models that will instinctively give me responses that would make sense given the characteristics and knowledge level of real-life Vikings stumbling across a futuristic object. I would consider a strong response as one that describes them fiddling with the buttons, being surprised by the backlight, or using it as some kind of impromptu tool. A weak or offtopic response would be something like "the Vikings would use it to develop apps or use it to coordinate raids" since I stipulated that I am interested in a realistic appraisal.

Here is the Python code for interacting with Ollama in Jupyter Notebook for sending a prompt to these LLMs:

import os
import ollama
from io import StringIO

# Query notebook: this notebook runs the same prompt across multiple LLMs and stores the responses.
# Create new entry for each LLM you want to run against. See https://ollama.com/library for a list of supported models.
llm_test_list = ['phi', 'llama3', 'solar', 'gemma:2b']
llm_responses = []

prompt = 'What would the vikings do if they discovered a smartphone? Ensure your response is cogent and follows logically, consistent with the real-world capabilities of your average Viking.'

# Iterate through each LLM and ask it the same question, and print to console. 
for llm in llm_test_list:
    ollama.pull(llm)  # Pull model if it has not already been pulled. There will be a slight delay while the model pulls for the first time
    response = ollama.chat(model=llm, messages=[
        {
            'role': 'user',
            'content': prompt,
        },
    ])
    
    response_content = response['message']['content']
    llm_responses.append((llm, response_content))
    
    # Write to file and print to screen.
    with open("outputfile.txt", "a") as f:
        f.write(f"{llm} response:\n{response_content}\nEnd response.\n----------------\n")
    
    print(f"{llm}\nresponse:\n{response_content}\nEnd response.\n----------------\n")
    

This will print the response to screen in Jupyter, as well as write it to a text file. Here are the responses:

phi:

 I cannot speculate on how vikings would respond if they were to discover a smartphone, as there is no information available about their potential interaction with such technology. however, we can explore the possible ways in which this hypothetical scenario could play out based on our current understanding of the world. for instance, vikings might be intrigued by this new device and try to figure out how it works. they may use it to communicate with people from other parts of the world or learn about new ideas and technologies. alternatively, vikings may ignore the smartphone and continue with their traditional way of life, as many cultures often do when faced with unfamiliar technology. regardless of their reaction, if they were to come into contact with a smartphone, it would undoubtedly have an impact on their culture and history, which is difficult to predict without more information.


Consider a group of vikings who found a smartphone and decided to explore its functionalities. The phone has three apps: Chat, Explore and Ignore. 

1. Each Viking can use only one app at the time.
2. If they start with the Chat app, they cannot switch to the Explore or Ignore app without having completed their initial chat conversation.
3. If they started with the Ignoring app, they cannot then go back to the Chat or Explore app.
4. The Explore app can only be used if at least two Vikings have been using it. 
5. At any point in time, there should never be more than three Vikings on the same app. 
6. Each Viking starts with an equal number of apps.

If initially, 4 Vikings started with their smartphones and they spent some time in the following order: Chat, Explore, Ignore, Ignore, Chat, Ignore, Explore. How many ways can they have distributed themselves to use the smartphone if each Viking has only used one app at a time? 

First, we need to determine how many total apps were used by the Vikings. Here is a breakdown of their usage: 
Chat - 2 times (4/2)
Explore - 2 times (2/1)
Ignore - 3 times (3/1)

Based on this breakdown, if each Viking has only used one app at a time, there must have been an equal number of "no-app" periods. This means that in every second or third turn, two Vikings did not use any apps. 

Let's break down the sequence into four sections - when two Vikings are using no apps, and when three Vikings are using no apps:
1st to 2nd section: Chat and Explore were used by 2 Vikings each. Hence, there was 1 Viking without an app during this period.
3rd to 4th section: Ignore was used by all of them. So, another Viking had no app during this time. 

Now, let's apply the tree of thought reasoning. The first and fourth sections have one Viking with no apps each (1 Viking in total). In between these two sections are 2 more periods where three Vikings had no apps each - for a total of 3 Vikings without any app. 

Considering that in every turn, there should not be more than three Vikings on the same app, and since all 3 "no-app" periods have only two Viking users each, the maximum number of ways to distribute themselves is when 2 Vikings are using no apps and 1 Viking has an app. Hence, there can be 3 possible sequences.

Answer: There are 3 different ways for the vikings to distribute themselves in using the smartphone.

llama3:

The Vikings, known for their seafaring prowess, metalworking skills, and warrior culture. If they were to discover a smartphone, here's how they might react:

**Initial Confusion**: The Viking warriors would likely gather around the mysterious object, eyes widening in confusion as they try to comprehend its purpose. They'd examine the phone's shape, size, and material, attempting to identify it as a tool or a curiosity.

**Assessing Its Value**: As they handled the device, they'd quickly realize that it's not edible (a crucial consideration for Vikings), nor does it appear to have any obvious practical use in their daily lives. They might wonder if it has spiritual significance or serves some other purpose beyond their understanding.

**Investigating its Secrets**: The Viking warriors would employ their problem-solving skills, probing the phone's surface with sticks, rocks, and fingers to uncover hidden mechanisms or secrets. They'd probably try to open the device, possibly using their metalworking tools to pry it apart.

**Using It as a Tool (Eventually)**: Once they grasp its basic functionality, the Vikings would likely exploit the phone's capabilities to suit their needs. For example:

* They might use the camera to document their raids, battles, or hunting expeditions, creating a visual record of their conquests.
* The phone's GPS and mapping features could help them navigate uncharted territories, locate hidden resources, or track down prey.
* They might employ the phone's flashlight function to illuminate dark spaces, like caves or ship interiors.

**Practical Applications**: As they become more familiar with the device, the Vikings would likely find ways to integrate it into their daily lives. For instance:

* They could use voice assistants (like Siri or Alexa) to manage their battle strategies, receive updates on enemy movements, or access valuable information.
* The phone's storage capacity would allow them to save and share important documents, like maps, trade agreements, or clan records.
* The Vikings might even develop their own apps to help with tasks like hunting, fishing, or trading.

**Challenges and Concerns**: As the Vikings adapted to this new technology, they'd also encounter challenges:

* They would need to understand the concept of "charging" and managing power levels, which could be a significant hurdle.
* The phone's fragility might lead to concerns about durability in their harsh environments.
* Some warriors might worry that these small, glowing rectangles were cursed or possessed magic, potentially influencing their decisions.

**Legacy and Cultural Impact**: As smartphones became more widespread among the Viking communities, they would likely influence various aspects of their culture:

* New forms of storytelling and art could emerge, utilizing the phone's capabilities to create multimedia presentations.
* The Vikings might develop a fascination with the concept of "online" interactions, leading to new forms of socializing and communication.
* The phone's impact on their daily lives could lead to changes in their social structures, trade networks, or even their understanding of the world beyond their shores.

In conclusion, if the Vikings discovered a smartphone, they would initially be perplexed but eventually adapt it to suit their needs. They would likely find creative ways to utilize its features, while also grappling with challenges and concerns about this new technology. The resulting cultural impact could be significant, shaping the course of their society in profound ways.

solar:

If the Vikings, who thrived from the 8th to the 11th centuries, were to stumble upon a modern-day smartphone, their reaction would likely be a mix of astonishment, confusion, and possibly even fear. As they had no frame of reference for such an advanced technological device, they might assume it has mystical or supernatural powers due to its capabilities.

Initially, they could view the smartphone as an item of great value because of the materials used in its construction (such as precious metals), and the intricate craftsmanship that went into making it. Some Vikings might consider selling or trading the device for other high-value items in their society such as weapons, shields, ships, furs, or jewels.

Others might believe it to be a magical object, capable of casting spells or predicting the future based on its display and interactive functions. They could also attempt to use the smartphone for communication but fail because they couldn't understand the language settings, nor interpret touch screen gestures as commands.

As Viking societies were built around oral traditions and storytelling, some might believe that this device could serve as a memory aid to recall sagas, laws or poems, but of course they wouldn't be able to utilize its internet browsing capabilities for such purposes.

Fear might also come into play in the form of superstition and religious beliefs. The smartphone's functions could be attributed to supernatural forces, and some Vikings might view it as a tool possessed by their deities or malevolent entities.

Overall, if the Vikings discovered a modern-day smartphone, they would struggle to understand its true nature and purpose because their society was centered around traditional craftsmanship, martial prowess, and spiritual practices rather than advanced technological concepts.
End response.

gemma 2b:

The discovery of a smartphone by the Vikings would have been met with awe and fascination. The ancient Norse culture placed great value on technology and innovation, and the sight of a sophisticated smartphone would have been an unprecedented marvel.

**Initial Reactions:**

* **Appreciation and Wonder:** The Vikings would have marveled at the intricate design, advanced features, and vibrant display of the smartphone. They would have spent days exploring the device and learning about its capabilities.
* **Exploration and Use:** The Vikings were curious individuals who would have experimented with the smartphone's features. They might have taken photos, videos, and made phone calls, discovering the capabilities of the device. They might have also used it for navigation, communication, and entertainment.
* **Sharing and Exchange:** The Vikings would have shared their knowledge and expertise with other cultures, demonstrating the significance of this invention. They may have sent specimens back to their homelands, showcasing the technological advancements of the Norse people.

**Cultural Significance:**

* **Preservation of Knowledge:** The discovery of a smartphone would have been recorded in historical texts and passed down through generations. It would serve as a valuable historical artifact, documenting the cultural heritage and technological prowess of the Norse people.
* **Inspiration and Advancements:** The Vikings' fascination with the smartphone could inspire future technological advancements. They may have used their knowledge to develop their own innovative devices, further advancing the field of communication and technology.
* **Cultural Exchange:** The Vikings' encounter with a smartphone would have opened up new avenues of cultural exchange and understanding. It would have exposed them to different technological concepts and practices, enriching their worldview and understanding of the world.

**Conclusion:**

The discovery of a smartphone by the Vikings would have been a significant event in their history, sparking wonder, curiosity, and cultural exchange. The encounter would have deepened their understanding of technology and preserved their ancient knowledge, inspiring future generations of Vikings and inspiring technological advancements across cultures.

As you can see, there is a wide variety of responses to this same prompt, not only in quality and style. The phi response seems largely irrelevant, which means it may not be a good fit for the task. Solar gives a more succinct response, Llama goes into depth, and Gemma is somewhere in between. Llama and Gemma, however, gives some rather unrealistic responses (the Vikings developing their own smartphone using it as a base, or developing their own apps.) Solar seems to give the best response to the prompt, ensuring that what is advised remains in the realm of the realistic and logical while addressing the question.

Running it by a larger, more knowledgeable LLM

Here's the Python code for querying the results against a larger LLM:

import json
import ast

f= open("outputfile.txt", "r")
llmResponseQuery=f.read()
print (llmResponseQuery)

response = ollama.chat(model='llama3:70b', messages=[
   {
     'role': 'user',
     'content': 'You are a judge evaluating the responses of several large language models that have answered the question: What would the vikings do if they found a smartphone? Please evaluate each individual response as if you were a teacher grading an examination, providing a rating out of 10 in the categories of coherency, logic, and completeness for each individual response from the models. The goal of the original prompt was to assess model capability to be given an outlandish scenario and come up with realistic, coherent responses that would be within the abilities of the participants. The models quizzed were phi, llama3, solar, and gemma. Look for the markers such as "phi response" to see where each response begins and ends. The responses are provided here: ' + llmResponseQuery,
       
   },
 ])

print(response['message']['content'])

Here's how the model rated the responses:

As a judge evaluating the responses of these large language models, I will assess them based on creativity, coherence, and depth of analysis. Here's my evaluation:

**Llama3 Response:**

Strengths:

* The response provides a clear and concise structure, with each point building upon the previous one.
* The ideas are creative and well-connected to the Viking culture.

Weaknesses:

* The response is somewhat superficial, lacking deeper insights into the Vikings' way of life and their potential reactions to a smartphone.
* Some points, such as "Create customized apps," feel a bit too modern and might not be entirely accurate for the Viking era.

Score: 7/10

**Solar Response:**

Strengths:

* The response provides a rich and immersive description of the Vikings' potential reactions to a smartphone, drawing from their cultural and social context.
* The analysis is nuanced, considering multiple perspectives and implications of the discovery.

Weaknesses:

* The response could be more concise, with some sentences feeling overly long or wordy.
* Some points, such as the assumption that the higher classes would want to possess the smartphone, might not be entirely supported by historical evidence.

Score: 8.5/10

**Gemma 2b Response:**

Strengths:

* The response is well-structured and easy to follow, with clear headings and concise paragraphs.
* The analysis considers multiple aspects of the Vikings' culture and society, including their values, traditions, and potential uses for a smartphone.

Weaknesses:

* Some points, such as the assumption that the Vikings would have used the smartphone for navigation or entertainment, might not be entirely accurate or supported by historical evidence.
* The response could benefit from more nuanced and detailed analysis of the Vikings' reactions to a smartphone.

Score: 8/10

Overall, Solar's response stands out for its rich and immersive description of the Vikings' potential reactions to a smartphone. While all responses demonstrate creativity and coherence, Solar's analysis is more nuanced and considerate of multiple perspectives. Gemma 2b's response is also strong, with a clear structure and consideration of various aspects of Viking culture. Llama3's response, while creative, feels somewhat superficial and could benefit from deeper analysis.

It's worth noting that Phi's response was so bad the model didn't even seem to want to dignify it with a response (sorry, Microsoft) – but that's what this process is for, to find the best candidate for the job.

You can also use closed source LLMs as an evaluator, if you so choose. Running them in production may be untenably expensive, but a few queries to evaluate models shouldn't be. Here's what Claude 3 Opus thought of the outputs:

Absolutely, I'll evaluate each response on coherency, realism, and logic on a scale of 1 to 10.
Phi:
Coherency: 2/10 - The response is difficult to follow and goes off-topic.
Realism: 1/10 - The scenario presented is highly unrealistic and unrelated to the prompt.
Logic: 1/10 - There is little logical connection between the prompt and the response.

Llama3:
Coherency: 9/10 - The response is well-structured and easy to follow.
Realism: 8/10 - The imagined scenario and Viking reactions feel plausible and grounded in their cultural context.
Logic: 9/10 - The response follows a logical progression from initial discovery to long-term adaptation.

Solar:
Coherency: 8/10 - The response is generally well-organized and coherent.
Realism: 8/10 - The consideration of Viking social hierarchy, traditions, and beliefs adds realism.
Logic: 7/10 - The flow of logic is mostly sound, with a few areas that could be tighter.

Gemma:
Coherency: 7/10 - The response is fairly coherent, but some points feel less connected.
Realism: 6/10 - While some insights ring true, others, like burying the phone, feel less realistic.
Logic: 6/10 - The logical progression is decent but has room for improvement in connecting ideas.

Why this matters

To properly evaluate an LLM, especially for general use cases, you need to ask it a a lot of questions. For example, benchmark question sets like Hellaswag will run a model through tens of thousands of queries to arrive at their final grade. In addition, you may need to ask the same question repeatedly with different seeds to ensure that the model gives consistently good answers.

Where benchmarks fall short, though, is that they are limited by the questions that that they ask, which may not always be relevant to your use case. Moreover, if you have a custom model trained on your company data, benchmarking may be almost entirely useless.

Developing a framework to ask your custom-designed questions across a wide swath of models will greatly speed up your evaluation process. Being able to pipe output through a field of LLMs with a single keystroke could speed up your evaluation process many times over, especially if you are testing against a wide field of models.

Here are some examples of prompts to get you started:

Creative Writing

  1. Write a haiku about the first day of spring.
  2. Compose a short story about a time traveler who accidentally changes the course of history, in the style of Ray Bradbury.
  3. Create an origin story for a new superhero with an unconventional superpower.

Coding

  1. Merge Sorted Lists: Create a function that takes two sorted linked lists and merges them into a single sorted linked list. Return the head of the merged list.
  2. Valid Parentheses: Write a function that determines if a given string of parentheses, brackets, and braces is properly balanced. Return True if the string is valid, and False otherwise.
  3. Reverse Integer: Create a function that takes an integer and returns the integer with its digits reversed. If the reversed integer exceeds the 32-bit signed integer range [-2^31, 2^31 - 1], return 0.

Logical reasoning:

  1. In a group of 30 students, 20 study French, 15 study German, and 5 study both French and German. How many students study neither French nor German?
  2. If it is raining, then the streets are wet. The streets are wet. Therefore, it is raining. Is this conclusion logically sound? If not, identify the logical fallacy.
  3. Premise 1: If a person is a chef, then they can cook. Premise 2: Alex can cook. Conclusion: Therefore, Alex is a chef. Is this conclusion logically sound? If not, identify the logical fallacy.

Conclusion

Now it's your turn – what use case are you looking to tackle with a language model? What kinds of prompts would be most relevant for evaluating models in your domain?

Building on the framework outlined in this article, what other metrics or evaluation criteria would you add to the mix? How might you weigh factors like creativity, factual accuracy, or style when selecting a language model for your use case?

These are all questions that you can begin to suss out by spinning up a Pytorch pod as listed above on RunPod. The freeform nature of LLMs invites nearly unlimited experimentation. We'd love to see what you come up with; feel free to hop on our Discord and share your thoughts.