NVidia's Llama 3.1 Nemotron 70b Instruct: Can It Handle My Unsolved LLM Problem?

NVidia's Llama 3.1 Nemotron 70b Instruct: Can It Handle My Unsolved LLM Problem?

Earlier this month, NVidia released Llama 3.1 Nemotron Instruct, a 70b model that has taken some notably high spots on various leaderboards, seeming to punch far above its weight. As of October 14th, it is not only beating high-end closed source models that far outweigh it like Claude 3 Opus and some flavors of GPT4, but it is also the highest ranking open-source LLM on leaderboards such as arena-hard:

claude-3-5-sonnet-20240620     | score: 82.0  | 95% CI: (-1.6, 2.2)  | average #tokens: 567                                                      
o1-preview-2024-09-12          | score: 81.6  | 95% CI: (-2.4, 2.2)  | average #tokens: 1193                                                     
o1-mini-2024-09-12             | score: 79.2  | 95% CI: (-2.6, 2.4)  | average #tokens: 1399                                                     
gpt-4-turbo-2024-04-09         | score: 74.4  | 95% CI: (-2.5, 2.1)  | average #tokens: 662                                                      
gpt-4-0125-preview             | score: 73.5  | 95% CI: (-2.4, 1.8)  | average #tokens: 619                                                      
gpt-4o-2024-08-06              | score: 71.0  | 95% CI: (-2.5, 2.8)  | average #tokens: 594
llama-3.1-nemotron-70b-instruct| score: 70.9  | 95% CI: (-3.3, 3.3)  | average #tokens: 869
gpt-4o-2024-05-13              | score: 69.9  | 95% CI: (-2.5, 2.3)  | average #tokens: 696                                                      
athene-70b                     | score: 67.7  | 95% CI: (-3.2, 2.2)  | average #tokens: 685                                                      
yi-lightning                   | score: 67.1  | 95% CI: (-2.3, 2.8)  | average #tokens: 875                                                      
llama-3.1-405b-instruct        | score: 66.8  | 95% CI: (-2.6, 1.9)  | average #tokens: 658                                                      
claude-3-opus-20240229         | score: 65.5  | 95% CI: (-2.3, 2.5)  | average #tokens: 541

So the big question is, is it just overfitting or "training for the test" , or does it really have some kind of "special sauce" that's helping it to handily beat a model like Llama 3.1 405b several times its size?

Logical Reasoning

One of my personal projects of using LLMs is for creative writing and roleplay. And while I realize that sounds like a fairly niche (even pedestrian) use of an LLM, I have some very specific demands of the LLM that really put it through the logical reasoning wringer – problems that I have not gotten any currently available LLM to completely solve, despite years of testing models, prompting techniques, and sampler settings. Different LLMs handle this at different levels of capability, but no LLM to date has been able to completely excise unwanted output with the demands I put on them:

  1. The LLM is to play the character without divulging any of its internal narrative. It is not to simply tell you what the character is thinking. It is only allowed to show those thoughts in output through things that your character could potentially perceive, such as tone, action, body language, and dialogue. If you are playing a tabletop RPG, for example, the DM isn't going to tell you that a certain NPC "feels sad" or "has some lingering thoughts that slowly arouse his suspicion." A proper DM, like any creative, is going to know to "show, not tell." The problem is this is a huge ask of an LLM. Remember that LLMs work by predicting the next token given a list of preceding tokens. With this prompt structure, by hiding all of that internal narrative and thinking from you, it also effectively hides it from the LLM itself. The LLM no longer has any explicit context clues as to how its own character should act (aside from what it is told in the character sheet), and essentially requires the LLM to re-infer what its own character is thinking based on its own limited output – and it has to do so accurately and consistently to avoid the character drifting aimlessly.
  2. The LLM is to take an active role in how the story moves forward. A lot of LLM output tends to be reactive to your output. However, getting them to be proactive is much harder. This requires a much deeper level of world knowledge. It's easy to tell an AI character, "let's go to the store" if you want a scene to happen in said store. An AI character beating you to the punch and recognizing that the store trip needs to happen based on occurrences in the story before you can suggest it is much harder to accomplish.
  3. The LLM should never "powergame" by telling the user what to think of its character's actions or insinuating or planting thoughts or opinions for your character in its output. It is to state things as observed by your character, and nothing more.

In my experience, almost all models have fallen short of this level of logical reasoning required. Any LLM can roleplay, if you're not picky. They can put words on the screen in response to your input in a way that is grammatically correct and coherent. However, almost none of them can actually do it like a human . Even high-end closed models like Claude Sonnet 3.5/Opus, and GPT-4 often struggle with this. They get closer than most other models, but unfortunately do still drop the ball with failing to hide their character's internal thoughts, or get stuck in repetition traps because the model's reasoning capability hits a dead end. Moreover, the models constantly plant "breadcrumbs" for themselves to refer to that I find very unsporting in roleplay - such as informing the reader how their character is to think and act.

llama-3.1-nemotron-70b-instruct is the first model where I've actually seen it consistently handle the above demands. I'm not suggesting that this is a model that's suited only for roleplay. What I am saying, though, is that it is effectively handling what I believed to be an unsolved problem in this application, which tells me that its spot on the benchmarks is well-deserved.

Here is a prime example of the kind of writing that this model can perform:

The neon lights of the diner cast a warm, inviting glow across the rain-slicked sidewalk. Sly's grip on your hand tightens almost imperceptibly as you approach, his eyes darting from shadow to shadow. With a fluid motion, he pushes open the door, the cheerful jingle of bells announcing your arrival. The aroma of coffee and grilled butter envelops you, a stark contrast to the cool night air.
Inside, the diner is a snapshot of small-town Americana, all chrome and red vinyl. A weary-eyed waitress looks up from wiping down the counter, her eyebrows rising slightly at the sight of the odd pair you make. Sly guides you to a booth in the corner, positioning himself with his back to the wall, a clear view of both exits. His fingers drum a restless rhythm on the tabletop as he scans the laminated menu.
The waitress approaches, notepad in hand, her gaze lingering curiously on Sly's mask. "What'll it be, hon?" she asks, her voice gravelly from years of late-night shifts and cigarettes. Sly's eyes flick to you, a silent invitation to order first, his posture relaxed but alert. The fluorescent lights hum softly overhead, casting strange shadows across his features.
---my response---
I smile over at you, taking a moment to flit over the menu. "Oh, um.. just a hobo skillet. With scrambled eggs. And coffee." I glance over at the waitress, my own hands drumming on the tabletop, mirroring yours. "Don't wanna abuse his charity too much, after all." I brush my hair out of my eyes as I look over at you, looking to see what you'll order.
---end my response---
"Same here, sweetheart," the raccoon replies, flashing a charming smile at the waitress as he orders. His voice drips with smooth confidence, effortlessly drawing her in. "Add a slice of cherry pie à la mode to mine, would ya?" He leans back into the vinyl seat, stretching out his legs, seemingly unaware of the extra scrutiny directed his way.
With the waitress dispatched, Sly focuses on you once more, his eyes locking onto yours. For a fleeting instant, a spark of warmth flickers within their depths, accompanied by the barest whisper of a genuine smile. Almost immediately, however, it vanishes, as it flicked off by a switch, replaced by a now-familiar manufactured veneer of charm and charisma. Sly tilts his head to the side, seeming to make a show of studying you with interest. "So, how exactly does someone like you end up in a museum, hmm?" He raises an eyebrow, awaiting your response.
With a slide of a finger, he raises a sugar packet out of the nearby holder, practicing legerdemain as it slides into his hand, and he effortlessly slips it between his fingers, as if performing a magic trick that never quite reaches fruition.

As far as I'm concerned, it completed the task flawlessly. At no point does it tell the reader what to think. It's clear that Sly has goals in this scene, getting to know the reader, and trying to make a show of consuming space in the diner booth and impressing the reader's character. It's made clear that he's a charmer (or trying to be) but at no point insinuates that the reader's character is at all attracted to him. He's cautious, putting his back to the wall so he can observe all exits. He's full of nervous energy, demonstrated by the restless fidgeting. The model feels like it actually writes a story the way a writer actually would. It simply spins the character's actions for what they are, and lets the reader decide what they mean.

Here's how some other models handled the last output in the story with the same context, log, and prompting, and my complaints about their output:

claude-3-opus: Gets close. Does well at setting the scene, but still annoying and powergamey. Falls into "tell, don't show" with informing the reader of Sly's manufactured charisma rather than demonstrating it through action.

The waitress nods, jotting your order on her pad with a practiced flourish. She turns her expectant gaze to Sly, who flashes her a roguish grin. "Make it two," he says smoothly, handing back the menu. "And keep the coffee coming, darlin'."
As she saunters away, Sly leans back in the booth, one arm draped casually over the backrest. He regards you with a curious tilt of his head, his eyes glinting in the neon light. "So," he drawls, his voice low and conspiratorial, "what's a kit like you doing sleeping rough in a museum?" His tone is light, teasing, but there's an undercurrent of genuine concern.
The diner buzzes with quiet activity around you - the sizzle of bacon on the grill, the clink of silverware, the low murmur of the few patrons scattered at the counter. Sly's presence seems to fill the space, his charisma almost palpable. Yet his focus remains solely on you, his gaze intense and unwavering, waiting for your answer.

claude-sonnet-3.5: Kind of the same as Opus, but worse. Too much opining about how the character perceives the scene ("diner seems to shrink, space charged with questions")

Sly's lips quirk into a half-smile at your modest order. He turns to the waitress, his voice low and smooth. "I'll have the same. And a slice of cherry pie." His gaze flicks back to you, a mischievous glint in his eye. "For sharing, of course."
The waitress nods, scribbling down the order before shuffling away. In her absence, the diner seems to shrink, the space between you and Sly charged with unspoken questions. His fingers cease their drumming, instead tracing abstract patterns on the tabletop. The fluorescent lights flicker once, twice, casting fleeting shadows across his masked face.
Outside, a police siren wails in the distance, growing louder before fading away. Sly's ears twitch at the sound, his posture tensing almost imperceptibly. His eyes, however, remain fixed on you, curiosity and something softer - concern, perhaps - evident in their depths. "So," he murmurs, leaning forward slightly, "want to tell me how you ended up sleeping in a museum?"

llama-3.1-405b-instruct: Now we're getting into what I would call "slop" territory. Much of this output breaks multiple directives by telling the reader what to think and even implying their actions, while divulging a huge amount about Sly's mental state.

Sly's eyes never leave yours as he orders, his voice low and smooth. "Make that two hobo skillets, darlin'. And a side of pancakes. With fresh fruit." The waitress nods, her gaze lingering on the unlikely pair before scribbling down the order and heading off to the kitchen. The diner's background hum fills the space between you, a gentle accompaniment to the soft sizzle of the grill.
As you wait for your food to arrive, Sly's fingers continue their gentle drumming on the tabletop, a soothing rhythm that seems to match the beat of your own heart. His eyes, too, remain fixed on yours, their warm, golden light drawing you in like a magnet. The air between you vibrates with an almost palpable tension, as if the very fate of the world hangs suspended in this fragile, fleeting moment.
The waitress returns, bearing a tray laden with steaming plates and a carafe of fresh coffee. Sly's eyes never waver as he reaches out to take the plates from her, his fingers brushing against yours in the process. The touch sends a spark of electricity through your entire body, leaving you breathless and wanting more. The pancakes, once the focal point of your meal, now sit forgotten on the plate, a sweet and sticky afterthought to the feast of sensations unfolding before you.

llama-3.1-70b-instruct: Yeah.. you can see where this is going. It just gets worse the lower down on the leaderboards you go. The entire third paragraph is just telling the reader what to think.

A low rumble of approval escapes Sly's throat as he nods along with your order, his eyes flicking to the waitress. "Same here, minus the hobo skillet. Just coffee and a slice of apple pie." The waitress scribbles down your orders with a nod, her eyes flicking between the two of you before she pads off to the kitchen.
Sly leans back in his seat, the vinyl creaking softly as he stretches out, his eyes drifting over the diner's patrons with a practiced air. A faint smile plays on his lips, the corners crinkling upwards as he watches you, his gaze lingering on the curve of your ear. His fingers continue to drum a restless rhythm on the tabletop, the sound blending with the hum of the fluorescent lights overhead.
The air between you seems to thicken, heavy with the promise of secrets shared and stories untold. Sly's eyes lock onto yours, the spark of mischief in their depths tempered by a growing warmth. He leans forward, his elbows resting on the tabletop, his voice taking on a conspiratorial tone. "So, kid. What's your story?" The words hang in the air, an invitation to share, to trust, to reveal the secrets hidden behind those bright, curious eyes.

Conclusion

Benchmarking and comparing LLMs has always been a tricky business. Everyone has their own way of evaluating them - test sets, riddles, logic tests. A lot of what I described in this post is incredibly subjective, but unfortunately, so much of LLM testing IS incredibly subjective.

But I do feel very strongly that I give LLMs an exceptionally difficult ruleset to follow in my pet project, rules that they were never designed to be able to follow.

And somehow, this model just does it. And it's arguable about how often or how well it does it compared to the big closed source models, but it does seem pretty explicitly clear that it's edging them out despite being a fraction of the size of the Anthropic models, and solidly beating an open source model five times its size. At least for my use case, this flat out obsoletes llama-3 405b on the spot, and I can run it with a fraction of the compute.

And that, I think, invites some serious testing and consideration at how well it might work in your use case if you require logic and reasoning, if it handles my insane requirements at the level it does.

Questions? Let us know what you think on our Discord.