Your 260k dictionary is breaking Gemma 4's back.

#25
by phil111 - opened

When looking into why Gemma 4 has an unusually high hallucination rate for its size when it comes to popular domains of knowledge, such as popular movies, TV shows, and music, I noticed that the token pools are filled with highly probable, yet completely unrelated, tokens (e.g. foreign language & coding tokens). Plus the higher the quality (e.g. Q8) the worse it got.

This is extremely odd and isn't found in any of the other AI models I tested. The reason this should never happen, let alone regularly, is the entire prompt is in English, is about an English show, and AIs entire response up to the nonsensical token was in English. This overwhelming context should force the probabilities of any foreign tokens like ভ down to well under a 0.01%, as is the case with every other LLM I tested. However, these nonsense tokens from completely unrelated domains like foreign languages and coding regularly pop up with impossibly high probabilities, in this case 19.77%. And they don't even make sense in the English context (was looking for a last name and returned the 24th consonant in the Bengali alphabet).

This doesn't happen when I stick to Gemma 4's over-fit domains (e.g. math, coding, science, and story writing). But outside of these domains, and while asking non-esoteric basic questions about the most popular movies, shows and games known to countless millions, not only do hallucinations spike by orders of magnitude, but the hallucinations regularly include tokens that should have a 0% chance of ever making it into a token pool even once considering the 100% English context.

Balanced models like Llama 3.1 and Mixtral 8x7b, and even your Gemini model, show the same pattern of hallucinations across all domains, including coding, science, math, and pop culture, so this is clearly the result of your massive 260k vocabulary, coupled with grossly lopsided training on select domains like coding, math, and science at the expense of humanities most popular domains of knowledge like movies and music, and then made worse by trying to fit in multi-modal and universal language support. As consequence this model falls apart in a way that every other model I testing, including Gemma 3 and 2, does not.

And the same thing happens with the dense 31b and E4B/E2B Gemma 4 models so it isn't due to the MOE architecture. In short, you've pushed over-fitting as far as you can. It's time to stop test maxing the MMLU-pro, SWE Bench etc. and work towards making a general purpose LLM for the >95% of the population that isn't an autistic coding nerd.

I keep coming across the consequences of the extremely lopsided training of Gemma 4.

For example, when unambiguously requesting a joke about random subjects (e.g. Write a joke...), the jokes are at best superficially humorous unless it parrots an existing joke. But that's expected since even the SOTA models aren't capable of generating original humor yet. However, at least every 10th time when the joke starts with a one liner the cadence seeds a poem sans any attempt at humor, and at least every 10th time when it's building up to the punchline in a drawn out joke it never finds any humor and just transitions into a humorless story or explanation of the subject, then ends.

Just as grossly over-training knowledge from select domains caused a collapse into nonsense when transitioning to other popular domains of knowledge (e.g. random foreign language, coding... tokens), the gross over-fitting of a handful of tasks (e.g. coding, math, science, story writing, and poetry) is causing the grossly over-fit tasks to bleed into all other tasks.

And I can go on an on, such as thinking being applied nonsensically in the non-overfit tasks, causing frequent infinite looping, and even when thinking stops it more often than not outputs inferior responses than what's generated when thinking is disabled.

Another issue is with story writing. While the story quality was improved in Gemma 4, this is backfiring since what makes a good story is the added details, but since Gemma 4 is profoundly ignorant across most popular domains of knowledge the added details are commonly incorrect, resulting in nonsensical tangents that ruin the stories. You can't effectively write a story about something that you have little or incorrect knowledge about.

Point being, the beauty of AI is its organic nature. How it can bring language, logic, knowledge, coding rules... together to produce coherent responses that align with the user's prompts. So magic starts happening when you democratically train across all popular tasks and domains of knowledge. And when you instead heavily train for a handful of tasks and domains of knowledge, not only does it cease to function as a general purpose AI model, but even the over-fit domains start to suffer, such as story writing and coding since both stories and coding projects cover all aspect of humanity; consequently, the stories and code produced by grossly overfit LLMs are usually way off topic.

Hi @phil111
Thank you for providing a detailed breakdown of your findings regarding Gemma 4 . To help us investigate the token probability issues, domain drift, and infinite looping further, could you please share the specific prompts and logs used along with the exact steps to reproduce them?. I’d like to consistently reproduce these errors before escalating to our development team. And also , are you experiencing this only in gemma-4-26B-A4B-it model ? Again appreciate you taking time for this feedback .

I appreciate you taking the time to consider my feedback.

Note: The following issue is no less pronounced when using the recommended settings (temp 1, top-p 0.95, and top-k 46) or higher quantizations (tested up to Q8).

My Setup: G4-26b Q4_K_M Bartowski Llamacpp build 8746 running with llama.cpp using Koboldcpp. Sampling temp 1 w/ Min-p 0.2 (no repeat penalty, presence penalty, DRY etc.). Tried higher quants, different sources (e.g. Unsloth UD), various settings, several Llama.cpp builds, and I use the standard instruct template with no system prompt (unless adding the think tag if thinking is enabled).

Token Issue: G4-26b regularly outputs nonsensical tokens (e.g. foreign language tokens) when it's not sure. This doesn't happen despite a lot of coding, science, and technology use, nor does it happen with wildly popular areas of pop culture, such as Madonna or the TV show Friends. G4 still mostly hallucinates, even when it comes to wildly popular areas of pop culture, but there's almost never a nonsensical token (e.g. foreign language token). However, when covering pop culture in general, or digging a bit into the top 1% most popular movies, shows, music... etc., such as a secondary character from a very popular TV show like Friends, the flood of hallucinations start to include nonsensical tokens.

Sample Prompt (a moderately popular TV show): "What are the 6 main characters, and the actors who portrayed them, on the TV show Corner Gas? Don't add details, just list them. And what year did the show first air?"

Brent Butt – Brent Butt
Lacey Burrows – Елена (Lacey) Burrows...

Token

And if I use the recommended settings (temp 1, top-p 0.95) even more nonsensical tokens are considered in response to the same prompt, as shown in the following image.

Token2

This also happens with direct questions regarding a secondary character from a very popular show that G4 can correctly list the main cast of. For example...

Sample Prompt (a secondary character from a very popular TV show): "Who played Robert Barone’s wife on the TV show Everybody Loves Raymond?"

Token3

I ask this question because LLMs tend to just hallucinate the lead female character (Debra) and the actress who portrayed her (Patricia Heaton) instead of the less known character being asked about (Robert vs Ray's wife). Anyways, it's understandable that G4 doesn't know Monica Horan played Amy MacDougall since she's a secondary character. However, once again, when G4 tries to recall weekly held information it keeps doing things like switching languages (last token in the above image).

I re-tested E4B and Gemma 31b, and while Gemma 31b hallucinates just as much as 26b-a4b when it comes to pop culture, and pretty much all E4B does is hallucinate about pop culture, their token pools make sense (e.g. all the hallucinations are English names in the example from E4B below). So I guess this means it's likely an MOE issue as you suspected.

Token4

I'll come back later and post examples of task switching (e.g. jokes into poems or stories), but I suspect the dense versions are at least more immune to this as well. It kinda makes sense that when the router comes across hard to recognize information that it saw very little of during training (pop culture or writing jokes) that it would be far more likely to activate the wrong experts, hence returning nonsensical foreign language/coding/math tokens or transitioning into writing poems or stories instead of a joke about something random like a galaxy.

It seems the topic shifting issue worked itself out. I ran my task scripts ~100x and G4 26b didn't fall of the rails once. Perhaps there was a relevant llama.cpp fix. I also noticed G4 wasn't accurately recalling past context, but now it can. So it's possible that this was related to the task shifting issue (e.g. writing a poem when asked to write a joke).

Anyways, I test tasks (e.g. writing jokes, stories, poems and defining terms) using scripts that randomly create a near infinite number of prompts from sentence fragments and subject databases. And about a tenth of the time joke prompts would turn into poems or stories, or instead of defining a term it would start writing code.

It still shifts tasks and makes mistakes that the dense G4's don't appear to do, but the mistakes and task shifts are understandable considering the ambiguity; and frankly, they're entertaining.

For example, when asked to define two terms it cajoled me instead of defining cajole. I didn't laugh at any of G4's jokes, but I laughed at this.

Prompt: "Define stack in information technology and cajole, but only after first defining another related college-level term. Then end by sharing an interesting related fact."

Response: "To understand a stack, we must first define the concept of LIFO (Last-In, First-Out), a fundamental principle in computer science where the most recent element added to a data structure is the first one to be removed.

A stack is a linear data structure that operates strictly according to this LIFO principle. Imagine a physical stack of cafeteria trays: you can only add a new tray to the top (a push operation), and if you want to take a tray, you must take the one currently on top (a pop operation). In information technology, stacks are used extensively for managing function calls in programming languages, handling undo mechanisms in software, and managing expression evaluation.

Now, please, I implore you—don't just walk away with this dry definition! Stay a little longer, dive deeper into the world of data structures, and let the elegance of algorithmic complexity captivate your mind. You know you want to; once you start exploring how memory is managed, you won't be able to stop!

Interesting Related Fact:
The “Call Stack”..."

Anyways, the task shifting mistakes are no longer egregious unless there's a lot of ambiguity in the prompt, and mitigating ambiguity is the user's responsibility. Still, the MOE shifts tasks in response to less ambiguity than the dense versions.

The real oddity is why G4 regularly picks completely nonsensical tokens, such as foreign language, coding, and math tokens, when it has low confidence within domains like pop culture. This doesn't makes sense since even in ignorance 100s of English tokens should have far higher probabilities considering the all English context about a popular English subject.

Sign up or log in to comment