The Training Data Problem
Why AI hallucination is not an easy fix
The entire history of the internet may seem like a huge amount of information, but it’s not unlimited. Per topic of marginal interest, there isn’t all that much information. And mankind can’t really produce it faster than it already does. Hence, we’ve hit the training data ceiling.
And what the gradient descent algorithm does is, it will ALWAYS produce a result that looks like all the other results. Even if there is actually zero training data on a topic, it will still speak confidently on it. It’s just all completely made up.
The algorithm was originally developed due to the fact that fighter jets are so unstable that a human being doesn’t react fast enough to even theoretically keep it in the air. So, gradient descent takes the stick inputs as a general idea of what the pilot wants, and then interprets it into the signals to the actuators. In other words, it takes a very tiny amount of data, and then converts it into a very large amount of data. But everything outside the specific training data is always interpolation.
The fascinating thing is that it turned out that there are vector embeddings that you can create from natural language, and their interpolations are still valid natural language. However, it doesn’t work the same way for the concept of truth. If an AI doesn’t know something about, say, a legal situation and can’t find a precedent, it might take a similar article name from the field of medicine and then claim it’s a precedent for court. Because the embeddings happened to be close. As a huge amount of lawyers have already found out the hard way.
Here’s a good example based on a Finnish pun. Here’s part of its analysis. This will probably sound convincing to someone who doesn’t know Finnish, but it’s pure nonsense on ALL levels. Not even the simplest truth-claims in it are so. “Sipiin” doesn’t mean “into a barrel”. It doesn’t mean anything at all.
Original Finnish Joke:
“Auto ajoi Mississippi-joen yli. Joe kuoli välittömästi.”
This is a pun (sanavitsi) in Finnish. The humor relies on the name “Mississippi” sounding like the Finnish phrase “missi sipiin”, which means “a beauty pageant contestant into a barrel”.
Therefore, the first sentence has a double meaning to a Finnish listener: “A car drove over the Mississippi River” and “A car drove a beauty queen into a barrel.”
The punchline “Joe died instantly” then becomes a darkly humorous, non-sequitur conclusion to the absurd “beauty queen into a barrel” scenario. The name “Joe” might even be a nod to the American setting of the river.
Now, this doesn’t mean the problem cannot be solved. But solving the problem requires reinventing the LLM and rebuilding it from the ground up on the basis of something other than the interpolation algorithm, and given the way in which the AI investment bubble appears to be popping, it seems unlikely that anyone is going to initiate a ground-up rebuild anytime soon.



"But solving the problem requires reinventing the LLM and rebuilding it from the ground up on the basis of something other than the interpolation algorithm..."
i never knew that 🤔
Perhaps an analogue to the above and related to the example, this digital "instability" somewhat mimics the F-16 Fighting Falcon's physical flight characteristics. The fighter was created to be unstable in flight, in that without positive control by the pilot the plane will rapidly diverge from manually set flight paths. This allows for rapid intended changes in flight path by the pilot --turns, climbs, dives-- for which the Falcon is well known.
While not likely "intended" by the LLM, AI diverges when it runs out of real data and has to make a path to an answer -- rapidly satisfying the request.