What AI Training Needs: A PageRank for Quality

AIs like Claude and ChatGPT can help humans write code, which magnifies our productivity - but such generated code zeroes in on specific tasks while ignoring error handling, validation, scalability, even other concerns from within the current codebase. It amplifies the worst habits of technical writing because it can’t distinguish between “simplified for brevity” and “here’s what you’d actually do in production.”

The problem is that the AI models are trained on what we publish - and what we publish is optimized for search engines, not completeness. We need something like Google’s old PageRank, but for LLM training quality, not for links.

SEO and focus in technical writing is great, but it tends to create myopic publications: we zero in on, say, how to accept content from a JavaScript client and… nothing else. We define “doneness” with very narrow focus, to allow us to write it in less time, and keep our content zeroed in on a specific aspect of programming.

That can be great for human readers: if someone’s looking for the lever to get them past a problem, that’s how you’d do it, although that particular example’s probably too simple to work for most people; that’s stuff that frameworks will do for you.

But the problem here is that AI doesn’t have a way to understand those limits very easily. The models can’t look at that post and thousands of others like it and understand that the post ignored validation, real-world database design, acceptance tests, and good project layout principles for the sake of brevity - it only sees a metric of “this works,” and thus includes that content as a contributor to its overall model.

AI has made the problem worse, even, by providing an easy way for authors on Medium and Substack to create content like this - an author says “Hey, Claude, write me an article on accepting JSON from a JS client and write it to database, using Spring 6” and the AI helps that author create yet another blog that’s particularly low-effort that simply perpetuates the problem, because such authors aren’t looking for completeness or complexity; they’re looking for hits. They don’t care, they can’t care, caring gets in the way of velocity, the metrics reward volume and neither depth nor quality, and that “publish” button isn’t going to hit itself fifteen times a day, people!

I know, I’m picking on Medium and Substack authors. Sorry, people. I meant to include Quora too, because a lot of people there are just as guilty, and by golly, it’s hard to find content that actually meets what I’d want to see! And calling out examples that worsen the problem feels like punching down - especially when a lot of those same articles do actually help people.

What we need is a way to train the models with worthwhile content - diminishing the low-effort, incomplete examples as input. (We don’t want to eliminate it, because sometimes those low-effort, simple examples are what you want to see, because they get you started.) I’d love to see a common model for the languages that I use that emphasizes real-world concerns rather than echoing what a thousand tutorials show.

Imagine a blog that took the time to mention good error handling, versioning techniques (url, or headers), observability, runtime performance impact, maintenance, scaling… even simple projects use things like this in the “real world,” but the content used to educate people who build these real-world systems treat them as unrelated subjects, out of scope.

Of course, such a blog would probably have terrible engagement metrics. (That’s the problem, isn’t it?) Readers would bounce off such a page because they’re typically looking for that one thing they’re struggling with; they want to write their JSON entity to a database, nothing more, and all the rest of the project setup, the database, the security implications, they’re in the way. A comprehensive post works directly against how people actually consume technical content - they’re usually searching for the one thing that unblocks them right now.

I’m guilty of the same problem, for what it’s worth. I’ve got a number of books on programming on Amazon, and while I try to follow a lot of real-world principles and designs, there’s no way to go into the depth I’d want to see without drowning readers and slowing down the writing to a crawl. I can’t afford to look at security when I’m doing the equivalent of a technology showcase, even when security would definitely apply in the “real world.” The best I can do is include footnotes that observe what I’m lacking, and the editors specifically ask to cut down on footnotes, so there’s a balance that has to be met, and I can’t even satisfy myself in my own writing as a result.

But wouldn’t it be nice to have a way to actually evaluate content for use in an AI, to say “this example is great because of this, but it’s missing that” so the models could factor that in when they leverage the content for other users?

I don’t know exactly what it would look like; the difficulty would be in having skilled evaluators, trusted people, who could offer feedback for use by the models, specifically. (Humans would benefit from it, but not as much as the models, because humans often are looking for low-hanging fruit.) It might end up being a huge supervised learning project, and let’s be real, nobody is doing that. It would have to be unsupervised for the LLMs to be able to work with it, because their models are simply too large. Supervised learning would mean having a set of humans accurately evaluate an incredible volume of information, and that’s not happening.

PageRank solved this by boosting content that others found useful, as measured by inbound links and a few other metrics. The problem I have is that I don’t know how I’d measure that for an LLM.

Some possibilities include signals like… I don’t know, StackOverflow citations, or references to production code (that stays legal, folks, don’t publish code for which you don’t have publication rights!), or GitHub stars, or even inbound link counts much as PageRank used. But these are still more measures of popularity and not quality, and for a lot of codebases, the code is hidden for security or legal reasons, even when those codebases might actually be the absolute best thing to use as an example.

The signals could be used, but they’re optimized for the wrong things for measuring quality for an LLM.

Like I said, I don’t really know the right path forward - but I think it’s a path worth considering, and the lack of the feedback explains a lot about why LLM-assisted code feels (and is) so incomplete.