The Repo Is Right There. Why Are You Checking Their CV?

Choose A, B or C

Someone posts this on LinkedIn:

“I implemented a numerical algorithm to validate string theory vacuum consistency. Built with NumPy, not PyTorch. Repo below.”

What do you do first?

A) Open their profile. University, publications, credentials.

B) Click the repo. Read the README. Try to run it.

C) Scroll past.

C means you’re not in this field. Fine.

But if you work in AI, you chose A or B. That instinct reveals more about where this industry is cracking than any benchmark announcement in 2026.

Your answer depends on your destination

If you chose A, you are probably operating inside a world where that instinct makes sense. That’s not a criticism — it’s the logic of the job.

A researcher’s destination is reputation. Citations, SOTA results, conference invitations. Checking credentials is rational in that world. A paper from an unknown author with no affiliation is genuinely harder to evaluate. The system makes sense, inside that world.

An engineer’s destination is something that runs. Ships. Doesn’t break at 2am. The only signal that matters is the working artifact. That system makes sense, inside a different world.

Neither instinct is wrong. The problem is that both worlds now use the same words for different things.

I’ve built enough systems that looked correct in testing and failed in production to know the difference. A result that doesn’t survive real conditions isn’t a result. It’s a failure with good cosmetics. That understanding changes how you read a repo. It changes which questions you even think to ask.

A researcher reading the same repo is asking: is this idea novel? Does this advance the field? Real questions. Just not the ones I need at 2am.

This looks like a communication problem on the surface. It isn’t. It is an incentive problem with a vocabulary leak. Before 2025, the gap was mostly about speed — how long research took to reach production. What changed is the nature of the gap itself. It became semantic. The same words now describe different realities depending on who’s using them.

That leak has a cost. And it accumulates.

The field is drowning in unconsumed insight

The AI field is not short on intelligence. It is drowning in unconsumed insight.

Ideas that are real, verified, published, cited. Ideas that never survive the journey from the lab to somewhere something actually needs to run.

Most people assume the bottleneck is the model. It isn’t.

ML code is roughly 5% of a production system.1 The other 95% is data pipelines, monitoring, error handling, cost controls. That’s where technical debt compounds. That’s also where researchers rarely look, because it doesn’t generate publishable results.

A paper reports near-perfect efficiency on a benchmark dataset. Clean data, controlled compute, no legacy systems, no users doing unexpected things. When that paper meets a real system, the promised behavior sometimes doesn’t degrade. It simply doesn’t appear.

Think of every research paper as a Michelin-starred recipe. Precise, rigorous, tested — in the chef’s kitchen. When someone else tries to cook the same dish, the recipe isn’t wrong. It was just never written for their kitchen, their ingredients, their equipment. Most AI papers are the same. The research works. The conditions don’t transfer.

The recipe problem doesn’t stay in the lab. It follows the system into production — and there, the stakes change.

💡

💡Relatived Post:

When the Michelin Recipe Fails in Your Kitchen

Capability and reliability do not scale on the same curve

Autonomous task duration in AI systems grew from 4 minutes in early 2024 to 14.5 hours by February 2026. Real progress. But chain 20 steps together in production, and end-to-end success rates fall to around 36%, even at 95% per-step reliability. Error multiplies. 91% of ML models degrade after deployment. Fewer than 10% of enterprises have anything resembling agent governance.2

Capability and reliability do not scale on the same curve.

There’s also a perception problem that makes this harder to see. The METR study from 2025 tracked experienced software developers completing realistic engineering tasks — debugging, feature implementation, code review — using Cursor Pro with Claude 3.5/3.7 Sonnet. Objective measurement: 19% slower. Self-reported: 20% faster. A 39-point gap.3 Not a technology failure. What happens when your destination (feeling productive) diverges from the destination that matters (the system works, the product ships).

Researchers and engineers are measuring the same industry with instruments calibrated for different purposes. Both instruments work correctly. They just don’t measure the same thing.

So who closes the gap? It starts with the vocabulary.

When the same word means opposite things

In October 2025, Andrej Karpathy appeared on the Dwarkesh Patel podcast and defined what he meant by “agent.” An employee. An intern you could actually delegate to. A system with enough intelligence, memory, and reliability to do the work.

His listeners immediately noticed the problem. Blogger Simon Willison wrote: “It turns out Andrej is using a different definition of agents to the one that I prefer.”

Nobody was wrong. They were just using the same word from different worlds.

To a researcher, an agent is an autonomous goal-directed entity with planning capabilities and long-horizon reasoning. That definition comes from decades of AI literature and it’s precise within that tradition.

To an engineer in 2026, an agent is something different. It’s a system with tool-calling, retry logic, and an observability layer that you’ve tuned until it stops crashing in ways that cost money. A 70% end-to-end success rate is cause for genuine celebration. Shipping at 36% is sometimes acceptable if the fallback is graceful enough.

These aren’t different opinions about the same thing. They’re different objects that inherited the same name.

The same drift happened to “scaling,” “alignment,” and “AGI.” Each word started in one world and got adopted by the other without the definition transferring. When a researcher and an engineer read the same post about agents and each thinks the other doesn’t understand, they’re often both right. They’re evaluating different claims.

“Agents in papers vs agents in prod — night and day.” That line circulated on r/AI_Agents in late 2025. It kept being cited not because it was clever. Because it named something people had been experiencing without language for it.

So who builds the translation layer?

The bridge gets built by individuals

The obvious answer is institutions. Academic-industry partnerships, joint research labs, government-funded translation programs.

In practice, it rarely works cleanly. Authorship disputes, IP complications, publication timelines running perpendicular to shipping timelines. The institutions themselves are optimized for the wrong destination to close this gap.

What actually closes the gap looks nothing like an institution. It looks like one person who needed the thing to run, and had enough theoretical grounding to make it run right.

Ollama is one example. A small team was frustrated that running models locally required too much configuration. Now it’s ollama run llama3. One command. 100,000+ GitHub stars. Not a university project. Just people who wanted it to work on their own machines.

The honest counter: Ollama eventually attracted investment. So did Hugging Face, vLLM, and most of the other tools that closed the research-production gap. Institutions end up involved.

The claim is about sequence. In each case, someone decided the paper wasn’t enough and built the bridge before any institution sanctioned it. The initial translation from “this works in a paper” to “this works in your terminal” came from individuals whose primary motivation was running the thing. Funding arrived after the bridge existed, not before.

Big tech noticed. And drew exactly the wrong conclusion.

What the talent wars actually show

If individuals close the gap, the logic goes, buy the individuals. Pay enough, and the outcome follows.

Meta reportedly paid around $200M to bring Ruoming Pang from Apple in July 2025. He moved to OpenAI seven months later. The reporting around why is incomplete. But even if the details are contested, the pattern is familiar to anyone who has watched this play out inside organizations.

Money can relocate a person. It can’t relocate their definition of done.

OpenAI re-hired Pang. Both companies keep acquiring research prestige and attempting to convert it to production output. The conversion step is where the collision lives, every time.

The ARR numbers — OpenAI at roughly $25B, Anthropic at roughly $19B — tell a limited story. OpenAI has the user base and capital. What’s factually observable is that Anthropic’s strongest growth is in enterprise and coding.

Those buyers write reliability requirements into procurement documents. Whether that reflects the research-production gap being priced by the market, or simply different go-to-market choices, the data doesn’t settle it.

This is the actual frontier

Research produces ideas in one language. Production tries to implement them in another. The translation layer between them still doesn’t exist. That gap is where most AI projects actually fail — not in the model, not in the benchmark, but in the distance between the two.

The next meaningful advances won’t come from bigger benchmarks. They’ll come from whoever closes the distance between “works in the paper” and “works in the hospital, the factory, the government office, the small team with one overworked engineer.”

Nobody is writing LinkedIn posts about that work. It doesn’t look like progress from the outside. It just looks like something finally running.

The insight exists. The kitchen keeps failing it.

The two clicks that reveal everything

So back to that LinkedIn post. The string theory algorithm. NumPy. Repo below.

A) You opened their profile.

B) You clicked the repo.

Has your answer changed?

If it has, even slightly — that’s not small. The gap between those two instincts is the gap this field most needs to close.

References

Sculley, D. et al. “Hidden Technical Debt in Machine Learning Systems.” NIPS 2015. The 5% figure (ML code as fraction of total production system) originates here and has been widely cited since.

Radoff, Jon. “The State of AI Agents in 2026.” Metavert Meditations (Substack), February 24, 2026.

Becker, Joel et al. “Measuring the Impact of Early-2025 AI on Experienced Open-Source Developer Productivity.” METR, July 2025. — arXiv: 2507.09089

Willison, Simon. “Notes from the Dwarkesh Patel / Andrej Karpathy podcast.” simonwillison.net, October 2025.

The Repo Is Right There. Why Are You Checking Their CV?

Choose A, B or C

Your answer depends on your destination

The field is drowning in unconsumed insight

💡Relatived Post:

When the Michelin Recipe Fails in Your Kitchen

Capability and reliability do not scale on the same curve

When the same word means opposite things

The bridge gets built by individuals

What the talent wars actually show

This is the actual frontier

The two clicks that reveal everything

References

Share

Related Reading

Is MCP Really Dead? A History of AI Hype — Told Through the Rise and Fall of a Protocol

When the Michelin Recipe Fails in Your Kitchen

95% of AI Businesses Will Die. Here’s How to Not Be One of Them.