What AI Changed About Research Code — and What It Didn’t

A note on scope: this is not an essay about AI writing papers, replacing scientists, or somehow “doing science” on its own. It is about something quieter, but perhaps more consequential. The distance between theory, implementation, and verification is shrinking faster than research culture has adjusted to. In some domains, that distance has become short enough that the bottleneck is no longer writing research code at all. The bottleneck is knowing whether the code still means what the theory meant. The audience I have in mind is practical rather than abstract: undergraduate students beginning to use AI for scientific coding, graduate students building research pipelines, and working researchers trying to turn papers, equations, or biological heuristics into executable systems without losing scientific fidelity along the way.

1. The Day an Equation Had to Wait

Imagine being a researcher in the late 1950s.

You have an equation you want to test.

Not a philosophy.

Not a grand theory of science.

Just one equation, and one stubborn question:

If I translate this into a machine, what happens?

You do not open a notebook.

You do not paste the equation into a coding assistant.

You do not try five variations before lunch.

You prepare punched cards.

Then you hand them over and wait.

That waiting did more than slow research down.

It shaped thought itself.

When feedback takes a day instead of a second, you think differently.

You check your signs on paper.

You rehearse assumptions in your head.

You become conservative about experimentation

because every mistake costs real time.

When IBM released FORTRAN in 1957, it was revolutionary because it shortened that painful distance. Scientists, engineers, and mathematicians could finally express computation in something closer to mathematical notation rather than raw machine instructions.

The name itself — Formula Translation — captured the dream. The machine had become a little more reachable. The equation no longer had to travel quite so far before becoming executable.

But the deeper problem did not disappear.

The real cost was never only typing.

It was translation.

Between the blackboard and the machine lived what we might call the translation tax: the time, energy, and meaning lost when theory had to be handed off to implementation, and then handed off again to interpretation.

Often the person who understood the theory was not the person who wrote the code. Often the person who wrote the code was not the person who judged whether the result made sense. Scientific computing was not just computation. It was staged translation.

For decades, much of progress in scientific software came from shrinking that tax.

2. We Have Been Trying to Shrink This Gap for a Long Time

Interactive workstations helped.

MATLAB helped.

The scientific Python stack helped.

Jupyter helped even more.

Each step made it easier to move from idea to execution.

Each step tightened the loop.

A notebook was especially important because it let three things sit in one place:

Explanation

Code

Output.

The markdown cell could say what you meant.

The code cell could attempt to do it.

The plot could show the consequences.

This felt, for good reason, like a small intellectual reunification.

But even then, a gap remained.

The equation was still one thing.

The implementation was still another.

A researcher still had to cross the bridge between them manually.

The notebook made the bridge shorter, but it did not eliminate it.

That distinction matters because the hardest errors in scientific software are often not syntax errors. They are fidelity errors.

The code runs.

The shapes match.

The plots look plausible.

The tests pass.

And yet the thing the code computes is not quite the thing the theory described.

That problem existed before AI.

AI did not invent it.

AI changed how fast we can now run into it.

3. The Old Bottleneck Was Writing the Code

A useful way to understand the present moment is this:

The old bottleneck was often, Can we implement this at all?

The new bottleneck is increasingly, Did we actually implement the thing we think we implemented?

This is the real shift.

AI-assisted tools can now turn descriptions into code with astonishing speed.

A student can take a paper, summarize the core idea in English, add a few equations, and get a working scaffold in minutes.

A researcher can sketch a scoring rule, a simulation routine, or a filtering step and immediately test it.

A small lab can now build software prototypes that once required a much longer technical handoff.

That matters.

It lowers the cost of realization.

More ideas can become executable.

More hypotheses can become testable.

More variants can be explored in one sitting.

But the speed moved the failure boundary.

When code is produced at the speed of explanation, the most dangerous error is no longer a compiler crash. It is something quieter and more respectable-looking: an answer that looks correct, not because the scientific object was truly implemented, but because the benchmark was too easy to expose the absence.

Sometimes a missing idea survives simply because all the early test cases live in worlds where the right answer was going to look quiet anyway.

4. A Physics Story: When Zero Was Right for the Wrong Reason

A useful example comes from a specialized corner of high-energy physics.

The field itself is not the point here. The point is the pattern: a hard mathematical object can appear to be implemented long before the code is actually forced to reveal it.

In one physics-oriented verification engine, the system claimed to evaluate several nontrivial mathematical terms — curvature contributions, flux contributions, and other pieces that should matter once the background becomes complicated enough.

Early in development, some of those pieces were placeholders. That is normal. Engineering often begins that way.

The interesting part came later.

Some of those placeholders kept “passing.” Not because they had been fully implemented, but because the standard benchmark cases lived in quiet regions of the theory.

In those simplified settings, the correct answer really could be zero. A missing contribution and a genuine vanishing contribution became observationally identical.

This is exactly the sort of mistake that ordinary testing can miss.

A unit test can confirm that the function returns an array of the right shape.

Static analysis can confirm that the code is structurally sound.

A regression suite can confirm that familiar examples still produce familiar outputs.

And still the central scientific question remains unanswered:

Is the mathematical object actually alive in the code path?

That question is harder, but more important.

It asks not merely whether the code returns the expected value in an easy case, but whether the code can be forced to reveal the presence of the thing it claims to compute.

If a curvature term is real, choose a case where curvature cannot stay quiet.

If a field contribution is real, choose a case where removing it changes the answer.

If nothing changes when theory says something should, then the code may still be carrying a beautifully organized absence.

5. A Biology Story: Beautiful Structures Can Still Be Wrong

Biology gives us the same lesson in a more visual form.

A predicted protein structure can look beautiful on screen and still fail the most basic test of physical reality.

Too many atoms may be trying to occupy the same space at once. A binding geometry may look persuasive until one model disagrees sharply with another. A candidate may appear coherent until a downstream filter asks a harder question than the generator did.

Here the danger is not that the system produces obvious nonsense. The danger is that it produces plausible structure.

And plausible structure is exactly what tempts human beings to relax too early.

In this kind of workflow, disagreement becomes interesting.

If two strong models produce materially different answers, that discrepancy is not merely noise to be averaged away. Sometimes it is the most scientifically valuable object in the room.

It tells you that the output is underdetermined, or that the data are thin, or that the verification layer is weaker than the generation layer.

That is why biology offers such a good mirror for this problem. The eye can be fooled by elegance. So can the benchmark. So can the code review.

A structure that looks convincing may still collapse under steric clashes — physically impossible atomic overlaps — or fail a downstream constraint that the generator never really understood.

Once again, the problem is not that the code produced nothing. It is that it produced something persuasive enough to delay suspicion.

6. What Kind of Verification Culture Should Follow This?

If research code can now be produced at the speed of explanation, what kind of verification culture should follow it?

Not formal proof in every case. That is unrealistic.

Not blind faith in passing tests. That is no longer sufficient.

What seems increasingly necessary is a style of verification that stays tightly coupled to the scientific object itself.

In practice, the habit looks something like this:

Name the object.Be specific about what the code claims to compute. In physics, this might be a curvature term, a flux contribution, or a consistency condition — a check on whether the background hangs together mathematically. In biology, it might be a structural filter, an interaction rule, or a plausibility constraint.

Get it running quickly.Build the implementation, manually or with AI assistance. The goal at this stage is not perfection. The goal is to get something executable into the world so that it can be interrogated.

Ask the code a question it cannot answer correctly unless the object is really there.This is the step many people skip. If a tensor term is real, choose a case where it cannot stay zero for trivial reasons. If a biological constraint is real, feed the system a case that should trigger a visible problem: steric clashes, geometric strain, or disagreement between strong models.

Turn nearby signals on and off.In a physics workflow, this may mean checking what changes when curvature, topology, or the dilaton — a scalar field term in the consistency equations — are varied separately. In a biology workflow, it may mean comparing the output before and after a structural filter or asking whether the disagreement disappears when a constraint layer is removed.

Interpret unchanged behavior as evidence.If nothing changes when theory says something should, do not treat that as a comforting pass. Treat it as evidence that the claimed object may not yet exist in the implementation path.

Patch and repeat.The patch may be mathematical, architectural, or procedural. What matters is that each cycle makes the link between the scientific claim and the executable behavior harder to fake.

This is not glamorous. It does not sound like frontier AI. But it may become one of the most important habits of the next phase of research software.

A passing result is no longer the end of scrutiny. Sometimes it is the beginning.

7. The Feynman Question Hiding Inside the AI Problem

At this point, the problem stops looking purely technical.

It becomes epistemic.

Richard Feynman once put the underlying danger very plainly:

“The first principle is that you must not fool yourself — and you are the easiest person to fool.”

That warning feels newly relevant here.

The problem with AI-assisted research code is not merely that it can generate mistakes. Human programmers have always generated mistakes.

The deeper problem is that AI can generate mistakes wrapped in competence.

It can produce code that makes it easier to reassure ourselves too early.

That is why the real question is not, “Can the model write the code?”

The real question is closer to a Feynman question:

How do you know the code is right for the right reason?

Or more sharply:

What question could you ask the code that it could not answer correctly unless the missing idea were genuinely present?

That is not only a debugging question.

It is a scientific question.

And it may become one of the defining questions for students and researchers who grow up with AI as part of their normal research environment.

8. What AI Changed About Research Code — and What It Didn’t

AI changed the economics of implementation.

It changed the speed at which theory can become executable.

It changed how many variants a small team can explore in one sitting.

It changed the practical cost of turning paper ideas into runnable artifacts.

But it did not eliminate ambiguity.

It did not make passing results trustworthy by default.

It did not remove the need for carefully designed hard cases, meaningful constraints, or domain-specific suspicion.

The old translation tax has not vanished so much as changed shape. Part of it has been compressed. Another part has reappeared as a fidelity problem: not can we write the code? but does the code still faithfully instantiate the thing the theory intended?

That may be one of the central methodological questions of this new era of scientific software.

A fair objection belongs here.

Sometimes AI-generated code really is easier to verify.

That is true. It is also not enough.

Readability improves review. It does not guarantee that the scientific object named in the paper is actually instantiated in the logic that drives the result. In some cases, the polished appearance creates a new hazard: the code looks rigorous before it has earned scientific trust.

This is why what we might call Theory-Coupled Verification is beginning to matter. Not verification as an afterthought — a test suite written once and trusted forever — but verification that stays actively coupled to the meaning of each computational object, asking at each stage whether the implementation can be forced to behave as the theory demands.

That may be one of the central methodological questions of this new era of scientific software.

9. What Changes Next

We have been translating thought into machinery for decades.

What feels new is not that translation itself exists, but that the machinery now answers back fast enough that verification can happen in the same working cycle as implementation.

That creates a real opportunity — especially for smaller labs, individual researchers, and students who can now turn an idea into executable code without waiting for a long technical handoff.

But the opportunity is narrow if the verification habit does not mature with it.

Faster generation can just as easily produce more polished mistakes. So the real gain is not speed alone. It is the chance to build a better discipline around implementation fidelity: asking not just whether an output looks right, but whether the scientific object we care about is actually doing work inside the code.

That is a modest claim.

But it may be the more useful one.

It is also where a fair skepticism begins.

If this is more than a good-sounding philosophy, then it should be visible in code, in tests, in failure cases, and in the way a system changes after being proven wrong.

That is where I want to go next: away from the general argument and into the working artifacts.

In the next piece, I will show what this looks like when the method is forced into practice — through actual patterns from Flamehaven-TOE and RexSyn, where verification stopped being a principle and became part of the algorithm.

References

IBM, FORTRAN history: https://www.ibm.com/history/fortran

Project Jupyter, About: https://jupyter.org/about

Richard P. Feynman, Cargo Cult Science: https://resolver.caltech.edu/CaltechES:37.7.CargoCult

David Donoho, 50 Years of Data Science (DOI): https://doi.org/10.1080/10618600.2017.1384734

John P. A. Ioannidis, Why Most Published Research Findings Are False (PubMed): https://pubmed.ncbi.nlm.nih.gov/16060722/

Monya Baker, 1,500 scientists lift the lid on reproducibility (Nature): https://www.nature.com/articles/533452a

What AI Changed About Research Code — and What It Didn’t

1. The Day an Equation Had to Wait

2. We Have Been Trying to Shrink This Gap for a Long Time

3. The Old Bottleneck Was Writing the Code

4. A Physics Story: When Zero Was Right for the Wrong Reason

5. A Biology Story: Beautiful Structures Can Still Be Wrong

6. What Kind of Verification Culture Should Follow This?

7. The Feynman Question Hiding Inside the AI Problem

8. What AI Changed About Research Code — and What It Didn’t

9. What Changes Next

References

Share

Related Reading

Chaos Engineering for AI: Validating a Fail-Closed Pipeline with Fake Data and Math

How do you know when your entire AI pipeline is wrong — not just one model? (EXP-033)

What an AI Reasoning Engine Built for Alzheimer's Metabolic Research: A Code Walkthrough