Unlocking Ephemeral Testing with Generative AI: Part Two
Because zero marginal cost reshapes what tests are even for
This article was originally published on the Airbyte Blog.
In Part 1, we talked about using LLMs to generate ephemeral tests that unfreeze legacy code — basically treating your AI as a bored-but-willing intern who will brute-force the emergent contract you’re too scared to guess at.
But there’s an adjacent point I keep circling back to: if the marginal cost of writing a test is now effectively zero, we’ve been massively underusing tests everywhere, not just in fossilized code.
Most of us only write tests when we “need to.” Translation: right before we touch a landmine. But writing tests as part of the exploratory process? Understanding new tools? Probing edge cases you didn’t think of? Historically that’s been too slow, too expensive, and too annoying. Now it’s just a cheap prompt away.
1. Testing unfamiliar systems: documentation by empiricism
This is basically the same move as Part 1 but for greenfield work.
You pick up a new library, framework, or middleware — say, the thing at my job that magically turns incoming JSON into Kotlin objects, plus some undocumented set of delightful quirks that everyone politely ignores.
Before, your options were:
Read the docs (ha).
Read the source (double ha).
Ship something, hit prod, and learn what actually happens (the traditional method).
Now you can just have an LLM carpet-bomb the thing with tests and infer the behavioral terrain map. Ask it for dozens of permutations: missing fields, extra fields, weird nesting, bad types, whitespace crimes — all the delightful real-world entropy the docs never mention.
Most of these tests you’ll delete. A few you might keep, because they reveal a “fun” subtlety that Future Developer (which is also Present Developer + six months of memory decay) is going to trip over.
It’s like doing reconnaissance on a foreign API. Except now the recon is both free and tireless.
2. Testing things you didn’t think of
At a recent conference, George Fraser from Fivetran said something to the effect of:
“I get an LLM’s opinion on everything I do, because sometimes it notices something I don’t.”
That stuck with me.
Not because I need more opinions in my life (I write software; opinions are my primary export), but because it’s the exact same philosophy as ephemeral testing:
Ask the model to test your code for cases you never considered.
Worst case: the tests are useless and you don’t commit them.
Best case: it surfaces a weird edge case that would have cost you a day of debugging and a grumpy Slack thread.
Treat your LLM like an extremely pedantic coworker who specializes in pointing out the one thing you forgot. The key difference is that you don’t owe the LLM coffee or emotional labor.
3. The next frontier: acceptance testing via prompt
This last bit is more speculative, but I’m increasingly convinced that some tests shouldn’t be written in code at all.
Long-lived automated tests age poorly. They embed outdated assumptions in a thousand helper functions and silently pass even after they stop exercising the actual code path.
But a prompt like:
“When I update the number of records moved in a replication job, the job summary returns the updated count.”
…is short, human-readable, and tightly scoped to intent, not implementation.
Imagine a small suite of English prompts that represent the product’s core acceptance criteria. As part of CI, you ask an LLM to execute those prompts against your system and confirm that reality matches the story.
Two nice properties fall out of this:
It detects mismatches between the code and the canonical user-facing behavior. Maybe the old tests still hit the v1 endpoint, while your docs point to v2. A human might miss that; an LLM poking at the surface won’t.
Anyone can write prompts. We needed QA teams because the tools for encoding and checking product level behavior required a lot of technical expertise. If PMs, designers, and support can maintain their own prompts to describe what workflows they care about, that unlocks entirely new ways of approaching QA.
Of course, there’s an obvious landmine: LLMs are nondeterministic, and flaky tests are how engineering teams slowly lose their will to live.
So I’m not claiming victory here. This idea needs more experimenting — guardrails, temperature controls, scenario anchoring, maybe multiple-run consensus. But the shape of the opportunity is interesting: acceptance tests that read like requirements, not code.
The bigger shift: when tests hit zero marginal cost
This whole series boils down to a simple economic shift: writing tests just dropped from “painful but virtuous” to basically zero marginal cost. And any time something useful collapses to zero marginal cost (think Ben Thompson and Aggregation Theory), the right question isn’t “How do we do the old thing cheaper?” It’s “What new behaviors does this unlock?”
That’s the invitation here. Once tests are cheap, they stop being artifacts you carefully curate and start becoming probes — disposable instruments for exploring unfamiliar code, mapping legacy behavior, surfacing missed edges, or even expressing acceptance criteria in plain English. The examples in these posts are just early sketches. The real point is: we should get weird and creative again. When tests cost nothing, the space of things worth testing suddenly gets a lot bigger.
