Unlocking Ephemeral Testing with Generative AI: Part One
Because sometimes the best tests are the ones you leave behind
This article was originally published on the Airbyte Blog.
Every codebase has that file.
The one everyone tiptoes around in code review.
The one that’s “critical” and “old” and “owned by… someone who left three years ago.”
The one with no tests, no docs, and a suspicious # TODO: refactor from 2019.
What happens to that code? It freezes.
No one wants to touch it. Or when they do touch it, production catches fire and we all learn, again, that fear is a perfectly rational emotion.
AI lets us unfreeze this code by generating piles of tests that reveal emergent-contract of the existing code before rewriting anything
This post is the first in a small series on “ephemeral testing”: an exploration of how generative AI unlocks new approaches to testing. In the past, tests were slow to write and expensive to maintain. With LLMs, they’re cheap — which opens up entirely new ways to use them.
Legacy code without context: the perfect AI target
The pattern is simple:
You find some untested, low-context, scary code.
You want to modify it — for correctness, performance, or sanity.
You have no idea how it’s actually being used in the wild.
Hyrum’s Law whispers: “If an API can be depended on, it will be.”
Translation: someone is relying on behavior you don’t know exists.
Historically, the “responsible” move here is:
Write a couple of tests for the obvious cases.
Make your change.
Pray.
What you’re really doing here is empirically mapping the code’s behavior — reverse-engineering its actual contract by peppering it with tests. It’s useful, but it’s also tedious enough that most developers will only scratch the surface before giving up.
Enter generative AI: a junior engineer who never gets bored and will happily spit out 77 test cases for a single function without blinking.
The date parser from hell
Here’s a real example I used.
We’ve got a Python function whose job is:
“Take any string that looks like a time, date, or datetime and return an ISO 8601 datetime.”
Here’s the original implementation:
from dateutil import parser
“”“
Date parser implementation. Converts any string that looks like a time, date, or datetime into an ISO 8601 datetime.
“”“
def parse_to_iso8601(date_string):
if not isinstance(date_string, str):
return None
try:
dt = parser.parse(date_string)
return dt.isoformat()
except (ValueError, TypeError, parser.ParserError):
return NoneA couple of problems:
It uses dateutil.parser, which is flexible but slow.
The function takes a plain str. No type hints, no constraints.
This code is presumably used all over the place, in ways no one fully remembers.
There are zero tests.
This is a classic Hyrum’s Law trap. If it “kind of works” for arbitrary date-ish strings, someone, somewhere, is passing it absolute garbage in production — and depending on how it behaves.
I’d like to replace dateutil with the more efficient ciso8601 library. Easy change, right?
Sure. If you don’t care what breaks.
Point the AI at it: “Please describe all my sins”
Instead of guessing which formats matter, I ask an LLM to do the annoying part for me.
Here’s the prompt I actually used:
I am about to rewrite the implementation of the function in date_parser.py. The previous developer didn’t write any tests. I don’t have full context into all of the places where this function is being used. Because the input is a string, the breadth of inputs coming in could be quite broad. I need you to write tests that capture all of the existing behavior of the current implementation. I especially need to test every permutation of date, time, and datetime inputs. I will then use these tests to make sure that my new implementation doesn’t cause a regression.
The model happily generated 77 tests in seconds.
They covered:
Date formats: 2024-01-15, 01/15/2024, 15/01/2024, 20240115, Jan 15, 2024, January 15th, 2024, etc.
Times: 14:30, 2:30 PM, 14:30:45.123456, “midnight”, “noon”.
Datetimes: ISO strings, “January 15, 2024 2:30 PM”, 2024-01-15 23:59:59, slash formats with times.
Timezones: Z, +05:00, -08:00, “EST”.
Edge cases: leap years, invalid dates, numeric strings, weird whitespace.
Invalid input types: None, 123, {}, True, etc.
Here’s a tiny sample (out of those 77):
class TestDateOnly:
def test_iso_date_format(self):
assert parse_to_iso8601(”2024-01-15”) == “2024-01-15T00:00:00”
def test_slash_date_format_mdy(self):
assert parse_to_iso8601(”01/15/2024”) == “2024-01-15T00:00:00”
def test_written_date_format(self):
assert parse_to_iso8601(”January 15, 2024”) == “2024-01-15T00:00:00”
class TestTimeOnly:
def test_24hour_time_with_seconds(self):
result = parse_to_iso8601(”14:30:45”)
assert result is not None
assert “14:30:45” in result
def test_12hour_time_pm(self):
result = parse_to_iso8601(”2:30 PM”)
assert result is not None
assert “14:30:00” in result
class TestInvalidInputs:
def test_non_string_input_integer(self):
assert parse_to_iso8601(123) is None
def test_invalid_date_values(self):
assert parse_to_iso8601(”2024-13-01”) is NoneDo I know that this is perfect coverage? No.
Do I know it’s vastly better than the 3 tests I would’ve manually written before getting bored? Absolutely.
Now swap the implementation
Here’s the new implementation using ciso8601:
import ciso8601
def parse_to_iso8601(date_string):
if not isinstance(date_string, str):
return None
try:
dt = ciso8601.parse_datetime(date_string)
if dt is None:
return None
return dt.isoformat()
except (ValueError, TypeError):
return NoneSame interface. Stricter, faster parser.
I run the AI-generated tests against this new version.
42 tests fail.
Perfect.
Not because I enjoy failure (I work in software, I get plenty), but because this is exactly the information I actually need:
Where does ciso8601 behave differently from dateutil?
Which formats were silently “working” before that will now break?
Which of those differences do I care about, and which are acceptable changes?
Reading failing tests as a behavioral diff
The failing tests become a behavioral diff between “legacy weirdness” and “new stricter behavior.”
Now I get to work through them deliberately:
Maybe I don’t care that “January 2024” used to be accepted and now isn’t.
Mark that as an intentional breaking change.Maybe I do care that “2024-01-15 14:30” used to parse fine and now fails.
I can either:adjust the new implementation, or
add a small compatibility shim, or
explicitly document the supported formats.
The key point: with almost no manual effort, I’ve surfaced dozens of behavioral differences I’d never have thought to test.
Without these tests, I would have:
Shipped the new implementation.
Broken a bunch of obscure paths.
Found out from angry users and mysterious alerts.
With the ephemeral tests, I instead:
See the blast radius before I ship.
Choose which behaviors to preserve.
Turn accidental behavior into intentional behavior.
“Ephemeral” tests: why we don’t keep them all
Crucially, I don’t intend to commit all 77 tests.
Most of them are scaffolding:
They exist to help me understand current behavior.
They help me safely refactor.
Once I’ve decided what behavior I actually support, their job is done.
In practice, I’ll:
Keep a subset of tests that define the intended contract going forward.
Drop the ones that encode legacy quirks I’ve explicitly chosen to remove.
Possibly regenerate a smaller, cleaner test suite that matches the new behavior, also with AI’s help.
This is why I call them ephemeral tests:
They’re part of the development process, not necessarily part of the enduring test suite.
They’re like temporary scaffolding around a building: essential while you’re doing the work, ugly if you leave them up forever.
Why this is such a big unlock for teams
As a leader, I see this pattern all the time:
Engineers are scared to touch old, critical code.
The lack of tests becomes a psychological barrier, not just a technical one.
Refactors get kicked down the road because “it’s risky” and everyone’s busy.
Generative AI doesn’t magically make that risk go away — but it gives us a cheap, fast way to map it.
Now, when someone on the team has to change a scary subsystem, I can give them a playbook:
Identify the untested surface you’re about to touch.
Ask an LLM to generate aggressive characterization tests for the current behavior.
Make your changes.
Run the tests.
Inspect what broke:
Decide what to keep.
Decide what to drop.
Turn surprises into choices.
Keep only the tests that define the contract you care about.
Instead of “I’m afraid to change this,” the conversation becomes:
“Here are the 17 behaviors that will change if we ship this.
We care about these 5. The rest are deprecated weirdness.”
That’s a very different kind of engineering culture.
