Can you transform unreliable agentic tests into deterministic, repeatable tests using a “run twice” pattern?
I wondered if this works… so I tried it.
Why did Claude Code suddenly start building a test automation framework I didn’t ask for?
Towards the end of Experiment #2, things went sideways. I asked Claude Code to create a slash command for test automation. Instead, it started generating commands for converting tests to YAML, creating action libraries, building test runners… way beyond what I’d asked for.
Then I realized: I’d forgotten about a markdown document in my project folder. A document about “deterministic test automation” that I’d been exploring with Claude Code in a previous session. When I said “read all the files in this project,” it read that forgotten document.
And that context guided everything that followed.
That last one – the forgotten document – is the key to understanding this entire experiment.
Here’s how I got this working (or rather, how it got itself working):
Installation:
Chrome DevTools MCP already configured from Experiment #2. Nothing new needed.
Configuration:
Discovered Claude Code had read a “deterministic testing” markdown file I’d forgotten about. That file explained concepts like:
Starting Point:
Picking up where Experiment #2 unexpectedly pivoted – with three slash commands Claude Code had generated:
/run-test – Execute markdown test case using Chrome DevTools MCP/convert-test-to-yaml – Create deterministic YAML specification/create-action – Build reusable action libraryI followed the three-step workflow Claude Code had created to see what would actually happen.
What I asked:
What happened:
Claude Code executed the markdown test case using Chrome DevTools MCP. It:
But it didn’t just run the test. It also:
test-evidence folderWait, it created a complete test execution framework just from the markdown?
That’s what I wasn’t expecting. It built evidence capture, reporting, and verification tracking automatically.
What I asked:
What happened:
Claude Code created a structured YAML specification. It:
test-specs foldernavigate – Go to URLclick-button – Click elementsselect-dropdown – Choose optionsfill-text – Enter dataverify-text – Assert expected contentThe YAML spec captured everything from the agentic run – navigation steps, form interactions, verifications – but structured it as commands that could execute deterministically.
This is a “run twice” pattern – agentic first to explore, then deterministic for reliability.
The first run uses the LLM to figure out how to interact with the application. The second run captures that as a deterministic specification.
What I asked:
What happened:
Claude Code extracted a reusable action from the test case. It:
reusable-actions folderHere’s what’s interesting: the original test case was specific – “add an investment account with these exact values.” But the action it extracted was generic – “add any type of account with any values.”
That’s pretty impressive!
It’s building a test automation framework without being explicitly asked – extracting patterns and creating reusable components that could work for multiple test scenarios.
After running through the three-command workflow, some clear patterns emerged:
Works well for:
Gets messy with:
Surprises:
⚡ Quick Verdict:
🟡 AMBER: “Really interesting, no idea where this goes”
The Good:
The Concerns:
Would I use this?
Maybe – need to complete the framework and validate it works.
The “run twice” pattern feels right: agentic for exploration, deterministic for reliability. If the YAML specs actually execute reliably, this could be valuable.
For what?
When would I NOT use it?
What I’m still curious about and want to test further…
Context matters enormously.
A forgotten document about deterministic testing guided Claude Code to build an entire framework I wasn’t expecting.
I was just exploring ideas with Claude Code – writing down thoughts about moving from agentic to deterministic testing, discussing action libraries, thinking about YAML specifications. That document sat in my project folder. I moved it to a temp location and forgot about it.
When I started Experiment #2 and said “read all the files in this project,” Claude Code read that document. And when things went off the rails, that context guided it to build exactly what I’d been theorizing about.
This suggests a pattern: have Claude Code help you explore ideas and concepts, keep those documents in context, and let that guide future implementation.
I’m working this out as I go along, but the “run twice” pattern that emerged – agentic for exploration, then deterministic for reliability – feels like a real insight.
If you’re interested in trying this, these are the exact prompts I used:
Important Note: These slash commands were generated by Claude Code at the end of Experiment #2. They were guided by a forgotten “deterministic testing” markdown document that was sitting in my project context.
If you want to try this approach, you’d need to:
Want to try this yourself?
Watch this space – this needs follow-up to see if it actually delivers on the promise. I need to:
Let me know if you’ve experimented with agentic vs deterministic testing approaches – I’m especially curious about whether the “run twice” pattern resonates with your experience.