AI Experiment #3: When Claude Code Builds a Framework You Didn’t Ask For

Can you transform unreliable agentic tests into deterministic, repeatable tests using a “run twice” pattern?

I wondered if this works… so I tried it.

The Question

Why did Claude Code suddenly start building a test automation framework I didn’t ask for?

Towards the end of Experiment #2, things went sideways. I asked Claude Code to create a slash command for test automation. Instead, it started generating commands for converting tests to YAML, creating action libraries, building test runners… way beyond what I’d asked for.

Then I realized: I’d forgotten about a markdown document in my project folder. A document about “deterministic test automation” that I’d been exploring with Claude Code in a previous session. When I said “read all the files in this project,” it read that forgotten document.

And that context guided everything that followed.

What I’m Using

Chrome DevTools MCP (already configured from Experiment #2)
Claude Code
Financial Dashboard app (v0 app for testing)
Test case markdown documents
Hidden context: a forgotten “deterministic testing” document

That last one – the forgotten document – is the key to understanding this entire experiment.

The Setup

Here’s how I got this working (or rather, how it got itself working):

Installation:
Chrome DevTools MCP already configured from Experiment #2. Nothing new needed.

Configuration:
Discovered Claude Code had read a “deterministic testing” markdown file I’d forgotten about. That file explained concepts like:

Moving from agentic (AI-driven, somewhat unpredictable) to deterministic (scripted, reliable) testing
Creating action libraries
Building YAML specifications
Test execution frameworks

Starting Point:
Picking up where Experiment #2 unexpectedly pivoted – with three slash commands Claude Code had generated:

/run-test – Execute markdown test case using Chrome DevTools MCP
/convert-test-to-yaml – Create deterministic YAML specification
/create-action – Build reusable action library

The Experiment

I followed the three-step workflow Claude Code had created to see what would actually happen.

Try #1: Run the agentic test

What I asked:

/run-test test-cases/test-management/TC-001-add-investment-account.md

What happened:
Claude Code executed the markdown test case using Chrome DevTools MCP. It:

Navigated to the accounts dashboard
Clicked the “Add Account” button
Selected “Investment” account type
Filled in account name: “Main Investment”
Filled in description: “My Investment Account”
Submitted the form
Verified the account was created
Hovered over the account to show highlight effect
Completed all 13 test steps

But it didn’t just run the test. It also:

Created a test-evidence folder
Captured screenshots at key points
Generated a comprehensive test execution report with:
- Summary
- Execution details
- Step-by-step verification results
- Evidence captured
- Observations
- Issues (none found)
- Test status: PASSED

Wait, it created a complete test execution framework just from the markdown?

That’s what I wasn’t expecting. It built evidence capture, reporting, and verification tracking automatically.

Try #2: Convert to YAML specification

What I asked:

/convert-test-to-yaml test-cases/test-management/TC-001-add-investment-account.md

What happened:
Claude Code created a structured YAML specification. It:

Created a test-specs folder
Generated a YAML file with:
- Metadata (test ID, functional area, timestamp)
- Preconditions (verified text on page)
- Steps using Chrome DevTools MCP commands as primitives:
  - navigate – Go to URL
  - click-button – Click elements
  - select-dropdown – Choose options
  - fill-text – Enter data
  - verify-text – Assert expected content
- Assertions for each step
- Evidence collection points (screenshots)
- Custom actions needed
Generated a conversion notes document explaining the structure

The YAML spec captured everything from the agentic run – navigation steps, form interactions, verifications – but structured it as commands that could execute deterministically.

This is a “run twice” pattern – agentic first to explore, then deterministic for reliability.

The first run uses the LLM to figure out how to interact with the application. The second run captures that as a deterministic specification.

Try #3: Create reusable action library

What I asked:

/create-action "add a new account through the dashboard modal"

What happened:
Claude Code extracted a reusable action from the test case. It:

Created a reusable-actions folder
Generated an “add account” action with:
- Parameters (account type, name, description, initial value)
- Implementation steps
- Selectors for UI elements
- Error handling
- Success criteria

Here’s what’s interesting: the original test case was specific – “add an investment account with these exact values.” But the action it extracted was generic – “add any type of account with any values.”

That’s pretty impressive!

It’s building a test automation framework without being explicitly asked – extracting patterns and creating reusable components that could work for multiple test scenarios.

Patterns I Noticed

After running through the three-command workflow, some clear patterns emerged:

Works well for:

Converting agentic (unreliable) runs into deterministic (reliable) specifications
Extracting reusable actions from specific test cases
Capturing test evidence automatically (screenshots, execution reports)
Parameterizing test data for reuse
Building action libraries incrementally from executed tests

Gets messy with:

Need to understand what Claude Code actually built (lots of files generated across multiple folders)
Framework emerged from accidental context (the forgotten document) – not planned

Surprises:

Claude Code built an entire framework I wasn’t planning
“Run twice” pattern emerged: agentic → deterministic
Context from a forgotten document guided the entire implementation
Three-command workflow was created automatically
Framework suggested building a YAML test runner as the next step
It generalized specific test cases into reusable actions

The Honest Take

⚡ Quick Verdict:
🟡 AMBER: “Really interesting, no idea where this goes”

The Good:

Solves a real problem: converting unreliable agentic tests to reliable deterministic tests
Three-step workflow makes sense: run → convert → extract actions
Automatically captures test evidence and execution reports
Creates parameterizable, reusable test specifications
Could enable a test case library that runs consistently
Extracts reusable patterns from specific implementations

The Concerns:

Completely unplanned – emerged from forgotten context
Haven’t validated the YAML specs actually work (no runner built yet)
No YAML test runner built yet
Lots of generated files to understand and validate
Unknown if this scales to multiple test cases
Need to understand what Claude Code built before trusting it

Would I use this?
Maybe – need to complete the framework and validate it works.

The “run twice” pattern feels right: agentic for exploration, deterministic for reliability. If the YAML specs actually execute reliably, this could be valuable.

For what?

When you need reliable test execution but want the speed of agentic test creation
Building test automation frameworks from exploratory testing
Creating action libraries from executed test patterns
Capturing test specifications from manual testing sessions

When would I NOT use it?

When I need deterministic tests immediately (framework not complete yet)
For simple tests where plain Playwright would be faster
Until I understand what Claude Code actually built and validate it works

Still Curious About

What I’m still curious about and want to test further…

Can the YAML specs actually execute reliably?
How would the YAML test runner work? (Claude Code suggested this as next step)
Does this scale to dozens or hundreds of test cases?
Can I build an action library that covers most test scenarios?
Is the “run twice” pattern a general principle for AI test automation?
What happens if the application UI changes – can the YAML specs adapt?
Could this work for API testing or just UI testing?

The Main Lesson

Context matters enormously.

A forgotten document about deterministic testing guided Claude Code to build an entire framework I wasn’t expecting.

I was just exploring ideas with Claude Code – writing down thoughts about moving from agentic to deterministic testing, discussing action libraries, thinking about YAML specifications. That document sat in my project folder. I moved it to a temp location and forgot about it.

When I started Experiment #2 and said “read all the files in this project,” Claude Code read that document. And when things went off the rails, that context guided it to build exactly what I’d been theorizing about.

This suggests a pattern: have Claude Code help you explore ideas and concepts, keep those documents in context, and let that guide future implementation.

I’m working this out as I go along, but the “run twice” pattern that emerged – agentic for exploration, then deterministic for reliability – feels like a real insight.

The Prompts I Actually Used

If you’re interested in trying this, these are the exact prompts I used:

/run-test test-cases/test-management/TC-001-add-investment-account.md
/convert-test-to-yaml test-cases/test-management/TC-001-add-investment-account.md
/create-action "add a new account through the dashboard modal"

Important Note: These slash commands were generated by Claude Code at the end of Experiment #2. They were guided by a forgotten “deterministic testing” markdown document that was sitting in my project context.

If you want to try this approach, you’d need to:

Set up Chrome DevTools MCP (see Experiment #2)
Create markdown test case documents
Give Claude Code context about deterministic testing concepts
Let it build the framework (or create the slash commands yourself)

Resources

Chrome DevTools MCP: https://github.com/ChromeDevTools/chrome-devtools-mcp
Financial Dashboard (demo app): v0 Demo app
Experiment #2 (where this started): https://www.testmanagement.com/blog/2025/10/test-cases-automated-in-minutes-with-chrome-devtools-mcp/

Want to try this yourself?

Watch this space – this needs follow-up to see if it actually delivers on the promise. I need to:

Validate the YAML specs actually work
Build (or have Claude Code build) the YAML test runner
Test with multiple test cases
See if the action library approach scales

Let me know if you’ve experimented with agentic vs deterministic testing approaches – I’m especially curious about whether the “run twice” pattern resonates with your experience.