AI Experiment #4: The Test Automation Compiler

Can you treat test cases like source code that compiles into automation?

I wondered if this might work… so I tried it.

The Question

How far can I take building a fully “prompt driven” test automation system?

In previous experiments, I’ve been exploring different aspects of AI-driven test automation:

Experiment #1: Creating test cases from screen recordings with FFMPEG
Experiment #2: Running test cases directly with Chrome DevTools MCP
Experiment #3: The “run twice” pattern – agentic discovery then deterministic YAML

This experiment brings it all together into a systematic approach I’m calling The Test Automation Compiler.

The core idea: Treat markdown test cases as source code that gets compiled into executable automation through a defined process – just like a programming language compiler turns source code into machine code.

I don’t honestly know if this is a good approach yet. Intuition just tells me it’s worth trying!

What I’m Using

Chrome DevTools MCP (configured from Experiment #2)
Claude Code
Financial Dashboard demo application
Markdown test case documents
Test Automation Compiler strategy document (version 2.0)

That last one is crucial – I took everything learned from Experiments #2 and #3, and a document I developed on determinitic AI test automation. I then refined this with Claude Code’s help, and built a comprehensive strategy document that defines the entire compilation philosophy and process.

The Setup

Here’s how the Test Automation Compiler works:

The Core Philosophy:

Human Intent (Markdown) → Compiler (AI Discovery) → Machine Code (YAML) → Execution (Deterministic)
           ↑                                                                        ↓
           └─────────────────── Feedback Loop (Continuous Learning) ────────────────┘

Traditional test automation requires translating human test cases into code – a manual, error-prone process that creates two artifacts. Two artifacts that need maintaining in parrallel. The Test Automation Compiler approach treats this as a compilation problem instead. Your markdown test case is the source code. An AI discovery run acts as the compiler, learning implementation details through intelligent exploration. The generated YAML is the compiled output – optimized, deterministic and ready to execute. Not ready to execute in the traditional sense. Ready to execute with an AI coding engine and an MCP connection to a browser.

This shift in approach could have profound implications. Just like a programming language compiler converts high-level code into machine instructions, this system converts human-readable test cases into a script that can be run. Kind of similar to modern compilers where they include optimization passes, our new learning phase extracts the most efficient patterns from discovery. The feedback loop acts like profiling tools, identifying where the compiled code needs refinement. The result: a single source of truth (your markdown test case) that stays in sync with automation through systematic recompilation.

The Three Pillars:

Not all test cases are created equal. A test case written as “Click the blue button” is human-readable but nightmarish to automate. A test case written as “CLICK button labeled ‘Submit’” is both human-readable AND automatable. This approach ensures test cases are written in a structured format that humans can read naturally while AI can parse reliably. Kind of similar to BDD but not as prescriptive. It’s an automation-aware format that is “compiler-friendly” test documentation.
- Markdown test cases written with automation in mind
- Structured natural language (GIVEN/WHEN/THEN)
- Clear element identification
- Specific assertions
The first run is exploratory – the AI coder with MCP DevTools connection tries multiple approaches, measures actual timing, discovers the most reliable selectors, and logs everything. Maybe a bit like a compiler’s analysis phase, understanding the structure before generating code. The second run validates that what was learned actually works deterministically. This two-phase approach separates the intelligent discovery (which can be non-deterministic) from the production execution (which must be deterministic). You get the benefits of AI exploration without the unreliability of agentic execution in production.
- Run 1: Discovery – AI explores and learns
- Run 2: Validation – YAML executes deterministically
Over time, the system learns from failures and successes. When a selector breaks, the feedback loop updates the YAML without touching the markdown test case. When business logic changes, the markdown is updated and the YAML recompiled. This feedback loop identifies and undertakes maintenance for your automated tests. The system gets smarter with every execution, learning which patterns work and which need adjustment.That’s the idea … I still have a bit to work out and finish off here. Watch out for Experiment #5.
- Learn from execution results
- Update YAML for implementation changes
- Update markdown for business logic changes
- Continuous improvement

The Four Commands:

I built four Claude Code slash commands that implement the compilation pipeline:

/discover – Discovery execution (Run 1)
/learn – Extract automation patterns
/generate – Create YAML from learnings
/validate – Validation execution (Run 2)

The Experiment

I took a markdown test case through this complete 4-step compilation process.

Step 1: Discovery Execution

What I asked:

/discover TC-001-add-investment-account.md

What happened:

Claude Code took the markdown test case and executed it using Chrome DevTools MCP. But this wasn’t just a simple run – it was a discovery session designed to learn everything about automating this test:

Discovery captured:

Multiple selector strategies attempted
Successful selectors with confidence scores
Timing requirements measured in milliseconds
Screenshots at each step
Patterns in the application behavior
Modal animations (600ms discovered)
Form validation delays (1000ms required)
Success indicators and their duration

The output: discovery-log.json – a comprehensive record of everything learned during the first run.

This discovery run is analyzing the application’s structure, parsing the UI patterns, measuring the real-world behavior. It’s not just blindly executing steps; it’s building an internal model of how this application works so it can generate optimal automation instructions. The discovery log is a structured representation of everything needed for the yaml script generation.

The key insight here is that this first run is intelligent exploration, not just execution.

Step 2: Learning Extraction

What I asked:

/learn discovery-log.json

What happened:

Claude Code analyzed the discovery log and extracted automation patterns specific to this application:

Learnings extracted:

Most reliable selectors identified
Optimal wait strategies (immediate vs timed)
Element interaction patterns
Data transformation rules
Application-specific quirks
Best practices for this UI

The output: learnings.json – distilled intelligence ready for YAML generation.

This learning extraction step does is looking to optimise our test automation process specifically for our test case and our application. It looks at all the attempted selectors and picks the most reliable ones. It analyzes timing patterns and calculates optimal wait strategies. It identifies application-specific behaviors (like a 600ms modal animation) that need special handling. This is more than just data aggregation – it’s using AI intelligent pattern recognition that extracts reusable automation knowledge from raw execution data. The learnings become the “optimisation rules” that guide YAML generation (the next step).

Step 3: YAML Generation

What I asked:

/generate learnings.json TC-001-add-investment-account.md

What happened:

Claude Code combined:

The learnings (implementation details)
The original markdown test case (requirements)

To generate a deterministic YAML specification.

The YAML included:

Structured steps using discovered selectors
Optimized waits based on measured timing
Proper sequencing learned from discovery
Fallback strategies for flaky elements
Confidence scores for each selector
Execution notes documenting quirks

The output: test-spec.yaml – the “compiled” version of the markdown test case.

This is the yaml script generation phase – where high-level requirements (markdown) and implementation intelligence (learnings) combine to produce a script that can be followed by AI coding tools. The YAML specification includes everything needed for deterministic execution: precise selectors with confidence scores, optimized wait times based on measured behavior, proper sequencing learned from discovery, and even fallback strategies for unreliable elements. It’s structured, readable, and maintainable – just like well-written code. But unlike hand-written automation this is like a “compiled” output based on actual observed behavior.

The YAML reads like hand-crafted automation code, but it was generated entirely from the discovery process. It’s deterministic, optimized, and includes all the learned implementation details.

Step 4: Validation Execution

What I asked:

/validate test-spec.yaml

What happened:

Claude Code executed the generated YAML specification to validate that the compilation was successful.

Validation performed:

Executed all steps deterministically
Compared results to discovery run
Verified repeatability
Scored reliability across multiple dimensions
Identified maintenance points
Confirmed CI/CD readiness

Validation Results:

Dimension	    Score	Notes
Reliability	    10/10	All steps execute consistently
Timing	        10/10	Optimal waits discovered
Data Handling	10/10	Correct transformations
Determinism	    10/10	No randomness in execution
Maintainability	10/10	Well-structured, documented

Overall Assessment: Production-ready, fully deployable to CI/CD

Maintenance Notes: Selectors need monitoring for UI changes (which is true for any automation).

The validation step completes the compilation cycle by proving the generated YAML actually works. It’s comparing results against the discovery run, checking for deterministic behavior, confirming there’s no randomness in execution. The scoring system provides objective metrics across multiple dimensions, giving you confidence the automation is production-ready. This isn’t just a binary pass/fail – it’s a comprehensive quality assessment that identifies potential maintenance points before they become problems. The 10/10 scores across all dimensions mean the compilation was successful: the YAML faithfully represents the markdown test case with optimal implementation.

Patterns I Noticed

After completing the 4-step compilation workflow:

Works well for:

Converting test documentation directly into automation without coding
Maintaining single source of truth (markdown test case)
Creating deterministic execution from agentic discovery
Systematic approach with clear, defined stages
Production-ready output with confidence scores
Auditable compilation process (all artifacts saved)

Might not work quite so well for:

Currently Chrome-only (using Chrome DevTools MCP limitation)
Needs extensive testing with more complex scenarios
Unknown performance in production GitHub workflows
No feedback loop implemented yet (planned next)

Surprises:

The 4-command workflow felt natural and systematic
10/10 reliability score on first attempt with no tweaking
Generated YAML was readable and maintainable by humans
The “compiler” concept works as both metaphor and reality
Validation report proactively identified future maintenance points
Discovery log captured quirks I wouldn’t have thought to document

The Honest Take

⚡ Quick Verdict:🟡 AMBER: “Interesting . . . an approach that’s taking shape! Need to work in the feedback and learning layer!”

The Good:

Single source of truth: markdown test case is THE documentation
No coding required: entire workflow is prompt-driven
Systematic process: 4 clear steps with defined inputs/outputs
Production-ready: 10/10 reliability score, validated for CI/CD
Intelligent compilation: learns optimal selectors and timing automatically
Self-validating: system knows its own reliability and limitations
Auditable: every stage produces artifacts for review

The Concerns:

Still needs extensive testing with complex scenarios (multi-step flows, error cases)
Chrome-only currently (Chrome DevTools MCP limitation)
Unknown reliability in production CI/CD workflows
Feedback loop not yet implemented (continuous learning needed)
Need to test with UI changes to verify maintenance approach

**Would I use this?**Maybe – the approach is solid and the 4-step workflow makes intuitive sense. I need to:

Validate with more complex tests
Run in actual CI/CD pipeline
Add the feedback loop
Test Playwright MCP for multi-browser

I think the foundation we have here is strong enough to convince me to invest more time in this approach.

The Main Lesson

The “Test Automation Compiler” concept isn’t just a metaphor – it’s a working proof of concept.

Treating test cases as source code that compiles into automation provides several benefits:

Clear mental model: Everyone understands compilers
Defined stages: Each step has specific inputs/outputs
Separation of concerns: Requirements (markdown) vs implementation (YAML)
Optimization opportunity: Learning phase extracts best practices
Validation built-in: Compiler verifies its own output

The three pillars work together:

Automation-Aware Creation ensures good input (well-structured test cases)
Run Twice Pattern provides intelligent translation (discovery → YAML)
Feedback Loop enables continuous improvement (not yet implemented but designed)

I think the concept and foundation is solid. From markdown to production-ready automation with a systematic compilation process and no code written.

Conclusion

This experiment demonstrates that the Test Automation Compiler approach isn’t just an interesting idea – it’s a practical, working proof of concept. The compiler metaphor proved to be more than just a convenient analogy; it’s a useful description of what’s happening. We’re taking human-readable test documentation (source code) and systematically transforming it through analysis, optimization, and code generation phases into deterministic, production-ready automation (executable machine code). When I say “executable machine code” I really mean a script that’s reliably execuatable by an AI coding engine with an MCP conncetion to a browser.

The four-command workflow provides a clear, repeatable process. A process that separates concerns: discovery for learning, extraction for optimization, generation for compilation, and validation for quality assurance.

What makes this approach fundamentally different from traditional test automation is the elimination of parallel artifacts. There’s no separate test documentation that falls out of sync with test code. There’s no manual translation step where implementation details get lost or misinterpreted. Although it could be argued that losing this translation step means you’re missing a human review and analysis step that traditionally would find issues – both in the test case and the application under test.

The markdown test case becomes THE documentation, and the YAML specification is automatically compiled from observed behavior rather than assumed implementation. When the application changes, you update the markdown and recompile – just like updating source code and rebuilding. When implementation details change (selectors, timing), the feedback loop updates the YAML without touching your documentation (at least that’s what I’m hoping once this stage is implemented).

This single source of truth approach, combined with intelligent compilation and continuous learning, represents a genuinely interesting way to think about test automation maintenance and sustainability.