AI Experiment #5: Can Test Automation AI Learn From Its Own Failures?

November 3, 2025

Can this prompt-driven test automation system scale with complex applications using lessons learnt loops?

I wondered if AI could fail at automating a complex test case, learn from that failure, and succeed on the second try. Here’s my attempt at building a process with a feedback loop that acheives that.


The Question

Can this prompt-driven test automation system scale with complex applications using lessons learnt loops? I know from experience that AG Grid scenarios are really difficult to automate – drag-and-drop, row grouping, complex UI interactions. They’re automation nightmares. So what happens when AI comes up against these sorts of challenges?

What I’m looking at is scenarios where you encounter a complex application that defeats standard automation approaches. Where you can get Claude Code to complete some deep research and work out solutions. Solutions you can then build into a learning loop. And then, maybe creating a growing knowledge base of lessons that gets better over time, making each subsequent test easier to automate.

Let’s see if we can get this to work.


What I’m Using

  • Chrome DevTools MCP for browser automation
  • Claude Code with ultra-thinking capability for deep research
  • AG Grid example application (demanding complex automation interactions)
  • Test case markdown documents from video recordings
  • The deterministic prompt-driven automation framework from Experiment #4

I picked AG Grid specifically because I know it’s going to fail with the standard, simple, approaches used with test automation. The drag-and-drop implementation in AGgrid is non-standard, the components are complex, and it’s exactly the kind of application that makes automation engineers question their career choices.


The Setup

Here’s how the Lessons Learnt Loop works:

The Core Philosophy:

Test Case (Markdown) → Discovery (AI Execution) → Failure Analysis → Lessons Document → Enhanced Discovery
           ↑                                                                              ↓
           └──────────────────── Success with Learned Patterns ─────────────────────────┘

Traditional test automation often fails when encountering complex UI patterns – and then you’re stuck. You either spend hours debugging selector strategies, or you give up and mark it as “manual only.” The Lessons Learnt Loop treats this as a learning problem instead. Your first execution attempt gathers data about what didn’t work. An ultra-thinking phase analyzes why it failed and documents solutions. The second attempt uses these lessons to succeed where the first attempt failed.

This shift in thinking has interesting implications. Just like a student learns from mistakes, our automation system can build knowledge about specific testing challenges. The AG Grid drag-and-drop that defeats standard automation becomes a documented, solved problem. The lessons learnt document becomes organizational knowledge that can be shared across teams. The result: automation that gets smarter with each challenge it encounters.

The Three Key Components:

  1. Failure Detection and Analysis

    • Identify exactly what failed and why
    • Capture error patterns and behaviors
    • Document the gap between expected and actual

    Not all failures are created equal. An “element not found” error is different from “drag started but drop didn’t register.” This approach captures the nuances of complex failures – particularly important with frameworks like AG Grid that implement custom event handling. The failure analysis isn’t just logging errors; it’s understanding the underlying cause. This deep understanding is what enables the ultra-thinking phase to generate meaningful solutions.

  2. Ultra-Thinking Research Phase

    • Deep dive into framework documentation
    • Analyze alternative approaches
    • Generate multiple solution strategies
    • Document findings in structured format

    The ultra-thinking capability is like having a senior automation engineer research the problem for you. It doesn’t just try random alternatives – it systematically investigates the framework’s architecture, understands the implementation choices, and proposes solutions based on that understanding. For AG Grid, this meant discovering the custom event system and proposing API-based alternatives. This isn’t trial and error; it’s informed problem-solving.

  3. Knowledge Integration

    • Feed lessons back into discovery
    • Apply learned patterns in execution
    • Build reusable knowledge base

    The lessons learnt document isn’t just a one-time fix – it becomes part of the automation’s knowledge base. Future test cases can reference these lessons, avoiding the same failures. Over time, you build a comprehensive understanding of your application’s quirks and complexities. This is organizational learning embedded in your automation system.

The Four Commands Enhanced:

Building on the Test Automation Compiler from Experiment #4, I’m using the same four commands but with lessons learnt integration:

  • /discover – Discovery execution (now accepts lessons learnt documents)
  • /learn – Extract automation patterns (incorporates lessons)
  • /generate – Create YAML from learnings (uses lesson-based strategies)
  • /validate – Validation execution (confirms lesson effectiveness)

The Experiment

I took an AG Grid test case through the complete learning loop process – expecting failure, researching solutions, and applying lessons.

Step 1: The Expected Failure

What I asked:

/discover TestManagement\test-cases\TC-002-aggrid-column-management-and-row-grouping.md

What happened:

Claude Code attempted to execute the test case using standard Chrome DevTools MCP approaches. The test involved dragging column headers into the row grouping panel – a seemingly simple interaction that’s actually one of AG Grid’s most complex features.

The discovery phase captured:

  • Successful navigation to the AG Grid example
  • Correct identification of column headers
  • Successful initiation of drag events
  • Complete failure on the drop action

“It’s not behaving in the way our AI agent was expecting. It’s a non-standard implementation of the drag and drop components It can drag, but it can’t actually drop.”

This wasn’t a surprise – I specifically chose AG Grid because I knew it would fail. AG Grid uses a custom drag-and-drop implementation that doesn’t respond to standard browser events. The drag appears to work visually, but the drop zone doesn’t accept the element. This is exactly the kind of complex scenario that makes automation engineers either write custom JavaScript handlers or give up entirely.

The key insight here is that the failure was informative. We learned exactly what doesn’t work and why – setting up the next phase perfectly.


Step 2: Ultra-Thinking Research

What I asked:

please can you examine why this drag-and-drop for the row grouping didn't work,
ultrathink and come up with solutions. Please write these solutions to the
file aggrid-lessons-learnt.md

What happened:

Claude Code spent five minutes in deep research mode. This wasn’t just error analysis – it was comprehensive investigation into AG Grid’s architecture and implementation choices.

The ultra-thinking phase discovered:

  • Root Cause: AG Grid doesn’t use native HTML5 drag-and-drop APIs
  • Implementation: Custom event handling with synthetic events
  • Complexity: Multiple ambiguous drop zones with custom hit detection
  • Event Sequence: Specific order of mouse events required that standard automation misses

The research went deeper, analyzing AG Grid’s documentation and finding:

  • The grid exposes a comprehensive API for programmatic control
  • Column state can be managed without UI interaction
  • Row grouping can be applied through applyColumnState() method
  • The API approach is actually more reliable than UI automation

The output was a comprehensive lessons learnt document with:

  1. Detailed explanation of why standard approaches fail
  2. Multiple solution strategies ranked by reliability
  3. Code examples for API-based approaches
  4. Fallback strategies if API isn’t available
  5. Timing considerations for animations and state changes

This research phase is transformative. Instead of blindly trying different selectors or timing strategies (the traditional debugging approach), we’re building understanding of the underlying system. The lessons learnt document isn’t just a workaround – it’s a knowledge artifact that captures expert-level understanding of AG Grid automation.


Step 3: The Informed Second Attempt

What I asked:

/discover TestManagement\test-cases\TC-002-aggrid-column-management-and-row-grouping.md
          aggrid-lessons-learnt.md

What happened:

I completely cleared the context – this was a fresh start with no memory of the previous failure. The only difference was the inclusion of the lessons learnt document.

This second discovery run was radically different:

  • Identified AG Grid’s API access pattern through React Fiber
  • Used gridApi.applyColumnState() instead of drag-and-drop
  • Applied row grouping programmatically in milliseconds
  • Verified the UI updated correctly
  • Expanded grouped rows using node.setExpanded(true)

“It’s been through all of the test steps. It says that it matches exactly what was expected in the test case.”

The transformation was impressive. What failed completely in the first attempt now executing flawlessly. The drag-and-drop interaction that would have taken complex custom code was replaced with a simple API call. The visual result was identical – the Country column appeared in the row groups panel, the data reorganized into groups, and the Belgium group expanded to show Isabella Kingston’s data.

But here’s the crucial point: the test case didn’t change. The markdown still described dragging and dropping. The AI learned to interpret “drag Country to row groups” as “apply row grouping by Country” and implement it the most reliable way possible.

That’s pretty impressive!


Step 4: Complete the 4-Phase Workflow

What I asked:

/learn TestManagement\discoveries\TC-002-discovery-log.json aggrid-lessons-learnt.md

Then generate the YAML specification:

/generate TestManagement\learnings\TC-002-learnings.json
          TestManagement\test-cases\TC-002-aggrid-column-management-and-row-grouping.md
          aggrid-lessons-learnt.md

Finally, validate the compiled automation:

/validate TestManagement\specs\TC-002-test-spec.yaml

What happened:

The complete 4-phase workflow (Discover → Learn → Generate → Validate) now incorporated the lessons learnt at every stage:

Learning Extraction Enhanced:
The /learn command didn’t just extract patterns from the discovery log – it cross-referenced them with the lessons learnt document. It understood that certain UI interactions should be replaced with API calls. It learned that timing for AG Grid animations needs special handling. It identified which selectors were reliable and which were fragile. The output wasn’t just a pattern list – it was an intelligent synthesis of discovered behavior and documented solutions.

YAML Generation Transformed:
The /generate command produced a radically different YAML specification than it would have without the lessons. Instead of drag-and-drop instructions, it contained API calls. Instead of complex event sequences, it had simple state changes. The YAML included:

  • API access patterns through React Fiber
  • Programmatic column state management
  • Direct node manipulation for row expansion
  • Optimized timing based on measured animations
  • Fallback strategies for unreliable elements

Validation Results:
The validation phase confirmed that our learning loop worked perfectly:

✅ Validation Complete - TC-002 PASSED

Outcome: ✅ PASS (13/13 steps successful)

📊 Key Metrics
| Metric                  | Result                        |
|-------------------------|-------------------------------|
| Total Steps             | 13 (Setup: 3, Test Steps: 10) |
| Successful Steps        | 13                            |
| Failed Steps            | 0                             |
| Assertions Passed       | 4/4                           |
| Deterministic Execution | ✅ YES                         |
| Spec Followed Exactly   | ✅ YES                         |
| Fallbacks Used          | ❌ NONE                        |
| Improvisation           | ❌ NONE                        |

The validation was twice as fast as discovery because it executed deterministically without exploration. Every step passed on the first attempt. No fallbacks were needed. The API approach was not just a workaround – it was demonstrably superior to UI automation.

The Run Twice Pattern Validated:
Both the discovery run (with lessons) and the validation run produced identical functional outcomes:

  • Belgium group expanded ✅
  • Isabella Kingston data visible and correct ✅
  • 20 country groups created ✅
  • Row grouping by Country active ✅

This proves the YAML specification successfully captured ALL automation knowledge, including the lessons learnt. The compilation from markdown to YAML was complete and correct.


Patterns I Noticed

After watching this 4-phase workflow (Discover → Learn → Generate → Validate) complete, some clear patterns emerged:

Works well for:

  • Complex UI components that have API alternatives
  • Applications where standard automation approaches fail initially
  • Building reusable knowledge bases for specific frameworks (AG Grid, Kendo UI, etc.)
  • Scenarios where you need to “teach” the automation about application quirks
  • Catching up on technical debt where automation has been deferred

Gets messy with:

  • Applications without good API access (purely visual interfaces)
  • Scenarios where the UI approach is the only way to test user experience

Surprises:

  • The ultra-thinking capability produced genuinely useful, actionable research
  • The lessons learnt document was comprehensive enough to use immediately
  • Validation was 2x faster than discovery (530ms vs 1,127ms)
  • The API approach was actually MORE reliable than UI automation
  • The YAML specification captured ALL automation knowledge perfectly

The Honest Take

⚡ Quick Verdict:
🟢 GREEN: “I’m going to start using this in production” – for catching up on technical debt

The Good:

  • Self-healing automation: Tests that learn from failure and fix themselves
  • Knowledge preservation: Lessons learnt documents capture expert knowledge permanently
  • Speed improvement: 2x faster execution after learning (530ms vs 1,127ms)
  • Reliability boost: API approach more stable than UI automation
  • Reusable solutions: Lessons apply to all similar test cases
  • Systematic process: Clear fail → research → learn → succeed workflow
  • Production-ready: 13/13 steps passing with deterministic execution

The Concerns:

  • API vs UI testing: The API approach doesn’t test exactly what the end user uses
  • Research time: Ultra-thinking takes 5+ minutes per complex problem (that’s quicker than me doing it!)
  • Framework-specific: Lessons are tied to specific frameworks (AG Grid, Kendo, etc.)
  • Not universal: Only works when alternative approaches exist (API, keyboard nav, etc.)

Would I use this?
Absolutely. I’m already planning to scale this up for production use. The ability to fail, learn, and succeed systematically is transformative for complex test automation. This isn’t just a clever experiment – it’s a practical solution to real problems I face daily.

Here’s where I’d use it immediately:

  • Legacy application automation where standard approaches have failed
  • Complex UI frameworks (AG Grid, Kendo UI, DevExpress) that defeat simple automation
  • Technical debt catch-up – finally automate those “too hard” test cases
  • Knowledge base building – create a library of lessons for the entire team
  • Onboarding acceleration – new team members inherit accumulated automation knowledge

When would I NOT use it?
There are clear boundaries to this approach:

  • Pure UI validation – when you must test exact user interactions, not API equivalents
  • Simple applications – overhead isn’t justified for basic forms and buttons
  • Visual regression testing – pixel-perfect validation needs different tools
  • One-off tests – the learning investment doesn’t pay off for single-use cases

Still Curious About

What I’m still curious about and want to test further

  • Shared knowledge libraries: Can we build a centralized repository of lessons learnt that multiple teams can contribute to and benefit from? Imagine a GitHub repo of automation lessons for every major framework.
  • Other complex frameworks: Would this pattern work with Kendo UI, DevExpress, Telerik, or other notoriously difficult frameworks? Each has its own quirks that might benefit from documented lessons.
  • Lessons evolution: How many test cases need to fail and be fixed before the lessons document becomes truly comprehensive? Is there a point of diminishing returns?
  • Playwright MCP integration: Could we swap in Playwright MCP instead of Chrome DevTools for cross-browser testing? That would open up Firefox and Safari automation with the same learning approach.
  • Lessons versioning: How do we handle framework updates that invalidate lessons? Can we version lessons alongside application versions?

The Main Lesson

The Lessons Learnt Loop isn’t just error recovery – it’s systematic knowledge building for test automation.

Traditional automation fails and stays failed. This approach fails, learns why, documents solutions, and succeeds. The difference is profound:

  1. Failure becomes valuable: Each failure generates knowledge that prevents future failures
  2. Knowledge persists: Solutions are documented, not trapped in one engineer’s head
  3. Complexity becomes manageable: Even AG Grid’s notorious drag-and-drop can be automated reliably
  4. Teams scale better: Junior engineers inherit senior engineers’ solutions
  5. Maintenance simplifies: When something breaks, check if there’s already a lesson for it

The pattern demonstrated here – fail → research → document → succeed – mirrors how human experts develop. We’re essentially teaching our automation system to become an expert through experience. The ultra-thinking phase acts like a senior engineer researching a problem. The lessons learnt document captures that expertise permanently. The second attempt applies that expertise successfully.

This approach transforms test automation from a brittle, high-maintenance burden into a learning system that gets smarter over time. Every challenge makes it stronger. Every failure makes it wiser.


Conclusion

This experiment proves that test automation with AI can literally learn from its own failures. The Lessons Learnt Loop isn’t just a workaround for complex scenarios – it’s a fundamental shift in how we approach automation challenges.

What started as an expected failure with AG Grid’s drag-and-drop became a demonstration of systematic problem-solving. The automation failed, researched why, documented solutions, and succeeded on the second attempt. The real success was turning an “impossible to automate” scenario into a solved problem.

The implications extend beyond this single test case. Every organization has applications with quirky behaviors that defeat standard automation. Every team has that list of “manual only” test cases they’ve given up on. The Lessons Learnt Loop offers a path forward: systematic learning that turns automation failures into documented solutions.

What makes this approach particularly powerful is its alignment with how organizations actually work. Senior engineers naturally build mental models of application quirks. The Lessons Learnt Loop captures that knowledge explicitly, making it shareable, reusable, and permanent. When that senior engineer leaves, their automation knowledge stays.

The four-phase workflow (Discover → Learn → Generate → Validate) now has an enhancement: front load the earlier phases with lessons learnt. This creates a positive feedback loop where automation gets progressively smarter. Each failure contributes to future success. Each lesson learned benefits all subsequent tests.

I’m moving this from experiment to production. The approach is mature enough, the results are compelling enough, and the need is definitely there. Those AG Grid tests that have been “manual only” for years? They’re about to become automated.


The Prompts I Actually Used

If you’re interested in trying this, these are the exact prompts I used:

# Initial discovery (expected to fail)
/discover TestManagement\test-cases\TC-002-aggrid-column-management-and-row-grouping.md

# Ultra-thinking research after failure
please can you examine why this drag-and-drop for the row grouping didn't work,
ultrathink and come up with solutions. Please write these solutions to the
file aggrid-lessons-learnt.md

# Second discovery with lessons learnt
/discover TestManagement\test-cases\TC-002-aggrid-column-management-and-row-grouping.md
          aggrid-lessons-learnt.md

# Learning phase with lessons
/learn TestManagement\discoveries\TC-002-discovery-log.json aggrid-lessons-learnt.md

# Generation phase with lessons
/generate TestManagement\learnings\TC-002-learnings.json
          TestManagement\test-cases\TC-002-aggrid-column-management-and-row-grouping.md
          aggrid-lessons-learnt.md

# Validation to confirm it works
/validate TestManagement\specs\TC-002-test-spec.yaml

Resources


Want to try this yourself?

Really was simple to get setup once you understand the pattern: Fail → Research → Learn → Succeed. I’ll get this packaged up and released on GitHub very soon. Then you can try it and let me know what happens when you try it – I’m especially curious about what frameworks defeat your first attempt and whether ultra-thinking helps you too.