Tech/Engineering

Beyond Prompt Engineering: How We Built a Deterministic AI-Driven Test Automation System at Scale

Samiksha Survade, Principal SDET

Everyone is experimenting with AI in engineering workflows. And most teams hit the same wall very quickly: the AI sounds confident, produces plausible output, and still gets enough things wrong to become unreliable in production.

The instinct is usually to write a bigger prompt. A longer one. A smarter one. That instinct alone is not enough.

Most engineering teams already have years of testing knowledge documented somewhere, Xray, TestRail, spreadsheets, Confluence pages, or legacy automation repositories.

The problem is that these test cases were written for humans, not for systems. Different engineers describe the same workflow differently. Important assumptions remain implicit. Validation logic often lives in tribal knowledge or hidden framework behavior instead of structured definitions. What works well for manual testing becomes ambiguous when translated into executable automation.

We ran into this exact problem while trying to scale automation for complex backup and restore workflows at Druva

Our workflows were highly stateful and execution-heavy:

  • backup and restore workflows were tightly connected across execution flow

  • later verification steps depended on outputs from earlier operations

  • verification logic spanned filesystem checks, database validation, and checksum comparison

  • end-to-end execution required coordination across multiple tools and services

We already had a Pytest-based automation framework capable of executing these workflows reliably. The challenge was scaling test creation itself.

To solve that, we introduced a JSON-driven execution model where AI generated structured workflows that could be executed by a central orchestrator and modular tooling layer.

The transition from manual workflow authoring to AI-assisted generation exposed an important architectural gap: human-written test definitions lacked the structural consistency required for reliable large-scale execution.

Addressing that gap required moving beyond prompt refinement and designing explicit execution boundaries around workflow generation.

Designing the Execution System First

Before solving generation, we had to answer a more fundamental question:

If AI generates test cases, what exactly is responsible for executing them?

The first architectural decision we made was to separate generation from execution.

At a high level, the system consists of:

  • an AI assembly layer

  • a structured context library

  • a lightweight orchestrator

  • a modular tool ecosystem

  • validation and quality gates

This separation turned out to be critical.

The AI layer is responsible for producing structured workflows. The framework is responsible for executing them deterministically. The two operate independently. Once we separated generation from execution, the architecture became significantly easier to reason about, validate, and scale.

Prompt Engineering

The Orchestrator and Tooling Layer

Each generated test case is represented as a sequence of structured execution steps.

Every step follows a strict contract:

  • an action

  • a params block

  • verification requirements

  • references to execution context where required

The orchestrator itself is intentionally lightweight.

Its responsibility is not to interpret business logic. It simply:

  • reads a step

  • resolves references from shared execution context

  • routes the step to the correct tool

  • captures outputs

  • propagates state into downstream steps

  • validates execution results

Keeping orchestration deterministic was important. Once orchestration logic starts making dynamic decisions, debugging failures becomes difficult very quickly.

The shared execution context became especially important for long-running workflows. Outputs generated during backup operations were often required by downstream restore and verification steps. Instead of hardcoding those values, the orchestrator dynamically propagated them between steps during execution.

The tooling layer itself was organized by responsibility:

  • client tools for backup and restore operations

  • filesystem tools for file creation and mutation

  • verification tools for checksum and metadata validation

  • database tools for validating internal state

  • service tools for service orchestration

  • mocked server tools for simulating backend responses

Each action mapped directly to a tool capability. The orchestrator only knew how to invoke the tool and process the result — it did not contain implementation-specific logic.

This separation allowed us to validate complete data paths without depending on live infrastructure.

The Shift from Generation to Assembly

Once the execution framework was stable, we focused on improving AI-generated workflows.

Our initial approach relied heavily on prompts and reference examples. The results were not production-ready.

The issue was not syntax. Most generated JSON looked structurally valid during review.

The failures appeared during execution:

  • missing execution dependencies

  • invalid action sequencing

  • inconsistent mocks

  • unsupported actions

  • incorrect state propagation across steps

We kept refining prompts, adding examples, simplifying mocks, and introducing additional validation layers.

Each improvement solved one class of failures while exposing another.

At that point, we changed the approach entirely.

Instead of asking AI to generate workflows freely, we constrained it to assemble workflows from predefined building blocks.

That changed the system in several important ways:

  • actions had to exist in the framework

  • step structure came from predefined templates

  • only explicitly allowed fields could be modified

  • validation rules were enforced before execution

The model stopped inventing structure.

It started assembling known-good components within predefined execution boundaries.

That shift removed a large class of structural failures and dramatically improved consistency.

The Context Library

One of the biggest breakthroughs came when we stopped treating the AI like a chatbot and started treating it like a new engineer joining the team.

A new engineer cannot contribute effectively with just a prompt. They need onboarding, domain understanding, rules, constraints, and examples of how the system works.

We organized that knowledge into three layers.

Layer 1: concepts — The Interpretation Layer

This layer defines how the system interprets workflows.

It includes:

  • glossary definitions

  • backup and restore semantics

  • domain rules

  • flow constraints

For example:

  • successful backups must be followed by restore and verification

  • failed or aborted backups should not proceed into restore validation flows

These rules are enforced consistently across generated workflows.

Layer 2: action_rules — The Contract Layer

This layer defines:

  • available actions

  • required parameters

  • supported combinations

  • execution constraints

For example, a restore verification step requires:

  • device references

  • restore job references

  • expected validation details

This layer effectively acts as a contract between generation and execution.

Layer 3: step_templates — The Structure Layer

This layer defines the exact structure of each execution step.

Example:

{

  "action": "trigger_backup_operation_on_client",

  "params": {

    "verify_mapckpt_exists": true,

    "wait_time_before_abort_backup": 30

  },

  "description": "Triggers backup and validates mapckpt behavior during execution."

}

The model does not generate this structure dynamically.

It selects the appropriate template and fills only the allowed fields.

That distinction turned out to be critical.

Quality Gates

Even with structured assembly in place, we added validation at multiple stages of the pipeline.

Pre-generation

An analysis layer detects:

  • ambiguities

  • contradictory expectations

  • incomplete scenarios

  • technically infeasible workflows

Post-generation

A custom linter validates:

  • schema correctness

  • action validity

  • parameter structure

  • mock consistency

Runtime validation

Execution-time checks validate:

  • sequencing

  • state transitions

  • intermediate outputs

  • verification dependencies

These validation layers prevented structurally invalid workflows from reaching execution.

The Evolution of the System

The framework evolved in phases.

Phase 1 — Monolithic prompts

Large prompts with reference examples produced approximately 20% usable output.

Phase 2 — Simplified mocks and validation

Reducing mock complexity and adding verification layers improved reliability to roughly 60%.

Phase 3 — Template-driven assembly

Introducing structured templates and execution contracts increased reliability to around 80%.

Phase 4 — Context-driven deterministic assembly

The full Context Library, deterministic orchestration, and constrained assembly pushed generation accuracy beyond 95%.

The improvements did not come from prompting alone.

They came from progressively reducing ambiguity, limiting uncontrolled generation, and enforcing deterministic execution boundaries.

Impact

The results significantly changed the scale and speed at which we could build automation:

  • 700+ test cases generated

  • 95%+ generation accuracy

  • ~70% reduction in manual effort

  • 450 complex backup and restore workflows automated in approximately 1.5 months

Once the execution model and assembly pipeline were established, scaling new scenarios became substantially faster and more predictable.

What Other Teams Can Reuse

The implementation itself is not plug-and-play.

But the architectural pattern is reusable.

The key ideas are:

  • separate generation from execution

  • constrain output structure

  • standardize execution contracts

  • introduce deterministic orchestration

  • teach domain context explicitly

  • validate aggressively at every stage

The biggest lesson for us was this:

Large language models are very good at producing plausible output.

Production systems require deterministic behavior.

Bridging that gap required much more than better prompting. It required designing a system where:

  • execution is deterministic

  • workflows are structured

  • contracts are explicit

  • and the model has limited room to invent behavior

That architectural shift made the difference between an interesting prototype and a production-ready system we could actually scale.

Interested in accelerating your cyber resilience with Druva? 

Take a free tour or request a demo today!

 

Druva Blog: Cloud Technology & Data Protection Articles