Use Cases
- AI Resilience
  - AI Resilience
  - AI
    - AI
    - Claude
    - Copilot
    - MCP
  - Endpoints
    - Endpoints
    - Endpoints
- Cloud Native
  - Cloud Native
  - AWS
    - AWS
    - Amazon EC2
    - Amazon RDS
    - Amazon S3
    - Amazon EFS
  - Microsoft & Azure
- Data Center
  - Data Center
  - Virtualization
    - Virtualization
    - VMware
    - Hyper-V
    - Nutanix
  - Databases
  - Unstructured Data
    - Unstructured Data
    - NAS
- SaaS Apps
- Adopt AI with Confidence
  Recover, govern, defend, and accelerate AI data, workflows, and operations
  
  Accelerate Cyber Resilience
  Reduce costs, accelerate cyber recovery and simplify management
  
  Secure Multi-Cloud Environments
  Secure data within AWS/Azure or across clouds without hardware headaches
  
  Modernize Data Protection
  Data protection for data centers, cloud workloads, SaaS apps, and edge devices
Why Druva
- The Druva Difference
  The Druva Difference
- About Druva
  About Druva
- Explore
  Explore
  - Customers
  - Careers
  - Events
  - Newsroom
  - Blog
- Customer Spotlight
  
  ZS Associates cuts recovery from days to just hours
  Case Study
  
  Contact Us
  
  Our experts are here to help.
  Reach out
Products
- The Resilience Cloud
  The Resilience Cloud
  Fully managed data security across enterprise, cloud, SaaS, and end user.
  Dru AI
  Ensure backup health and trends, accelerate troubleshooting using Agentic AI
  
  Dru Metagraph
  
  Dru SRE Agent
- Dru AI
  Dru AI
  Ensure backup health and trends, accelerate troubleshooting using Agentic AI
  - Dru Metagraph
  - Dru SRE Agent
- AI Resilience
  AI Resilience
  Enterprise Cloud Backup and data management across edge, on-premises and cloud workloads
- Identity Resilience
  Identity Resilience
  Enterprise Cloud Backup and data management across edge, on-premises and cloud workloads
- eDiscovery & Compliance
  eDiscovery & Compliance
  Ensure compliance and accelerate eDiscovery with Druva’s cloud-native SaaS. Instantly search backup data, apply legal holds, and simplify governance.
  - eDiscovery & Legal Hold
  - Compliance & Sensitive Data Governance
- Data Resilience
  Data Resilience
  Discover Druva's data resilience solutions to protect, backup, and recover your enterprise data effortlessly in the cloud. Ensure business continuity with secure, scalable, and automated data protection solutions.
- Cyber Resilience
  Cyber Resilience
  Explore Druva's cyber resilience framework featuring real-time threat insights and 24/7 managed data detection
Learning Center
- Resource Library
  Resource Library
- Explore
- Product Resources
- Druva is a 2026 Gartner® Magic Quadrant™ Leader
  Get the Report
  
  Switch to Druva, Reduce TCO by up to 40%
  Calculate Your Savings
Partners
- Alliances
  Alliances
  - AWS
  - Dell
  - Microsoft
- Ecosystem
  Ecosystem
  - Security Integrations
  - Technology Partners
- Value Added Resellers
  Value Added Resellers
- Managed Service Providers
  Managed Service Providers
- Partner Portal
  - Partner Portal Login
  - Managed Service Center
- Join Our Partner Network
  
  Deliver cyber resilience with ZERO hardware, ZERO infrastructure, ZERO hassle
  Apply now
  
  Druva Marketplace
  
  Discover trusted integrations to extend Druva and simplify your cyber resilience workflows.
  Explore the Marketplace
Get Started
Search queries sent to third parties.
Support
Login

Tech/Engineering

Beyond Prompt Engineering: How We Built a Deterministic AI-Driven Test Automation System at Scale

June 21, 2026 Samiksha Survade, Principal SDET

Everyone is experimenting with AI in engineering workflows. And most teams hit the same wall very quickly: the AI sounds confident, produces plausible output, and still gets enough things wrong to become unreliable in production.

The instinct is usually to write a bigger prompt. A longer one. A smarter one. That instinct alone is not enough.

Most engineering teams already have years of testing knowledge documented somewhere, Xray, TestRail, spreadsheets, Confluence pages, or legacy automation repositories.

The problem is that these test cases were written for humans, not for systems. Different engineers describe the same workflow differently. Important assumptions remain implicit. Validation logic often lives in tribal knowledge or hidden framework behavior instead of structured definitions. What works well for manual testing becomes ambiguous when translated into executable automation.

We ran into this exact problem while trying to scale automation for complex backup and restore workflows at Druva

Our workflows were highly stateful and execution-heavy:

backup and restore workflows were tightly connected across execution flow
later verification steps depended on outputs from earlier operations
verification logic spanned filesystem checks, database validation, and checksum comparison
end-to-end execution required coordination across multiple tools and services

We already had a Pytest-based automation framework capable of executing these workflows reliably. The challenge was scaling test creation itself.

To solve that, we introduced a JSON-driven execution model where AI generated structured workflows that could be executed by a central orchestrator and modular tooling layer.

The transition from manual workflow authoring to AI-assisted generation exposed an important architectural gap: human-written test definitions lacked the structural consistency required for reliable large-scale execution.

Addressing that gap required moving beyond prompt refinement and designing explicit execution boundaries around workflow generation.

Designing the Execution System First

Before solving generation, we had to answer a more fundamental question:

If AI generates test cases, what exactly is responsible for executing them?

The first architectural decision we made was to separate generation from execution.

At a high level, the system consists of:

an AI assembly layer
a structured context library
a lightweight orchestrator
a modular tool ecosystem
validation and quality gates

This separation turned out to be critical.

The AI layer is responsible for producing structured workflows. The framework is responsible for executing them deterministically. The two operate independently. Once we separated generation from execution, the architecture became significantly easier to reason about, validate, and scale.

The Orchestrator and Tooling Layer

Each generated test case is represented as a sequence of structured execution steps.

Every step follows a strict contract:

an action
a params block
verification requirements
references to execution context where required

The orchestrator itself is intentionally lightweight.

Its responsibility is not to interpret business logic. It simply:

reads a step
resolves references from shared execution context
routes the step to the correct tool
captures outputs
propagates state into downstream steps
validates execution results

Keeping orchestration deterministic was important. Once orchestration logic starts making dynamic decisions, debugging failures becomes difficult very quickly.

The shared execution context became especially important for long-running workflows. Outputs generated during backup operations were often required by downstream restore and verification steps. Instead of hardcoding those values, the orchestrator dynamically propagated them between steps during execution.

The tooling layer itself was organized by responsibility:

client tools for backup and restore operations
filesystem tools for file creation and mutation
verification tools for checksum and metadata validation
database tools for validating internal state
service tools for service orchestration
mocked server tools for simulating backend responses

Each action mapped directly to a tool capability. The orchestrator only knew how to invoke the tool and process the result — it did not contain implementation-specific logic.

This separation allowed us to validate complete data paths without depending on live infrastructure.

The Shift from Generation to Assembly

Once the execution framework was stable, we focused on improving AI-generated workflows.

Our initial approach relied heavily on prompts and reference examples. The results were not production-ready.

The issue was not syntax. Most generated JSON looked structurally valid during review.

The failures appeared during execution:

missing execution dependencies
invalid action sequencing
inconsistent mocks
unsupported actions
incorrect state propagation across steps

We kept refining prompts, adding examples, simplifying mocks, and introducing additional validation layers.

Each improvement solved one class of failures while exposing another.

At that point, we changed the approach entirely.

Instead of asking AI to generate workflows freely, we constrained it to assemble workflows from predefined building blocks.

That changed the system in several important ways:

actions had to exist in the framework
step structure came from predefined templates
only explicitly allowed fields could be modified
validation rules were enforced before execution

The model stopped inventing structure.

It started assembling known-good components within predefined execution boundaries.

That shift removed a large class of structural failures and dramatically improved consistency.

The Context Library

One of the biggest breakthroughs came when we stopped treating the AI like a chatbot and started treating it like a new engineer joining the team.

A new engineer cannot contribute effectively with just a prompt. They need onboarding, domain understanding, rules, constraints, and examples of how the system works.

We organized that knowledge into three layers.

Layer 1: concepts — The Interpretation Layer

This layer defines how the system interprets workflows.

It includes:

glossary definitions
backup and restore semantics
domain rules
flow constraints

For example:

successful backups must be followed by restore and verification
failed or aborted backups should not proceed into restore validation flows

These rules are enforced consistently across generated workflows.

Layer 2: action_rules — The Contract Layer

This layer defines:

available actions
required parameters
supported combinations
execution constraints

For example, a restore verification step requires:

device references
restore job references
expected validation details

This layer effectively acts as a contract between generation and execution.

Layer 3: step_templates — The Structure Layer

This layer defines the exact structure of each execution step.

Example:

{

"action": "trigger_backup_operation_on_client",

"params": {

"verify_mapckpt_exists": true,

"wait_time_before_abort_backup": 30

"description": "Triggers backup and validates mapckpt behavior during execution."

}

The model does not generate this structure dynamically.

It selects the appropriate template and fills only the allowed fields.

That distinction turned out to be critical.

Quality Gates

Even with structured assembly in place, we added validation at multiple stages of the pipeline.

Pre-generation

An analysis layer detects:

ambiguities
contradictory expectations
incomplete scenarios
technically infeasible workflows

Post-generation

A custom linter validates:

schema correctness
action validity
parameter structure
mock consistency

Runtime validation

Execution-time checks validate:

sequencing
state transitions
intermediate outputs
verification dependencies

These validation layers prevented structurally invalid workflows from reaching execution.

The Evolution of the System

The framework evolved in phases.

Phase 1 — Monolithic prompts

Large prompts with reference examples produced approximately 20% usable output.

Phase 2 — Simplified mocks and validation

Reducing mock complexity and adding verification layers improved reliability to roughly 60%.

Phase 3 — Template-driven assembly

Introducing structured templates and execution contracts increased reliability to around 80%.

Phase 4 — Context-driven deterministic assembly

The full Context Library, deterministic orchestration, and constrained assembly pushed generation accuracy beyond 95%.

The improvements did not come from prompting alone.

They came from progressively reducing ambiguity, limiting uncontrolled generation, and enforcing deterministic execution boundaries.

Impact

The results significantly changed the scale and speed at which we could build automation:

700+ test cases generated
95%+ generation accuracy
~70% reduction in manual effort
450 complex backup and restore workflows automated in approximately 1.5 months

Once the execution model and assembly pipeline were established, scaling new scenarios became substantially faster and more predictable.

What Other Teams Can Reuse

The implementation itself is not plug-and-play.

But the architectural pattern is reusable.

The key ideas are:

separate generation from execution
constrain output structure
standardize execution contracts
introduce deterministic orchestration
teach domain context explicitly
validate aggressively at every stage

The biggest lesson for us was this:

Large language models are very good at producing plausible output.

Production systems require deterministic behavior.

Bridging that gap required much more than better prompting. It required designing a system where:

execution is deterministic
workflows are structured
contracts are explicit
and the model has limited room to invent behavior

That architectural shift made the difference between an interesting prototype and a production-ready system we could actually scale.

Interested in accelerating your cyber resilience with Druva?

Take a free tour or request a demo today!