The Orchestrator and Tooling Layer
Each generated test case is represented as a sequence of structured execution steps.
Every step follows a strict contract:
The orchestrator itself is intentionally lightweight.
Its responsibility is not to interpret business logic. It simply:
reads a step
resolves references from shared execution context
routes the step to the correct tool
captures outputs
propagates state into downstream steps
validates execution results
Keeping orchestration deterministic was important. Once orchestration logic starts making dynamic decisions, debugging failures becomes difficult very quickly.
The shared execution context became especially important for long-running workflows. Outputs generated during backup operations were often required by downstream restore and verification steps. Instead of hardcoding those values, the orchestrator dynamically propagated them between steps during execution.
The tooling layer itself was organized by responsibility:
client tools for backup and restore operations
filesystem tools for file creation and mutation
verification tools for checksum and metadata validation
database tools for validating internal state
service tools for service orchestration
mocked server tools for simulating backend responses
Each action mapped directly to a tool capability. The orchestrator only knew how to invoke the tool and process the result — it did not contain implementation-specific logic.
This separation allowed us to validate complete data paths without depending on live infrastructure.
The Shift from Generation to Assembly
Once the execution framework was stable, we focused on improving AI-generated workflows.
Our initial approach relied heavily on prompts and reference examples. The results were not production-ready.
The issue was not syntax. Most generated JSON looked structurally valid during review.
The failures appeared during execution:
missing execution dependencies
invalid action sequencing
inconsistent mocks
unsupported actions
incorrect state propagation across steps
We kept refining prompts, adding examples, simplifying mocks, and introducing additional validation layers.
Each improvement solved one class of failures while exposing another.
At that point, we changed the approach entirely.
Instead of asking AI to generate workflows freely, we constrained it to assemble workflows from predefined building blocks.
That changed the system in several important ways:
actions had to exist in the framework
step structure came from predefined templates
only explicitly allowed fields could be modified
validation rules were enforced before execution
The model stopped inventing structure.
It started assembling known-good components within predefined execution boundaries.
That shift removed a large class of structural failures and dramatically improved consistency.
The Context Library
One of the biggest breakthroughs came when we stopped treating the AI like a chatbot and started treating it like a new engineer joining the team.
A new engineer cannot contribute effectively with just a prompt. They need onboarding, domain understanding, rules, constraints, and examples of how the system works.
We organized that knowledge into three layers.
Layer 1: concepts — The Interpretation Layer
This layer defines how the system interprets workflows.
It includes:
For example:
These rules are enforced consistently across generated workflows.
Layer 2: action_rules — The Contract Layer
This layer defines:
available actions
required parameters
supported combinations
execution constraints
For example, a restore verification step requires:
This layer effectively acts as a contract between generation and execution.
Layer 3: step_templates — The Structure Layer
This layer defines the exact structure of each execution step.
Example:
{
"action": "trigger_backup_operation_on_client",
"params": {
"verify_mapckpt_exists": true,
"wait_time_before_abort_backup": 30
},
"description": "Triggers backup and validates mapckpt behavior during execution."
}
The model does not generate this structure dynamically.
It selects the appropriate template and fills only the allowed fields.
That distinction turned out to be critical.
Quality Gates
Even with structured assembly in place, we added validation at multiple stages of the pipeline.
Pre-generation
An analysis layer detects:
Post-generation
A custom linter validates:
schema correctness
action validity
parameter structure
mock consistency
Runtime validation
Execution-time checks validate:
These validation layers prevented structurally invalid workflows from reaching execution.
The Evolution of the System
The framework evolved in phases.
Phase 1 — Monolithic prompts
Large prompts with reference examples produced approximately 20% usable output.
Phase 2 — Simplified mocks and validation
Reducing mock complexity and adding verification layers improved reliability to roughly 60%.
Phase 3 — Template-driven assembly
Introducing structured templates and execution contracts increased reliability to around 80%.
Phase 4 — Context-driven deterministic assembly
The full Context Library, deterministic orchestration, and constrained assembly pushed generation accuracy beyond 95%.
The improvements did not come from prompting alone.
They came from progressively reducing ambiguity, limiting uncontrolled generation, and enforcing deterministic execution boundaries.
Impact
The results significantly changed the scale and speed at which we could build automation:
700+ test cases generated
95%+ generation accuracy
~70% reduction in manual effort
450 complex backup and restore workflows automated in approximately 1.5 months
Once the execution model and assembly pipeline were established, scaling new scenarios became substantially faster and more predictable.
What Other Teams Can Reuse
The implementation itself is not plug-and-play.
But the architectural pattern is reusable.
The key ideas are:
separate generation from execution
constrain output structure
standardize execution contracts
introduce deterministic orchestration
teach domain context explicitly
validate aggressively at every stage
The biggest lesson for us was this:
Large language models are very good at producing plausible output.
Production systems require deterministic behavior.
Bridging that gap required much more than better prompting. It required designing a system where:
That architectural shift made the difference between an interesting prototype and a production-ready system we could actually scale.
Interested in accelerating your cyber resilience with Druva?
Take a free tour or request a demo today!