Hualin Luan Cloud Native · Quant Trading · AI Engineering
Back to articles

Article

AI-TDD: Requirement Contracts and Multi-Dimensional Evidence Acceptance

AI-TDD turns human intent into a Manifest requirement contract, then accepts AI-generated work through evidence chains and Gate verdicts.

Meta

Published

5/27/2026

Category

guide

Reading Time

83 min read

Cover illustration: AI-TDD: Requirement Contracts and Multi-Dimensional Evidence Acceptance

Introduction: from code-level TDD to requirement-contract-driven development

From the GitHub Copilot technical preview to GPT-4, Claude Code, Codex, and similar tools in day-to-day development, AI coding has evolved far beyond simple code completion. It now reaches into requirement understanding, implementation generation, test fixing, and broader engineering collaboration.

The real shift is no longer “Can AI write code?” but “Does the code AI writes stay within the requirement boundary?”

The industry’s focus has shifted from “Can AI write code?” to “Is the code AI writes correct?” Current mainstream large language models can reliably generate large amounts of runnable code, but syntactic correctness does not equal requirement alignment. Code-level TDD still has a structural limitation here because it lacks a global view of the requirement boundary.

Problem boundaries precede solutions.

This principle matters even more in the AI era. Software engineering has long understood that most of the work lies in understanding the problem before trying to solve it. Once AI can generate code quickly and cheaply, success depends less on raw implementation speed and more on whether the problem boundary is defined clearly enough.

Ambiguous requirements lead to AI improvisation, and improvisation often deviates from expectations. AI-TDD is built on this principle: AI generation without Manifest as a contract is unanchored improvisation. The essence of Manifest is to transform vague requirement boundaries into machine-readable, verifiable contract matrices.

A typical failure scenario: lessons from an AI-generated user module

Below is a composite failure scenario constructed from multiple common risk points. A product owner describes the requirement to AI: “Implement a user registration feature with email verification and password encryption.”

AI generates the code, and the team quickly deploys. However, within a week of going live, the following issues occur:

  1. Security vulnerability: AI used MD5 instead of bcrypt to store passwords. While “encrypted,” they are vulnerable to rainbow table attacks
  2. Concurrency issues: When many users register simultaneously, the system creates duplicate accounts because AI didn’t implement email uniqueness constraints
  3. Boundary overflow: Users can upload avatar images of arbitrary sizes, causing storage costs to spiral out of control — because the requirement didn’t specify “limit image size”
  4. Feature drift: AI implemented OAuth social login, but this was functionality the team planned for “next quarter”

Root cause: Natural language requirements contain ambiguity, implicit assumptions, and fuzzy boundaries. AI fills these gaps according to its own “understanding,” and this understanding deviates from the team’s true intent.

If AI-TDD Manifest had been used:

must:
  - id: MUST-REG-EMAIL-001
    text: Users can register via email
    validation: "email format complies with RFC 5322, password length 8-32 characters"
  - id: MUST-REG-PASSWORD-001
    text: Password must be encrypted for storage
    validation: "Use bcrypt, cost factor >= 12, not MD5 or SHA"
  - id: MUST-REG-UNIQUE-001
    text: Prevent duplicate registration
    validation: "Database-level unique index + application-level atomic check"

outOfScope:
  - id: OUT-AUTH-OAUTH-001
    text: OAuth social login not supported
    reason: "Not included in this iteration, moved to REQ-OAUTH-001"

AI would generate implementations under explicit contract constraints, rather than improvising.

Limitations of code-level TDD

Traditional TDD (Test-Driven Development) centers on “test first”: write tests, then implement. This practice works in the manual coding era because developers write tests and implement themselves, maintaining consistent requirements in their minds.

But when AI becomes the implementation agent, problems arise:

Test code can only cover local aspects, not constrain the global picture.

Imagine this scenario: You ask AI to implement a user registration module. You write a unit test verifying “returns error when empty password is input.” AI generates code, the test passes. But after going live, you discover:

  • AI didn’t implement email format validation (because no corresponding test existed)
  • AI stored passwords in plaintext (tests only verified API behavior, not storage logic)
  • AI didn’t handle race conditions for concurrent registration (tests didn’t cover concurrency scenarios)

This is the essential limitation of code-level TDD: tests can only verify “what I’ve tested,” while AI implements according to “its own understanding of the complete requirement.” The gap between these is the risk.

Manifest as the requirement contract matrix

There’s a long-standing consensus in software engineering: the later a defect is discovered, the higher the cost to fix it. In AI-generated code scenarios, this problem is amplified because models can quickly expand vague requirements into large amounts of seemingly runnable implementations.

AI-TDD’s core breakthrough: elevate acceptance criteria from “code-level test cases” to “requirement-level contract matrices.”

We call this the AI-TDD Gate Manifest — a machine-readable requirement contract checklist generated during the requirement confirmation phase.

Manifest is not simply a “test list” but a requirement contract matrix containing MUST, NEG (MUST NOT negative assertions), OUT (OUT OF SCOPE boundaries), TRACE, EVD, ACC/E2E, FAIL/EDGE, CMD, ART, and TASK namespaces. The complete technical definitions and Schema specifications for these dimensions will be expanded in Chapter 4.

Key insight: Manifest should reach a sufficiently complete state before execution begins. If requirements are added during execution, the model is more likely to miss key boundaries or experience implementation drift.

Core principles of AI-TDD

AI-TDD is not “write tests first, then let AI generate” but rather “define problem boundaries first, then verify implementation within boundaries.”

Six cognitive stages: Requirement Confirmation → Architecture Confirmation → Implementation Readiness → Execution Closure → Audit Review → Delivery Closeout

Two key gates: Implementation Readiness Gate (entry gate, expected status AI-TDD-RED, shortened below to TDD-RED) and Delivery Closeout Gate (delivery gate, expected status AI-TDD-GREEN, shortened below to TDD-GREEN), ensuring “no complete Manifest, no execution” and “unverified Manifest items, no delivery.”

The goal of this framework is straightforward: replace vague natural language requirements with machine-readable Manifest, establish a complete requirement contract before execution begins, and use reproducible evidence chains to decide whether the current implementation is deliverable.

AI-TDD’s first-class definition: requirement-contract-driven + multi-dimensional evidence acceptance

This article does not define AI-TDD as “let AI run TDD.” It defines AI-TDD as:

Encode human intent as a requirement contract first, then prove the current implementation satisfies that contract through a multi-dimensional evidence chain.

In engineering terms, the chain is:

Requirement Contract
→ Contract Slice
→ Evidence Chain
→ Gate Verdict
→ Human Decision

The chain has five layers:

LayerTypical namespaceRole
DeclarationMUST / NEG / OUTDefines what must be done, what must not happen, and what is out of scope
SliceTRACE / TASKBreaks requirements into traceable, acceptable contract slices
ScenarioACC / E2E / EDGE / FAILDefines happy paths, end-to-end paths, edge cases, and failure paths
EvidenceEVD / CMD / ART / hash / receiptDefines what counts as proof, how to reproduce it, and where artifacts live
StateAI-TDD-RED / IMPLEMENTING / CLOSEOUT_CANDIDATE / AI-TDD-GREEN / CLOSEDLets Gates decide whether the contract lifecycle can move forward

The TRACE row is AI-TDD’s smallest contract unit. Traditional TDD’s atomic unit is the test case. BDD’s atomic unit is the scenario. AI-TDD’s atomic unit is the contract slice: it must state which MUST/NEG it covers, which scenario verifies it, which evidence is required, which commands run, and which artifacts are produced. OUT boundaries do not belong in covers; they bind to scope-audit evidence through scopeAuditRefs or an equivalent field.

Multi-dimensional evidence acceptance must also be stated upfront: test pass is one form of evidence, not delivery itself. A delivery can only be called AI-TDD-GREEN after the current attempt closes its TRACE -> EVD -> CMD -> ART evidence chain, the Gate returns a pass verdict, and a recorded Human Decision accepts the result.

Three-minute reading path: if you only want the core thesis, read this first-class definition, then the Quick Start section “Accept delivery through the evidence chain,” then Chapter 6’s Delivery Closeout Gate, and finally use the glossary to check the chain Requirement Contract -> Contract Slice -> Evidence Chain -> Gate Verdict -> Human Decision.

Quick start: get started with AI-TDD

Don’t want to read theory first? No problem. Follow this example to understand AI-TDD’s core principles.

Example: implementing a calculator with AI-TDD

Your requirement: “Implement an addition function”

Problems with traditional approaches: Telling AI directly to “write an addition function” might result in:

  • Not handling non-numeric inputs
  • Not clarifying numeric boundary conditions
  • Adding unnecessary features (subtraction, multiplication)

AI-TDD approach: Write Manifest first, then let AI generate.

Key state flow: Manifest → AI-TDD-RED / TDD-RED (tests exist/implementation missing) → CLOSEOUT_CANDIDATE (registered validations pass, delivery verdict not yet closed) → AI-TDD-GREEN / TDD-GREEN (evidence chain, Gate Verdict, and Human Decision close) → Delivery

Step 1: Create a Manifest file

Create calculator-manifest.yaml:

manifest:
  version: "1.0.0"
  project: "calculator-demo"
  requirementId: "REQ-CALC-001"
  title: "Addition Function"

  acceptanceCriteria:
    must:
      - id: "MUST-CALC-ADD-001"
        description: "Accept two numeric parameters, return their sum"
        validation: "add(2, 3) === 5"

      - id: "MUST-CALC-FLOAT-001"
        description: "Handle floating-point numbers"
        validation: "add(0.1, 0.2) close to 0.3 (considering floating-point precision)"

    mustNot:
      - id: "NEG-CALC-TYPE-001"
        description: "Do not accept non-numeric inputs"
        validation: "add('a', 1) throws TypeError"

    outOfScope:
      - id: "OUT-CALC-OPS-001"
        description: "Subtraction, multiplication, division"
        reason: "This iteration only implements addition"

    evidence:
      - id: "EVD-CALC-UNIT-001"
        type: "unit_test"
        description: "Addition, floating-point, and non-numeric input tests all pass"
        requiredCommandRefs: ["CMD-CALC-TEST-001"]
        artifactRefs: ["ART-CALC-TEST-REPORT-001"]

      - id: "EVD-CALC-SCOPE-001"
        type: "scope_audit"
        description: "Implementation contains no subtraction, multiplication, or division behavior"
        requiredCommandRefs: ["CMD-CALC-SCOPE-001"]
        artifactRefs: ["ART-CALC-SCOPE-REPORT-001"]

    traceRows:
      - id: "TRACE-CALC-ADD-001"
        covers: ["MUST-CALC-ADD-001", "MUST-CALC-FLOAT-001"]
        evidenceRefs: ["EVD-CALC-UNIT-001"]
        commandRefs: ["CMD-CALC-TEST-001"]
        artifactRefs: ["ART-CALC-TEST-REPORT-001"]

      - id: "TRACE-CALC-NEG-001"
        covers: ["NEG-CALC-TYPE-001"]
        evidenceRefs: ["EVD-CALC-UNIT-001"]
        commandRefs: ["CMD-CALC-TEST-001"]
        artifactRefs: ["ART-CALC-TEST-REPORT-001"]

      - id: "TRACE-CALC-SCOPE-001"
        scopeAuditRefs: ["OUT-CALC-OPS-001"]
        evidenceRefs: ["EVD-CALC-SCOPE-001"]
        commandRefs: ["CMD-CALC-SCOPE-001"]
        artifactRefs: ["ART-CALC-SCOPE-REPORT-001"]

    commands:
      - id: "CMD-CALC-TEST-001"
        run: "npm test -- calculator.test.js --runInBand > artifacts/calculator-test-report.txt"
        producesArtifactRefs: ["ART-CALC-TEST-REPORT-001"]
      - id: "CMD-CALC-SCOPE-001"
        run: "node scripts/check-calculator-scope.mjs > artifacts/calculator-scope-report.txt"
        producesArtifactRefs: ["ART-CALC-SCOPE-REPORT-001"]

    artifacts:
      - id: "ART-CALC-TEST-REPORT-001"
        path: "artifacts/calculator-test-report.txt"
      - id: "ART-CALC-SCOPE-REPORT-001"
        path: "artifacts/calculator-scope-report.txt"

Key points:

  • must: What addition should do (2+3=5, floating-point handling)
  • mustNot: What addition should not do (not accepting strings)
  • outOfScope: Explicitly exclude other operations (preventing AI improvisation)
  • traceRows: Bind requirement boundaries to evidence, commands, and artifacts
  • evidence / commands / artifacts: Define what counts as proof, how to reproduce it, and where the evidence is stored

Step 2: AI generates test code

Give the Manifest to AI with this prompt:

Based on the following Manifest, generate test code, ensuring coverage of NEG (MUST NOT) negative scenarios:

[Paste Manifest]

AI will generate tests similar to:

// calculator.test.js
describe("Addition Function", () => {
  // MUST-CALC-ADD-001
  test("2 + 3 = 5", () => {
    expect(add(2, 3)).toBe(5);
  });

  // MUST-CALC-FLOAT-001
  test("floating-point handling", () => {
    expect(add(0.1, 0.2)).toBeCloseTo(0.3, 10);
  });

  // NEG-CALC-TYPE-001
  test("reject non-numeric inputs", () => {
    expect(() => add("a", 1)).toThrow(TypeError);
  });
});

Note: This is a conceptual example, not a fully runnable project scaffold. To execute it locally, you still need to import add, initialize a test runner such as Jest or Vitest, and define the npm test script.

Development-phase red check: You may run npm test first to quickly confirm the tests fail. For the formal evidence chain, use the Manifest-registered CMD-CALC-TEST-001 to produce ART-CALC-TEST-REPORT-001.

Expected result: All tests fail (TDD-RED status) ✅

  • The add function doesn’t exist yet → failure is expected
  • This proves test code is ready, implementation can begin

Step 3: AI generates implementation code

Prompt:

Based on the following Manifest and test code, implement the add function, ensuring all tests pass:

[Paste Manifest]
[Paste test code]

AI will generate:

// calculator.js
function add(a, b) {
  if (typeof a !== "number" || typeof b !== "number") {
    throw new TypeError("Parameters must be numbers");
  }
  return a + b;
}

module.exports = { add };

Run the registered command: CMD-CALC-TEST-001

Expected result: All tests pass, and the candidate implementation enters closeout candidate status. It is not AI-TDD-GREEN yet because delivery evidence and Human Decision are not closed.

Step 4: Accept delivery through the evidence chain

A delivery cannot rely on the sentence “tests passed.” It must show that the current attempt’s contract slices are closed:

Contract itemTRACE / Scope AuditEVDCMDARTGate VerdictHuman Decision
MUST-CALC-ADD-001 / MUST-CALC-FLOAT-001TRACE-CALC-ADD-001EVD-CALC-UNIT-001CMD-CALC-TEST-001ART-CALC-TEST-REPORT-001passaccept, with decisionTimestamp and HD-CALC-001 receipt
NEG-CALC-TYPE-001TRACE-CALC-NEG-001EVD-CALC-UNIT-001CMD-CALC-TEST-001ART-CALC-TEST-REPORT-001passaccept, confirming the negative path belongs to the current attempt
OUT-CALC-OPS-001TRACE-CALC-SCOPE-001.scopeAuditRefsEVD-CALC-SCOPE-001CMD-CALC-SCOPE-001ART-CALC-SCOPE-REPORT-001passaccept, confirming the scope-audit receipt is archived

The checklist should be written as evidence closure, not as a generic todo list:

  • Every MUST/NEG has TRACE coverage
  • Every TRACE binds to EVD
  • Every EVD traces to a CMD run in the current attempt
  • Every CMD produces auditable ART
  • The scope audit for OUT-CALC-OPS-001 confirms AI did not add out-of-scope operations
  • Gate Verdict is pass
  • Human Decision is accept, with timestamp, actor, and receipt artifact recorded

Conclusion: Delivery Closeout Gate can return AI-TDD-GREEN only after the current attempt’s evidence chain closes, Gate Verdict is pass, and Human Decision is accept.

CMD-CALC-SCOPE-001 should not directly use rg 'subtract|multiply|divide' as the passing condition because rg returns exit code 1 when no matches are found. The scope audit script should explicitly convert “no forbidden symbols” into exit code 0 and write ART-CALC-SCOPE-REPORT-001. For example:

// scripts/check-calculator-scope.mjs
import { readFileSync } from 'node:fs';

const source = readFileSync('calculator.js', 'utf8');
const forbidden = ['subtract', 'multiply', 'divide'].filter((name) => source.includes(name));

if (forbidden.length > 0) {
  console.error(`Out-of-scope operations found: ${forbidden.join(', ')}`);
  process.exit(1);
}

console.log('Scope audit passed: only addition is implemented.');

Comparison: traditional approach vs AI-TDD

DimensionTraditional ApproachAI-TDD
Requirement expression”Write an addition function” (vague)Manifest YAML (precise)
AI outputMay include subtraction, multiplicationStrictly addition only (clear boundaries)
Error handlingMay be omittedNEG (MUST NOT) enforces requirements
Acceptance criteriaManual judgment “seems correct”TRACE/EVD/CMD/ART evidence-chain closure

Next steps

If this example interests you, continue reading Chapter 1 to understand why AI-TDD solves the core problems of AI-generated code.


Background verification: AI-TDD concepts predate the terminology

First, let’s clarify terminology boundaries: as of now, AI-TDD is not an industry-standard term with unified definitions like TDD, BDD, or CI/CD. Public materials do contain exact matches, such as the 2023 open-source project di-sukharev/AI-TDD and team training courses on “AI Test-Driven Development (AI-TDD).” But these are mostly tool names, course names, or community practice names, insufficient to prove that AI-TDD has formed recognized standards.

Therefore, when this article uses AI-TDD, it’s not treating it as an existing term formally defined by authoritative bodies, but rather using it to summarize a verifiable engineering trend: as AI assumes more code generation work, TDD evolves from “testing practice before manual coding” to “contract mechanism constraining AI generation behavior.”

In 2021, GitHub Copilot was released as a technical preview, allowing large numbers of developers to directly see that “large models can generate code.” At this point, industry focus was primarily on efficiency: can AI complete code, reduce boilerplate work, and help developers implement faster?

In 2023, the question began shifting from “can it generate” to “is it generated correctly.” After GPT-4’s release, large models demonstrated strong capabilities in various professional and academic tasks, but hallucinations, incorrect reasoning, and requirement misunderstandings persisted. In August 2023, Paul Sobocinski published “TDD with GitHub Copilot” on Martin Fowler’s website, systematically pointing out that LLMs provide irrelevant information and even hallucinate, making TDD more necessary when using AI coding assistants. The key judgment was that tests are not just feedback mechanisms but also ways to break problems into smaller pieces and let AI gradually approach correct implementation.

In 2023, the di-sukharev/AI-TDD open-source project explored the workflow of “humans write tests, GPT writes code until tests pass.” This project doesn’t represent industry consensus, but it shows that the abbreviation AI-TDD wasn’t coined after the fact. It was used early to describe specific practices combining AI and TDD.

From 2024 to 2026, academia began researching this direction with more cautious names, such as Generative AI for Test Driven Development, Test-Driven Development for Code Generation, Tests as Prompt, Test-Driven Agentic Development, and AI-native TDD framework. These papers may not use the abbreviation AI-TDD, but they collectively point to a trend: tests are not just verification artifacts but can serve as inputs for models to understand tasks, constrain generation, and expose errors. In other words, tests are evolving from “post-development checking tools” to “pre-AI-generation contract languages.”

In 2025, discussions around AI coding agents further intensified. Some engineering blogs and team practices began reinterpreting TDD as a quality-control mechanism for AI-generated code; Kent Beck discussed TDD’s relationship with AI agents in The Pragmatic Engineer interview, emphasizing tests’ constraining value in AI collaboration. At this point, the practice of “define executable verification first, then let AI implement” began evolving from personal techniques to team workflows, but the more rigorous statement should be: the industry is forming a practice spectrum of “AI-assisted TDD / TDD with AI agents / AI-native TDD governance” rather than having converged on a single AI-TDD standard term.

In February 2026, Thoughtworks held “The Future of Software Development Retreat” in Deer Valley, Utah. The report contained a crucial judgment: as AI assumes more code production work, engineering rigor doesn’t disappear but migrates to specifications, tests, constraints, and risk management. The report specifically noted that TDD can produce significantly better results for AI coding agents because when tests exist before code, agents cannot make incorrect implementations pass by writing tests that “verify incorrect behavior.”

This development trajectory shows that the AI-TDD discussed in this article is not a recitation of an existing standard term but a further abstraction of the above practice spectrum: from code-level testing to requirement-level contracts; from local assertions to global Manifest; from “test-driven implementation” to “contract-driven generation.” This is the background and naming boundary for the AI-TDD Gate Manifest proposed in this article.


Chapter 1: Why requirement-contract-driven AI-TDD is needed

1.1 Essential limitations of code-level TDD

Why doesn’t traditional TDD work in the AI era? The core problem isn’t the “test first” concept itself, but that tests can only cover code locally and cannot control AI’s global understanding of requirements.

Limitation 1: Locality of Test Coverage

Unit tests can only cover local code behavior. A user registration feature might require:

  • Input validation (unit tests can cover)
  • Database operations (integration tests needed)
  • Concurrency control (concurrency tests needed)
  • Security compliance (security scanning needed)
  • Performance metrics (performance tests needed)

Developers often only write the test types they’re most familiar with (usually unit tests), omitting other dimensions.

Limitation 2: Implicitness of Requirement Understanding

Test code itself is an “encoding” of requirements. But test code cannot answer:

  • “Which requirements does this test cover?”
  • “Which requirements are not yet covered?”
  • “Are there conflicts between requirements?”

When AI generates code, it sees isolated tests, not the global picture of requirements.

Limitation 3: Missing Boundary Definitions

Code-level TDD is good at defining “what to do” but not good at defining:

  • “What not to do” (OUT OF SCOPE)
  • “To what extent” (EVD)
  • “How to trace” (TRACE rows)

AI will “fill in” these missing definitions according to its own understanding, which is where risk lies.

1.2 Manifest: from test checklist to contract matrix

The essence of Manifest is transforming implicit knowledge in human brains into machine-readable explicit contracts.

Manifest vs Test Code:

DimensionTest CodeManifest
Abstraction levelCode-level (How)Requirement-level (What)
Coverage scopeLocalGlobal
Human readabilityPoor (requires code knowledge)Good (structured document)
Machine readabilityGood (executable)Good (parsable)
TraceabilityWeak (test→requirement?)Strong (TRACE rows)
Completeness checkHard (coverage ≠ requirement coverage)Easy (acceptance item checklist)

Core value of Manifest:

  1. Global view: Define “what all requirements are” before execution begins
  2. Machine-readable: AI can parse Manifest and understand the global picture of requirements
  3. Completeness checking: Can automatically check “which requirements are not yet covered by tests”
  4. Contract authority: Once confirmed, Manifest becomes a formal contract between humans and machines

1.3 AI-TDD Gate: Manifest execution engine

AI-TDD Gate is Manifest’s technical implementation layer, responsible for:

1. Manifest Parsing and Registration

  • Read ai-tdd-manifest.yaml
  • Register all MUST/NEG/OUT/EVD/ACC/E2E acceptance items
  • Establish TRACE rows mappings

2. Acceptance Test Generation

  • Automatically generate test code frameworks based on Manifest
  • Associate E2E/ACC test suites
  • Generate test coverage requirements

3. Gate State Determination

  • Run all tests
  • Determine TDD-RED (entry) or TDD-GREEN (delivery)
  • Block non-compliant items

4. Traceability Recording

  • Record results of each gate run
  • Maintain requirement→test→result traceability chains
  • Support post-hoc auditing

Chapter 2: First principles of architecture

2.1 From “solving problems” to “defining problem boundaries”

Before diving into AI-TDD’s technical details, it helps to ground the methodology in a more basic architectural idea: first principles.

Ultimate first principle: human cognitive capacity is limited

This assumption sits underneath much of software engineering. If human cognitive capacity were effectively unlimited, many architectural principles, patterns, and methodologies would matter much less. We would need far less decomposition, abstraction, and modularization.

Fundamental Goal Derived from Ultimate Principle: Managing Complexity

Because human cognition is limited, systems become hard to maintain, modify, and predict once complexity rises beyond what people can reliably reason about. At its core, architecture is an attempt to keep system complexity within human operating range.

Core Principle for Achieving the Goal: Problem Boundaries Precede Solutions

This is the most fundamental lever we have for managing complexity. The main source of complexity is often not the solution itself, but unclear problem boundaries. Only after those boundaries are explicit can we tell which complexity is necessary and which is self-inflicted. Any solution detached from the problem boundary introduces avoidable complexity.

From an engineering perspective, verifiability is often as important as raw model capability. If a task’s boundaries are unclear and cannot be checked, improvements in model architecture alone may still fail to produce reliable outcomes. Clear boundaries and verification criteria make model behavior easier to guide and evaluate.

Traditional architectural pitfall: over-focusing on “how”

Software architecture has long leaned toward one cognitive bias: we spend too much time on “how to solve the problem” and not enough on “where the problem boundary actually is.”

Traditional architecture design process:

Traditional Waterfall Architecture Design: Business requirements → Technology selection → System architecture → Module division → Interface design → Coding implementation

In this process, architects focus on “how to design” (How) at each stage:

  • Technology selection stage: Which framework? (Spring Boot / Express / Django)
  • System architecture stage: How to divide services? (Monolithic / Microservices / Serverless)
  • Module division stage: Where are module boundaries? (By domain / By function / By team)
  • Interface design stage: What API format? (REST / GraphQL / gRPC)
  • Coding implementation stage: How to specifically implement? (Design patterns / Code standards)

Fundamental flaw: once AI becomes the implementation agent, this pattern starts to break down. AI can generate endless “how” options, but if the “what” is unclear, any of those “how” options may be wrong. The traditional process does not define the problem boundary explicitly enough.

First Principle: Problem Boundaries Precede Solutions

AI-TDD’s architectural mental model is based on a simple but profound insight:

Clear problem boundary definition is more important than elegant solutions.

This doesn’t make solutions unimportant. It means any solution becomes risky when the boundaries are vague, because AI will fill in the gaps on its own, and those guesses may conflict with human intent.

From “Solving Problems” to “Defining Problem Boundaries”

The difference between these two architectural design paradigms determines whether AI-generated code “aligns with requirements” or “improvises”:

Paradigm A: Solution-Driven (Traditional)

Requirement: Implement user registration feature
Architect thinks:
- Which framework to use? (Spring Boot / Express / Django)
- How to design the database? (users table, password field)
- What API interface format? (POST /api/users)
- Which middleware is needed? (Redis cache, RabbitMQ message queue)

Paradigm B: Boundary-Definition-Driven (AI-TDD)

Requirement: Implement user registration feature
Architect thinks:
- MUST: What must be supported? (email registration, password encryption, duplicate detection)
- NEG / MUST NOT: What is prohibited? (plaintext storage, SQL injection, concurrent race conditions)
- OUT OF SCOPE: What is explicitly excluded? (OAuth, phone verification)
- EVD: How to verify? (unit tests, security scans, performance tests)
- BOUNDARY: Where are system boundaries? (Only responsible for registration, not email sending?)

Paradigm A’s result is a design solution. Paradigm B’s result is a contract matrix.

Key differences:

  • Paradigm A’s deliverable is “advice” — AI can reference or deviate from it
  • Paradigm B’s deliverable is “constraints” — AI must implement within boundaries; deviation equals failure

Five Dimensions of Boundary Definition

AI-TDD’s Manifest defines problem boundaries from five dimensions:

  1. Functional Boundary (MUST)

    • What the system must do
    • Each MUST has corresponding acceptance criteria and evidence requirements
  2. Negative And Scope Boundary (NEG / MUST NOT + OUT OF SCOPE)

    • What the system is prohibited from doing
    • Explicitly excluded functions and scenarios
    • This dimension is most easily overlooked in traditional architecture
  3. Evidence Boundary (EVD)

    • What constitutes proof of “completion”
    • Not “I think it’s complete” but “these commands all return 0”
  4. Traceability Boundary (TRACE rows)

    • Mapping relationships between requirements and verification
    • Ensures every requirement has test coverage and every test corresponds to a requirement
  5. State Boundary (TDD-RED/GREEN)

    • Entry state: Must be TDD-RED (tests exist but fail)
    • Delivery state: Must be TDD-GREEN (the current attempt’s TRACE/EVD/CMD/ART evidence chain is closed, Gate Verdict is pass, and Human Decision is accept)

Why Boundary Definition is Crucial in the AI Era

The fundamental difference between human developers and AI lies in how they autonomously fill gaps:

  • Human developers: Fill based on experience, intuition, and team norms. If boundaries are fuzzy, humans ask for clarification
  • AI: Fill based on probability, training data, and pattern matching. If boundaries are fuzzy, AI “guesses” the most likely implementation

A real case of fuzzy boundaries:

Requirement: “Implement file upload functionality”

Architect A (traditional thinking) designed the upload API, storage solution, and file type validation. But missed one boundary:

  • Single file size limit?
  • Total storage quota?
  • Virus scanning requirements?

AI-generated code implemented “file upload” without size limits. After going live, users uploaded 10GB files causing server crashes. This wasn’t AI’s error — AI completed the task of “implementing file upload,” but the architect didn’t define the boundaries of “what to upload.”

If AI-TDD Manifest defined:

must:
  - id: MUST-UPLOAD-FILE-001
    text: Support file upload
    validation: "Single file max 100MB, supports jpg/png/pdf formats"

outOfScope:
  - id: OUT-UPLOAD-VIDEO-001
    text: Video file upload not supported
    reason: "Not included in this iteration, moved to REQ-VIDEO-001"

evidence:
  - id: EVD-UPLOAD-PASS-001
    text: Upload 100MB file succeeds
    threshold: "Upload time < 30s"

  - id: EVD-UPLOAD-REJECT-001
    text: Upload 110MB file fails
    oracle: "Returns 413 Payload Too Large"

AI would generate implementations under these boundary constraints, uploads over 100MB would be explicitly rejected.

Shift in Architect Responsibilities

Under the AI-TDD paradigm, architect responsibilities shift from “designing optimal solutions” to “defining the clearest possible boundaries”:

Traditional ResponsibilityAI-TDD Responsibility
Selecting technology stackDefining technology constraints (MUST use certain technology)
Designing module interfacesDefining interface contracts (MUST satisfy certain contracts)
Writing architecture documentsGenerating machine-readable Manifest
Reviewing code implementationAccepting TDD-GREEN evidence

This doesn’t reduce the architect’s value. It raises the bar for architectural precision: from “advice” to “contract,” and from “documentation” to “executable specification.”

2.2 Essence of human-AI collaboration: why natural language alone is often insufficient as an AI execution contract

AI-TDD uses structured YAML rather than natural language as the main requirement carrier. That is not just a tooling preference. It follows from a more basic observation about how human-AI collaboration actually communicates.

Three Communication Traps of Natural Language

Human communication works with natural language because people share context, experience, and the ability to infer intent. AI does not share those strengths, which makes natural language much riskier as an execution contract.

Trap 1: Amplification of Ambiguity

Natural language ambiguity can usually be resolved through context for humans but is fatal for AI:

Human instruction: "Implement a high-performance cache"

AI possible understandings:
- Use Redis (most common)
- Use Memcached (lightweight)
- Use local memory Map (simplest)
- Use Caffeine (Java ecosystem)
- Use multi-level cache (most "high-performance")

Human expectation: Probably means Redis cluster + local cache multi-level solution
AI implementation: Chose simplest HashMap because instruction didn't explicitly exclude it

Trap 2: Invisibility of Implicit Assumptions

Human communication relies heavily on implicit assumptions. Consider this sentence:

"Implement user login, must be secure"

Human-understood implicit assumptions:

  • “Secure” means passwords must be encrypted for storage (bcrypt, not MD5)
  • “Secure” means preventing SQL injection
  • “Secure” means having anti-brute-force mechanisms (failure count limits)
  • “Secure” means using HTTPS
  • “Secure” means sessions have expiration times

AI lacks this implicit knowledge. Without explicit definition, AI might:

  • Use MD5 for password storage (seen in training data)
  • Forget to handle SQL injection
  • Have no failure count limits
  • Use HTTP instead of HTTPS

Trap 3: Impossibility of Completeness Verification

Natural language requirement documents have a fundamental problem: you cannot automatically verify whether they are “complete.”

Requirements document:
- Users can register via email
- Users can log in via email
- Users can change passwords

Complete? Yes for AI.
But missing:
- Password strength requirements?
- Email verification process?
- Session expiration policy?
- Account deletion functionality?
- Concurrent login handling?

Humans can ask “Are these requirements complete?” but no automated tool can scan natural language documents and answer “The requirement on page 3 paragraph 2 has no corresponding test coverage.”

Four Communication Advantages of Structured Contracts

AI-TDD uses YAML Manifest instead of natural language requirements not because YAML is “cooler” but because it solves the above traps of natural language:

Advantage 1: Elimination of Ambiguity

# Not "high-performance cache" but:
must:
  - id: MUST-CACHE-001
    text: Use Redis as cache storage
    validation: "Redis cluster mode, minimum 3 nodes"

  - id: MUST-CACHE-002
    text: Local second-level cache
    validation: "Caffeine cache, TTL 5 minutes"

outOfScope:
  - id: OUT-CACHE-MEMCACHED-001
    text: Do not use Memcached
    reason: "Team technology stack unified on Redis"

No ambiguity: Not “high-performance” (what is high-performance?) but explicit “Redis cluster + Caffeine second-level cache.”

Advantage 2: Explicitation of Implicit Assumptions

must:
  - id: MUST-AUTH-001
    text: Password encrypted storage
    validation: "Use bcrypt, cost factor >= 12"
    # Explicit definition, not implicit assumption

  - id: MUST-AUTH-002
    text: Prevent SQL injection
    validation: "All database operations use parameterized queries"
    # Not "be secure" but explicit constraints

outOfScope:
  - id: OUT-AUTH-SSO-001
    text: SSO single sign-on not supported
    reason: "Enterprise feature, moved to REQ-ENTERPRISE-001"
    # Explicit exclusion, not omission

Advantage 3: Machine-Verifiable Completeness

must:
  - id: MUST-REG-EMAIL-001
    text: Users can register via email
    evidenceRefs: [EVD-REG-TEST-001]
    coveredByTraceRows: [TRACE-REG-001] # ← Must have Trace coverage

traceRows:
  - id: TRACE-REG-001
    covers: [MUST-REG-EMAIL-001] # ← Reverse declaration of coverage
    evidenceRefs: [EVD-REG-TEST-001]
    acceptanceRefs: [ACC-REG-001] # ← Must have tests

acceptanceTests:
  - id: ACC-REG-001
    file: tests/acceptance/registration.test.ts

Gate can automatically verify:

  • “Does MUST-REG-EMAIL-001 have Trace coverage?”
  • “Does TRACE-REG-001 have EVD support?”
  • “Does EVD have CMD verification?”
  • “Does ACC-REG-001 test file exist?”

Completeness is not a “feeling” but a machine-checkable fact.

Advantage 4: Precise Definition of Execution Semantics

Natural language tells you “what to do” but not “how to do it” and “to what extent.” Manifest precisely defines execution semantics through multi-layer structure:

# Not just "registration feature" but:
must:
  - id: MUST-REG-001
    text: Users can register via email
    validation: |
      1. Email format complies with RFC 5322
      2. Password length 8-32 characters
      3. Must contain uppercase, lowercase letters and numbers
    evidenceRefs: [EVD-REG-001, EVD-REG-002]
    coveredByTraceRows: [TRACE-REG-001]

evidence:
  - id: EVD-REG-001
    text: Registration success evidence
    requiredCommandRefs: [CMD-REG-TEST-001]
    artifactRefs: [ART-TEST-REPORT-001]

  - id: EVD-REG-002
    text: Registration failure boundary evidence
    requiredCommandRefs: [CMD-REG-TEST-002]
    # Negative tests: invalid email, weak password, duplicate registration

Each layer has clear execution semantics, AI won’t “guess” what “valid registration” is — all validation logic is explicitly defined in Manifest.

New Communication Protocol for Human-AI Collaboration

AI-TDD essentially defines a new communication protocol for human-AI collaboration:

Traditional CommunicationAI-TDD Communication
Natural language requirement documentsYAML Manifest contracts
”You know what I mean” implicit assumptionsExplicit MUST/NEG/OUT constraints
Manual completeness reviewAutomatic TRACE rows verification
”Probably complete” feelingTDD-GREEN objective evidence
Post-implementation manual acceptanceImplementation Readiness Gate pre-interception

Key insight:

  • Humans excel at: Fuzzy intent, creative thinking, value judgment, boundary trade-offs
  • AI excels at: Pattern matching, large-scale generation, consistent execution, rapid iteration
  • Manifest’s role: Transform human fuzzy intent into contracts AI can precisely execute

Not replacing human judgment, but precisely transmitting human judgment

AI-TDD’s goal is not to let AI replace human architectural decisions but to let AI precisely execute human architectural decisions.

Natural language suits human-to-human communication because humans have consensus and context. YAML Manifest suits human-AI collaboration because it eliminates ambiguity, makes constraints explicit, and enables verifiable completeness.

This is not a downgrade in communication but an upgrade in precision — from “roughly understood” to “precise contract.”

2.3 Contract as code: Manifest as the bridge between human intent and AI execution

Manifest’s role in AI-TDD can be summarized in one sentence: Contract as Code.

More precisely, Manifest can be treated as the “source code” of human intent, AI-TDD Gate plays a role similar to a “compiler and runtime,” and generated implementation is the execution result.

Intent Decay in Traditional Software Development

In traditional software development, requirements pass from humans to implementation through multiple “translations,” each potentially causing intent decay:

Human intent (Product Manager)
    ↓ Translate to natural language
Requirements document (PRD)
    ↓ Translate to technical language
Technical solution (Architecture Doc)
    ↓ Translate to code
Implementation code (Source Code)
    ↓ Translate to machine instructions
Executable program (Binary)

Each “translation” introduces noise:

  • PM’s “high-performance” understood by architect as “use cache”
  • “Use cache” understood by developer as “add Redis”
  • “Add Redis” implemented as “single-node Redis” rather than “cluster”
  • Final performance is not “high”

Intent Decay Amplified in the AI Era

When AI becomes the implementation agent, the problem worsens:

Human intent
    ↓ Natural language Prompt
AI understanding (probability model inference)
    ↓ Generate code
Implementation code

AI’s understanding is not translation but inference — probability inference based on training data. This inference introduces additional noise:

  • “High-performance” → AI infers “use most familiar technology”
  • “Secure” → AI infers “common security measures”
  • “Scalable” → AI infers “microservices architecture” (potentially over-engineered)

Manifest: Eliminate Translation Layers, Directly Encode Intent

AI-TDD’s solution is to eliminate intermediate translation layers, making human intent directly machine-executable contracts:

Human intent (architectural decision)
    ↓ Direct encoding
Manifest contract (YAML)
    ↓ AI-TDD Gate parsing and execution
AI generation (under contract constraints)
    ↓ Verification
TDD-GREEN evidence

Key difference: Manifest doesn’t “describe” intent, it “encodes” intent.

Three Levels of Contract as Code

Level 1: Syntax Layer — Structured Encoding

Manifest uses YAML syntax, not by accident. The choice of YAML is based on these technical considerations:

# Human-readable
must:
  - id: MUST-REG-EMAIL-001 # Unique identifier
    text: "Users can register via email" # Human intent
    validation: "email format complies with RFC 5322" # Machine-verifiable constraint
    evidenceRefs: [EVD-REG-TEST-001] # Explicit dependencies
    coveredByTraceRows: [TRACE-REG-001] # Bidirectional traceability
  • Unique identifier (ID): Each requirement has a globally unique ID, referenceable and traceable
  • Human-readable text: Preserves human-understandable descriptions
  • Machine-verifiable validation: Precise verification conditions
  • Explicit dependencies evidenceRefs: No implicit dependencies, all dependencies explicitly declared
  • Bidirectional traceability coveredByTraceRows: Establishes complete requirement↔verification linkage

Level 2: Semantic Layer — Executable Contracts

Manifest isn’t just “documentation,” it has clear execution semantics:

# These fields aren't decorations, they're executable instructions
traceRows:
  - id: TRACE-REG-001
    covers: [MUST-REG-EMAIL-001, MUST-REG-PASSWORD-001] # ← Executor checks: Are these MUSTs defined?
    evidenceRefs: [EVD-REG-TEST-001] # ← Executor checks: Are these EVDs provided?
    deliveryEvidenceCommandRefs: [CMD-REG-TEST-001] # ← Executor runs: Execute these commands
    acceptanceRefs: [ACC-REG-001] # ← Executor verifies: Do these tests pass?

When AI-TDD Gate parses Manifest:

  1. Static verification: Check if all references exist (MUST, EVD, CMD)
  2. Dependency resolution: Build dependency graphs of requirements→evidence→commands→tests
  3. Execution scheduling: Run commands in Trace order, collect evidence
  4. State determination: Determine TDD-RED or TDD-GREEN based on evidence

Level 3: Meta-Semantic Layer — Contract Meta-Programming

Manifest’s most powerful feature is self-reference — Manifest can describe its own completeness requirements:

# Manifest describes "what makes a Manifest complete"
implementationConfirmation:
  status: user_confirmed

  must:
    - id: MUST-META-001
      text: Manifest must contain Trace coverage for all MUSTs
      oracle: "Every MUST has coveredByTraceRows"

    - id: MUST-META-002
      text: Manifest must contain boundary declarations for all NEG (MUST NOT) items
      oracle: "Every NEG is bound to EVD or FAIL paths; OUT OF SCOPE only records scope boundaries"

  closeoutReadinessPreview:
    requiredCommands: [CMD-VALIDATE-MANIFEST-001]
    # Manifest can define "how to validate itself"

This self-reference enables AI-TDD to programmatically verify structural contract completeness. Human review is still required for business semantics, but reference integrity, coverage relationships, and evidence binding can be checked by machines.

Contract as Code vs Code as Contract

Some might ask: Why not use code (test code) directly as the contract? That’s what traditional TDD does.

Key differences:

Code as Contract (Traditional TDD)Contract as Code (AI-TDD)
Contract is test codeContract is YAML Manifest
Humans read tests to infer requirementsMachines directly parse contracts
Requirements implicit in testsRequirements explicitly declared
Completeness cannot be automatically checkedTRACE rows can be automatically verified
Negative behavior hard to expressOUT OF SCOPE natively supported
State only “pass/fail”TDD-RED/GREEN explicit states

Test code is a tool for “verifying implementation,” Manifest is a tool for “encoding intent.” They complement each other: Manifest defines “what should be done,” tests verify “whether it was done.”

Manifest as the Primary Source of Truth for Intent

In AI-TDD, Manifest should be maintained as the primary source of truth for the acceptance contract:

Manifest-Centric Architecture: All participants (humans, AI, gates) get information from the same Manifest

ParticipantHow They Use ManifestOutput
HumansRead HTML confirmation pagesDecisions and confirmations
AIParse YAML for executionImplementation code
AI-TDD GateVerify execution statusGate reports and evidence

Advantage: Version drift between “requirements understood by humans” and “code implemented by AI” becomes easier to reduce and trace. All participants work from the same contract.

Practical Insight: Writing Manifest is Programming

Write Manifest as code:

  1. Version control: Manifest under Git management, every modification has history
  2. Code review: Manifest changes need Review, just like code Review
  3. Automated checking: CI/CD pipelines automatically verify Manifest syntax and semantics
  4. Refactoring support: When requirements change, Manifest refactoring corresponds to test and implementation refactoring
  5. Documentation as code: Manifest is self-documenting, no additional “requirement documents” needed

Key insight:

Manifest is not “better requirement documentation” but a new type of programming language — a domain-specific language (DSL) specifically for expressing human intent and machine contracts.

Learning AI-TDD isn’t learning how to “write better documents” but learning how to think and encode in the language of contracts.


Chapter 3: AI-TDD vs existing technical solutions

3.1 From TDD to BDD to AI-TDD: evolution of methodologies

Traditional TDD, BDD, and AI-TDD can be read as three stages in the evolution of software engineering practice. To understand what AI-TDD adds, we need to look at the blind spots that TDD and BDD expose in the AI era. Both methods assume “humans write the tests, and humans write the implementation.” Once AI becomes the implementation agent, that assumption weakens.

Evolution from TDD to BDD to AI-TDD Figure 1: This figure answers one question: how do TDD, BDD, and AI-TDD evolve across abstraction levels?

Traditional TDD: test-driven development

Traditional TDD is built on a simple idea: write the test first, then implement. That model works well in a manual coding environment, but it faces structural challenges once implementation is delegated to AI.

Three Implicit Assumptions of Traditional TDD:

  1. Test writer and implementer are the same person — requirements are consistent in the developer’s mind
  2. Test code can express complete requirements — correct behavior can be inferred from tests
  3. Test omissions can be discovered in code review — humans can identify missing test scenarios

In the AI era, all three assumptions fail:

  1. Test writer is human, implementer is AI — their “understanding of requirements” may be completely different
  2. AI cannot see the global picture of tests — it only sees isolated test cases, not the constraint relationships between requirements
  3. AI-generated code may pass existing tests but miss key scenarios — code reviewers struggle to discover AI’s “implicit assumptions”

Traditional TDD vs AI-TDD comparison:

DimensionTraditional TDDAI-TDD
Requirement carrierTest code + developer memoryManifest contract matrix
Completeness verificationDepends on manual reviewMachine automatic checking (TRACE rows)
Boundary definitionImplicit (no test = not done)Explicit (OUT OF SCOPE must be declared)
Negative constraintsDifficult to expressNative support (NEG / MUST NOT)
State flowNo explicit stateTDD-RED→TDD-GREEN gate-driven
Execution agentHuman developerAI execution, human confirmation
Acceptance criteriaTests passAll Manifest EVD verified

BDD/Gherkin: behavior-driven development

BDD (Behavior-Driven Development) uses natural language to describe behavior in an attempt to narrow the gap between business stakeholders and developers:

Feature: User Registration
  Scenario: Valid registration
    Given user inputs valid email and password
    When submit registration request
    Then return 201 Created
    And password must be encrypted for storage

BDD’s Core Assumptions vs AI Era Reality:

BDD makes three core assumptions:

  1. Natural language is clear enough — business people, developers, and testers can understand the same set of descriptions
  2. Step Definitions are accurate mappings — natural language can be accurately converted to code implementation
  3. Humans are execution agents — developers manually implement functionality according to BDD descriptions

In an AI-heavy workflow, each of those assumptions becomes less stable:

Challenge 1: Natural Language Ambiguity is Amplified by AI

BDD’s Given/When/Then is clear enough for humans but full of ambiguity for AI:

# Clear to humans, ambiguous to AI
Then password must be encrypted for storage

AI might understand:

  • Return encrypted password in response (wrong)
  • Store bcrypt hash in database (correct, but what hash strength?)
  • Use AES encryption for storage (wrong, reversible encryption)
  • Use SHA256 (wrong, unsuitable for passwords)

Challenge 2: BDD’s Traceability is Unidirectional

BDD’s trace chain is Scenario → Step Definitions → Code implementation. That mapping is mostly one-way and lacks strong reverse verification:

  • Code reviewers can ask: “Which Scenario does this function correspond to?” — requires manual lookup
  • AI cannot achieve: “Given code implementation, automatically check if it satisfies all Scenarios”
  • No global contract matrix to verify “whether all requirements are covered”

Challenge 3: BDD Cannot Express Negative Contracts

BDD excels at expressing “what should happen” but not “what is prohibited”:

# Awkward negative expression
Scenario: Should not store plaintext password
  Given user registration succeeds
  Then plaintext password should not exist in database

# Even more awkward boundary exclusion
Scenario: Does not include social login
  Given this is standard registration flow
  Then OAuth option should not appear

These “negative” scenarios are awkward in BDD by design. BDD is good at expressing intended behavior, not at declaring hard negative constraints.

BDD/Gherkin vs AI-TDD Gate Manifest Comparison:

DimensionBDD/GherkinAI-TDD Gate Manifest
FormatNatural language (Given/When/Then)Structured YAML (machine-readable)
Abstraction layerBehavior descriptionRequirement contract matrix
Completeness verificationDepends on manual Scenario coverage reviewMachine automatic checking (TRACE rows)
TraceabilityWeak (Scenario→Step Definitions)Strong (5D trace matrix)
State managementNo built-in state machineTDD-RED/TDD-GREEN explicit state flow
Gate controlNoneImplementation Readiness Gate + Delivery Closeout Gate
Evidence requirementsTests passMust associate EVD, ART, CMD
Negative constraintsDifficult to express (Given not… awkward)Native support (NEG / MUST NOT + OUT OF SCOPE)
Execution constraintsLoose (Steps can be omitted)Strict (Traces must cover all MUSTs/NEGs)

Key differences explained:

1. Machine readability vs human readability

BDD prioritizes human readability:

# Human-friendly, machine difficult to precisely parse
Then password must be encrypted for storage and comply with security standards

AI-TDD prioritizes machine readability:

# Machine-readable, semantically precise
must:
  - id: MUST-REG-PASSWORD-001
    text: Password must be encrypted for storage
    validation: "Database stores bcrypt hash, not plaintext, cost factor >= 12"
    evidenceRefs: [EVD-SEC-PASSWORD-001]

When AI becomes the implementation agent, machine readability matters more. AI needs precise constraints, not descriptions that depend on human interpretation.

2. Contract Completeness Mechanism

BDD has no “is the contract complete” checking mechanism:

  • You can write 100 Scenarios and still miss critical negative constraints
  • No machine can ask: “Which requirements have no Scenario coverage?”

AI-TDD’s TRACE rows enforce bidirectional traceability:

must:
  - id: MUST-REG-EMAIL-001
    text: Accept valid email and password
    evidenceRefs: [EVD-REG-TEST-001]
    coveredByTraceRows: [TRACE-REG-001] # Must have Trace coverage

traceRows:
  - id: TRACE-REG-001
    covers: [MUST-REG-EMAIL-001] # Reverse declaration of coverage
    evidenceRefs: [EVD-REG-TEST-001]
    deliveryEvidenceCommandRefs: [CMD-REG-TEST-001]
    acceptanceRefs: [ACC-REG-001]

Gate can automatically verify: “Does MUST-REG-EMAIL-001 have Trace coverage? Does TRACE-REG-001 have EVD support?”

3. First-Class Status of Negative Behavior

Negative behaviors can only be awkwardly expressed in BDD. In AI-TDD, negative behavior is core to the contract:

mustNot:
  - id: NEG-SEC-PLAIN-PASSWORD-001
    text: Prohibit storing plaintext password
    evidenceRefs: [EVD-SEC-PASSWORD-001]
    oracle: negative control oracle

outOfScope:
  - id: OUT-SEC-STRESS-001
    text: SQL-injection stress testing is outside this delivery slice
    reason: Moved to the dedicated security task

edgeCases:
  - id: EDGE-REG-CONCURRENCY-001
    category: explicit_error_case_mapping
    condition: Duplicate email during concurrent registration
    expectedBehavior: Fail closed
    forbiddenBehavior: Silently overwrite existing account

Why must negative behavior be explicitly defined in the AI era?

Because AI “creatively” fills requirement gaps. If you don’t explicitly say “prohibit plaintext storage,” AI might choose the simplest implementation (plaintext storage) to make functionality “work quickly.” Negative contracts are boundary fences preventing AI from over-freedom.

4. TDD-RED/GREEN State Machine

BDD has no state concept — Scenarios either pass or fail. AI-TDD has explicit state flow:

acceptanceTests:
  - id: ACC-REG-001
    file: tests/acceptance/user-registration.test.ts
    covers: [MUST-REG-EMAIL-001]
    expectedPreImplementationState: expected_red # Entry gate must be TDD-RED
    commandRefs: [CMD-REG-TEST-001]

Value of TDD-RED state: During implementation readiness phase, tests must all be red. This proves:

  • Tests are actually running (not fake tests)
  • Implementation is actually missing (not pre-implemented)
  • Manifest is actually being used for verification

BDD cannot distinguish “tests not yet written” from “tests written but implementation missing” — both are “failures” in BDD but fundamentally different states in AI-TDD.

5. Evidence Chain Completeness

BDD’s “pass” is binary: test pass = complete. AI-TDD’s “pass” is a multi-layer evidence chain:

Five-Dimension Evidence Chain Structure (core five layers):

LevelElementRole
L1MUST (Requirement)Business commitment
L2TRACE (Execution Slice)Execution tracking
L3EVD (Evidence Definition)Verification criteria
L4CMD (Verification Command)Reproducible execution
L5ART (Deliverable)Physical output

The complete evidence chain also includes: ACC (automated verification), EXIT CODE (objective result), AUDIT RECEIPT (audit traceability).

Each link can be independently audited and verified. BDD Scenario passes can lose credibility when over-mocked or over-stubbed; AI-TDD evidence chains raise the cost of fabrication through commands, artifacts, and audit records, but still require review discipline.

AI-TDD: contract-driven development

Traditional TDD and BDD are “description-driven” — describe what should happen, then expect implementation to match the description. AI-TDD is “contract-driven” — define complete contract matrices before execution, then force implementation to satisfy the contract.

Evolutionary Relationship of Three Methods:

EraMethodologyCore CharacteristicsAbstraction Layer Elevation
2000sTraditional TDDTest first, red then greenBaseline layer
2010sBDDNatural language behavior description, business participationAbstraction layer elevated
2020sAI-TDDStructured contract definition, AI executionAbstraction layer elevated again + machine-readable constraints

Evolution Path: Traditional TDD (test-driven) → BDD (behavior-driven) → AI-TDD (contract-driven)

Core Problems Solved by AI-TDD:

  1. Solving “test coverage ≠ requirement coverage”

    • BDD: 100 Scenarios pass, may still miss critical constraints
    • AI-TDD: TRACE rows force every MUST/NEG to have coverage, gate automatically checks
  2. Solving “AI understanding deviation”

    • BDD: Natural language ambiguity amplified by AI
    • AI-TDD: Machine-readable YAML eliminates ambiguity, validation field precisely constrains
  3. Solving “negative behavior difficult to express”

    • BDD: “Should not” can only be awkwardly expressed
    • AI-TDD: NEG / MUST NOT and OUT OF SCOPE are first-class citizens with dedicated verification strategies
  4. Solving “incomplete evidence chain”

    • BDD: Test pass = complete
    • AI-TDD: Five-dimension trace matrix, each layer auditable
  5. Solving “missing state management”

    • BDD: No state concept
    • AI-TDD: TDD-RED→TDD-GREEN explicit state flow, dual gate control

Migration Recommendations:

If you already have TDD or BDD practices, the path to migrate to AI-TDD:

  1. Keep existing tests/Scenarios as human communication tools — but don’t treat them as contracts
  2. Add AI-TDD Manifest on top — define machine-readable contracts with YAML
  3. Extract MUST/NEG from tests/Scenarios — ask yourself: “Which MUST does this correspond to? Which NEGs are missing?”
  4. Supplement missing OUT OF SCOPEs — explicitly declare features not included in this iteration
  5. Add Trace coverage — ensure every requirement has corresponding acceptance tests and evidence

Key insight:

  • Traditional TDD test code is a verification tool
  • BDD Scenarios are a communication tool
  • AI-TDD Manifest is the contract carrier defined in this article
  • All three can coexist: Scenarios for human communication, tests for verification, Manifest for machine execution

If you already have TDD practices, we recommend migrating to AI-TDD in the following steps:

Phase 1: Parallel Operation

  • Keep existing TDD tests, continue running them
  • Select 1 new feature, try writing AI-TDD Manifest
  • Compare differences between the two approaches: “test code” vs “contract matrix”

Phase 2: Problem Discovery

  • Focus on negative scenarios not covered by existing tests
  • Document “requirement drift” cases in AI-generated code
  • Identify which NEG (MUST NOT) items should be explicitly defined in Manifest

Phase 3: Gradual Switch

  • Prioritize AI-TDD for new features
  • Supplement Manifest for legacy features during refactoring
  • Establish internal Manifest templates and best practices

Common Migration Pitfalls:

  • ❌ Pitfall 1: Rewrite all tests at once → ✅ Should run in parallel, gradual switch
  • ❌ Pitfall 2: Delete existing tests → ✅ Keep TDD tests as regression tests
  • ❌ Pitfall 3: Manifest too detailed → ✅ Focus on boundary definition, not implementation details

Chapter 4: AI-TDD Gate Manifest detailed explanation

4.0 Manifest schema specification (v1.0)

Before looking at specific examples, we need to clarify Manifest’s Schema specifications:

Field Naming Conventions:

  • Use camelCase (e.g., mustNot, outOfScope, traceRows)
  • Top-level container: manifestacceptanceCriteria
  • Core dimensions: must, mustNot (holds NEG-*), outOfScope (holds OUT-*), evidence (holds EVD-*), and traceRows (holds TRACE-*)
  • Status fields: gateStatusimplementationReadiness / closeout

Version Notes:

  • version: "1.0.0" follows SemVer specification
  • Major: Requirement scope changes (MUST/OUT OF SCOPE additions/deletions)
  • Minor: Verification condition refinement (validation modifications)
  • Patch: Text description optimization

4.1 Core structure of Manifest

Manifest Structure Analysis Figure 2: This figure answers one question: what contract layers make up a Manifest?

File: ai-tdd-manifest.yaml

manifest:
  version: "1.0.0"
  project: "user-service"
  requirementId: "REQ-001"
  title: "User Registration"

  # Requirement contract matrix
  acceptanceCriteria:
    # MUST: Conditions that must be satisfied
    must:
      - id: "MUST-REQ-001"
        description: "Accept valid email and password"
        validation: "email format complies with RFC 5322, password length 8-32 characters"
        test: "test_valid_registration"
        priority: "P0"

      - id: "MUST-REQ-002"
        description: "Reject existing email"
        validation: "Query database, duplicate email returns 409 Conflict"
        test: "test_duplicate_email"
        priority: "P0"

      - id: "MUST-REQ-003"
        description: "Password must be encrypted for storage"
        validation: "Database stores bcrypt hash, not plaintext"
        test: "test_password_hashing"
        priority: "P0"

    # NEG / MUST NOT: Blocking negative assertions
    mustNot:
      - id: "NEG-SEC-001"
        description: "Prohibit storing plaintext password"
        validation: "Database fields do not contain plaintext password"
        test: "test_no_plaintext_storage"

      - id: "NEG-SEC-002"
        description: "Prohibit SQL injection"
        validation: "All database operations use parameterized queries"
        test: "test_sql_injection_prevention"

    # OUT OF SCOPE: Current-iteration scope boundary, not a completion blocker
    outOfScope:
      - id: "OUT-AUTH-001"
        description: "Social login (OAuth)"
        reason: "Not included in this iteration, moved to REQ-010"

      - id: "OUT-PHONE-001"
        description: "Phone verification code registration"
        reason: "Not included in this iteration, moved to REQ-011"

NEG vs OUT OF SCOPE Semantic Distinction Guide:

mustNot / NEG (Prohibited):

  • Scenario 1: Absolute prohibitions at security/compliance level (never allowed even in future iterations)
  • Scenario 2: Technical constraint-induced prohibitions (e.g., “prohibit storing passwords in plaintext”)
  • Characteristic: Has validation conditions, can be violated; violation blocks completion

outOfScope / OUT (Scope boundary):

  • Scenario: Explicit exclusion at functional level (not implemented in current iteration but possible in future)
  • Characteristic: Has reason explanation and follow-up plans (moved to REQ-XXX)
  • Note: OUT-* is a scope boundary, not a completion blocker; it should not appear in covers

Simple judgment: If this feature “might be done later” → use outOfScope; if “should never be done” → use mustNot.

The same Manifest also needs to explicitly write evidence, trace matrix, and gate status:

manifest:
  acceptanceCriteria:
    # EVD: Assertions requiring evidence
    evidence:
      - id: "EVD-TEST-001"
        type: "test_coverage"
        description: "Code coverage >= 80%"
        threshold: "80%"
        artifact: "coverage_report.html"

      - id: "EVD-SEC-001"
        type: "security_scan"
        description: "No high-severity security vulnerabilities"
        tool: "bandit"
        artifact: "security_scan_report.json"

      - id: "EVD-PERF-001"
        type: "performance_test"
        description: "Registration API response time < 200ms (P99)"
        threshold: "200ms"
        artifact: "perf_test_report.html"

  # TRACE rows: executable slices across requirements, tasks, evidence, commands, acceptance, and artifacts
  traceRows:
    - id: "TRACE-REG-001"
      covers: ["MUST-REQ-001", "NEG-SEC-001"]
      taskRefs: ["TASK-REG-001"]
      evidenceRefs: ["EVD-TEST-001", "EVD-SEC-001"]
      contractValidationCommandRefs: ["CMD-REG-TEST-001"]
      acceptanceRefs: ["ACC-REG-001", "E2E-REG-001"]
      artifactRefs: ["ART-REG-REPORT-001"]
      status: "pending"

  # Failure paths: required failure scenarios attached to NEG items
  failurePaths:
    - id: "FAIL-SEC-001"
      covers: ["NEG-SEC-001"]
      expectedFailure: "Plaintext password write attempt is blocked by tests"
      evidenceRefs: ["EVD-SEC-001"]

  # Edge cases: implementation boundaries that do not expand business scope
  edgeCases:
    - id: "EDGE-REG-001"
      covers: ["MUST-REQ-001"]
      case: "Email uniqueness still holds after case normalization"
      evidenceRefs: ["EVD-TEST-001"]

  tasks:
    - id: "TASK-REG-001"
      description: "Implement the email-registration path and negative security constraints"

  # End-to-end test suites
  e2eTestSuites:
    - name: "user_registration_flow"
      path: "tests/e2e/user_registration.test.js"
      scenarios: ["happy_path", "duplicate_email", "invalid_input"]

  # Acceptance test suites
  accTestSuites:
    - name: "user_registration_acc"
      path: "tests/acc/user_registration.test.py"
      criteria: "all_pass"

  # Gate status
  gateStatus:
    implementationReadiness:
      status: "TDD-RED"
      lastRun: "2026-05-27T10:00:00Z"
      summary:
        total: 15
        passed: 0
        failed: 15
        pending: 0

    closeout:
      status: "TDD-GREEN"
      closeoutAttemptId: "closeout-20260527-160000"
      lastRun: "2026-05-27T16:00:00Z"
      summary:
        total: 15
        passed: 15
        failed: 0
        pending: 0

  deliveryEvidence:
    closeoutAttemptId: "closeout-20260527-160000"
    manifestHash: "sha256:..."
    sourceSnapshotHash: "sha256:..."
    commandRunRefs: ["RUN-REG-TEST-001", "RUN-SEC-SCAN-001"]
    artifactRefs:
      - id: "ART-REG-REPORT-001"
        sha256: "sha256:..."

  gateVerdict:
    status: "pass"
    evaluatedAt: "2026-05-27T16:02:00Z"
    receiptArtifactRef: "ART-CLOSEOUT-REPORT-001"

  humanDecision:
    decision: "accept"
    actor: "tech-lead"
    decidedAt: "2026-05-27T16:05:00Z"
    receiptArtifactRef: "ART-HUMAN-DECISION-001"

Five-Dimension Trace Matrix Figure 3: This figure answers one question: how do requirement, execution, evidence, command, and artifact form a trace chain?

4.2 Manifest generation process

Manifest is not written in one go but generated through atomic decomposition iteration.

Iteration 1: Requirement Draft Human provides: “Implement user registration feature” AI generates: Manifest draft (containing basic MUST items) Human reviews: Confirm/modify/supplement

Iteration 2: Boundary Clarification Human supplements: “Need email verification, password strength check” AI updates: Adds MUST items, generates OUT OF SCOPE list Human reviews: Confirms exclusions

Iteration 3: Evidence Definition Human supplements: “Need 80% code coverage” AI updates: Adds EVD items Human reviews: Confirms thresholds

Iteration 4: Trace Establishment AI generates: TRACE rows draft Human reviews: Confirms requirement↔test↔evidence mapping relationships

Iteration 5: Final Confirmation Human reads: Requirement confirmation page HTML Human decision: Confirm Manifest complete, proceed to next stage

Key Principles:

  • Manifest must reach “complete” state before execution begins
  • Definition of “complete”: MUST/NEG/OUT/EVD/TRACE rows all defined, no omissions
  • Incomplete Manifest not allowed to pass Implementation Readiness Gate

4.3 Machine readability of Manifest

Manifest must be machine-readable, meaning:

1. Structured Format

  • Use YAML or JSON
  • Have strict Schema definitions
  • Can be parsed and validated by programs

2. Clear Field Semantics

  • Each field has clear types and constraints
  • Support automated validation (e.g., JSON Schema validation)

3. Executable References

  • MUST items associated with specific test file paths
  • EVD items associated with specific verification tools
  • TRACE rows can automatically check coverage

4. Queryable State

  • Can programmatically query “which MUST items not yet verified”
  • Can programmatically query “TRACE rows coverage”
  • Can programmatically determine “whether entry allowed” or “whether delivery allowed”

Chapter 5: Six mental models of AI-TDD

AI-TDD Six Mental Models and Gate Flow Figure 4: AI-TDD six mental models and gate flow, showing the two gates and their expected statuses: Entry Gate maps to AI-TDD-RED, and Delivery Gate maps to AI-TDD-GREEN.

The six mental models form AI-TDD’s cognitive abstraction layer. Together, they describe the six recurring thought patterns in human-AI collaboration.

5.1 Requirement confirmation

This is the first implementation link of the “problem boundaries precede solutions” principle. The sole goal of requirement confirmation is to clearly define the business boundaries of the problem.

Cognitive core: transformation from fuzzy intent to machine-readable contract

Core idea: use iterative atomic decomposition during requirement confirmation to produce a complete AI-TDD Gate Manifest, which becomes the requirement contract matrix.

Key insights:

  • AI can assist in generating Manifest, but humans have final authority over requirement definition
  • Manifest should be sufficiently complete before execution begins; otherwise, omissions and drift become more likely
  • Manifest is not natural language documentation but machine-readable structured contracts

Key actions:

  1. Atomic Requirement Decomposition

    Decompose “implement user registration feature” into:

    • MUST-REG-INPUT-001: Accept email and password inputs
    • MUST-REG-EMAIL-FORMAT-001: Verify email format is valid
    • MUST-REG-PASSWORD-LENGTH-001: Verify password length ≥ 8
    • MUST-REG-UNIQUE-001: Reject existing email
    • MUST-REG-PASSWORD-HASH-001: Password must be encrypted for storage
    • NEG-SEC-PLAIN-PASSWORD-001: Prohibit storing plaintext password
    • NEG-SEC-SQL-INJECTION-001: Prohibit SQL injection
    • OUT-AUTH-SOCIAL-001: Social login (moved to REQ-010)
    • OUT-PHONE-VERIFY-001: Phone verification (moved to REQ-011)
  2. Define EVD (Verifiable Evidence)

    • EVD-TEST-COV-001: Code coverage >= 80%
    • EVD-SEC-SCAN-001: No high-severity security vulnerabilities
    • EVD-PERF-REG-001: Registration API response time < 200ms (P99)
  3. Establish TRACE rows (Traceability Matrix)

    Build complete requirement↔test↔evidence traceability so every MUST item has both test coverage and evidence support.

  4. Generate Requirement Confirmation Page (HTML)

    Generate human-readable requirement confirmation pages containing:

    • Business goals and background
    • Complete Manifest preview
    • MUST/NEG/OUT checklists
    • EVD requirements
    • TRACE rows coverage
  5. Human Confirmation and Decision

    ⚠️ Human-in-the-loop critical link:

    • Humans read requirement confirmation page HTML
    • Check Manifest completeness
    • Make decision: “Manifest is complete, can proceed to next stage”

Exit criteria:

  • Manifest completely generated (MUST/NEG/OUT/EVD/TRACE rows all defined)
  • Requirement confirmation page HTML generated
  • Human decision passed

5.2 Architecture confirmation

On the basis of clear business boundaries, further define technical boundaries. Clarify technical solutions and interface contracts, demarcating system boundaries with the external world.

Cognitive core: mapping from problem space to solution space

Core idea: within Manifest constraints, define technical solutions and interface contracts, then generate an architecture confirmation page for human decision-making.

Key insights:

  • Manifest defines “what to do,” architecture confirmation defines “how to do it”
  • AI can provide multiple architecture options, but humans have architectural decision authority

Key actions:

  1. Technical Solution Evaluation

    • AI provides 2-3 alternative architecture options
    • Compare pros/cons, risks, resource requirements
  2. Define Interface Contracts

    • API interface definitions (OpenAPI/Swagger)
    • Data model definitions (JSON Schema)
    • Error code conventions
  3. Generate Architecture Confirmation Page (HTML)

    ⚠️ Human-in-the-loop critical link:

    • Recommended architecture solution and rationale
    • Alternative solution comparison
    • Risk assessment
    • Interface contract preview
  4. Human Confirmation and Decision

    Technical decision-makers read architecture confirmation page HTML, make decisions on which solution to adopt.

Exit criteria: technical solution confirmed + architecture confirmation page HTML generated + human decision passed + interface contracts defined


5.3 Implementation readiness

Verify whether boundaries can be covered by tests. Based on defined boundaries (Manifest), generate acceptance test baselines that sufficiently cover the current scope, confirming boundaries are verifiable.

Cognitive core: state transition from contract definition to execution readiness

Core idea: generate acceptance-test baselines from the completed Manifest, then run the AI-TDD Gate to establish the TDD-RED entry state.

Key insights:

  • Tests are not “written while generating” but generated from the confirmed Manifest scope
  • TDD-RED state for newly generated or linked acceptance tests is a necessary condition for entry

Key actions:

  1. Generate Test Code Based on Manifest

    Not manually writing tests, but AI-TDD Gate automatically parsing Manifest to generate:

    • Unit tests (covering MUST items)
    • Integration tests (covering interface contracts)
    • E2E tests (covering end-to-end scenarios)
    • Acceptance tests (covering ACC items)
  2. Register Tests to AI-TDD Gate

    All tests must be registered in AI-TDD Gate, establishing associations with Manifest’s TRACE rows.

  3. Run AI-TDD Gate (Entry Run)

    First run of AI-TDD Gate, at this point:

    • Test code has been generated
    • Implementation code not yet generated
    • New or associated acceptance tests should be in expected failure state (TDD-RED)
  4. Confirm TDD-RED State

    Checkpoints:

    • Manifest completeness check ✓
    • Test registration completeness check ✓
    • TDD-RED state confirmation

Exit Criteria: AI-TDD Gate status is TDD-RED for the current acceptance scope

Concept Clarification: TDD-RED here refers to “first run of Implementation Readiness Gate, verifying test baselines completely generated but implementation not yet started.” This is the prerequisite for entering the next stage (Execution Closure), not the final delivery standard.


Sanitized Example: Stale Attempt Detection Mechanism

The following snippet comes from a sanitized closeout runner requirement contract, illustrating how to prevent reuse of historical success records:

- id: FAIL-RUNNER-STALE-ATTEMPT-001
  title: stale_attempt_reused
  trigger: >-
    deliveryEvidence.requiredCommands[].lastRunRef.closeoutAttemptId
    inconsistent with current attempt
  expectedBehavior: >-
    AI TDD closeout must report stale_attempt or
    current_attempt_command_missing
  forbiddenBehavior: Historical success satisfies current closeout

This failure path defines a key safeguard: Each closeout attempt has a unique ID; historical success cannot prove current iteration. This prevents the classic problem of “passing off old test reports” — even if previous tests all passed, as long as the current attempt’s evidence chain is incomplete, re-verification is required.


5.4 Execution closure (Bounded Packet Closure)

AI generates implementation under boundary constraints, strictly prohibited from crossing boundaries. Any implementation deviating from boundary definitions is considered a failure.

Cognitive core: bounded convergence under Manifest constraints

Core idea: AI generates implementation within Manifest constraints and iterates until registered validations pass, producing a closeout candidate without drifting outside the agreed boundary.

Key insights:

  • AI generation must be constrained by Manifest, cannot improvise
  • Iteration goal is “make all acceptance items defined in Manifest pass”

Key actions:

  1. Use Manifest as Prompt Context

    Prompt given to AI contains:

    • Full Manifest text
    • Current list of failing tests
    • Failure reasons
  2. AI Generates Implementation

    AI uses the Manifest as the global requirement context and generates implementation code within that frame.

  3. Run AI-TDD Gate (Iteration Run)

    Run all tests to get current status (TDD-RED or partial GREEN).

  4. Iterative Correction

    If tests fail:

    • Feed failure information back to AI
    • AI regenerates corrected code
    • Run AI-TDD Gate again
    • Until registered validations pass and a closeout candidate is produced

Exit criteria: the candidate implementation satisfies Manifest-linked tests and enters closeout candidate status. Final AI-TDD-GREEN still requires evidence-chain closure, Gate Verdict, and Human Decision during delivery closeout.


Sanitized Example: Fail-Fast Execution Strategy and Evidence Atomicity

The following snippet comes from the same class of sanitized requirement contracts, showing how AI-TDD handles command execution failures:

- id: MUST-RUNNER-FAILFAST-001
  text: >-
    Dynamic runner must record successful required commands as current attempt's
    artifact-bound deliveryEvidence.requiredCommands[]; when any required command
    first fails, must immediately stop remaining required commands, write failed
    evidence packet before returning, and return non-zero.
  evidenceRefs:
    - EVD-RUNNER-SUCCESS-001
    - EVD-RUNNER-FAILURE-001
  coveredByTraceRows:
    - TRACE-RUNNER-FAILFAST-001

Note the key design:

  • coveredByTraceRows: Every MUST must be covered by at least one trace
  • evidenceRefs: Every trace must produce verifiable evidence
  • Fail-Fast: Stop immediately on first command failure, preventing “partial success” false states
  • Artifact-Bound: Evidence must be bound to artifacts (file hashes), not just exit codes

This contrasts sharply with the traditional “write code first, then add tests” reverse flow — in AI-TDD, requirements and evidence are defined before implementation; implementation must converge to predefined verification standards.


5.5 Audit review

Verify whether implementation strictly follows boundary definitions. Multi-Agent critical audit to avoid execution agent self-checking rapid self-consistency, ensuring boundaries were not drifted during execution.

Cognitive core: multi-agent critical audit, avoiding execution agent self-checking rapid self-consistency

Core idea: introduce independent audit agents to review execution results critically and reduce the blind spots of a single execution agent.

Key insights:

  • Execution Agents easily fall into “rapid self-consistency” during iteration
  • Audit Agents are independent from execution Agents, providing critical perspectives

Key actions:

  1. Multi-Agent Cross Audit

    • Audit Agents independently review execution results
    • Check whether NEG (MUST NOT) items defined in Manifest are truly not violated
    • Check architecture consistency
  2. Findings Recording and RCA

    • Record audit-discovered issues (Findings)
    • Conduct root cause analysis (RCA)
  3. Scoring and Improvement Recommendations

    • Score execution quality
    • Generate improvement recommendations

Exit Criteria: Audit Agents have no blocking Findings, or all Findings have been corrected and passed re-verification


Production Case: Negative Verification and “What Must Fail”

AI-TDD not only defines “what should be done” (MUST) but more importantly defines “what must fail under what circumstances” (NEG + FAILURE PATH). The following snippet demonstrates this core idea:

# Negative assertion: Define what is insufficient to constitute completion proof
- id: NEG-EVD-EXITCODE-001
  text: Exit code only proof shall not satisfy deliveryEvidence.requiredCommands[] or closeout
  whyItBlocksCompletion: >-
    Successful process missing artifact-bound proof is not current attempt delivery evidence

# Corresponding failure path: Define specific trigger conditions and expected behaviors
- id: FAIL-RUNNER-CMD-001
  title: required_command_failed
  trigger: Some required command returns non-zero
  expectedBehavior: >-
    Runner must immediately stop remaining required-command execution, write failed evidence packet
    before returning that does not go through successful implementation evidence ingest path for failed command
  forbiddenBehavior: >-
    Runner continues executing remaining required commands, omits failed evidence packet,
    submits failed packet through implementation_evidence_ingested

The profound aspect of this design: It not only verifies “success” but strictly defines “failure.”

In traditional development, we often say “tests passed.” But in AI-TDD, we must ask: “Under what circumstances do tests not count even if they pass?”

  • Only exit code? Doesn’t count
  • No artifact binding? Doesn’t count
  • Historical run results? Doesn’t count
  • Current attempt mismatch? Doesn’t count

The essence of audit review is verifying whether the boundary of “what must fail” is strictly enforced.


5.6 Delivery closeout

Final verification of whether boundaries have been completely implemented. Full regression testing confirms all requirements within boundaries have been achieved, no drift.

Cognitive core: full regression verification and formal closure

Core idea: Re-run AI-TDD Gate full tests to confirm no drift occurred during execution, generate delivery confirmation page, and formally complete delivery closeout.

Key insights:

  • Delivery confirmation is not simple “package output” but full regression verification
  • Must re-run all acceptance tests registered in Manifest

Key actions:

  1. Re-run AI-TDD Gate full tests

    ⚠️ Key distinction:

    • Implementation Readiness Gate: First run, expected TDD-RED
    • Delivery Closeout Gate: Re-run, must be TDD-GREEN

    Ensure:

    • Tests passed during execution closure phase still pass (no regression)
    • All acceptance items in Manifest have status “VERIFIED”
  2. Generate delivery closeout page (HTML)

    Generate human-readable delivery confirmation pages containing:

    • Complete Manifest status (all VERIFIED)
    • Test result summary (all green)
    • Code quality metrics
    • Audit records and Findings
  3. Human final review

    Humans read delivery confirmation page HTML for final gatekeeping.

  4. Formal delivery closeout

    After the Human Decision is recorded as accept, mark the task formally complete and archive all project assets.

Exit Criteria: current-attempt evidence chain closed + Gate Verdict is pass + Human Decision is recorded as accept + delivery confirmation page generated


Production Case: Gate Verdict, Oracle, Artifact, and Hash Chain Verification

The following snippet shows AI-TDD delivery confirmation’s core verification structure:

# Evidence definition: What constitutes "deliverable"
- id: EVD-CLOSEOUT-GATE-001
  text: Only updated records passing the team's custom closeout gate are allowed closeout pass
  gate: vitest
  oracle: Runner returns 0 only when closeout report decision=pass and closeoutReadinessReport.ready=true
  requiredCommandRefs:
    - CMD-TEST-CLOSEOUT-GATES
  artifactRefs:
    - ART-AI-TDD-CLOSEOUT-REPORT

Coordinated with the triple hash verification mechanism at the top of the document:

# Source document hash: Ensures requirements not tampered
sourceDocumentHash: sha256:51f17f2172e951599153bbb744877386cd582f33e393bc597c6ea926e143a878

# Implementation confirmation hash: Ensures scope confirmed
implementationConfirmationHash: sha256:c1ca24b3b7bc63cfc5d8eee55b72e510cdbcb03c8cb4ceee9bc3806b15c537d3

# Confirmation page hash: Ensures rendering result consistent
confirmationPageHash: sha256:f00fbd4f6610bdf5ef0f87dbd9cfe0f1b69ecb2777e70d1a57021e2f1f75dd0f

Hash Calculation Notes (avoiding circular dependencies):

sourceDocumentHash calculation method:

  • SHA256 on ai-tdd-manifest.yaml file content
  • Note: Set this field value to empty string or placeholder when calculating to avoid circular dependency
  • Storage location: Recommended in separate .manifest.lock file or Git tag

implementationConfirmationHash calculation method:

  • SHA256 on requirement confirmation phase generated HTML confirmation page
  • Purpose: Verify confirmation page not tampered

confirmationPageHash calculation method:

  • SHA256 on delivery confirmation page
  • Purpose: Verify final deliverable integrity

The essence of this design is the verdict chain (Evidence → Gate → Oracle → Artifact):

  • Gate: the team’s custom closeout gate is the final arbiter. The gate name in this example is project-specific pseudocode, not a public CLI command guaranteed to exist
  • Oracle: decision=pass && ready=true is the clear pass criterion
  • Artifact: ART-AI-TDD-CLOSEOUT-REPORT is an auditable evidence file

Significance of hash chain:

  • If source document changes → sourceDocumentHash changes → must re-confirm
  • If implementation scope changes → implementationConfirmationHash changes → must re-verify
  • If rendering result tampered → confirmationPageHash changes → must regenerate

The essence of delivery confirmation is verifying the integrity and consistency of the entire evidence chain.


Chapter 6: Two key gates

How this chapter maps to Chapter 5: of the six cognitive stages introduced in Chapter 5, the end of stage 3 (Implementation Readiness) maps to Section 6.1, “Implementation Readiness Gate,” and the end of stage 6 (Delivery closeout) maps to Section 6.2, “Delivery Closeout Gate.” These two gates are the key control points that keep state transitions governed.

6.1 Gate 1: Implementation Readiness Gate

TDD State Machine Figure 5: AI-TDD state machine, from MANIFEST_DRAFT → AI-TDD-RED → IMPLEMENTING → CLOSEOUT_CANDIDATEAI-TDD-GREEN → CLOSED.

Position: After “Implementation Readiness,” before “Execution Closure”

Core Function: Ensure “Manifest is complete, test baselines established” before allowing AI to begin generating implementation

Entry Criteria:

  • Requirement confirmation completed (Manifest completely generated)
  • Architecture confirmation completed (technical solution and interface contracts defined)
  • Implementation readiness completed (test code based on Manifest generated)

Exit Criteria: New or associated acceptance tests should enter expected TDD-RED state

Precise Definition of TDD-RED:

Necessary Conditions:

  1. Test code based on Manifest has been registered to AI-TDD Gate
  2. Implementation code not yet generated or empty (no pre-implementation exists)
  3. New or associated acceptance tests return expected failures (not passes)

Exclusion Conditions (following cases do not count as TDD-RED):

  • Test failure due to test code errors → Should be considered BLOCKED state, needs test fix
  • Test failure due to environment configuration issues → Should be considered ERROR state, needs environment fix
  • Test failure due to unavailable external dependencies → Should be considered DEFERRED state, needs dependency wait

Key Principle: TDD-RED is “expected failure,” not “erroneous failure.”

  1. Manifest Completeness Check

    • All MUST items defined and associated with tests
    • All NEG (MUST NOT) items defined
    • All OUT OF SCOPE items explicitly excluded
    • All EVD items defined with thresholds
    • TRACE rows established for the acceptance items declared in the current Manifest
  2. AI-TDD Gate Registration Check

    • All acceptance items registered to AI-TDD Gate
    • E2E TEST SUITES registered
    • ACC TEST SUITES registered
  3. TDD-RED State Confirmation

    • Run AI-TDD Gate full tests
    • New or associated acceptance tests should be in expected failure state (TDD-RED)
    • If new acceptance tests already pass, need to confirm whether pre-implementation exists, tests are invalid, or acceptance items are duplicated

Gate Status:

  • 🔴 TDD-RED: Expected state, allowed to enter execution closure
  • 🔴 BLOCKED: Manifest incomplete, entry prohibited

Failure Handling:

  • Manifest incomplete → Return to requirement confirmation for re-iteration
  • Pre-implementation exists → Clean pre-implementation code

6.2 Gate 2: Delivery Closeout Gate

Position: After “Audit Review,” before formal delivery

Core Function: Ensure all Manifest contract slices are proven by the current attempt’s evidence chain, with no drift, no scope creep, and no stale evidence reuse

Entry Criteria:

  • Execution closure completed (candidate implementation satisfies Manifest-linked tests)
  • Audit review completed (no blocking Findings)
  • Delivery preparation completed (confirmation page, command output, evidence artifacts, and hash snapshots prepared)
  • This closeout evaluation has a unique closeoutAttemptId
  • Manifest/source hash, implementation snapshot hash, required command run refs, and artifact refs are recorded in the candidate delivery record

Exit Criteria: All Manifest-linked TRACE rows reach evidence-chain closure, scope audits cover every OUT, Gate Verdict is pass, and Human Decision is accept before the Gate can return AI-TDD-GREEN

Core Checkpoints:

  1. Manifest Acceptance Status Check

    • All MUST items status is “VERIFIED”
    • All NEG (MUST NOT) items verified as not violated
    • All OUT OF SCOPE items pass scope audit, confirming implementation did not overstep
    • All TRACE rows cover the current Manifest’s MUST/NEG items, and scope audit covers OUT items
  2. Multi-dimensional evidence-chain check

    • Every TRACE has corresponding EVD
    • Every EVD has CMD runs from the current attempt
    • Every CMD has auditable ART, such as test reports, security scan reports, scope diffs, or closeout reports
    • Critical artifact hashes match the delivery confirmation page
  3. Registered validation rerun check, necessary but insufficient

    • Re-run all acceptance tests registered in Manifest
    • Manifest-associated acceptance tests must pass as one input to the evidence chain
    • If any test fails, delivery prohibited
    • Test pass alone cannot produce AI-TDD-GREEN; evidence, artifacts, hashes, anti-stale checks, and Human Decision must also pass
  4. Anti-forgery checks

    • Old runs, old reports, and old screenshots cannot prove the current attempt
    • Tests that are not bound to Manifest TRACE rows do not count as acceptance evidence
    • Exit code alone is not proof of completion without ART, hash, or receipt
    • Human confirmation must leave a decision, timestamp, confirmation page, or equivalent receipt
  5. Delivery closeout page completeness

    • Delivery confirmation page HTML generated
    • Contains complete TRACE rows
    • Contains audit records
  6. Human final review

    • Humans read delivery confirmation page HTML
    • Make decision: “Can deliver” or “Return for correction”
    • The decision is written to the confirmation page, audit receipt, or equivalent delivery record

Gate Status:

  • 🟢 TDD-GREEN + Human pass: Allowed formal delivery closeout
  • 🔴 TDD-RED: Tests not passed, delivery prohibited
  • 🔴 BLOCKED: Human review not passed

Chapter 7: Toolchain and implementation

7.1 AI-TDD toolchain ecosystem

AI-TDD Toolchain Ecosystem Figure 6: This figure answers one question: what layers make up the AI-TDD toolchain?

The AI-TDD toolchain is best understood as a layered system, with core engines at the bottom and higher-level Skills on top.

7.2 Core engine layer

1. Manifest Parser

  • YAML contract parsing
  • ID extraction and matrix building
  • Schema validation

2. Gate Controller

  • Gate state management
  • TDD-RED/GREEN state determination
  • Entry/delivery interception logic

3. Trace Engine

  • Five-dimension trace matrix execution
  • Execution slice management
  • Dependency graph building

4. Evidence Log

  • Evidence hash storage
  • Tamper-proof verification
  • Audit traceability

7.3 Currently executable CLI and integration boundary

The currently executable CLI is provided by the npm package bmad-speckit-sdd-flow, with bmad-speckit as the main entry point. Based on the commands verified for this article, the public CLI currently covers setup checks, version reporting, dashboard lifecycle, and broader orchestration surfaces. Lower-level Gate execution belongs to the project integration layer and should not be read as a stable public command exposed directly by this repository.

CommandFunctionExample
bmad-speckit checkInstallation/environment checknpx --yes --package bmad-speckit-sdd-flow@latest bmad-speckit check
bmad-speckit versionView CLI versionnpx --yes --package bmad-speckit-sdd-flow@latest bmad-speckit version
bmad-speckit dashboard-statusView dashboard statusnpx --yes --package bmad-speckit-sdd-flow@latest bmad-speckit dashboard-status

Note: Gate-related capabilities should be entered through the CLI workflow first. The examples below show stable CLI entry points. Teams still need to wire the concrete Gate scripts, tests, and evidence artifacts that exist in their own projects.

CLI usage examples

Step 1: Environment check

$ npx --yes --package bmad-speckit-sdd-flow@latest bmad-speckit check
CLI version: 1.1.0
Template version: 1.1.0
Selected AI: cursor-agent
Subagent support: native
Check OK.

Step 2: Verify the installed CLI version

$ npx --yes --package bmad-speckit-sdd-flow@latest bmad-speckit version
CLI version: 1.1.0
Template version: 1.1.0
Node version: v22.x

Step 3: Inspect dashboard server status

$ npx --yes --package bmad-speckit-sdd-flow@latest bmad-speckit dashboard-status
{
  "ok": false,
  "mode": "stopped",
  "healthy": false
}

Note: These commands reflect the public CLI behavior verified while editing this article. They do not parse Manifest files or execute project-specific delivery Gates by themselves. Concrete entry and delivery gate execution depends on how a team wires Gate scripts, test commands, and evidence artifacts in its own project.

Actual workflow

In practice, the AI-TDD workflow is completed through CLI entry + underlying Gate scripts + AI coding tools:

Three stages:

  1. Implementation Readiness Gate - Enter through the bmad-speckit workflow first. Teams can then wire their own local Gate scripts to verify Manifest completeness. The expected status is AI-TDD-RED (new acceptance tests exist, but implementation is not yet satisfied)
  2. AI Generates Implementation - Developers paste the Manifest into AI tools such as Claude or Cursor as prompt context, and the AI generates code within those boundary constraints
  3. Delivery Closeout Gate - Again, enter through the bmad-speckit workflow first. Teams can wire local delivery Gate scripts to verify the current attempt’s TRACE/EVD/CMD/ART closure, rerun registered validations, check artifact/hash/receipt records, and reject stale evidence reuse. The expected status is AI-TDD-GREEN only when Gate Verdict is pass and Human Decision is accept

Complete state flow appears in Figure 5: AI-TDD State Machine (/images/content/standalone/ai-tdd-framework/tdd-state-machine-en.svg), showing the flow from MANIFEST_DRAFT → AI-TDD-RED → IMPLEMENTING → CLOSEOUT_CANDIDATEAI-TDD-GREEN → CLOSED, including reject/revise and reconfirm loopback paths.

Key Boundaries:

  1. Developers manually maintain Manifest YAML files
  2. CLI is the stable entry point; lower-level Gate scripts should be wired and maintained inside each team’s own project
  3. AI code generation is completed through external tools (Claude, Cursor, etc.), not built-in commands

7.4 Common errors and solutions

The examples in this section show illustrative integration-layer error shapes. They are not commands guaranteed to exist in this repository, and teams should replace them with the actual Gate scripts wired in their own projects.

Error 1: Manifest File Not Found

$ npx --yes --package bmad-speckit-sdd-flow@latest bmad-speckit check
 Gate BLOCKED: Manifest file not found

Expected path: ./my-project/ai-tdd-manifest.yaml
Actual status: File does not exist

Fix:

  1. Confirm ai-tdd-manifest.yaml exists in the project root.
  2. Or use --cwd to specify the correct project directory.

Error 2: Gate Status Not as Expected

$ <your-team-readiness-gate-command> --requirement-record ./records/REQ-001.json

🟢 Gate status: TDD-GREEN (already implemented)
 Gate BLOCKED: Expected status is TDD-RED

Problem analysis:

  • All tests passed, indicating implementation is already complete.
  • But Implementation Readiness Gate requires “tests exist but implementation is still missing.”

Fix:

  1. Check whether historical implementation code already exists.
  2. If you need to restart, clean implementation files while preserving tests.
  3. Or skip the readiness stage and continue from the appropriate delivery closeout flow.

Error 3: Insufficient Test Coverage

$ <your-team-delivery-closeout-command> --requirement-record ./records/REQ-001.json
 Gate BLOCKED: TRACE rows coverage insufficient

Uncovered MUST items:
  - MUST-REG-PASSWORD-LENGTH-001: No associated test

Fix:

  1. Add the missing test binding in ai-tdd-manifest.yaml.
  2. Re-run the Manifest generation and Gate checking workflow.

AI-TDD’s Skill system is extensible Agent capability modules:

SkillFunctionScenario
req-trace-matrixRequirement trace matrix generationGenerate Manifest from requirement documents
contract-authoringContract writing assistanceManifest writing and verification
reverse-auditReverse auditCheck if code satisfies Manifest
checkpoint-gateCheckpoint gateAutomated acceptance
encoding-integrityEncoding integrity checkFile encoding and format verification

7.5 External integration layer

AI-TDD integration with existing tool ecosystems:

  • Test Frameworks: Vitest, Jest, Playwright
  • Version Control: Git (commit hash as evidence)
  • LLM APIs: Claude, Codex, Gemini
  • YAML Parsers: js-yaml, PyYAML
  • Hash Algorithms: SHA256 (evidence tamper-proofing)

7.6 Implementation roadmap

Phase 1: Pilot Project

  1. Select a small functional module
  2. Write first AI-TDD Manifest
  3. Run Implementation Readiness Gate
  4. AI generates implementation and iterates to TDD-GREEN

Phase 2: Team Rollout

  1. Establish team Manifest writing standards
  2. Build CI/CD gate pipelines
  3. Train team on AI-TDD workflow

Phase 3: Scale Application

  1. Establish organization-level Skill library
  2. Integrate existing requirement management systems
  3. Establish audit and measurement systems

7.7 Best practices

1. Manifest Writing Principles

  • Every MUST must have corresponding EVD
  • Every OUT OF SCOPE must explain reason and follow-up plans
  • Use explicit validation conditions, avoid vague descriptions

2. Gate Configuration Recommendations

  • Implementation readiness gate must be TDD-RED (tests exist but fail)
  • Delivery gate must be TDD-GREEN (all verifications pass)
  • Set gate timeouts to avoid infinite blocking

3. Team Collaboration Model

  • Product managers responsible for requirement confirmation phase
  • Architects responsible for architecture confirmation phase
  • Development engineers responsible for implementation readiness and execution closure
  • QA/Audit Agents responsible for audit review phase

Chapter 8: AI-TDD Gate mechanism detailed explanation

8.1 Core functions of AI-TDD Gate

AI-TDD Gate is Manifest’s technical implementation layer:

1. Manifest Parsing

  • Read ai-tdd-manifest.yaml
  • Parse MUST/NEG/OUT/EVD
  • Establish TRACE rows

2. Test Code Generation

  • Automatically generate test code frameworks based on Manifest
  • Associate E2E/ACC test suites
  • Generate coverage check configurations

3. Gate Execution

  • Run full tests
  • Determine TDD-RED or TDD-GREEN
  • Generate gate reports

4. Status Tracking

  • Record results of each gate run
  • Maintain requirement→test→result traceability chains

8.2 CI/CD integration

The following example is an integration sketch. It shows how a team might connect the bmad-speckit CLI, a manual AI implementation step, and the team’s own Gate execution layer in CI. It is not a copy-paste-complete workflow for this repository.

GitHub Actions integration sketch:

name: AI-TDD Gate

on: [pull_request]

jobs:
  implementation-readiness:
    name: "Implementation Readiness Gate"
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v3
      - name: Check AI-TDD CLI Environment
        run: npx --yes --package bmad-speckit-sdd-flow@latest bmad-speckit check

      - name: Run team-specific Implementation Readiness Gate
        run: echo "Replace with your project's actual readiness gate command"

  ai-implementation:
    name: "AI Implementation"
    needs: implementation-readiness
    runs-on: ubuntu-latest
    steps:
      - name: AI Generate with Manifest Context
        run: echo "Use Claude/Cursor/Codex with the confirmed Manifest as context, then commit generated implementation."

  delivery-closeout:
    name: "Delivery Closeout Gate"
    needs: ai-implementation
    runs-on: ubuntu-latest
    steps:
      - name: Run team-specific Delivery Closeout Gate
        run: echo "Replace with your project's actual closeout gate command"

Conclusion: requirement-contract-driven AI-TDD and multi-dimensional evidence acceptance

Within this article’s framing, the core AI-TDD claim can be summarized in one sentence: AI generation without Manifest as contract is unanchored improvisation; delivery without an evidence chain is optimistic judgment that cannot be reviewed.

Problem boundaries precede solutions: a first principle of architecture in the AI era

AI can help write code, draft documents, run analysis, and even assist with architectural design. What it cannot do independently is define the business boundary for a team. Problem boundaries are not purely technical; they also encode value judgments, priorities, and tradeoffs.

Software engineering has long held that code should be written for human understanding, not only machine execution. In the AI era, that idea can be reframed as:

“Any AI can write code that solves problems. Excellent architects define the boundaries of problems worth solving.”

The six mental models form the cognitive framework for human-AI collaboration. Each one follows the same line: lock the requirement contract first, then collect reviewable evidence.

  • Requirement Confirmation: Atomic decomposition, generating machine-readable Manifest contract matrices
  • Architecture Confirmation: Defining technical solutions
  • Implementation Readiness: Generate acceptance test baselines that sufficiently cover the current Manifest scope
  • Execution Closure: AI generates implementation under Manifest constraints
  • Audit Review: Multi-Agent critical audit
  • Delivery Closeout: Re-run full Manifest validation, confirming TRACE/EVD/CMD/ART evidence, Gate Verdict, and Human Decision are closed

The two key gates act as the quality firewall:

  • Implementation Readiness Gate: Check whether Manifest is complete, with the expected state AI-TDD-RED / TDD-RED
  • Delivery Closeout Gate: Check whether Manifest-linked evidence chains are closed, with the expected state AI-TDD-GREEN / TDD-GREEN

AI-TDD application prospects in large-scale engineering

If AI continues to take on more implementation work, AI-TDD may become especially useful in the following settings:

1. Enterprise-Level Microservice Architecture

In complex multi-service systems, AI-TDD’s TRACE rows can help establish cross-service requirement traceability. Each service’s Manifest defines its interface contracts, verified end-to-end through unified Gates.

2. Safety-Critical Systems

For safety-critical domains like finance, healthcare, and aviation, AI-TDD’s negative contracts (NEG / MUST NOT) and scope-boundary (OUT OF SCOPE) audit mechanisms can strengthen security-constraint management. Each security constraint needs explicit verification evidence.

3. Compliance-Driven Development

In scenarios with strict compliance requirements like GDPR and SOX, AI-TDD’s EVD mechanism can preserve compliance audit material, helping prove that each compliance item has corresponding verification.

4. Open Source Project Collaboration

AI-TDD’s Manifest can become an open source project’s “social contract”: contributors compare PR scope against the Manifest before submitting, reducing the risk of violating project constraints.

Core principles review

  1. Manifest must be completely defined before execution begins
  2. Manifest is machine-readable, not natural language description
  3. AI-TDD Gate runs based on Manifest, not based on scattered tests
  4. No complete Manifest, prohibited from entering execution
  5. Manifest has unverified items, prohibited from delivery

This is AI-TDD’s value: moving AI-generated code quality from natural-language guessing to requirement-contract governance, and from local tests to TRACE/EVD/CMD/ART evidence-chain acceptance.


Chapter 9: Frequently Asked Questions (FAQ)

Q1: How much additional development time does AI-TDD add?

A: It adds upfront time for clarification and Manifest drafting, but it often reduces rework later. The payoff depends on requirement complexity, team proficiency, test infrastructure, and how deeply AI participates in implementation.

Taking the calculator example from the Quick Start chapter:

  • Traditional approach: Directly let AI generate → discover boundary gaps later → rework
  • AI-TDD: Write Manifest → generate tests → generate implementation → accept by evidence

The payoff is usually easiest to see when the project has many boundaries, high rework costs, or explicit audit-evidence requirements.


Q2: How detailed does Manifest need to be?

A: Focus on boundaries, not implementation.

Over-detailed (bad):

must:
  - id: "MUST-SUM-LOOP-001"
    description: "Use for loop to traverse array, accumulate each element to accumulator variable"
    # Too specific! Limits AI's implementation approach

Appropriately detailed (good):

must:
  - id: "MUST-SUM-RESULT-001"
    description: "Calculate sum of array elements"
    validation: "sum([1,2,3]) === 6, time complexity O(n)"
    # Only define boundaries (input/output/performance), not implementation

Principle: Manifest defines “what to do” and “what not to do,” AI decides “how to do it.”


Q3: What if AI repeatedly cannot make tests green?

A: It depends on the failure mode:

Situation 1: AI-generated implementation has logic errors

  • Human intervention, give AI more specific hints
  • Split complex MUSTs into multiple simple MUSTs
  • Use audit Agent to provide feedback

Situation 2: Tests themselves have problems

  • Check if MUST’s validation is achievable
  • Check if there are contradictory requirements, such as MUST-REQ-A-001 and MUST-REQ-A-002
  • Modify Manifest, regenerate tests

Situation 3: AI capability boundaries

  • Some complex algorithms may exceed current AI capabilities
  • Mark this part as OUT OF SCOPE, manually implement
  • Continue AI-TDD process for remaining parts

Q4: How do existing projects migrate to AI-TDD?

A: A practical migration path is to start in “shadow mode”:

Phase 1: Parallel Verification

  • Keep existing development process unchanged
  • Select 1 new feature, implement simultaneously with AI-TDD and traditional methods
  • Compare quality and time differences

Phase 2: Selective Application

  • Prioritize AI-TDD for new features
  • Supplement Manifest for legacy features during refactoring
  • Keep existing tests as regression tests

Phase 3: Full Switch (ongoing)

  • Establish internal team templates
  • Form best practices
  • New hire training materials

Do not:

  • ❌ Rewrite all code at once
  • ❌ Delete existing tests
  • ❌ Force the team to use AI-TDD for every task

Q5: How does AI-TDD integrate with existing CI/CD?

A: AI-TDD can be integrated into existing pipelines as independent verification stages. The two snippets below are integration sketches, not copy-paste-complete configs for this repository. The explicit bmad-speckit check step is only a setup smoke check; it does not execute project-specific AI-TDD Gates. Replace placeholder commands with the actual Gate commands or CI tasks used in your own project.

GitLab CI integration sketch:

# .gitlab-ci.yml
stages:
  - validate # New: AI-TDD validation
  - test
  - deploy

ai-tdd-validate:
  stage: validate
  script:
    - npx --yes --package bmad-speckit-sdd-flow@latest bmad-speckit check
    - echo "Replace with your project's actual implementation readiness gate command"
  only:
    - merge_requests

ai-tdd-closeout:
  stage: test
  script:
    - echo "Replace with your project's actual delivery closeout gate command"
  only:
    - main

GitHub Actions integration sketch:

# .github/workflows/ai-tdd.yml
name: AI-TDD Gate
on: [pull_request]

jobs:
  ai-tdd-check:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v3
      - name: AI-TDD validation
        run: |
          npx --yes --package bmad-speckit-sdd-flow@latest bmad-speckit check
          echo "Replace with your project's actual implementation readiness gate command"

Q6: Are small teams (1-3 people) suitable for AI-TDD?

A: It depends on project complexity.

Suitable Scenarios:

  • Projects with clear requirement boundaries (e.g., API services, tool libraries)
  • Need long-term maintenance (>3 months)
  • Frequent requirement changes (Manifest helps track changes)

May Not Be Suitable:

  • One-time scripts/tools
  • Pure prototype exploration
  • Personal learning projects

Recommendation: Even a one-person team can start with a minimal mode: write only MUST items first, and add the rest later.


Q7: Does AI-TDD require special AI models?

A: Usually not. As long as a model can understand structured context, generate code, and iterate on failure feedback, it can participate in an AI-TDD workflow. More complex projects may still need models with stronger context windows, tool calling, and code capabilities.

Model Selection Recommendations:

  • Code generation models: Choose models supporting tool calling (Function Calling) for easy integration with Gate systems
  • Context length: Complex Manifests may need 128K+ context, choose models supporting long contexts
  • Multimodal capabilities: If needing to process screenshots, design drafts, etc., choose multimodal-capable models

Prompt tips (model-independent):

  • Clearly indicate in Prompt “generate based on following Manifest”
  • Require AI to “strictly follow OUT OF SCOPE, don’t implement”
  • Require AI to “first understand NEG (MUST NOT), ensure not violated”
  • Provide brief Manifest Schema explanations to help AI understand structure

Q8: What is AI-TDD’s Manifest Git management strategy?

A: Common strategies include the following:

Branch Strategy:

main
├── feature/login-manifest      # Only change Manifest
├── feature/login-impl          # AI generates implementation (based on Manifest)
└── hotfix/fix-auth-bug        # Emergency fix

Commit Conventions:

# Manifest changes
feat(manifest): Add user registration MUST-REG-EMAIL-VERIFY-001

# Implementation changes (generated by AI)
feat(impl): AI implements MUST-REG-EMAIL-VERIFY-001 email verification

# Gate passed
gate(pass): Implementation Readiness Gate TDD-RED

# Audit passed
audit(pass): 3 Agents reviewed, 0 blockers

Version Tags:

# Manifest confirmed version
git tag -a manifest-v1.0.0 -m "Requirement confirmation passed"

# Delivery version
git tag -a release-v1.0.0 -m "AI-TDD Delivery Closeout Gate passed"

Chapter 10: Limitations and applicability boundaries

AI-TDD is not a silver bullet. Before adopting it broadly, teams need a clear view of its limits and where it fits best.

10.1 Unsuitable scenarios

Scenario 1: Highly Exploratory Prototype Development

  • When you’re not even sure “what to do” and need rapid trial-and-error
  • Manifest’s predefined nature becomes a constraint on innovation
  • Recommendation: Traditional AI-assisted coding is more suitable

Scenario 2: Very Small Projects (<500 lines of code)

  • Time to write Manifest may exceed directly writing code
  • Methodology overhead not worthwhile in small projects
  • Recommendation: Only use when project complexity exceeds threshold (e.g., >5 interaction interfaces)

Scenario 3: Pure Frontend UI Development

  • UI/UX requirements are highly subjective, difficult to describe precisely with MUST/EVD
  • “Beautiful” “Smooth” etc. requirements cannot be quantified for verification
  • Recommendation: Visual-first development may be more suitable; use AI-TDD more selectively for backend or rule-heavy logic

10.2 Usage costs

Learning Cost:

  • Teams need dedicated time to learn Manifest syntax and AI-TDD workflow
  • It helps to have at least one team member who knows the workflow well and can keep templates, gates, and audit expectations consistent

Time Cost:

  • Writing Manifest and confirmation processes increase upfront time
  • If requirement boundaries are complex and rework costs are high, later rework usually decreases
  • Net benefits are usually more visible in large, long-term, or high-risk projects

Tool Cost:

  • Need runnable AI-TDD Gate infrastructure, whether built internally or integrated from an existing team pipeline
  • Multi-Agent audit requires additional token consumption

10.3 Failure risks

Risk 1: Manifest itself incorrectly defined

  • If human understanding of requirements is wrong, AI-TDD can only accelerate wrong implementation
  • Mitigation: Strengthen manual review in requirement confirmation phase

Risk 2: AI cannot understand complex constraints

  • Some domain-specific logic (e.g., financial compliance rules) may exceed current AI capabilities
  • Mitigation: Split complex constraints into simpler sub-constraints

Risk 3: Over-reliance on tools

  • Teams may fall into “using AI-TDD for the sake of using AI-TDD”
  • Mitigation: Regular retrospectives to confirm methodology indeed brings value

10.4 Gradual adoption path

First phase: Single MUST item trial

  • Choose a simple feature (e.g., logging)
  • Only define 1 MUST, experience complete workflow

Second phase: Add NEG / MUST NOTs

  • Increase negative constraints (e.g., “prohibit hardcoded keys”)
  • Experience TDD-RED interception effects

Third phase: Complete workflow

  • Add OUT OF SCOPE, EVD, TRACE rows
  • Run complete AI-TDD Gate

Team rollout phase: Broaden adoption

  • Summarize lessons learned from the pilot
  • Develop team internal Manifest templates and best practices

Appendix A: Glossary

TermEnglishDefinition
AI-TDDAI-TDD defined in this articleManifest-level AI-TDD with AI as execution agent and Manifest contract at the core; not yet standardized industry terminology
ManifestAI-TDD Gate ManifestYAML-format requirement contract matrix containing MUST/NEG/OUT/TRACE/EVD/ACC/E2E/FAIL/EDGE/CMD/ART/TASK
MUSTMust RequirementFunctional requirements that system must implement
NEGMust Not / Negative AssertionBlocking negative assertion; MUST NOT is the conceptual alias and machine IDs use NEG-*
OUTOut Of Scope BoundaryFunctional boundary excluded from the current iteration; older “not completed” wording should migrate to OUT OF SCOPE / OUT-*
TRACETrace RowContract-slice index binding MUST/NEG, TASK, scenario layer ACC/E2E/EDGE/FAIL, and evidence layer EVD/CMD/ART; OUT binds through scope-audit refs
EVDEvidenceRequirement verification evidence items, including thresholds and artifacts
ACCAcceptance TestAutomated test or check tied to acceptance
E2EEnd-to-End TestEnd-to-end scenario verification
FAILFailure PathRequired failure path attached to NEG-*
EDGEEdge CaseBoundary condition or exceptional input scenario
CMDCommandReproducible verification command
ARTArtifactAuditable delivery or evidence file
TASKTaskImplementation task or execution slice
TDD-REDTDD-REDImplementation readiness gate status: the new or linked acceptance tests exist and fail because the implementation has not yet satisfied them
TDD-GREENTDD-GREENDelivery gate status: the current-attempt evidence chain is closed, Gate Verdict is pass, and Human Decision is accept
CLOSEOUT_CANDIDATECloseout CandidateIntermediate state where registered validations pass, but the evidence chain, Gate Verdict, and Human Decision are not all closed
Implementation Readiness GateImplementation Readiness GateGate that must be passed before execution begins, requiring TDD-RED status
Delivery Closeout GateDelivery Closeout GateGate that must be passed before delivery, requiring evidence chain, Gate Verdict, and Human Decision closure
Requirement ContractRequirement ContractHuman-confirmed and versioned Manifest that constrains the AI execution scope
Contract SliceContract SliceSmallest acceptable unit centered on a TRACE row, binding requirement, scenario, evidence, command, and artifact
Evidence ChainEvidence ChainReviewable proof chain composed of TRACE, EVD, CMD, ART, and their hashes or receipts
Gate VerdictGate VerdictGate decision such as pass, blocked, or failed based on the current attempt’s evidence chain
Human DecisionHuman DecisionRecorded human accept/reject decision based on confirmation page and audit evidence
Bounded Packet ClosureBounded Packet ClosureAI iterates generation under Manifest constraints until the candidate implementation satisfies registered validations and enters closeout
Human-in-the-loopHuman-in-the-loopMechanism where critical decision points must be confirmed by humans
Contract as CodeContract as CodeManifest is executable encoding of human intent
Reverse AuditReverse AuditAudit method tracing from code implementation to verify whether it satisfies Manifest
Trace RowTrace RowMinimum requirement verification unit that can be independently executed
SkillSkillAI-TDD’s extensible Agent capability modules
bmadBMAD CLIAI-TDD’s command line tool

References and extensions

Core theoretical literature

Beck, K. (2002). Test-driven development: By example. Addison-Wesley.

Brooks, F. P. Jr. (1975). The mythical man-month: Essays on software engineering. Addison-Wesley.

GitHub. (2021, June 29). Introducing GitHub Copilot: Your AI pair programmer. https://github.blog/news-insights/product-news/introducing-github-copilot-ai-pair-programmer/

Humble, J., & Farley, D. (2010). Continuous delivery: Reliable software releases through build, test, and deployment automation. Addison-Wesley.

OpenAI. (2023). GPT-4 Technical Report. arXiv. https://arxiv.org/abs/2303.08774

Sobocinski, P. (2023, August 17). TDD with GitHub Copilot. Martin Fowler. https://martinfowler.com/articles/exploring-gen-ai/06-tdd-with-coding-assistance.html

Sukharev, D. (2023). AI-TDD: CLI for TDD - you write the test, GPT writes the code to pass it. GitHub. https://github.com/di-sukharev/AI-TDD

Piya, S., & Sullivan, A. (2023). LLM4TDD: Best Practices for Test Driven Development Using Large Language Models. arXiv. https://arxiv.org/abs/2312.04687

Mathews, N. S., & Nagappan, M. (2024). Test-Driven Development for Code Generation. arXiv. https://arxiv.org/abs/2402.13521

Cui, Y. (2025). Tests as Prompt: A Test-Driven-Development Benchmark for LLM Code Generation. arXiv. https://arxiv.org/abs/2505.09027

Schneider, J. G., Borjigin, A., Kamal, M., & Grundy, J. (2026). Test-Driven Agentic Development for Automated Software Evolution. arXiv. https://arxiv.org/abs/2604.26615

Beck, K. (2025, June 11). TDD, AI agents and coding with Kent Beck [Interview]. The Pragmatic Engineer. https://newsletter.pragmaticengineer.com/p/tdd-ai-agents-and-coding-with-kent

Thoughtworks. (2026). The Future of Software Engineering: Retreat Findings and Strategic Insights. https://www.thoughtworks.com/content/dam/thoughtworks/documents/report/tw_future%20_of_software_development_retreat_%20key_takeaways.pdf


This article represents original framework thinking. It defines AI-TDD around Manifest requirement contracts, TRACE/EVD/CMD/ART evidence chains, Gate Verdict, and Human Decision. The term is still evolving — feedback and corrections are welcome.

Reading path

Continue along this topic path

Follow the recommended order for AI engineering practice instead of jumping through random articles in the same topic.

View full topic path →

Next step

Go deeper into this topic

If this article is useful, continue from the topic page or subscribe to follow later updates.

Return to topic Subscribe via RSS

RSS Subscribe

Subscribe to updates

Follow new articles in an RSS reader without checking the site manually.

Recommended readers include Follow , Feedly or Inoreader and other RSS readers.

Comments and discussion

Sign in with GitHub to join the discussion. Comments are synced to GitHub Discussions

Loading comments...