Article
AI-TDD: Requirement Contracts and Multi-Dimensional Evidence Acceptance
AI-TDD turns human intent into a Manifest requirement contract, then accepts AI-generated work through evidence chains and Gate verdicts.
Introduction: from code-level TDD to requirement-contract-driven development
From the GitHub Copilot technical preview to GPT-4, Claude Code, Codex, and similar tools in day-to-day development, AI coding has evolved far beyond simple code completion. It now reaches into requirement understanding, implementation generation, test fixing, and broader engineering collaboration.
The real shift is no longer “Can AI write code?” but “Does the code AI writes stay within the requirement boundary?”
The industry’s focus has shifted from “Can AI write code?” to “Is the code AI writes correct?” Current mainstream large language models can reliably generate large amounts of runnable code, but syntactic correctness does not equal requirement alignment. Code-level TDD still has a structural limitation here because it lacks a global view of the requirement boundary.
Problem boundaries precede solutions.
This principle matters even more in the AI era. Software engineering has long understood that most of the work lies in understanding the problem before trying to solve it. Once AI can generate code quickly and cheaply, success depends less on raw implementation speed and more on whether the problem boundary is defined clearly enough.
Ambiguous requirements lead to AI improvisation, and improvisation often deviates from expectations. AI-TDD is built on this principle: AI generation without Manifest as a contract is unanchored improvisation. The essence of Manifest is to transform vague requirement boundaries into machine-readable, verifiable contract matrices.
A typical failure scenario: lessons from an AI-generated user module
Below is a composite failure scenario constructed from multiple common risk points. A product owner describes the requirement to AI: “Implement a user registration feature with email verification and password encryption.”
AI generates the code, and the team quickly deploys. However, within a week of going live, the following issues occur:
- Security vulnerability: AI used MD5 instead of bcrypt to store passwords. While “encrypted,” they are vulnerable to rainbow table attacks
- Concurrency issues: When many users register simultaneously, the system creates duplicate accounts because AI didn’t implement email uniqueness constraints
- Boundary overflow: Users can upload avatar images of arbitrary sizes, causing storage costs to spiral out of control — because the requirement didn’t specify “limit image size”
- Feature drift: AI implemented OAuth social login, but this was functionality the team planned for “next quarter”
Root cause: Natural language requirements contain ambiguity, implicit assumptions, and fuzzy boundaries. AI fills these gaps according to its own “understanding,” and this understanding deviates from the team’s true intent.
If AI-TDD Manifest had been used:
must:
- id: MUST-REG-EMAIL-001
text: Users can register via email
validation: "email format complies with RFC 5322, password length 8-32 characters"
- id: MUST-REG-PASSWORD-001
text: Password must be encrypted for storage
validation: "Use bcrypt, cost factor >= 12, not MD5 or SHA"
- id: MUST-REG-UNIQUE-001
text: Prevent duplicate registration
validation: "Database-level unique index + application-level atomic check"
outOfScope:
- id: OUT-AUTH-OAUTH-001
text: OAuth social login not supported
reason: "Not included in this iteration, moved to REQ-OAUTH-001"
AI would generate implementations under explicit contract constraints, rather than improvising.
Limitations of code-level TDD
Traditional TDD (Test-Driven Development) centers on “test first”: write tests, then implement. This practice works in the manual coding era because developers write tests and implement themselves, maintaining consistent requirements in their minds.
But when AI becomes the implementation agent, problems arise:
Test code can only cover local aspects, not constrain the global picture.
Imagine this scenario: You ask AI to implement a user registration module. You write a unit test verifying “returns error when empty password is input.” AI generates code, the test passes. But after going live, you discover:
- AI didn’t implement email format validation (because no corresponding test existed)
- AI stored passwords in plaintext (tests only verified API behavior, not storage logic)
- AI didn’t handle race conditions for concurrent registration (tests didn’t cover concurrency scenarios)
This is the essential limitation of code-level TDD: tests can only verify “what I’ve tested,” while AI implements according to “its own understanding of the complete requirement.” The gap between these is the risk.
Manifest as the requirement contract matrix
There’s a long-standing consensus in software engineering: the later a defect is discovered, the higher the cost to fix it. In AI-generated code scenarios, this problem is amplified because models can quickly expand vague requirements into large amounts of seemingly runnable implementations.
AI-TDD’s core breakthrough: elevate acceptance criteria from “code-level test cases” to “requirement-level contract matrices.”
We call this the AI-TDD Gate Manifest — a machine-readable requirement contract checklist generated during the requirement confirmation phase.
Manifest is not simply a “test list” but a requirement contract matrix containing MUST, NEG (MUST NOT negative assertions), OUT (OUT OF SCOPE boundaries), TRACE, EVD, ACC/E2E, FAIL/EDGE, CMD, ART, and TASK namespaces. The complete technical definitions and Schema specifications for these dimensions will be expanded in Chapter 4.
Key insight: Manifest should reach a sufficiently complete state before execution begins. If requirements are added during execution, the model is more likely to miss key boundaries or experience implementation drift.
Core principles of AI-TDD
AI-TDD is not “write tests first, then let AI generate” but rather “define problem boundaries first, then verify implementation within boundaries.”
Six cognitive stages: Requirement Confirmation → Architecture Confirmation → Implementation Readiness → Execution Closure → Audit Review → Delivery Closeout
Two key gates: Implementation Readiness Gate (entry gate, expected status AI-TDD-RED, shortened below to TDD-RED) and Delivery Closeout Gate (delivery gate, expected status AI-TDD-GREEN, shortened below to TDD-GREEN), ensuring “no complete Manifest, no execution” and “unverified Manifest items, no delivery.”
The goal of this framework is straightforward: replace vague natural language requirements with machine-readable Manifest, establish a complete requirement contract before execution begins, and use reproducible evidence chains to decide whether the current implementation is deliverable.
AI-TDD’s first-class definition: requirement-contract-driven + multi-dimensional evidence acceptance
This article does not define AI-TDD as “let AI run TDD.” It defines AI-TDD as:
Encode human intent as a requirement contract first, then prove the current implementation satisfies that contract through a multi-dimensional evidence chain.
In engineering terms, the chain is:
Requirement Contract
→ Contract Slice
→ Evidence Chain
→ Gate Verdict
→ Human Decision
The chain has five layers:
| Layer | Typical namespace | Role |
|---|---|---|
| Declaration | MUST / NEG / OUT | Defines what must be done, what must not happen, and what is out of scope |
| Slice | TRACE / TASK | Breaks requirements into traceable, acceptable contract slices |
| Scenario | ACC / E2E / EDGE / FAIL | Defines happy paths, end-to-end paths, edge cases, and failure paths |
| Evidence | EVD / CMD / ART / hash / receipt | Defines what counts as proof, how to reproduce it, and where artifacts live |
| State | AI-TDD-RED / IMPLEMENTING / CLOSEOUT_CANDIDATE / AI-TDD-GREEN / CLOSED | Lets Gates decide whether the contract lifecycle can move forward |
The TRACE row is AI-TDD’s smallest contract unit. Traditional TDD’s atomic unit is the test case. BDD’s atomic unit is the scenario. AI-TDD’s atomic unit is the contract slice: it must state which MUST/NEG it covers, which scenario verifies it, which evidence is required, which commands run, and which artifacts are produced. OUT boundaries do not belong in covers; they bind to scope-audit evidence through scopeAuditRefs or an equivalent field.
Multi-dimensional evidence acceptance must also be stated upfront: test pass is one form of evidence, not delivery itself. A delivery can only be called AI-TDD-GREEN after the current attempt closes its TRACE -> EVD -> CMD -> ART evidence chain, the Gate returns a pass verdict, and a recorded Human Decision accepts the result.
Three-minute reading path: if you only want the core thesis, read this first-class definition, then the Quick Start section “Accept delivery through the evidence chain,” then Chapter 6’s Delivery Closeout Gate, and finally use the glossary to check the chain Requirement Contract -> Contract Slice -> Evidence Chain -> Gate Verdict -> Human Decision.
Quick start: get started with AI-TDD
Don’t want to read theory first? No problem. Follow this example to understand AI-TDD’s core principles.
Example: implementing a calculator with AI-TDD
Your requirement: “Implement an addition function”
Problems with traditional approaches: Telling AI directly to “write an addition function” might result in:
- Not handling non-numeric inputs
- Not clarifying numeric boundary conditions
- Adding unnecessary features (subtraction, multiplication)
AI-TDD approach: Write Manifest first, then let AI generate.
Key state flow: Manifest → AI-TDD-RED / TDD-RED (tests exist/implementation missing) → CLOSEOUT_CANDIDATE (registered validations pass, delivery verdict not yet closed) → AI-TDD-GREEN / TDD-GREEN (evidence chain, Gate Verdict, and Human Decision close) → Delivery
Step 1: Create a Manifest file
Create calculator-manifest.yaml:
manifest:
version: "1.0.0"
project: "calculator-demo"
requirementId: "REQ-CALC-001"
title: "Addition Function"
acceptanceCriteria:
must:
- id: "MUST-CALC-ADD-001"
description: "Accept two numeric parameters, return their sum"
validation: "add(2, 3) === 5"
- id: "MUST-CALC-FLOAT-001"
description: "Handle floating-point numbers"
validation: "add(0.1, 0.2) close to 0.3 (considering floating-point precision)"
mustNot:
- id: "NEG-CALC-TYPE-001"
description: "Do not accept non-numeric inputs"
validation: "add('a', 1) throws TypeError"
outOfScope:
- id: "OUT-CALC-OPS-001"
description: "Subtraction, multiplication, division"
reason: "This iteration only implements addition"
evidence:
- id: "EVD-CALC-UNIT-001"
type: "unit_test"
description: "Addition, floating-point, and non-numeric input tests all pass"
requiredCommandRefs: ["CMD-CALC-TEST-001"]
artifactRefs: ["ART-CALC-TEST-REPORT-001"]
- id: "EVD-CALC-SCOPE-001"
type: "scope_audit"
description: "Implementation contains no subtraction, multiplication, or division behavior"
requiredCommandRefs: ["CMD-CALC-SCOPE-001"]
artifactRefs: ["ART-CALC-SCOPE-REPORT-001"]
traceRows:
- id: "TRACE-CALC-ADD-001"
covers: ["MUST-CALC-ADD-001", "MUST-CALC-FLOAT-001"]
evidenceRefs: ["EVD-CALC-UNIT-001"]
commandRefs: ["CMD-CALC-TEST-001"]
artifactRefs: ["ART-CALC-TEST-REPORT-001"]
- id: "TRACE-CALC-NEG-001"
covers: ["NEG-CALC-TYPE-001"]
evidenceRefs: ["EVD-CALC-UNIT-001"]
commandRefs: ["CMD-CALC-TEST-001"]
artifactRefs: ["ART-CALC-TEST-REPORT-001"]
- id: "TRACE-CALC-SCOPE-001"
scopeAuditRefs: ["OUT-CALC-OPS-001"]
evidenceRefs: ["EVD-CALC-SCOPE-001"]
commandRefs: ["CMD-CALC-SCOPE-001"]
artifactRefs: ["ART-CALC-SCOPE-REPORT-001"]
commands:
- id: "CMD-CALC-TEST-001"
run: "npm test -- calculator.test.js --runInBand > artifacts/calculator-test-report.txt"
producesArtifactRefs: ["ART-CALC-TEST-REPORT-001"]
- id: "CMD-CALC-SCOPE-001"
run: "node scripts/check-calculator-scope.mjs > artifacts/calculator-scope-report.txt"
producesArtifactRefs: ["ART-CALC-SCOPE-REPORT-001"]
artifacts:
- id: "ART-CALC-TEST-REPORT-001"
path: "artifacts/calculator-test-report.txt"
- id: "ART-CALC-SCOPE-REPORT-001"
path: "artifacts/calculator-scope-report.txt"
Key points:
must: What addition should do (2+3=5, floating-point handling)mustNot: What addition should not do (not accepting strings)outOfScope: Explicitly exclude other operations (preventing AI improvisation)traceRows: Bind requirement boundaries to evidence, commands, and artifactsevidence/commands/artifacts: Define what counts as proof, how to reproduce it, and where the evidence is stored
Step 2: AI generates test code
Give the Manifest to AI with this prompt:
Based on the following Manifest, generate test code, ensuring coverage of NEG (MUST NOT) negative scenarios:
[Paste Manifest]
AI will generate tests similar to:
// calculator.test.js
describe("Addition Function", () => {
// MUST-CALC-ADD-001
test("2 + 3 = 5", () => {
expect(add(2, 3)).toBe(5);
});
// MUST-CALC-FLOAT-001
test("floating-point handling", () => {
expect(add(0.1, 0.2)).toBeCloseTo(0.3, 10);
});
// NEG-CALC-TYPE-001
test("reject non-numeric inputs", () => {
expect(() => add("a", 1)).toThrow(TypeError);
});
});
Note: This is a conceptual example, not a fully runnable project scaffold. To execute it locally, you still need to import
add, initialize a test runner such as Jest or Vitest, and define thenpm testscript.
Development-phase red check: You may run npm test first to quickly confirm the tests fail. For the formal evidence chain, use the Manifest-registered CMD-CALC-TEST-001 to produce ART-CALC-TEST-REPORT-001.
Expected result: All tests fail (TDD-RED status) ✅
- The
addfunction doesn’t exist yet → failure is expected - This proves test code is ready, implementation can begin
Step 3: AI generates implementation code
Prompt:
Based on the following Manifest and test code, implement the add function, ensuring all tests pass:
[Paste Manifest]
[Paste test code]
AI will generate:
// calculator.js
function add(a, b) {
if (typeof a !== "number" || typeof b !== "number") {
throw new TypeError("Parameters must be numbers");
}
return a + b;
}
module.exports = { add };
Run the registered command: CMD-CALC-TEST-001
Expected result: All tests pass, and the candidate implementation enters closeout candidate status. It is not AI-TDD-GREEN yet because delivery evidence and Human Decision are not closed.
Step 4: Accept delivery through the evidence chain
A delivery cannot rely on the sentence “tests passed.” It must show that the current attempt’s contract slices are closed:
| Contract item | TRACE / Scope Audit | EVD | CMD | ART | Gate Verdict | Human Decision |
|---|---|---|---|---|---|---|
MUST-CALC-ADD-001 / MUST-CALC-FLOAT-001 | TRACE-CALC-ADD-001 | EVD-CALC-UNIT-001 | CMD-CALC-TEST-001 | ART-CALC-TEST-REPORT-001 | pass | accept, with decisionTimestamp and HD-CALC-001 receipt |
NEG-CALC-TYPE-001 | TRACE-CALC-NEG-001 | EVD-CALC-UNIT-001 | CMD-CALC-TEST-001 | ART-CALC-TEST-REPORT-001 | pass | accept, confirming the negative path belongs to the current attempt |
OUT-CALC-OPS-001 | TRACE-CALC-SCOPE-001.scopeAuditRefs | EVD-CALC-SCOPE-001 | CMD-CALC-SCOPE-001 | ART-CALC-SCOPE-REPORT-001 | pass | accept, confirming the scope-audit receipt is archived |
The checklist should be written as evidence closure, not as a generic todo list:
- Every
MUST/NEGhasTRACEcoverage - Every
TRACEbinds toEVD - Every
EVDtraces to aCMDrun in the current attempt - Every
CMDproduces auditableART - The scope audit for
OUT-CALC-OPS-001confirms AI did not add out-of-scope operations - Gate Verdict is
pass - Human Decision is
accept, with timestamp, actor, and receipt artifact recorded
Conclusion: Delivery Closeout Gate can return AI-TDD-GREEN only after the current attempt’s evidence chain closes, Gate Verdict is pass, and Human Decision is accept.
CMD-CALC-SCOPE-001 should not directly use rg 'subtract|multiply|divide' as the passing condition because rg returns exit code 1 when no matches are found. The scope audit script should explicitly convert “no forbidden symbols” into exit code 0 and write ART-CALC-SCOPE-REPORT-001. For example:
// scripts/check-calculator-scope.mjs
import { readFileSync } from 'node:fs';
const source = readFileSync('calculator.js', 'utf8');
const forbidden = ['subtract', 'multiply', 'divide'].filter((name) => source.includes(name));
if (forbidden.length > 0) {
console.error(`Out-of-scope operations found: ${forbidden.join(', ')}`);
process.exit(1);
}
console.log('Scope audit passed: only addition is implemented.');
Comparison: traditional approach vs AI-TDD
| Dimension | Traditional Approach | AI-TDD |
|---|---|---|
| Requirement expression | ”Write an addition function” (vague) | Manifest YAML (precise) |
| AI output | May include subtraction, multiplication | Strictly addition only (clear boundaries) |
| Error handling | May be omitted | NEG (MUST NOT) enforces requirements |
| Acceptance criteria | Manual judgment “seems correct” | TRACE/EVD/CMD/ART evidence-chain closure |
Next steps
If this example interests you, continue reading Chapter 1 to understand why AI-TDD solves the core problems of AI-generated code.
Background verification: AI-TDD concepts predate the terminology
First, let’s clarify terminology boundaries: as of now, AI-TDD is not an industry-standard term with unified definitions like TDD, BDD, or CI/CD. Public materials do contain exact matches, such as the 2023 open-source project di-sukharev/AI-TDD and team training courses on “AI Test-Driven Development (AI-TDD).” But these are mostly tool names, course names, or community practice names, insufficient to prove that AI-TDD has formed recognized standards.
Therefore, when this article uses AI-TDD, it’s not treating it as an existing term formally defined by authoritative bodies, but rather using it to summarize a verifiable engineering trend: as AI assumes more code generation work, TDD evolves from “testing practice before manual coding” to “contract mechanism constraining AI generation behavior.”
In 2021, GitHub Copilot was released as a technical preview, allowing large numbers of developers to directly see that “large models can generate code.” At this point, industry focus was primarily on efficiency: can AI complete code, reduce boilerplate work, and help developers implement faster?
In 2023, the question began shifting from “can it generate” to “is it generated correctly.” After GPT-4’s release, large models demonstrated strong capabilities in various professional and academic tasks, but hallucinations, incorrect reasoning, and requirement misunderstandings persisted. In August 2023, Paul Sobocinski published “TDD with GitHub Copilot” on Martin Fowler’s website, systematically pointing out that LLMs provide irrelevant information and even hallucinate, making TDD more necessary when using AI coding assistants. The key judgment was that tests are not just feedback mechanisms but also ways to break problems into smaller pieces and let AI gradually approach correct implementation.
In 2023, the di-sukharev/AI-TDD open-source project explored the workflow of “humans write tests, GPT writes code until tests pass.” This project doesn’t represent industry consensus, but it shows that the abbreviation AI-TDD wasn’t coined after the fact. It was used early to describe specific practices combining AI and TDD.
From 2024 to 2026, academia began researching this direction with more cautious names, such as Generative AI for Test Driven Development, Test-Driven Development for Code Generation, Tests as Prompt, Test-Driven Agentic Development, and AI-native TDD framework. These papers may not use the abbreviation AI-TDD, but they collectively point to a trend: tests are not just verification artifacts but can serve as inputs for models to understand tasks, constrain generation, and expose errors. In other words, tests are evolving from “post-development checking tools” to “pre-AI-generation contract languages.”
In 2025, discussions around AI coding agents further intensified. Some engineering blogs and team practices began reinterpreting TDD as a quality-control mechanism for AI-generated code; Kent Beck discussed TDD’s relationship with AI agents in The Pragmatic Engineer interview, emphasizing tests’ constraining value in AI collaboration. At this point, the practice of “define executable verification first, then let AI implement” began evolving from personal techniques to team workflows, but the more rigorous statement should be: the industry is forming a practice spectrum of “AI-assisted TDD / TDD with AI agents / AI-native TDD governance” rather than having converged on a single AI-TDD standard term.
In February 2026, Thoughtworks held “The Future of Software Development Retreat” in Deer Valley, Utah. The report contained a crucial judgment: as AI assumes more code production work, engineering rigor doesn’t disappear but migrates to specifications, tests, constraints, and risk management. The report specifically noted that TDD can produce significantly better results for AI coding agents because when tests exist before code, agents cannot make incorrect implementations pass by writing tests that “verify incorrect behavior.”
This development trajectory shows that the AI-TDD discussed in this article is not a recitation of an existing standard term but a further abstraction of the above practice spectrum: from code-level testing to requirement-level contracts; from local assertions to global Manifest; from “test-driven implementation” to “contract-driven generation.” This is the background and naming boundary for the AI-TDD Gate Manifest proposed in this article.
Chapter 1: Why requirement-contract-driven AI-TDD is needed
1.1 Essential limitations of code-level TDD
Why doesn’t traditional TDD work in the AI era? The core problem isn’t the “test first” concept itself, but that tests can only cover code locally and cannot control AI’s global understanding of requirements.
Limitation 1: Locality of Test Coverage
Unit tests can only cover local code behavior. A user registration feature might require:
- Input validation (unit tests can cover)
- Database operations (integration tests needed)
- Concurrency control (concurrency tests needed)
- Security compliance (security scanning needed)
- Performance metrics (performance tests needed)
Developers often only write the test types they’re most familiar with (usually unit tests), omitting other dimensions.
Limitation 2: Implicitness of Requirement Understanding
Test code itself is an “encoding” of requirements. But test code cannot answer:
- “Which requirements does this test cover?”
- “Which requirements are not yet covered?”
- “Are there conflicts between requirements?”
When AI generates code, it sees isolated tests, not the global picture of requirements.
Limitation 3: Missing Boundary Definitions
Code-level TDD is good at defining “what to do” but not good at defining:
- “What not to do” (OUT OF SCOPE)
- “To what extent” (EVD)
- “How to trace” (TRACE rows)
AI will “fill in” these missing definitions according to its own understanding, which is where risk lies.
1.2 Manifest: from test checklist to contract matrix
The essence of Manifest is transforming implicit knowledge in human brains into machine-readable explicit contracts.
Manifest vs Test Code:
| Dimension | Test Code | Manifest |
|---|---|---|
| Abstraction level | Code-level (How) | Requirement-level (What) |
| Coverage scope | Local | Global |
| Human readability | Poor (requires code knowledge) | Good (structured document) |
| Machine readability | Good (executable) | Good (parsable) |
| Traceability | Weak (test→requirement?) | Strong (TRACE rows) |
| Completeness check | Hard (coverage ≠ requirement coverage) | Easy (acceptance item checklist) |
Core value of Manifest:
- Global view: Define “what all requirements are” before execution begins
- Machine-readable: AI can parse Manifest and understand the global picture of requirements
- Completeness checking: Can automatically check “which requirements are not yet covered by tests”
- Contract authority: Once confirmed, Manifest becomes a formal contract between humans and machines
1.3 AI-TDD Gate: Manifest execution engine
AI-TDD Gate is Manifest’s technical implementation layer, responsible for:
1. Manifest Parsing and Registration
- Read
ai-tdd-manifest.yaml - Register all MUST/NEG/OUT/EVD/ACC/E2E acceptance items
- Establish TRACE rows mappings
2. Acceptance Test Generation
- Automatically generate test code frameworks based on Manifest
- Associate E2E/ACC test suites
- Generate test coverage requirements
3. Gate State Determination
- Run all tests
- Determine TDD-RED (entry) or TDD-GREEN (delivery)
- Block non-compliant items
4. Traceability Recording
- Record results of each gate run
- Maintain requirement→test→result traceability chains
- Support post-hoc auditing
Chapter 2: First principles of architecture
2.1 From “solving problems” to “defining problem boundaries”
Before diving into AI-TDD’s technical details, it helps to ground the methodology in a more basic architectural idea: first principles.
Ultimate first principle: human cognitive capacity is limited
This assumption sits underneath much of software engineering. If human cognitive capacity were effectively unlimited, many architectural principles, patterns, and methodologies would matter much less. We would need far less decomposition, abstraction, and modularization.
Fundamental Goal Derived from Ultimate Principle: Managing Complexity
Because human cognition is limited, systems become hard to maintain, modify, and predict once complexity rises beyond what people can reliably reason about. At its core, architecture is an attempt to keep system complexity within human operating range.
Core Principle for Achieving the Goal: Problem Boundaries Precede Solutions
This is the most fundamental lever we have for managing complexity. The main source of complexity is often not the solution itself, but unclear problem boundaries. Only after those boundaries are explicit can we tell which complexity is necessary and which is self-inflicted. Any solution detached from the problem boundary introduces avoidable complexity.
From an engineering perspective, verifiability is often as important as raw model capability. If a task’s boundaries are unclear and cannot be checked, improvements in model architecture alone may still fail to produce reliable outcomes. Clear boundaries and verification criteria make model behavior easier to guide and evaluate.
Traditional architectural pitfall: over-focusing on “how”
Software architecture has long leaned toward one cognitive bias: we spend too much time on “how to solve the problem” and not enough on “where the problem boundary actually is.”
Traditional architecture design process:
Traditional Waterfall Architecture Design: Business requirements → Technology selection → System architecture → Module division → Interface design → Coding implementation
In this process, architects focus on “how to design” (How) at each stage:
- Technology selection stage: Which framework? (Spring Boot / Express / Django)
- System architecture stage: How to divide services? (Monolithic / Microservices / Serverless)
- Module division stage: Where are module boundaries? (By domain / By function / By team)
- Interface design stage: What API format? (REST / GraphQL / gRPC)
- Coding implementation stage: How to specifically implement? (Design patterns / Code standards)
Fundamental flaw: once AI becomes the implementation agent, this pattern starts to break down. AI can generate endless “how” options, but if the “what” is unclear, any of those “how” options may be wrong. The traditional process does not define the problem boundary explicitly enough.
First Principle: Problem Boundaries Precede Solutions
AI-TDD’s architectural mental model is based on a simple but profound insight:
Clear problem boundary definition is more important than elegant solutions.
This doesn’t make solutions unimportant. It means any solution becomes risky when the boundaries are vague, because AI will fill in the gaps on its own, and those guesses may conflict with human intent.
From “Solving Problems” to “Defining Problem Boundaries”
The difference between these two architectural design paradigms determines whether AI-generated code “aligns with requirements” or “improvises”:
Paradigm A: Solution-Driven (Traditional)
Requirement: Implement user registration feature
Architect thinks:
- Which framework to use? (Spring Boot / Express / Django)
- How to design the database? (users table, password field)
- What API interface format? (POST /api/users)
- Which middleware is needed? (Redis cache, RabbitMQ message queue)
Paradigm B: Boundary-Definition-Driven (AI-TDD)
Requirement: Implement user registration feature
Architect thinks:
- MUST: What must be supported? (email registration, password encryption, duplicate detection)
- NEG / MUST NOT: What is prohibited? (plaintext storage, SQL injection, concurrent race conditions)
- OUT OF SCOPE: What is explicitly excluded? (OAuth, phone verification)
- EVD: How to verify? (unit tests, security scans, performance tests)
- BOUNDARY: Where are system boundaries? (Only responsible for registration, not email sending?)
Paradigm A’s result is a design solution. Paradigm B’s result is a contract matrix.
Key differences:
- Paradigm A’s deliverable is “advice” — AI can reference or deviate from it
- Paradigm B’s deliverable is “constraints” — AI must implement within boundaries; deviation equals failure
Five Dimensions of Boundary Definition
AI-TDD’s Manifest defines problem boundaries from five dimensions:
-
Functional Boundary (MUST)
- What the system must do
- Each MUST has corresponding acceptance criteria and evidence requirements
-
Negative And Scope Boundary (NEG / MUST NOT + OUT OF SCOPE)
- What the system is prohibited from doing
- Explicitly excluded functions and scenarios
- This dimension is most easily overlooked in traditional architecture
-
Evidence Boundary (EVD)
- What constitutes proof of “completion”
- Not “I think it’s complete” but “these commands all return 0”
-
Traceability Boundary (TRACE rows)
- Mapping relationships between requirements and verification
- Ensures every requirement has test coverage and every test corresponds to a requirement
-
State Boundary (TDD-RED/GREEN)
- Entry state: Must be TDD-RED (tests exist but fail)
- Delivery state: Must be TDD-GREEN (the current attempt’s
TRACE/EVD/CMD/ARTevidence chain is closed, Gate Verdict ispass, and Human Decision isaccept)
Why Boundary Definition is Crucial in the AI Era
The fundamental difference between human developers and AI lies in how they autonomously fill gaps:
- Human developers: Fill based on experience, intuition, and team norms. If boundaries are fuzzy, humans ask for clarification
- AI: Fill based on probability, training data, and pattern matching. If boundaries are fuzzy, AI “guesses” the most likely implementation
A real case of fuzzy boundaries:
Requirement: “Implement file upload functionality”
Architect A (traditional thinking) designed the upload API, storage solution, and file type validation. But missed one boundary:
- Single file size limit?
- Total storage quota?
- Virus scanning requirements?
AI-generated code implemented “file upload” without size limits. After going live, users uploaded 10GB files causing server crashes. This wasn’t AI’s error — AI completed the task of “implementing file upload,” but the architect didn’t define the boundaries of “what to upload.”
If AI-TDD Manifest defined:
must:
- id: MUST-UPLOAD-FILE-001
text: Support file upload
validation: "Single file max 100MB, supports jpg/png/pdf formats"
outOfScope:
- id: OUT-UPLOAD-VIDEO-001
text: Video file upload not supported
reason: "Not included in this iteration, moved to REQ-VIDEO-001"
evidence:
- id: EVD-UPLOAD-PASS-001
text: Upload 100MB file succeeds
threshold: "Upload time < 30s"
- id: EVD-UPLOAD-REJECT-001
text: Upload 110MB file fails
oracle: "Returns 413 Payload Too Large"
AI would generate implementations under these boundary constraints, uploads over 100MB would be explicitly rejected.
Shift in Architect Responsibilities
Under the AI-TDD paradigm, architect responsibilities shift from “designing optimal solutions” to “defining the clearest possible boundaries”:
| Traditional Responsibility | AI-TDD Responsibility |
|---|---|
| Selecting technology stack | Defining technology constraints (MUST use certain technology) |
| Designing module interfaces | Defining interface contracts (MUST satisfy certain contracts) |
| Writing architecture documents | Generating machine-readable Manifest |
| Reviewing code implementation | Accepting TDD-GREEN evidence |
This doesn’t reduce the architect’s value. It raises the bar for architectural precision: from “advice” to “contract,” and from “documentation” to “executable specification.”
2.2 Essence of human-AI collaboration: why natural language alone is often insufficient as an AI execution contract
AI-TDD uses structured YAML rather than natural language as the main requirement carrier. That is not just a tooling preference. It follows from a more basic observation about how human-AI collaboration actually communicates.
Three Communication Traps of Natural Language
Human communication works with natural language because people share context, experience, and the ability to infer intent. AI does not share those strengths, which makes natural language much riskier as an execution contract.
Trap 1: Amplification of Ambiguity
Natural language ambiguity can usually be resolved through context for humans but is fatal for AI:
Human instruction: "Implement a high-performance cache"
AI possible understandings:
- Use Redis (most common)
- Use Memcached (lightweight)
- Use local memory Map (simplest)
- Use Caffeine (Java ecosystem)
- Use multi-level cache (most "high-performance")
Human expectation: Probably means Redis cluster + local cache multi-level solution
AI implementation: Chose simplest HashMap because instruction didn't explicitly exclude it
Trap 2: Invisibility of Implicit Assumptions
Human communication relies heavily on implicit assumptions. Consider this sentence:
"Implement user login, must be secure"
Human-understood implicit assumptions:
- “Secure” means passwords must be encrypted for storage (bcrypt, not MD5)
- “Secure” means preventing SQL injection
- “Secure” means having anti-brute-force mechanisms (failure count limits)
- “Secure” means using HTTPS
- “Secure” means sessions have expiration times
AI lacks this implicit knowledge. Without explicit definition, AI might:
- Use MD5 for password storage (seen in training data)
- Forget to handle SQL injection
- Have no failure count limits
- Use HTTP instead of HTTPS
Trap 3: Impossibility of Completeness Verification
Natural language requirement documents have a fundamental problem: you cannot automatically verify whether they are “complete.”
Requirements document:
- Users can register via email
- Users can log in via email
- Users can change passwords
Complete? Yes for AI.
But missing:
- Password strength requirements?
- Email verification process?
- Session expiration policy?
- Account deletion functionality?
- Concurrent login handling?
Humans can ask “Are these requirements complete?” but no automated tool can scan natural language documents and answer “The requirement on page 3 paragraph 2 has no corresponding test coverage.”
Four Communication Advantages of Structured Contracts
AI-TDD uses YAML Manifest instead of natural language requirements not because YAML is “cooler” but because it solves the above traps of natural language:
Advantage 1: Elimination of Ambiguity
# Not "high-performance cache" but:
must:
- id: MUST-CACHE-001
text: Use Redis as cache storage
validation: "Redis cluster mode, minimum 3 nodes"
- id: MUST-CACHE-002
text: Local second-level cache
validation: "Caffeine cache, TTL 5 minutes"
outOfScope:
- id: OUT-CACHE-MEMCACHED-001
text: Do not use Memcached
reason: "Team technology stack unified on Redis"
No ambiguity: Not “high-performance” (what is high-performance?) but explicit “Redis cluster + Caffeine second-level cache.”
Advantage 2: Explicitation of Implicit Assumptions
must:
- id: MUST-AUTH-001
text: Password encrypted storage
validation: "Use bcrypt, cost factor >= 12"
# Explicit definition, not implicit assumption
- id: MUST-AUTH-002
text: Prevent SQL injection
validation: "All database operations use parameterized queries"
# Not "be secure" but explicit constraints
outOfScope:
- id: OUT-AUTH-SSO-001
text: SSO single sign-on not supported
reason: "Enterprise feature, moved to REQ-ENTERPRISE-001"
# Explicit exclusion, not omission
Advantage 3: Machine-Verifiable Completeness
must:
- id: MUST-REG-EMAIL-001
text: Users can register via email
evidenceRefs: [EVD-REG-TEST-001]
coveredByTraceRows: [TRACE-REG-001] # ← Must have Trace coverage
traceRows:
- id: TRACE-REG-001
covers: [MUST-REG-EMAIL-001] # ← Reverse declaration of coverage
evidenceRefs: [EVD-REG-TEST-001]
acceptanceRefs: [ACC-REG-001] # ← Must have tests
acceptanceTests:
- id: ACC-REG-001
file: tests/acceptance/registration.test.ts
Gate can automatically verify:
- “Does MUST-REG-EMAIL-001 have Trace coverage?”
- “Does TRACE-REG-001 have EVD support?”
- “Does EVD have CMD verification?”
- “Does ACC-REG-001 test file exist?”
Completeness is not a “feeling” but a machine-checkable fact.
Advantage 4: Precise Definition of Execution Semantics
Natural language tells you “what to do” but not “how to do it” and “to what extent.” Manifest precisely defines execution semantics through multi-layer structure:
# Not just "registration feature" but:
must:
- id: MUST-REG-001
text: Users can register via email
validation: |
1. Email format complies with RFC 5322
2. Password length 8-32 characters
3. Must contain uppercase, lowercase letters and numbers
evidenceRefs: [EVD-REG-001, EVD-REG-002]
coveredByTraceRows: [TRACE-REG-001]
evidence:
- id: EVD-REG-001
text: Registration success evidence
requiredCommandRefs: [CMD-REG-TEST-001]
artifactRefs: [ART-TEST-REPORT-001]
- id: EVD-REG-002
text: Registration failure boundary evidence
requiredCommandRefs: [CMD-REG-TEST-002]
# Negative tests: invalid email, weak password, duplicate registration
Each layer has clear execution semantics, AI won’t “guess” what “valid registration” is — all validation logic is explicitly defined in Manifest.
New Communication Protocol for Human-AI Collaboration
AI-TDD essentially defines a new communication protocol for human-AI collaboration:
| Traditional Communication | AI-TDD Communication |
|---|---|
| Natural language requirement documents | YAML Manifest contracts |
| ”You know what I mean” implicit assumptions | Explicit MUST/NEG/OUT constraints |
| Manual completeness review | Automatic TRACE rows verification |
| ”Probably complete” feeling | TDD-GREEN objective evidence |
| Post-implementation manual acceptance | Implementation Readiness Gate pre-interception |
Key insight:
- Humans excel at: Fuzzy intent, creative thinking, value judgment, boundary trade-offs
- AI excels at: Pattern matching, large-scale generation, consistent execution, rapid iteration
- Manifest’s role: Transform human fuzzy intent into contracts AI can precisely execute
Not replacing human judgment, but precisely transmitting human judgment
AI-TDD’s goal is not to let AI replace human architectural decisions but to let AI precisely execute human architectural decisions.
Natural language suits human-to-human communication because humans have consensus and context. YAML Manifest suits human-AI collaboration because it eliminates ambiguity, makes constraints explicit, and enables verifiable completeness.
This is not a downgrade in communication but an upgrade in precision — from “roughly understood” to “precise contract.”
2.3 Contract as code: Manifest as the bridge between human intent and AI execution
Manifest’s role in AI-TDD can be summarized in one sentence: Contract as Code.
More precisely, Manifest can be treated as the “source code” of human intent, AI-TDD Gate plays a role similar to a “compiler and runtime,” and generated implementation is the execution result.
Intent Decay in Traditional Software Development
In traditional software development, requirements pass from humans to implementation through multiple “translations,” each potentially causing intent decay:
Human intent (Product Manager)
↓ Translate to natural language
Requirements document (PRD)
↓ Translate to technical language
Technical solution (Architecture Doc)
↓ Translate to code
Implementation code (Source Code)
↓ Translate to machine instructions
Executable program (Binary)
Each “translation” introduces noise:
- PM’s “high-performance” understood by architect as “use cache”
- “Use cache” understood by developer as “add Redis”
- “Add Redis” implemented as “single-node Redis” rather than “cluster”
- Final performance is not “high”
Intent Decay Amplified in the AI Era
When AI becomes the implementation agent, the problem worsens:
Human intent
↓ Natural language Prompt
AI understanding (probability model inference)
↓ Generate code
Implementation code
AI’s understanding is not translation but inference — probability inference based on training data. This inference introduces additional noise:
- “High-performance” → AI infers “use most familiar technology”
- “Secure” → AI infers “common security measures”
- “Scalable” → AI infers “microservices architecture” (potentially over-engineered)
Manifest: Eliminate Translation Layers, Directly Encode Intent
AI-TDD’s solution is to eliminate intermediate translation layers, making human intent directly machine-executable contracts:
Human intent (architectural decision)
↓ Direct encoding
Manifest contract (YAML)
↓ AI-TDD Gate parsing and execution
AI generation (under contract constraints)
↓ Verification
TDD-GREEN evidence
Key difference: Manifest doesn’t “describe” intent, it “encodes” intent.
Three Levels of Contract as Code
Level 1: Syntax Layer — Structured Encoding
Manifest uses YAML syntax, not by accident. The choice of YAML is based on these technical considerations:
# Human-readable
must:
- id: MUST-REG-EMAIL-001 # Unique identifier
text: "Users can register via email" # Human intent
validation: "email format complies with RFC 5322" # Machine-verifiable constraint
evidenceRefs: [EVD-REG-TEST-001] # Explicit dependencies
coveredByTraceRows: [TRACE-REG-001] # Bidirectional traceability
- Unique identifier (ID): Each requirement has a globally unique ID, referenceable and traceable
- Human-readable text: Preserves human-understandable descriptions
- Machine-verifiable validation: Precise verification conditions
- Explicit dependencies evidenceRefs: No implicit dependencies, all dependencies explicitly declared
- Bidirectional traceability coveredByTraceRows: Establishes complete requirement↔verification linkage
Level 2: Semantic Layer — Executable Contracts
Manifest isn’t just “documentation,” it has clear execution semantics:
# These fields aren't decorations, they're executable instructions
traceRows:
- id: TRACE-REG-001
covers: [MUST-REG-EMAIL-001, MUST-REG-PASSWORD-001] # ← Executor checks: Are these MUSTs defined?
evidenceRefs: [EVD-REG-TEST-001] # ← Executor checks: Are these EVDs provided?
deliveryEvidenceCommandRefs: [CMD-REG-TEST-001] # ← Executor runs: Execute these commands
acceptanceRefs: [ACC-REG-001] # ← Executor verifies: Do these tests pass?
When AI-TDD Gate parses Manifest:
- Static verification: Check if all references exist (MUST, EVD, CMD)
- Dependency resolution: Build dependency graphs of requirements→evidence→commands→tests
- Execution scheduling: Run commands in Trace order, collect evidence
- State determination: Determine TDD-RED or TDD-GREEN based on evidence
Level 3: Meta-Semantic Layer — Contract Meta-Programming
Manifest’s most powerful feature is self-reference — Manifest can describe its own completeness requirements:
# Manifest describes "what makes a Manifest complete"
implementationConfirmation:
status: user_confirmed
must:
- id: MUST-META-001
text: Manifest must contain Trace coverage for all MUSTs
oracle: "Every MUST has coveredByTraceRows"
- id: MUST-META-002
text: Manifest must contain boundary declarations for all NEG (MUST NOT) items
oracle: "Every NEG is bound to EVD or FAIL paths; OUT OF SCOPE only records scope boundaries"
closeoutReadinessPreview:
requiredCommands: [CMD-VALIDATE-MANIFEST-001]
# Manifest can define "how to validate itself"
This self-reference enables AI-TDD to programmatically verify structural contract completeness. Human review is still required for business semantics, but reference integrity, coverage relationships, and evidence binding can be checked by machines.
Contract as Code vs Code as Contract
Some might ask: Why not use code (test code) directly as the contract? That’s what traditional TDD does.
Key differences:
| Code as Contract (Traditional TDD) | Contract as Code (AI-TDD) |
|---|---|
| Contract is test code | Contract is YAML Manifest |
| Humans read tests to infer requirements | Machines directly parse contracts |
| Requirements implicit in tests | Requirements explicitly declared |
| Completeness cannot be automatically checked | TRACE rows can be automatically verified |
| Negative behavior hard to express | OUT OF SCOPE natively supported |
| State only “pass/fail” | TDD-RED/GREEN explicit states |
Test code is a tool for “verifying implementation,” Manifest is a tool for “encoding intent.” They complement each other: Manifest defines “what should be done,” tests verify “whether it was done.”
Manifest as the Primary Source of Truth for Intent
In AI-TDD, Manifest should be maintained as the primary source of truth for the acceptance contract:
Manifest-Centric Architecture: All participants (humans, AI, gates) get information from the same Manifest
| Participant | How They Use Manifest | Output |
|---|---|---|
| Humans | Read HTML confirmation pages | Decisions and confirmations |
| AI | Parse YAML for execution | Implementation code |
| AI-TDD Gate | Verify execution status | Gate reports and evidence |
Advantage: Version drift between “requirements understood by humans” and “code implemented by AI” becomes easier to reduce and trace. All participants work from the same contract.
Practical Insight: Writing Manifest is Programming
Write Manifest as code:
- Version control: Manifest under Git management, every modification has history
- Code review: Manifest changes need Review, just like code Review
- Automated checking: CI/CD pipelines automatically verify Manifest syntax and semantics
- Refactoring support: When requirements change, Manifest refactoring corresponds to test and implementation refactoring
- Documentation as code: Manifest is self-documenting, no additional “requirement documents” needed
Key insight:
Manifest is not “better requirement documentation” but a new type of programming language — a domain-specific language (DSL) specifically for expressing human intent and machine contracts.
Learning AI-TDD isn’t learning how to “write better documents” but learning how to think and encode in the language of contracts.
Chapter 3: AI-TDD vs existing technical solutions
3.1 From TDD to BDD to AI-TDD: evolution of methodologies
Traditional TDD, BDD, and AI-TDD can be read as three stages in the evolution of software engineering practice. To understand what AI-TDD adds, we need to look at the blind spots that TDD and BDD expose in the AI era. Both methods assume “humans write the tests, and humans write the implementation.” Once AI becomes the implementation agent, that assumption weakens.
Figure 1: This figure answers one question: how do TDD, BDD, and AI-TDD evolve across abstraction levels?
Traditional TDD: test-driven development
Traditional TDD is built on a simple idea: write the test first, then implement. That model works well in a manual coding environment, but it faces structural challenges once implementation is delegated to AI.
Three Implicit Assumptions of Traditional TDD:
- Test writer and implementer are the same person — requirements are consistent in the developer’s mind
- Test code can express complete requirements — correct behavior can be inferred from tests
- Test omissions can be discovered in code review — humans can identify missing test scenarios
In the AI era, all three assumptions fail:
- Test writer is human, implementer is AI — their “understanding of requirements” may be completely different
- AI cannot see the global picture of tests — it only sees isolated test cases, not the constraint relationships between requirements
- AI-generated code may pass existing tests but miss key scenarios — code reviewers struggle to discover AI’s “implicit assumptions”
Traditional TDD vs AI-TDD comparison:
| Dimension | Traditional TDD | AI-TDD |
|---|---|---|
| Requirement carrier | Test code + developer memory | Manifest contract matrix |
| Completeness verification | Depends on manual review | Machine automatic checking (TRACE rows) |
| Boundary definition | Implicit (no test = not done) | Explicit (OUT OF SCOPE must be declared) |
| Negative constraints | Difficult to express | Native support (NEG / MUST NOT) |
| State flow | No explicit state | TDD-RED→TDD-GREEN gate-driven |
| Execution agent | Human developer | AI execution, human confirmation |
| Acceptance criteria | Tests pass | All Manifest EVD verified |
BDD/Gherkin: behavior-driven development
BDD (Behavior-Driven Development) uses natural language to describe behavior in an attempt to narrow the gap between business stakeholders and developers:
Feature: User Registration
Scenario: Valid registration
Given user inputs valid email and password
When submit registration request
Then return 201 Created
And password must be encrypted for storage
BDD’s Core Assumptions vs AI Era Reality:
BDD makes three core assumptions:
- Natural language is clear enough — business people, developers, and testers can understand the same set of descriptions
- Step Definitions are accurate mappings — natural language can be accurately converted to code implementation
- Humans are execution agents — developers manually implement functionality according to BDD descriptions
In an AI-heavy workflow, each of those assumptions becomes less stable:
Challenge 1: Natural Language Ambiguity is Amplified by AI
BDD’s Given/When/Then is clear enough for humans but full of ambiguity for AI:
# Clear to humans, ambiguous to AI
Then password must be encrypted for storage
AI might understand:
- Return encrypted password in response (wrong)
- Store bcrypt hash in database (correct, but what hash strength?)
- Use AES encryption for storage (wrong, reversible encryption)
- Use SHA256 (wrong, unsuitable for passwords)
Challenge 2: BDD’s Traceability is Unidirectional
BDD’s trace chain is Scenario → Step Definitions → Code implementation. That mapping is mostly one-way and lacks strong reverse verification:
- Code reviewers can ask: “Which Scenario does this function correspond to?” — requires manual lookup
- AI cannot achieve: “Given code implementation, automatically check if it satisfies all Scenarios”
- No global contract matrix to verify “whether all requirements are covered”
Challenge 3: BDD Cannot Express Negative Contracts
BDD excels at expressing “what should happen” but not “what is prohibited”:
# Awkward negative expression
Scenario: Should not store plaintext password
Given user registration succeeds
Then plaintext password should not exist in database
# Even more awkward boundary exclusion
Scenario: Does not include social login
Given this is standard registration flow
Then OAuth option should not appear
These “negative” scenarios are awkward in BDD by design. BDD is good at expressing intended behavior, not at declaring hard negative constraints.
BDD/Gherkin vs AI-TDD Gate Manifest Comparison:
| Dimension | BDD/Gherkin | AI-TDD Gate Manifest |
|---|---|---|
| Format | Natural language (Given/When/Then) | Structured YAML (machine-readable) |
| Abstraction layer | Behavior description | Requirement contract matrix |
| Completeness verification | Depends on manual Scenario coverage review | Machine automatic checking (TRACE rows) |
| Traceability | Weak (Scenario→Step Definitions) | Strong (5D trace matrix) |
| State management | No built-in state machine | TDD-RED/TDD-GREEN explicit state flow |
| Gate control | None | Implementation Readiness Gate + Delivery Closeout Gate |
| Evidence requirements | Tests pass | Must associate EVD, ART, CMD |
| Negative constraints | Difficult to express (Given not… awkward) | Native support (NEG / MUST NOT + OUT OF SCOPE) |
| Execution constraints | Loose (Steps can be omitted) | Strict (Traces must cover all MUSTs/NEGs) |
Key differences explained:
1. Machine readability vs human readability
BDD prioritizes human readability:
# Human-friendly, machine difficult to precisely parse
Then password must be encrypted for storage and comply with security standards
AI-TDD prioritizes machine readability:
# Machine-readable, semantically precise
must:
- id: MUST-REG-PASSWORD-001
text: Password must be encrypted for storage
validation: "Database stores bcrypt hash, not plaintext, cost factor >= 12"
evidenceRefs: [EVD-SEC-PASSWORD-001]
When AI becomes the implementation agent, machine readability matters more. AI needs precise constraints, not descriptions that depend on human interpretation.
2. Contract Completeness Mechanism
BDD has no “is the contract complete” checking mechanism:
- You can write 100 Scenarios and still miss critical negative constraints
- No machine can ask: “Which requirements have no Scenario coverage?”
AI-TDD’s TRACE rows enforce bidirectional traceability:
must:
- id: MUST-REG-EMAIL-001
text: Accept valid email and password
evidenceRefs: [EVD-REG-TEST-001]
coveredByTraceRows: [TRACE-REG-001] # Must have Trace coverage
traceRows:
- id: TRACE-REG-001
covers: [MUST-REG-EMAIL-001] # Reverse declaration of coverage
evidenceRefs: [EVD-REG-TEST-001]
deliveryEvidenceCommandRefs: [CMD-REG-TEST-001]
acceptanceRefs: [ACC-REG-001]
Gate can automatically verify: “Does MUST-REG-EMAIL-001 have Trace coverage? Does TRACE-REG-001 have EVD support?”
3. First-Class Status of Negative Behavior
Negative behaviors can only be awkwardly expressed in BDD. In AI-TDD, negative behavior is core to the contract:
mustNot:
- id: NEG-SEC-PLAIN-PASSWORD-001
text: Prohibit storing plaintext password
evidenceRefs: [EVD-SEC-PASSWORD-001]
oracle: negative control oracle
outOfScope:
- id: OUT-SEC-STRESS-001
text: SQL-injection stress testing is outside this delivery slice
reason: Moved to the dedicated security task
edgeCases:
- id: EDGE-REG-CONCURRENCY-001
category: explicit_error_case_mapping
condition: Duplicate email during concurrent registration
expectedBehavior: Fail closed
forbiddenBehavior: Silently overwrite existing account
Why must negative behavior be explicitly defined in the AI era?
Because AI “creatively” fills requirement gaps. If you don’t explicitly say “prohibit plaintext storage,” AI might choose the simplest implementation (plaintext storage) to make functionality “work quickly.” Negative contracts are boundary fences preventing AI from over-freedom.
4. TDD-RED/GREEN State Machine
BDD has no state concept — Scenarios either pass or fail. AI-TDD has explicit state flow:
acceptanceTests:
- id: ACC-REG-001
file: tests/acceptance/user-registration.test.ts
covers: [MUST-REG-EMAIL-001]
expectedPreImplementationState: expected_red # Entry gate must be TDD-RED
commandRefs: [CMD-REG-TEST-001]
Value of TDD-RED state: During implementation readiness phase, tests must all be red. This proves:
- Tests are actually running (not fake tests)
- Implementation is actually missing (not pre-implemented)
- Manifest is actually being used for verification
BDD cannot distinguish “tests not yet written” from “tests written but implementation missing” — both are “failures” in BDD but fundamentally different states in AI-TDD.
5. Evidence Chain Completeness
BDD’s “pass” is binary: test pass = complete. AI-TDD’s “pass” is a multi-layer evidence chain:
Five-Dimension Evidence Chain Structure (core five layers):
| Level | Element | Role |
|---|---|---|
| L1 | MUST (Requirement) | Business commitment |
| L2 | TRACE (Execution Slice) | Execution tracking |
| L3 | EVD (Evidence Definition) | Verification criteria |
| L4 | CMD (Verification Command) | Reproducible execution |
| L5 | ART (Deliverable) | Physical output |
The complete evidence chain also includes: ACC (automated verification), EXIT CODE (objective result), AUDIT RECEIPT (audit traceability).
Each link can be independently audited and verified. BDD Scenario passes can lose credibility when over-mocked or over-stubbed; AI-TDD evidence chains raise the cost of fabrication through commands, artifacts, and audit records, but still require review discipline.
AI-TDD: contract-driven development
Traditional TDD and BDD are “description-driven” — describe what should happen, then expect implementation to match the description. AI-TDD is “contract-driven” — define complete contract matrices before execution, then force implementation to satisfy the contract.
Evolutionary Relationship of Three Methods:
| Era | Methodology | Core Characteristics | Abstraction Layer Elevation |
|---|---|---|---|
| 2000s | Traditional TDD | Test first, red then green | Baseline layer |
| 2010s | BDD | Natural language behavior description, business participation | Abstraction layer elevated |
| 2020s | AI-TDD | Structured contract definition, AI execution | Abstraction layer elevated again + machine-readable constraints |
Evolution Path: Traditional TDD (test-driven) → BDD (behavior-driven) → AI-TDD (contract-driven)
Core Problems Solved by AI-TDD:
-
Solving “test coverage ≠ requirement coverage”
- BDD: 100 Scenarios pass, may still miss critical constraints
- AI-TDD: TRACE rows force every MUST/NEG to have coverage, gate automatically checks
-
Solving “AI understanding deviation”
- BDD: Natural language ambiguity amplified by AI
- AI-TDD: Machine-readable YAML eliminates ambiguity, validation field precisely constrains
-
Solving “negative behavior difficult to express”
- BDD: “Should not” can only be awkwardly expressed
- AI-TDD: NEG / MUST NOT and OUT OF SCOPE are first-class citizens with dedicated verification strategies
-
Solving “incomplete evidence chain”
- BDD: Test pass = complete
- AI-TDD: Five-dimension trace matrix, each layer auditable
-
Solving “missing state management”
- BDD: No state concept
- AI-TDD: TDD-RED→TDD-GREEN explicit state flow, dual gate control
Migration Recommendations:
If you already have TDD or BDD practices, the path to migrate to AI-TDD:
- Keep existing tests/Scenarios as human communication tools — but don’t treat them as contracts
- Add AI-TDD Manifest on top — define machine-readable contracts with YAML
- Extract MUST/NEG from tests/Scenarios — ask yourself: “Which MUST does this correspond to? Which NEGs are missing?”
- Supplement missing OUT OF SCOPEs — explicitly declare features not included in this iteration
- Add Trace coverage — ensure every requirement has corresponding acceptance tests and evidence
Key insight:
- Traditional TDD test code is a verification tool
- BDD Scenarios are a communication tool
- AI-TDD Manifest is the contract carrier defined in this article
- All three can coexist: Scenarios for human communication, tests for verification, Manifest for machine execution
3.2 Migration example: recommended process from TDD to AI-TDD
If you already have TDD practices, we recommend migrating to AI-TDD in the following steps:
Phase 1: Parallel Operation
- Keep existing TDD tests, continue running them
- Select 1 new feature, try writing AI-TDD Manifest
- Compare differences between the two approaches: “test code” vs “contract matrix”
Phase 2: Problem Discovery
- Focus on negative scenarios not covered by existing tests
- Document “requirement drift” cases in AI-generated code
- Identify which NEG (MUST NOT) items should be explicitly defined in Manifest
Phase 3: Gradual Switch
- Prioritize AI-TDD for new features
- Supplement Manifest for legacy features during refactoring
- Establish internal Manifest templates and best practices
Common Migration Pitfalls:
- ❌ Pitfall 1: Rewrite all tests at once → ✅ Should run in parallel, gradual switch
- ❌ Pitfall 2: Delete existing tests → ✅ Keep TDD tests as regression tests
- ❌ Pitfall 3: Manifest too detailed → ✅ Focus on boundary definition, not implementation details
Chapter 4: AI-TDD Gate Manifest detailed explanation
4.0 Manifest schema specification (v1.0)
Before looking at specific examples, we need to clarify Manifest’s Schema specifications:
Field Naming Conventions:
- Use camelCase (e.g.,
mustNot,outOfScope,traceRows) - Top-level container:
manifest→acceptanceCriteria - Core dimensions:
must,mustNot(holdsNEG-*),outOfScope(holdsOUT-*),evidence(holdsEVD-*), andtraceRows(holdsTRACE-*) - Status fields:
gateStatus→implementationReadiness/closeout
Version Notes:
version: "1.0.0"follows SemVer specification- Major: Requirement scope changes (MUST/OUT OF SCOPE additions/deletions)
- Minor: Verification condition refinement (validation modifications)
- Patch: Text description optimization
4.1 Core structure of Manifest
Figure 2: This figure answers one question: what contract layers make up a Manifest?
File: ai-tdd-manifest.yaml
manifest:
version: "1.0.0"
project: "user-service"
requirementId: "REQ-001"
title: "User Registration"
# Requirement contract matrix
acceptanceCriteria:
# MUST: Conditions that must be satisfied
must:
- id: "MUST-REQ-001"
description: "Accept valid email and password"
validation: "email format complies with RFC 5322, password length 8-32 characters"
test: "test_valid_registration"
priority: "P0"
- id: "MUST-REQ-002"
description: "Reject existing email"
validation: "Query database, duplicate email returns 409 Conflict"
test: "test_duplicate_email"
priority: "P0"
- id: "MUST-REQ-003"
description: "Password must be encrypted for storage"
validation: "Database stores bcrypt hash, not plaintext"
test: "test_password_hashing"
priority: "P0"
# NEG / MUST NOT: Blocking negative assertions
mustNot:
- id: "NEG-SEC-001"
description: "Prohibit storing plaintext password"
validation: "Database fields do not contain plaintext password"
test: "test_no_plaintext_storage"
- id: "NEG-SEC-002"
description: "Prohibit SQL injection"
validation: "All database operations use parameterized queries"
test: "test_sql_injection_prevention"
# OUT OF SCOPE: Current-iteration scope boundary, not a completion blocker
outOfScope:
- id: "OUT-AUTH-001"
description: "Social login (OAuth)"
reason: "Not included in this iteration, moved to REQ-010"
- id: "OUT-PHONE-001"
description: "Phone verification code registration"
reason: "Not included in this iteration, moved to REQ-011"
NEG vs OUT OF SCOPE Semantic Distinction Guide:
mustNot / NEG (Prohibited):
- Scenario 1: Absolute prohibitions at security/compliance level (never allowed even in future iterations)
- Scenario 2: Technical constraint-induced prohibitions (e.g., “prohibit storing passwords in plaintext”)
- Characteristic: Has validation conditions, can be violated; violation blocks completion
outOfScope / OUT (Scope boundary):
- Scenario: Explicit exclusion at functional level (not implemented in current iteration but possible in future)
- Characteristic: Has reason explanation and follow-up plans (moved to REQ-XXX)
- Note:
OUT-*is a scope boundary, not a completion blocker; it should not appear incoversSimple judgment: If this feature “might be done later” → use outOfScope; if “should never be done” → use mustNot.
The same Manifest also needs to explicitly write evidence, trace matrix, and gate status:
manifest:
acceptanceCriteria:
# EVD: Assertions requiring evidence
evidence:
- id: "EVD-TEST-001"
type: "test_coverage"
description: "Code coverage >= 80%"
threshold: "80%"
artifact: "coverage_report.html"
- id: "EVD-SEC-001"
type: "security_scan"
description: "No high-severity security vulnerabilities"
tool: "bandit"
artifact: "security_scan_report.json"
- id: "EVD-PERF-001"
type: "performance_test"
description: "Registration API response time < 200ms (P99)"
threshold: "200ms"
artifact: "perf_test_report.html"
# TRACE rows: executable slices across requirements, tasks, evidence, commands, acceptance, and artifacts
traceRows:
- id: "TRACE-REG-001"
covers: ["MUST-REQ-001", "NEG-SEC-001"]
taskRefs: ["TASK-REG-001"]
evidenceRefs: ["EVD-TEST-001", "EVD-SEC-001"]
contractValidationCommandRefs: ["CMD-REG-TEST-001"]
acceptanceRefs: ["ACC-REG-001", "E2E-REG-001"]
artifactRefs: ["ART-REG-REPORT-001"]
status: "pending"
# Failure paths: required failure scenarios attached to NEG items
failurePaths:
- id: "FAIL-SEC-001"
covers: ["NEG-SEC-001"]
expectedFailure: "Plaintext password write attempt is blocked by tests"
evidenceRefs: ["EVD-SEC-001"]
# Edge cases: implementation boundaries that do not expand business scope
edgeCases:
- id: "EDGE-REG-001"
covers: ["MUST-REQ-001"]
case: "Email uniqueness still holds after case normalization"
evidenceRefs: ["EVD-TEST-001"]
tasks:
- id: "TASK-REG-001"
description: "Implement the email-registration path and negative security constraints"
# End-to-end test suites
e2eTestSuites:
- name: "user_registration_flow"
path: "tests/e2e/user_registration.test.js"
scenarios: ["happy_path", "duplicate_email", "invalid_input"]
# Acceptance test suites
accTestSuites:
- name: "user_registration_acc"
path: "tests/acc/user_registration.test.py"
criteria: "all_pass"
# Gate status
gateStatus:
implementationReadiness:
status: "TDD-RED"
lastRun: "2026-05-27T10:00:00Z"
summary:
total: 15
passed: 0
failed: 15
pending: 0
closeout:
status: "TDD-GREEN"
closeoutAttemptId: "closeout-20260527-160000"
lastRun: "2026-05-27T16:00:00Z"
summary:
total: 15
passed: 15
failed: 0
pending: 0
deliveryEvidence:
closeoutAttemptId: "closeout-20260527-160000"
manifestHash: "sha256:..."
sourceSnapshotHash: "sha256:..."
commandRunRefs: ["RUN-REG-TEST-001", "RUN-SEC-SCAN-001"]
artifactRefs:
- id: "ART-REG-REPORT-001"
sha256: "sha256:..."
gateVerdict:
status: "pass"
evaluatedAt: "2026-05-27T16:02:00Z"
receiptArtifactRef: "ART-CLOSEOUT-REPORT-001"
humanDecision:
decision: "accept"
actor: "tech-lead"
decidedAt: "2026-05-27T16:05:00Z"
receiptArtifactRef: "ART-HUMAN-DECISION-001"
Figure 3: This figure answers one question: how do requirement, execution, evidence, command, and artifact form a trace chain?
4.2 Manifest generation process
Manifest is not written in one go but generated through atomic decomposition iteration.
Iteration 1: Requirement Draft Human provides: “Implement user registration feature” AI generates: Manifest draft (containing basic MUST items) Human reviews: Confirm/modify/supplement
Iteration 2: Boundary Clarification Human supplements: “Need email verification, password strength check” AI updates: Adds MUST items, generates OUT OF SCOPE list Human reviews: Confirms exclusions
Iteration 3: Evidence Definition Human supplements: “Need 80% code coverage” AI updates: Adds EVD items Human reviews: Confirms thresholds
Iteration 4: Trace Establishment AI generates: TRACE rows draft Human reviews: Confirms requirement↔test↔evidence mapping relationships
Iteration 5: Final Confirmation Human reads: Requirement confirmation page HTML Human decision: Confirm Manifest complete, proceed to next stage
Key Principles:
- Manifest must reach “complete” state before execution begins
- Definition of “complete”: MUST/NEG/OUT/EVD/TRACE rows all defined, no omissions
- Incomplete Manifest not allowed to pass Implementation Readiness Gate
4.3 Machine readability of Manifest
Manifest must be machine-readable, meaning:
1. Structured Format
- Use YAML or JSON
- Have strict Schema definitions
- Can be parsed and validated by programs
2. Clear Field Semantics
- Each field has clear types and constraints
- Support automated validation (e.g., JSON Schema validation)
3. Executable References
- MUST items associated with specific test file paths
- EVD items associated with specific verification tools
- TRACE rows can automatically check coverage
4. Queryable State
- Can programmatically query “which MUST items not yet verified”
- Can programmatically query “TRACE rows coverage”
- Can programmatically determine “whether entry allowed” or “whether delivery allowed”
Chapter 5: Six mental models of AI-TDD
Figure 4: AI-TDD six mental models and gate flow, showing the two gates and their expected statuses: Entry Gate maps to
AI-TDD-RED, and Delivery Gate maps to AI-TDD-GREEN.
The six mental models form AI-TDD’s cognitive abstraction layer. Together, they describe the six recurring thought patterns in human-AI collaboration.
5.1 Requirement confirmation
This is the first implementation link of the “problem boundaries precede solutions” principle. The sole goal of requirement confirmation is to clearly define the business boundaries of the problem.
Cognitive core: transformation from fuzzy intent to machine-readable contract
Core idea: use iterative atomic decomposition during requirement confirmation to produce a complete AI-TDD Gate Manifest, which becomes the requirement contract matrix.
Key insights:
- AI can assist in generating Manifest, but humans have final authority over requirement definition
- Manifest should be sufficiently complete before execution begins; otherwise, omissions and drift become more likely
- Manifest is not natural language documentation but machine-readable structured contracts
Key actions:
-
Atomic Requirement Decomposition
Decompose “implement user registration feature” into:
- MUST-REG-INPUT-001: Accept email and password inputs
- MUST-REG-EMAIL-FORMAT-001: Verify email format is valid
- MUST-REG-PASSWORD-LENGTH-001: Verify password length ≥ 8
- MUST-REG-UNIQUE-001: Reject existing email
- MUST-REG-PASSWORD-HASH-001: Password must be encrypted for storage
- NEG-SEC-PLAIN-PASSWORD-001: Prohibit storing plaintext password
- NEG-SEC-SQL-INJECTION-001: Prohibit SQL injection
- OUT-AUTH-SOCIAL-001: Social login (moved to REQ-010)
- OUT-PHONE-VERIFY-001: Phone verification (moved to REQ-011)
-
Define EVD (Verifiable Evidence)
- EVD-TEST-COV-001: Code coverage >= 80%
- EVD-SEC-SCAN-001: No high-severity security vulnerabilities
- EVD-PERF-REG-001: Registration API response time < 200ms (P99)
-
Establish TRACE rows (Traceability Matrix)
Build complete requirement↔test↔evidence traceability so every MUST item has both test coverage and evidence support.
-
Generate Requirement Confirmation Page (HTML)
Generate human-readable requirement confirmation pages containing:
- Business goals and background
- Complete Manifest preview
- MUST/NEG/OUT checklists
- EVD requirements
- TRACE rows coverage
-
Human Confirmation and Decision
⚠️ Human-in-the-loop critical link:
- Humans read requirement confirmation page HTML
- Check Manifest completeness
- Make decision: “Manifest is complete, can proceed to next stage”
Exit criteria:
- Manifest completely generated (MUST/NEG/OUT/EVD/TRACE rows all defined)
- Requirement confirmation page HTML generated
- Human decision passed
5.2 Architecture confirmation
On the basis of clear business boundaries, further define technical boundaries. Clarify technical solutions and interface contracts, demarcating system boundaries with the external world.
Cognitive core: mapping from problem space to solution space
Core idea: within Manifest constraints, define technical solutions and interface contracts, then generate an architecture confirmation page for human decision-making.
Key insights:
- Manifest defines “what to do,” architecture confirmation defines “how to do it”
- AI can provide multiple architecture options, but humans have architectural decision authority
Key actions:
-
Technical Solution Evaluation
- AI provides 2-3 alternative architecture options
- Compare pros/cons, risks, resource requirements
-
Define Interface Contracts
- API interface definitions (OpenAPI/Swagger)
- Data model definitions (JSON Schema)
- Error code conventions
-
Generate Architecture Confirmation Page (HTML)
⚠️ Human-in-the-loop critical link:
- Recommended architecture solution and rationale
- Alternative solution comparison
- Risk assessment
- Interface contract preview
-
Human Confirmation and Decision
Technical decision-makers read architecture confirmation page HTML, make decisions on which solution to adopt.
Exit criteria: technical solution confirmed + architecture confirmation page HTML generated + human decision passed + interface contracts defined
5.3 Implementation readiness
Verify whether boundaries can be covered by tests. Based on defined boundaries (Manifest), generate acceptance test baselines that sufficiently cover the current scope, confirming boundaries are verifiable.
Cognitive core: state transition from contract definition to execution readiness
Core idea: generate acceptance-test baselines from the completed Manifest, then run the AI-TDD Gate to establish the TDD-RED entry state.
Key insights:
- Tests are not “written while generating” but generated from the confirmed Manifest scope
- TDD-RED state for newly generated or linked acceptance tests is a necessary condition for entry
Key actions:
-
Generate Test Code Based on Manifest
Not manually writing tests, but AI-TDD Gate automatically parsing Manifest to generate:
- Unit tests (covering MUST items)
- Integration tests (covering interface contracts)
- E2E tests (covering end-to-end scenarios)
- Acceptance tests (covering ACC items)
-
Register Tests to AI-TDD Gate
All tests must be registered in AI-TDD Gate, establishing associations with Manifest’s TRACE rows.
-
Run AI-TDD Gate (Entry Run)
First run of AI-TDD Gate, at this point:
- Test code has been generated
- Implementation code not yet generated
- New or associated acceptance tests should be in expected failure state (TDD-RED)
-
Confirm TDD-RED State
Checkpoints:
- Manifest completeness check ✓
- Test registration completeness check ✓
- TDD-RED state confirmation ✓
Exit Criteria: AI-TDD Gate status is TDD-RED for the current acceptance scope
Concept Clarification: TDD-RED here refers to “first run of Implementation Readiness Gate, verifying test baselines completely generated but implementation not yet started.” This is the prerequisite for entering the next stage (Execution Closure), not the final delivery standard.
Sanitized Example: Stale Attempt Detection Mechanism
The following snippet comes from a sanitized closeout runner requirement contract, illustrating how to prevent reuse of historical success records:
- id: FAIL-RUNNER-STALE-ATTEMPT-001
title: stale_attempt_reused
trigger: >-
deliveryEvidence.requiredCommands[].lastRunRef.closeoutAttemptId
inconsistent with current attempt
expectedBehavior: >-
AI TDD closeout must report stale_attempt or
current_attempt_command_missing
forbiddenBehavior: Historical success satisfies current closeout
This failure path defines a key safeguard: Each closeout attempt has a unique ID; historical success cannot prove current iteration. This prevents the classic problem of “passing off old test reports” — even if previous tests all passed, as long as the current attempt’s evidence chain is incomplete, re-verification is required.
5.4 Execution closure (Bounded Packet Closure)
AI generates implementation under boundary constraints, strictly prohibited from crossing boundaries. Any implementation deviating from boundary definitions is considered a failure.
Cognitive core: bounded convergence under Manifest constraints
Core idea: AI generates implementation within Manifest constraints and iterates until registered validations pass, producing a closeout candidate without drifting outside the agreed boundary.
Key insights:
- AI generation must be constrained by Manifest, cannot improvise
- Iteration goal is “make all acceptance items defined in Manifest pass”
Key actions:
-
Use Manifest as Prompt Context
Prompt given to AI contains:
- Full Manifest text
- Current list of failing tests
- Failure reasons
-
AI Generates Implementation
AI uses the Manifest as the global requirement context and generates implementation code within that frame.
-
Run AI-TDD Gate (Iteration Run)
Run all tests to get current status (TDD-RED or partial GREEN).
-
Iterative Correction
If tests fail:
- Feed failure information back to AI
- AI regenerates corrected code
- Run AI-TDD Gate again
- Until registered validations pass and a closeout candidate is produced
Exit criteria: the candidate implementation satisfies Manifest-linked tests and enters closeout candidate status. Final AI-TDD-GREEN still requires evidence-chain closure, Gate Verdict, and Human Decision during delivery closeout.
Sanitized Example: Fail-Fast Execution Strategy and Evidence Atomicity
The following snippet comes from the same class of sanitized requirement contracts, showing how AI-TDD handles command execution failures:
- id: MUST-RUNNER-FAILFAST-001
text: >-
Dynamic runner must record successful required commands as current attempt's
artifact-bound deliveryEvidence.requiredCommands[]; when any required command
first fails, must immediately stop remaining required commands, write failed
evidence packet before returning, and return non-zero.
evidenceRefs:
- EVD-RUNNER-SUCCESS-001
- EVD-RUNNER-FAILURE-001
coveredByTraceRows:
- TRACE-RUNNER-FAILFAST-001
Note the key design:
coveredByTraceRows: Every MUST must be covered by at least one traceevidenceRefs: Every trace must produce verifiable evidence- Fail-Fast: Stop immediately on first command failure, preventing “partial success” false states
- Artifact-Bound: Evidence must be bound to artifacts (file hashes), not just exit codes
This contrasts sharply with the traditional “write code first, then add tests” reverse flow — in AI-TDD, requirements and evidence are defined before implementation; implementation must converge to predefined verification standards.
5.5 Audit review
Verify whether implementation strictly follows boundary definitions. Multi-Agent critical audit to avoid execution agent self-checking rapid self-consistency, ensuring boundaries were not drifted during execution.
Cognitive core: multi-agent critical audit, avoiding execution agent self-checking rapid self-consistency
Core idea: introduce independent audit agents to review execution results critically and reduce the blind spots of a single execution agent.
Key insights:
- Execution Agents easily fall into “rapid self-consistency” during iteration
- Audit Agents are independent from execution Agents, providing critical perspectives
Key actions:
-
Multi-Agent Cross Audit
- Audit Agents independently review execution results
- Check whether NEG (MUST NOT) items defined in Manifest are truly not violated
- Check architecture consistency
-
Findings Recording and RCA
- Record audit-discovered issues (Findings)
- Conduct root cause analysis (RCA)
-
Scoring and Improvement Recommendations
- Score execution quality
- Generate improvement recommendations
Exit Criteria: Audit Agents have no blocking Findings, or all Findings have been corrected and passed re-verification
Production Case: Negative Verification and “What Must Fail”
AI-TDD not only defines “what should be done” (MUST) but more importantly defines “what must fail under what circumstances” (NEG + FAILURE PATH). The following snippet demonstrates this core idea:
# Negative assertion: Define what is insufficient to constitute completion proof
- id: NEG-EVD-EXITCODE-001
text: Exit code only proof shall not satisfy deliveryEvidence.requiredCommands[] or closeout
whyItBlocksCompletion: >-
Successful process missing artifact-bound proof is not current attempt delivery evidence
# Corresponding failure path: Define specific trigger conditions and expected behaviors
- id: FAIL-RUNNER-CMD-001
title: required_command_failed
trigger: Some required command returns non-zero
expectedBehavior: >-
Runner must immediately stop remaining required-command execution, write failed evidence packet
before returning that does not go through successful implementation evidence ingest path for failed command
forbiddenBehavior: >-
Runner continues executing remaining required commands, omits failed evidence packet,
submits failed packet through implementation_evidence_ingested
The profound aspect of this design: It not only verifies “success” but strictly defines “failure.”
In traditional development, we often say “tests passed.” But in AI-TDD, we must ask: “Under what circumstances do tests not count even if they pass?”
- Only exit code? Doesn’t count
- No artifact binding? Doesn’t count
- Historical run results? Doesn’t count
- Current attempt mismatch? Doesn’t count
The essence of audit review is verifying whether the boundary of “what must fail” is strictly enforced.
5.6 Delivery closeout
Final verification of whether boundaries have been completely implemented. Full regression testing confirms all requirements within boundaries have been achieved, no drift.
Cognitive core: full regression verification and formal closure
Core idea: Re-run AI-TDD Gate full tests to confirm no drift occurred during execution, generate delivery confirmation page, and formally complete delivery closeout.
Key insights:
- Delivery confirmation is not simple “package output” but full regression verification
- Must re-run all acceptance tests registered in Manifest
Key actions:
-
Re-run AI-TDD Gate full tests
⚠️ Key distinction:
- Implementation Readiness Gate: First run, expected TDD-RED
- Delivery Closeout Gate: Re-run, must be TDD-GREEN
Ensure:
- Tests passed during execution closure phase still pass (no regression)
- All acceptance items in Manifest have status “VERIFIED”
-
Generate delivery closeout page (HTML)
Generate human-readable delivery confirmation pages containing:
- Complete Manifest status (all VERIFIED)
- Test result summary (all green)
- Code quality metrics
- Audit records and Findings
-
Human final review
Humans read delivery confirmation page HTML for final gatekeeping.
-
Formal delivery closeout
After the Human Decision is recorded as accept, mark the task formally complete and archive all project assets.
Exit Criteria: current-attempt evidence chain closed + Gate Verdict is pass + Human Decision is recorded as accept + delivery confirmation page generated
Production Case: Gate Verdict, Oracle, Artifact, and Hash Chain Verification
The following snippet shows AI-TDD delivery confirmation’s core verification structure:
# Evidence definition: What constitutes "deliverable"
- id: EVD-CLOSEOUT-GATE-001
text: Only updated records passing the team's custom closeout gate are allowed closeout pass
gate: vitest
oracle: Runner returns 0 only when closeout report decision=pass and closeoutReadinessReport.ready=true
requiredCommandRefs:
- CMD-TEST-CLOSEOUT-GATES
artifactRefs:
- ART-AI-TDD-CLOSEOUT-REPORT
Coordinated with the triple hash verification mechanism at the top of the document:
# Source document hash: Ensures requirements not tampered
sourceDocumentHash: sha256:51f17f2172e951599153bbb744877386cd582f33e393bc597c6ea926e143a878
# Implementation confirmation hash: Ensures scope confirmed
implementationConfirmationHash: sha256:c1ca24b3b7bc63cfc5d8eee55b72e510cdbcb03c8cb4ceee9bc3806b15c537d3
# Confirmation page hash: Ensures rendering result consistent
confirmationPageHash: sha256:f00fbd4f6610bdf5ef0f87dbd9cfe0f1b69ecb2777e70d1a57021e2f1f75dd0f
Hash Calculation Notes (avoiding circular dependencies):
sourceDocumentHashcalculation method:
- SHA256 on ai-tdd-manifest.yaml file content
- Note: Set this field value to empty string or placeholder when calculating to avoid circular dependency
- Storage location: Recommended in separate
.manifest.lockfile or Git tag
implementationConfirmationHashcalculation method:
- SHA256 on requirement confirmation phase generated HTML confirmation page
- Purpose: Verify confirmation page not tampered
confirmationPageHashcalculation method:
- SHA256 on delivery confirmation page
- Purpose: Verify final deliverable integrity
The essence of this design is the verdict chain (Evidence → Gate → Oracle → Artifact):
- Gate: the team’s custom closeout gate is the final arbiter. The gate name in this example is project-specific pseudocode, not a public CLI command guaranteed to exist
- Oracle:
decision=pass && ready=trueis the clear pass criterion - Artifact:
ART-AI-TDD-CLOSEOUT-REPORTis an auditable evidence file
Significance of hash chain:
- If source document changes →
sourceDocumentHashchanges → must re-confirm - If implementation scope changes →
implementationConfirmationHashchanges → must re-verify - If rendering result tampered →
confirmationPageHashchanges → must regenerate
The essence of delivery confirmation is verifying the integrity and consistency of the entire evidence chain.
Chapter 6: Two key gates
How this chapter maps to Chapter 5: of the six cognitive stages introduced in Chapter 5, the end of stage 3 (Implementation Readiness) maps to Section 6.1, “Implementation Readiness Gate,” and the end of stage 6 (Delivery closeout) maps to Section 6.2, “Delivery Closeout Gate.” These two gates are the key control points that keep state transitions governed.
6.1 Gate 1: Implementation Readiness Gate
Figure 5: AI-TDD state machine, from MANIFEST_DRAFT →
AI-TDD-RED → IMPLEMENTING → CLOSEOUT_CANDIDATE → AI-TDD-GREEN → CLOSED.
Position: After “Implementation Readiness,” before “Execution Closure”
Core Function: Ensure “Manifest is complete, test baselines established” before allowing AI to begin generating implementation
Entry Criteria:
- Requirement confirmation completed (Manifest completely generated)
- Architecture confirmation completed (technical solution and interface contracts defined)
- Implementation readiness completed (test code based on Manifest generated)
Exit Criteria: New or associated acceptance tests should enter expected TDD-RED state
Precise Definition of TDD-RED:
Necessary Conditions:
- Test code based on Manifest has been registered to AI-TDD Gate
- Implementation code not yet generated or empty (no pre-implementation exists)
- New or associated acceptance tests return expected failures (not passes)
Exclusion Conditions (following cases do not count as TDD-RED):
- Test failure due to test code errors → Should be considered
BLOCKEDstate, needs test fix- Test failure due to environment configuration issues → Should be considered
ERRORstate, needs environment fix- Test failure due to unavailable external dependencies → Should be considered
DEFERREDstate, needs dependency waitKey Principle: TDD-RED is “expected failure,” not “erroneous failure.”
-
Manifest Completeness Check
- All MUST items defined and associated with tests
- All NEG (MUST NOT) items defined
- All OUT OF SCOPE items explicitly excluded
- All EVD items defined with thresholds
- TRACE rows established for the acceptance items declared in the current Manifest
-
AI-TDD Gate Registration Check
- All acceptance items registered to AI-TDD Gate
- E2E TEST SUITES registered
- ACC TEST SUITES registered
-
TDD-RED State Confirmation
- Run AI-TDD Gate full tests
- New or associated acceptance tests should be in expected failure state (TDD-RED)
- If new acceptance tests already pass, need to confirm whether pre-implementation exists, tests are invalid, or acceptance items are duplicated
Gate Status:
- 🔴 TDD-RED: Expected state, allowed to enter execution closure
- 🔴 BLOCKED: Manifest incomplete, entry prohibited
Failure Handling:
- Manifest incomplete → Return to requirement confirmation for re-iteration
- Pre-implementation exists → Clean pre-implementation code
6.2 Gate 2: Delivery Closeout Gate
Position: After “Audit Review,” before formal delivery
Core Function: Ensure all Manifest contract slices are proven by the current attempt’s evidence chain, with no drift, no scope creep, and no stale evidence reuse
Entry Criteria:
- Execution closure completed (candidate implementation satisfies Manifest-linked tests)
- Audit review completed (no blocking Findings)
- Delivery preparation completed (confirmation page, command output, evidence artifacts, and hash snapshots prepared)
- This closeout evaluation has a unique
closeoutAttemptId - Manifest/source hash, implementation snapshot hash, required command run refs, and artifact refs are recorded in the candidate delivery record
Exit Criteria: All Manifest-linked TRACE rows reach evidence-chain closure, scope audits cover every OUT, Gate Verdict is pass, and Human Decision is accept before the Gate can return AI-TDD-GREEN
Core Checkpoints:
-
Manifest Acceptance Status Check
- All MUST items status is “VERIFIED”
- All NEG (MUST NOT) items verified as not violated
- All OUT OF SCOPE items pass scope audit, confirming implementation did not overstep
- All TRACE rows cover the current Manifest’s MUST/NEG items, and scope audit covers OUT items
-
Multi-dimensional evidence-chain check
- Every TRACE has corresponding EVD
- Every EVD has CMD runs from the current attempt
- Every CMD has auditable ART, such as test reports, security scan reports, scope diffs, or closeout reports
- Critical artifact hashes match the delivery confirmation page
-
Registered validation rerun check, necessary but insufficient
- Re-run all acceptance tests registered in Manifest
- Manifest-associated acceptance tests must pass as one input to the evidence chain
- If any test fails, delivery prohibited
- Test pass alone cannot produce
AI-TDD-GREEN; evidence, artifacts, hashes, anti-stale checks, and Human Decision must also pass
-
Anti-forgery checks
- Old runs, old reports, and old screenshots cannot prove the current attempt
- Tests that are not bound to Manifest TRACE rows do not count as acceptance evidence
- Exit code alone is not proof of completion without ART, hash, or receipt
- Human confirmation must leave a decision, timestamp, confirmation page, or equivalent receipt
-
Delivery closeout page completeness
- Delivery confirmation page HTML generated
- Contains complete TRACE rows
- Contains audit records
-
Human final review
- Humans read delivery confirmation page HTML
- Make decision: “Can deliver” or “Return for correction”
- The decision is written to the confirmation page, audit receipt, or equivalent delivery record
Gate Status:
- 🟢 TDD-GREEN + Human pass: Allowed formal delivery closeout
- 🔴 TDD-RED: Tests not passed, delivery prohibited
- 🔴 BLOCKED: Human review not passed
Chapter 7: Toolchain and implementation
7.1 AI-TDD toolchain ecosystem
Figure 6: This figure answers one question: what layers make up the AI-TDD toolchain?
The AI-TDD toolchain is best understood as a layered system, with core engines at the bottom and higher-level Skills on top.
7.2 Core engine layer
1. Manifest Parser
- YAML contract parsing
- ID extraction and matrix building
- Schema validation
2. Gate Controller
- Gate state management
- TDD-RED/GREEN state determination
- Entry/delivery interception logic
3. Trace Engine
- Five-dimension trace matrix execution
- Execution slice management
- Dependency graph building
4. Evidence Log
- Evidence hash storage
- Tamper-proof verification
- Audit traceability
7.3 Currently executable CLI and integration boundary
The currently executable CLI is provided by the npm package bmad-speckit-sdd-flow, with bmad-speckit as the main entry point. Based on the commands verified for this article, the public CLI currently covers setup checks, version reporting, dashboard lifecycle, and broader orchestration surfaces. Lower-level Gate execution belongs to the project integration layer and should not be read as a stable public command exposed directly by this repository.
| Command | Function | Example |
|---|---|---|
bmad-speckit check | Installation/environment check | npx --yes --package bmad-speckit-sdd-flow@latest bmad-speckit check |
bmad-speckit version | View CLI version | npx --yes --package bmad-speckit-sdd-flow@latest bmad-speckit version |
bmad-speckit dashboard-status | View dashboard status | npx --yes --package bmad-speckit-sdd-flow@latest bmad-speckit dashboard-status |
Note: Gate-related capabilities should be entered through the CLI workflow first. The examples below show stable CLI entry points. Teams still need to wire the concrete Gate scripts, tests, and evidence artifacts that exist in their own projects.
CLI usage examples
Step 1: Environment check
$ npx --yes --package bmad-speckit-sdd-flow@latest bmad-speckit check
CLI version: 1.1.0
Template version: 1.1.0
Selected AI: cursor-agent
Subagent support: native
Check OK.
Step 2: Verify the installed CLI version
$ npx --yes --package bmad-speckit-sdd-flow@latest bmad-speckit version
CLI version: 1.1.0
Template version: 1.1.0
Node version: v22.x
Step 3: Inspect dashboard server status
$ npx --yes --package bmad-speckit-sdd-flow@latest bmad-speckit dashboard-status
{
"ok": false,
"mode": "stopped",
"healthy": false
}
Note: These commands reflect the public CLI behavior verified while editing this article. They do not parse Manifest files or execute project-specific delivery Gates by themselves. Concrete entry and delivery gate execution depends on how a team wires Gate scripts, test commands, and evidence artifacts in its own project.
Actual workflow
In practice, the AI-TDD workflow is completed through CLI entry + underlying Gate scripts + AI coding tools:
Three stages:
- Implementation Readiness Gate - Enter through the
bmad-speckitworkflow first. Teams can then wire their own local Gate scripts to verify Manifest completeness. The expected status isAI-TDD-RED(new acceptance tests exist, but implementation is not yet satisfied) - AI Generates Implementation - Developers paste the Manifest into AI tools such as Claude or Cursor as prompt context, and the AI generates code within those boundary constraints
- Delivery Closeout Gate - Again, enter through the
bmad-speckitworkflow first. Teams can wire local delivery Gate scripts to verify the current attempt’sTRACE/EVD/CMD/ARTclosure, rerun registered validations, check artifact/hash/receipt records, and reject stale evidence reuse. The expected status isAI-TDD-GREENonly when Gate Verdict ispassand Human Decision isaccept
Complete state flow appears in Figure 5: AI-TDD State Machine (/images/content/standalone/ai-tdd-framework/tdd-state-machine-en.svg), showing the flow from MANIFEST_DRAFT → AI-TDD-RED → IMPLEMENTING → CLOSEOUT_CANDIDATE → AI-TDD-GREEN → CLOSED, including reject/revise and reconfirm loopback paths.
Key Boundaries:
- Developers manually maintain Manifest YAML files
- CLI is the stable entry point; lower-level Gate scripts should be wired and maintained inside each team’s own project
- AI code generation is completed through external tools (Claude, Cursor, etc.), not built-in commands
7.4 Common errors and solutions
The examples in this section show illustrative integration-layer error shapes. They are not commands guaranteed to exist in this repository, and teams should replace them with the actual Gate scripts wired in their own projects.
Error 1: Manifest File Not Found
$ npx --yes --package bmad-speckit-sdd-flow@latest bmad-speckit check
❌ Gate BLOCKED: Manifest file not found
Expected path: ./my-project/ai-tdd-manifest.yaml
Actual status: File does not exist
Fix:
- Confirm
ai-tdd-manifest.yamlexists in the project root. - Or use
--cwdto specify the correct project directory.
Error 2: Gate Status Not as Expected
$ <your-team-readiness-gate-command> --requirement-record ./records/REQ-001.json
🟢 Gate status: TDD-GREEN (already implemented)
❌ Gate BLOCKED: Expected status is TDD-RED
Problem analysis:
- All tests passed, indicating implementation is already complete.
- But Implementation Readiness Gate requires “tests exist but implementation is still missing.”
Fix:
- Check whether historical implementation code already exists.
- If you need to restart, clean implementation files while preserving tests.
- Or skip the readiness stage and continue from the appropriate delivery closeout flow.
Error 3: Insufficient Test Coverage
$ <your-team-delivery-closeout-command> --requirement-record ./records/REQ-001.json
❌ Gate BLOCKED: TRACE rows coverage insufficient
Uncovered MUST items:
- MUST-REG-PASSWORD-LENGTH-001: No associated test
Fix:
- Add the missing test binding in
ai-tdd-manifest.yaml. - Re-run the Manifest generation and Gate checking workflow.
AI-TDD’s Skill system is extensible Agent capability modules:
| Skill | Function | Scenario |
|---|---|---|
| req-trace-matrix | Requirement trace matrix generation | Generate Manifest from requirement documents |
| contract-authoring | Contract writing assistance | Manifest writing and verification |
| reverse-audit | Reverse audit | Check if code satisfies Manifest |
| checkpoint-gate | Checkpoint gate | Automated acceptance |
| encoding-integrity | Encoding integrity check | File encoding and format verification |
7.5 External integration layer
AI-TDD integration with existing tool ecosystems:
- Test Frameworks: Vitest, Jest, Playwright
- Version Control: Git (commit hash as evidence)
- LLM APIs: Claude, Codex, Gemini
- YAML Parsers: js-yaml, PyYAML
- Hash Algorithms: SHA256 (evidence tamper-proofing)
7.6 Implementation roadmap
Phase 1: Pilot Project
- Select a small functional module
- Write first AI-TDD Manifest
- Run Implementation Readiness Gate
- AI generates implementation and iterates to TDD-GREEN
Phase 2: Team Rollout
- Establish team Manifest writing standards
- Build CI/CD gate pipelines
- Train team on AI-TDD workflow
Phase 3: Scale Application
- Establish organization-level Skill library
- Integrate existing requirement management systems
- Establish audit and measurement systems
7.7 Best practices
1. Manifest Writing Principles
- Every MUST must have corresponding EVD
- Every OUT OF SCOPE must explain reason and follow-up plans
- Use explicit validation conditions, avoid vague descriptions
2. Gate Configuration Recommendations
- Implementation readiness gate must be TDD-RED (tests exist but fail)
- Delivery gate must be TDD-GREEN (all verifications pass)
- Set gate timeouts to avoid infinite blocking
3. Team Collaboration Model
- Product managers responsible for requirement confirmation phase
- Architects responsible for architecture confirmation phase
- Development engineers responsible for implementation readiness and execution closure
- QA/Audit Agents responsible for audit review phase
Chapter 8: AI-TDD Gate mechanism detailed explanation
8.1 Core functions of AI-TDD Gate
AI-TDD Gate is Manifest’s technical implementation layer:
1. Manifest Parsing
- Read
ai-tdd-manifest.yaml - Parse MUST/NEG/OUT/EVD
- Establish TRACE rows
2. Test Code Generation
- Automatically generate test code frameworks based on Manifest
- Associate E2E/ACC test suites
- Generate coverage check configurations
3. Gate Execution
- Run full tests
- Determine TDD-RED or TDD-GREEN
- Generate gate reports
4. Status Tracking
- Record results of each gate run
- Maintain requirement→test→result traceability chains
8.2 CI/CD integration
The following example is an integration sketch. It shows how a team might connect the bmad-speckit CLI, a manual AI implementation step, and the team’s own Gate execution layer in CI. It is not a copy-paste-complete workflow for this repository.
GitHub Actions integration sketch:
name: AI-TDD Gate
on: [pull_request]
jobs:
implementation-readiness:
name: "Implementation Readiness Gate"
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v3
- name: Check AI-TDD CLI Environment
run: npx --yes --package bmad-speckit-sdd-flow@latest bmad-speckit check
- name: Run team-specific Implementation Readiness Gate
run: echo "Replace with your project's actual readiness gate command"
ai-implementation:
name: "AI Implementation"
needs: implementation-readiness
runs-on: ubuntu-latest
steps:
- name: AI Generate with Manifest Context
run: echo "Use Claude/Cursor/Codex with the confirmed Manifest as context, then commit generated implementation."
delivery-closeout:
name: "Delivery Closeout Gate"
needs: ai-implementation
runs-on: ubuntu-latest
steps:
- name: Run team-specific Delivery Closeout Gate
run: echo "Replace with your project's actual closeout gate command"
Conclusion: requirement-contract-driven AI-TDD and multi-dimensional evidence acceptance
Within this article’s framing, the core AI-TDD claim can be summarized in one sentence: AI generation without Manifest as contract is unanchored improvisation; delivery without an evidence chain is optimistic judgment that cannot be reviewed.
Problem boundaries precede solutions: a first principle of architecture in the AI era
AI can help write code, draft documents, run analysis, and even assist with architectural design. What it cannot do independently is define the business boundary for a team. Problem boundaries are not purely technical; they also encode value judgments, priorities, and tradeoffs.
Software engineering has long held that code should be written for human understanding, not only machine execution. In the AI era, that idea can be reframed as:
“Any AI can write code that solves problems. Excellent architects define the boundaries of problems worth solving.”
The six mental models form the cognitive framework for human-AI collaboration. Each one follows the same line: lock the requirement contract first, then collect reviewable evidence.
- Requirement Confirmation: Atomic decomposition, generating machine-readable Manifest contract matrices
- Architecture Confirmation: Defining technical solutions
- Implementation Readiness: Generate acceptance test baselines that sufficiently cover the current Manifest scope
- Execution Closure: AI generates implementation under Manifest constraints
- Audit Review: Multi-Agent critical audit
- Delivery Closeout: Re-run full Manifest validation, confirming TRACE/EVD/CMD/ART evidence, Gate Verdict, and Human Decision are closed
The two key gates act as the quality firewall:
- Implementation Readiness Gate: Check whether Manifest is complete, with the expected state AI-TDD-RED / TDD-RED
- Delivery Closeout Gate: Check whether Manifest-linked evidence chains are closed, with the expected state AI-TDD-GREEN / TDD-GREEN
AI-TDD application prospects in large-scale engineering
If AI continues to take on more implementation work, AI-TDD may become especially useful in the following settings:
1. Enterprise-Level Microservice Architecture
In complex multi-service systems, AI-TDD’s TRACE rows can help establish cross-service requirement traceability. Each service’s Manifest defines its interface contracts, verified end-to-end through unified Gates.
2. Safety-Critical Systems
For safety-critical domains like finance, healthcare, and aviation, AI-TDD’s negative contracts (NEG / MUST NOT) and scope-boundary (OUT OF SCOPE) audit mechanisms can strengthen security-constraint management. Each security constraint needs explicit verification evidence.
3. Compliance-Driven Development
In scenarios with strict compliance requirements like GDPR and SOX, AI-TDD’s EVD mechanism can preserve compliance audit material, helping prove that each compliance item has corresponding verification.
4. Open Source Project Collaboration
AI-TDD’s Manifest can become an open source project’s “social contract”: contributors compare PR scope against the Manifest before submitting, reducing the risk of violating project constraints.
Core principles review
- Manifest must be completely defined before execution begins
- Manifest is machine-readable, not natural language description
- AI-TDD Gate runs based on Manifest, not based on scattered tests
- No complete Manifest, prohibited from entering execution
- Manifest has unverified items, prohibited from delivery
This is AI-TDD’s value: moving AI-generated code quality from natural-language guessing to requirement-contract governance, and from local tests to TRACE/EVD/CMD/ART evidence-chain acceptance.
Chapter 9: Frequently Asked Questions (FAQ)
Q1: How much additional development time does AI-TDD add?
A: It adds upfront time for clarification and Manifest drafting, but it often reduces rework later. The payoff depends on requirement complexity, team proficiency, test infrastructure, and how deeply AI participates in implementation.
Taking the calculator example from the Quick Start chapter:
- Traditional approach: Directly let AI generate → discover boundary gaps later → rework
- AI-TDD: Write Manifest → generate tests → generate implementation → accept by evidence
The payoff is usually easiest to see when the project has many boundaries, high rework costs, or explicit audit-evidence requirements.
Q2: How detailed does Manifest need to be?
A: Focus on boundaries, not implementation.
❌ Over-detailed (bad):
must:
- id: "MUST-SUM-LOOP-001"
description: "Use for loop to traverse array, accumulate each element to accumulator variable"
# Too specific! Limits AI's implementation approach
✅ Appropriately detailed (good):
must:
- id: "MUST-SUM-RESULT-001"
description: "Calculate sum of array elements"
validation: "sum([1,2,3]) === 6, time complexity O(n)"
# Only define boundaries (input/output/performance), not implementation
Principle: Manifest defines “what to do” and “what not to do,” AI decides “how to do it.”
Q3: What if AI repeatedly cannot make tests green?
A: It depends on the failure mode:
Situation 1: AI-generated implementation has logic errors
- Human intervention, give AI more specific hints
- Split complex MUSTs into multiple simple MUSTs
- Use audit Agent to provide feedback
Situation 2: Tests themselves have problems
- Check if MUST’s validation is achievable
- Check if there are contradictory requirements, such as
MUST-REQ-A-001andMUST-REQ-A-002 - Modify Manifest, regenerate tests
Situation 3: AI capability boundaries
- Some complex algorithms may exceed current AI capabilities
- Mark this part as OUT OF SCOPE, manually implement
- Continue AI-TDD process for remaining parts
Q4: How do existing projects migrate to AI-TDD?
A: A practical migration path is to start in “shadow mode”:
Phase 1: Parallel Verification
- Keep existing development process unchanged
- Select 1 new feature, implement simultaneously with AI-TDD and traditional methods
- Compare quality and time differences
Phase 2: Selective Application
- Prioritize AI-TDD for new features
- Supplement Manifest for legacy features during refactoring
- Keep existing tests as regression tests
Phase 3: Full Switch (ongoing)
- Establish internal team templates
- Form best practices
- New hire training materials
Do not:
- ❌ Rewrite all code at once
- ❌ Delete existing tests
- ❌ Force the team to use AI-TDD for every task
Q5: How does AI-TDD integrate with existing CI/CD?
A: AI-TDD can be integrated into existing pipelines as independent verification stages. The two snippets below are integration sketches, not copy-paste-complete configs for this repository. The explicit bmad-speckit check step is only a setup smoke check; it does not execute project-specific AI-TDD Gates. Replace placeholder commands with the actual Gate commands or CI tasks used in your own project.
GitLab CI integration sketch:
# .gitlab-ci.yml
stages:
- validate # New: AI-TDD validation
- test
- deploy
ai-tdd-validate:
stage: validate
script:
- npx --yes --package bmad-speckit-sdd-flow@latest bmad-speckit check
- echo "Replace with your project's actual implementation readiness gate command"
only:
- merge_requests
ai-tdd-closeout:
stage: test
script:
- echo "Replace with your project's actual delivery closeout gate command"
only:
- main
GitHub Actions integration sketch:
# .github/workflows/ai-tdd.yml
name: AI-TDD Gate
on: [pull_request]
jobs:
ai-tdd-check:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v3
- name: AI-TDD validation
run: |
npx --yes --package bmad-speckit-sdd-flow@latest bmad-speckit check
echo "Replace with your project's actual implementation readiness gate command"
Q6: Are small teams (1-3 people) suitable for AI-TDD?
A: It depends on project complexity.
Suitable Scenarios:
- Projects with clear requirement boundaries (e.g., API services, tool libraries)
- Need long-term maintenance (>3 months)
- Frequent requirement changes (Manifest helps track changes)
May Not Be Suitable:
- One-time scripts/tools
- Pure prototype exploration
- Personal learning projects
Recommendation: Even a one-person team can start with a minimal mode: write only MUST items first, and add the rest later.
Q7: Does AI-TDD require special AI models?
A: Usually not. As long as a model can understand structured context, generate code, and iterate on failure feedback, it can participate in an AI-TDD workflow. More complex projects may still need models with stronger context windows, tool calling, and code capabilities.
Model Selection Recommendations:
- Code generation models: Choose models supporting tool calling (Function Calling) for easy integration with Gate systems
- Context length: Complex Manifests may need 128K+ context, choose models supporting long contexts
- Multimodal capabilities: If needing to process screenshots, design drafts, etc., choose multimodal-capable models
Prompt tips (model-independent):
- Clearly indicate in Prompt “generate based on following Manifest”
- Require AI to “strictly follow OUT OF SCOPE, don’t implement”
- Require AI to “first understand NEG (MUST NOT), ensure not violated”
- Provide brief Manifest Schema explanations to help AI understand structure
Q8: What is AI-TDD’s Manifest Git management strategy?
A: Common strategies include the following:
Branch Strategy:
main
├── feature/login-manifest # Only change Manifest
├── feature/login-impl # AI generates implementation (based on Manifest)
└── hotfix/fix-auth-bug # Emergency fix
Commit Conventions:
# Manifest changes
feat(manifest): Add user registration MUST-REG-EMAIL-VERIFY-001
# Implementation changes (generated by AI)
feat(impl): AI implements MUST-REG-EMAIL-VERIFY-001 email verification
# Gate passed
gate(pass): Implementation Readiness Gate TDD-RED
# Audit passed
audit(pass): 3 Agents reviewed, 0 blockers
Version Tags:
# Manifest confirmed version
git tag -a manifest-v1.0.0 -m "Requirement confirmation passed"
# Delivery version
git tag -a release-v1.0.0 -m "AI-TDD Delivery Closeout Gate passed"
Chapter 10: Limitations and applicability boundaries
AI-TDD is not a silver bullet. Before adopting it broadly, teams need a clear view of its limits and where it fits best.
10.1 Unsuitable scenarios
Scenario 1: Highly Exploratory Prototype Development
- When you’re not even sure “what to do” and need rapid trial-and-error
- Manifest’s predefined nature becomes a constraint on innovation
- Recommendation: Traditional AI-assisted coding is more suitable
Scenario 2: Very Small Projects (<500 lines of code)
- Time to write Manifest may exceed directly writing code
- Methodology overhead not worthwhile in small projects
- Recommendation: Only use when project complexity exceeds threshold (e.g., >5 interaction interfaces)
Scenario 3: Pure Frontend UI Development
- UI/UX requirements are highly subjective, difficult to describe precisely with MUST/EVD
- “Beautiful” “Smooth” etc. requirements cannot be quantified for verification
- Recommendation: Visual-first development may be more suitable; use AI-TDD more selectively for backend or rule-heavy logic
10.2 Usage costs
Learning Cost:
- Teams need dedicated time to learn Manifest syntax and AI-TDD workflow
- It helps to have at least one team member who knows the workflow well and can keep templates, gates, and audit expectations consistent
Time Cost:
- Writing Manifest and confirmation processes increase upfront time
- If requirement boundaries are complex and rework costs are high, later rework usually decreases
- Net benefits are usually more visible in large, long-term, or high-risk projects
Tool Cost:
- Need runnable AI-TDD Gate infrastructure, whether built internally or integrated from an existing team pipeline
- Multi-Agent audit requires additional token consumption
10.3 Failure risks
Risk 1: Manifest itself incorrectly defined
- If human understanding of requirements is wrong, AI-TDD can only accelerate wrong implementation
- Mitigation: Strengthen manual review in requirement confirmation phase
Risk 2: AI cannot understand complex constraints
- Some domain-specific logic (e.g., financial compliance rules) may exceed current AI capabilities
- Mitigation: Split complex constraints into simpler sub-constraints
Risk 3: Over-reliance on tools
- Teams may fall into “using AI-TDD for the sake of using AI-TDD”
- Mitigation: Regular retrospectives to confirm methodology indeed brings value
10.4 Gradual adoption path
First phase: Single MUST item trial
- Choose a simple feature (e.g., logging)
- Only define 1 MUST, experience complete workflow
Second phase: Add NEG / MUST NOTs
- Increase negative constraints (e.g., “prohibit hardcoded keys”)
- Experience TDD-RED interception effects
Third phase: Complete workflow
- Add OUT OF SCOPE, EVD, TRACE rows
- Run complete AI-TDD Gate
Team rollout phase: Broaden adoption
- Summarize lessons learned from the pilot
- Develop team internal Manifest templates and best practices
Appendix A: Glossary
| Term | English | Definition |
|---|---|---|
| AI-TDD | AI-TDD defined in this article | Manifest-level AI-TDD with AI as execution agent and Manifest contract at the core; not yet standardized industry terminology |
| Manifest | AI-TDD Gate Manifest | YAML-format requirement contract matrix containing MUST/NEG/OUT/TRACE/EVD/ACC/E2E/FAIL/EDGE/CMD/ART/TASK |
| MUST | Must Requirement | Functional requirements that system must implement |
| NEG | Must Not / Negative Assertion | Blocking negative assertion; MUST NOT is the conceptual alias and machine IDs use NEG-* |
| OUT | Out Of Scope Boundary | Functional boundary excluded from the current iteration; older “not completed” wording should migrate to OUT OF SCOPE / OUT-* |
| TRACE | Trace Row | Contract-slice index binding MUST/NEG, TASK, scenario layer ACC/E2E/EDGE/FAIL, and evidence layer EVD/CMD/ART; OUT binds through scope-audit refs |
| EVD | Evidence | Requirement verification evidence items, including thresholds and artifacts |
| ACC | Acceptance Test | Automated test or check tied to acceptance |
| E2E | End-to-End Test | End-to-end scenario verification |
| FAIL | Failure Path | Required failure path attached to NEG-* |
| EDGE | Edge Case | Boundary condition or exceptional input scenario |
| CMD | Command | Reproducible verification command |
| ART | Artifact | Auditable delivery or evidence file |
| TASK | Task | Implementation task or execution slice |
| TDD-RED | TDD-RED | Implementation readiness gate status: the new or linked acceptance tests exist and fail because the implementation has not yet satisfied them |
| TDD-GREEN | TDD-GREEN | Delivery gate status: the current-attempt evidence chain is closed, Gate Verdict is pass, and Human Decision is accept |
| CLOSEOUT_CANDIDATE | Closeout Candidate | Intermediate state where registered validations pass, but the evidence chain, Gate Verdict, and Human Decision are not all closed |
| Implementation Readiness Gate | Implementation Readiness Gate | Gate that must be passed before execution begins, requiring TDD-RED status |
| Delivery Closeout Gate | Delivery Closeout Gate | Gate that must be passed before delivery, requiring evidence chain, Gate Verdict, and Human Decision closure |
| Requirement Contract | Requirement Contract | Human-confirmed and versioned Manifest that constrains the AI execution scope |
| Contract Slice | Contract Slice | Smallest acceptable unit centered on a TRACE row, binding requirement, scenario, evidence, command, and artifact |
| Evidence Chain | Evidence Chain | Reviewable proof chain composed of TRACE, EVD, CMD, ART, and their hashes or receipts |
| Gate Verdict | Gate Verdict | Gate decision such as pass, blocked, or failed based on the current attempt’s evidence chain |
| Human Decision | Human Decision | Recorded human accept/reject decision based on confirmation page and audit evidence |
| Bounded Packet Closure | Bounded Packet Closure | AI iterates generation under Manifest constraints until the candidate implementation satisfies registered validations and enters closeout |
| Human-in-the-loop | Human-in-the-loop | Mechanism where critical decision points must be confirmed by humans |
| Contract as Code | Contract as Code | Manifest is executable encoding of human intent |
| Reverse Audit | Reverse Audit | Audit method tracing from code implementation to verify whether it satisfies Manifest |
| Trace Row | Trace Row | Minimum requirement verification unit that can be independently executed |
| Skill | Skill | AI-TDD’s extensible Agent capability modules |
| bmad | BMAD CLI | AI-TDD’s command line tool |
References and extensions
Core theoretical literature
Beck, K. (2002). Test-driven development: By example. Addison-Wesley.
Brooks, F. P. Jr. (1975). The mythical man-month: Essays on software engineering. Addison-Wesley.
GitHub. (2021, June 29). Introducing GitHub Copilot: Your AI pair programmer. https://github.blog/news-insights/product-news/introducing-github-copilot-ai-pair-programmer/
Humble, J., & Farley, D. (2010). Continuous delivery: Reliable software releases through build, test, and deployment automation. Addison-Wesley.
OpenAI. (2023). GPT-4 Technical Report. arXiv. https://arxiv.org/abs/2303.08774
Sobocinski, P. (2023, August 17). TDD with GitHub Copilot. Martin Fowler. https://martinfowler.com/articles/exploring-gen-ai/06-tdd-with-coding-assistance.html
Sukharev, D. (2023). AI-TDD: CLI for TDD - you write the test, GPT writes the code to pass it. GitHub. https://github.com/di-sukharev/AI-TDD
Piya, S., & Sullivan, A. (2023). LLM4TDD: Best Practices for Test Driven Development Using Large Language Models. arXiv. https://arxiv.org/abs/2312.04687
Mathews, N. S., & Nagappan, M. (2024). Test-Driven Development for Code Generation. arXiv. https://arxiv.org/abs/2402.13521
Cui, Y. (2025). Tests as Prompt: A Test-Driven-Development Benchmark for LLM Code Generation. arXiv. https://arxiv.org/abs/2505.09027
Schneider, J. G., Borjigin, A., Kamal, M., & Grundy, J. (2026). Test-Driven Agentic Development for Automated Software Evolution. arXiv. https://arxiv.org/abs/2604.26615
Beck, K. (2025, June 11). TDD, AI agents and coding with Kent Beck [Interview]. The Pragmatic Engineer. https://newsletter.pragmaticengineer.com/p/tdd-ai-agents-and-coding-with-kent
Thoughtworks. (2026). The Future of Software Engineering: Retreat Findings and Strategic Insights. https://www.thoughtworks.com/content/dam/thoughtworks/documents/report/tw_future%20_of_software_development_retreat_%20key_takeaways.pdf
This article represents original framework thinking. It defines AI-TDD around Manifest requirement contracts, TRACE/EVD/CMD/ART evidence chains, Gate Verdict, and Human Decision. The term is still evolving — feedback and corrections are welcome.
Reading path
Continue along this topic path
Follow the recommended order for AI engineering practice instead of jumping through random articles in the same topic.
Next step
Go deeper into this topic
If this article is useful, continue from the topic page or subscribe to follow later updates.
Loading comments...
Comments and discussion
Sign in with GitHub to join the discussion. Comments are synced to GitHub Discussions