Skip to main content

S4U Methodology — Showcase & Evidence

Case studies and evidence narratives moved out of the agent-facing canon (2026-06-12, canon v3): agents load rules (operating-card.md) and procedures (skills/); humans read this for the story and the numbers.

Case study: regulatory compliance as architecture

In one line: when a system operates under regulation, the regulation is an architectural constraint, not a feature to bolt on later.

This is an anonymized illustration of one team adopting the methodology on a high-stakes regulated system — an AI-assisted decisioning platform working under multiple overlapping regulatory frameworks. The lesson transfers to any domain where an automated decision can have legal or safety consequences: model the regulatory requirements into the schema, the API contract, and the audit trail from day one. Below is how the constraints shaped the design.


Applicable Regulations

EU AI Act (Annex III — High-Risk AI System). An AI-driven risk-assessment system of this kind qualifies as high-risk. The Act imposes specific requirements:

  • Article 11 (Technical Documentation): Complete documentation of the AI system's design, development, and performance. Every model version, every prompt template, every training dataset must be recorded.
  • Article 12 (Automatic Logging): All AI operations must be logged automatically, with sufficient granularity to reconstruct any individual decision.
  • Article 13 (Transparency): Users must understand that they are interacting with an AI system and how AI decisions affect them.
  • Article 14 (Human Oversight): Human officers must be able to override any AI decision. The system must support meaningful human review, not rubber-stamping.
  • Article 15 (Accuracy & Robustness): Ongoing monitoring of AI system accuracy, with documented methodology for measuring and reporting performance.

GDPR (Articles 22, 25, 30, 35). Automated decision-making about individuals triggers specific rights:

  • Article 22: Right to human review of automated decisions. Every AI-generated risk assessment must be reviewable by a human compliance officer.
  • Article 25: Data protection by design. Privacy considerations are architectural constraints, not post-hoc additions.
  • Article 30: Records of processing activities. Every data processing operation must be documented.
  • Article 35: Data Protection Impact Assessment required for high-risk processing.

Sector record-keeping regulation. Domain-specific compliance regimes typically add record-keeping and audit-trail requirements:

  • Long-horizon minimum retention for all decision records.
  • An audit trail sufficient to support mandatory regulatory reporting.
  • A documented, risk-based assessment methodology — the methodology itself must be auditable, not just the results.

The Five Requirements for Every AI Output

Every AI-driven decision, recommendation, or risk assessment in the system must satisfy five requirements. These are non-negotiable architectural constraints — the system is designed so that producing an AI output without meeting all five requirements is structurally impossible.

1. Input Provenance. What data was the decision based on? Every AI output records its input sources as SourcedFact and EvidenceReference objects — structured references to the specific documents, database records, or external data sources that contributed to the decision. A regulator reviewing an AI risk assessment can trace every claim to its source data.

2. Model Identification. Which model, which version, which prompt template? Every AI execution records the model identifier, the prompt template name, and the prompt version ID. When a prompt template is updated, the prompt_version_id foreign key ensures that historical decisions reference the exact prompt that was used, not the current version.

3. Chain of Thought. The full reasoning captured. The agent framework's full-message-history capture records the complete conversation between the system and the AI model — the prompt, the model's response, any tool calls, and the final output. This is stored as immutable evidence, not as a summary. (Lesson, not a product endorsement: pick a framework that hands you the raw message trace rather than a summarized one.)

4. Confidence Scoring. Quantified certainty with documented methodology. Every AI assessment includes a confidence score with a methodology reference explaining how the score was calculated. The scoring methodology is itself an auditable artifact — a regulator can evaluate not just the score but the method that produced it.

5. Immutable Audit Log. An append-only audit_events table records every state transition in the compliance workflow. The table schema prevents updates and deletes — the immutability guarantee is enforced at the database level, not by application code that might forget to call the audit function. This means the audit trail is tamper-evident: any gap in the sequence of events is detectable.


The Non-Suppression Principle

The system can ADD scrutiny but NEVER suppress risk signals. This is the foundational design constraint for all AI-driven risk assessment in the platform.

Concretely: if an AI model identifies a risk indicator (a sanctions match, a negative media mention, a registration anomaly), the system records the indicator and presents it to the compliance officer. A subsequent AI analysis that does not find the same risk does not remove the indicator — it adds a second opinion alongside the first. The compliance officer sees both and makes the final determination.

Any AI recommendation that would reduce the level of scrutiny applied to a case must be:

  1. Traceable to specific evidence that justifies the reduction (not just "the model thinks the risk is low")
  2. Flagged for human review before taking effect
  3. Recorded in the audit trail with the full reasoning chain

This principle exists because the regulatory consequences of suppressing a legitimate risk signal (a missed mandatory filing, a compliance failure) are orders of magnitude more severe than the operational cost of investigating a false positive. The system is architecturally biased toward caution.

Evidence: the system implemented all five requirements structurally — an evidence-bundle type system carrying sourced facts and evidence references, a prompt_version_id foreign key across every AI-execution record, raw model-message capture, a confidence-scoring methodology, and an append-only audit_events table whose schema forbids updates and deletes. The recurring lesson: each requirement is enforced by a type, a foreign key, or a table constraint — not by a convention an engineer might forget. See the Evidence metrics section below for the architectural-rigor metrics.


Evidence metrics (former appendix-f)

In one line: every claim about a codebase should ship with the command that produced it, so a skeptic can re-run it.

Do this: when you make an evidence-grade claim in your own docs, pin it to a commit and paste the exact collection command. The point of this section is the command set below — they are codebase-agnostic and reproducible — not the specific numbers any one project happened to report. Run them against your own repo to get your own numbers.

Scope used by these commands: backend/app/ for production Python code, frontend/src/ for production TypeScript (excluding tests, dependencies, and generated files).


1. Codebase Scale

Measure: total LOC, backend vs frontend split, API endpoint count, router-file count, ORM model count, service-module count. Each number below comes from one reproducible command — run them on your own repo.

Collection commands

# Backend file count and LOC
find backend/app -name "*.py" -not -path "*__pycache__*" | wc -l
find backend/app -name "*.py" -not -path "*__pycache__*" | xargs wc -l | tail -1

# Frontend file count and LOC
find frontend/src \( -name "*.ts" -o -name "*.tsx" \) | wc -l
find frontend/src \( -name "*.ts" -o -name "*.tsx" \) | xargs wc -l | tail -1

# API endpoints
grep -r "@router\.\(get\|post\|put\|patch\|delete\)" backend/app/api/ --include="*.py" | wc -l

# API router files
find backend/app/api -name "*.py" -not -name "__init__.py" -not -path "*/deps/*" | wc -l

# ORM models
grep -c "^class.*Base):" backend/app/db/models.py

# Service modules
find backend/app/services -name "*.py" -not -name "__init__.py" | wc -l

# Pydantic model files
find backend/app/models -name "*.py" | wc -l

2. Development Velocity

Measure: total commits, calendar vs active development days, commits per active day, the share of AI co-authored commits, and whether commits follow a conventional convention (feat/fix/docs/test with scope). The commands below derive each from git history.

Collection commands

# Total commits
git rev-list --count master

# First and last commit dates
git log --reverse --format="%ai" | head -1
git log -1 --format="%ai"

# Active development days
git log --format="%ad" --date=short | sort -u | wc -l

# AI co-authored commits
git log --all --grep="Co-Authored-By" --oneline | wc -l

3. Testing & Quality

Measure: backend/frontend test-file and test-function counts, the number of documented MOCK APPROVED comments (a proxy for honest mocking discipline), and how many files use testcontainers (a proxy for real-service integration testing). The commands below count each.

Collection commands

# Backend test files
find backend/tests -name "test_*.py" | wc -l

# Backend test functions
grep -r "def test_" backend/tests/ | wc -l

# Frontend test files
find frontend/src \( -name "*.test.*" -o -name "*.spec.*" \) | wc -l

# Documented mock approvals (total comments)
grep -r "MOCK APPROVED" backend/tests/ | wc -l

# Files containing mock approvals
grep -rl "MOCK APPROVED" backend/tests/ | wc -l

# Files using testcontainers
grep -r "testcontainers\|TestContainer\|PostgresContainer" backend/ --include="*.py" -l | wc -l

4. Architectural Rigor

Measure: number of Architecture Decision Records (with supersession tracking), Alembic migration count, and RLS-protected table count. These proxy for decision discipline, schema-change discipline, and tenant-isolation coverage. The commands below count each.

Collection commands

# ADR count
ls docs/adr/ | grep -c "^ADR-"

# Alembic migrations
ls backend/alembic/versions/*.py | wc -l

# RLS tables (grep + manual counting of TENANT_TABLES, DIAGNOSTICS_TABLES,
# and individual statements in migrations 023-030)

5. Living Documentation

Measure: total doc pages, the breakdown by audience (architecture, ADR, API reference, strategy), and the code-to-documentation commit ratio. A ratio near 1:1 indicates documentation is treated as a same-tier deliverable. Track these in your own docs repo over time.


6. Methodology Infrastructure

Measure: number of custom agent definitions (global vs project), persistent memory files, lifecycle skills, quality-gate hook layers, and MCP server integrations. These quantify how much of the methodology is wired into tooling rather than left to discipline.


7. Updated Metrics (March 2026)

Measure, over time: architecture pages carrying structured frontmatter, components mapped in the architecture index, backend documentation coverage (documented/total), and cross-tool compatibility via an AGENTS.md symlink. Coverage is a ratchet — record it each cycle and watch the trend rather than chasing a one-time target.


Velocity Context

In one line: the headline figure is the co-authoring ratio — a large majority of commits AI co-authored under human architecture and review.

The collaboration model the numbers reflect: the human architects, reviews, and validates; the AI implements, tests, and iterates. A single architect working this way sustains a commit cadence that would otherwise need a small team, without surrendering design authority or review.

This methodology specification was itself designed using the brainstorm-to-spec-to-plan lifecycle it describes, authored collaboratively with an AI pair, and reviewed by fresh-context agents — a practical demonstration of the process.


Measuring whether the methodology is working

In one line: this project ships NO borrowed metrics — instead, run scripts/methodology-health.sh on your own repo and watch the trend over cycles.

A methodology cannot honestly claim effectiveness by quoting one project's headline numbers; those numbers belong to that project, not to you. So the de-projected canon deliberately reports none. The instrument it ships instead is a reproducible, project-agnostic report you run against your repo:

bash scripts/methodology-health.sh <your-project-root>

It emits, from git and the file tree alone:

  • Feature flags currently defined versus retired in the last 30 days — the consolidation signal (do flags get removed once their work lands, or accumulate forever?).
  • ADRs added in the last 30 days — whether decisions are being recorded as they happen.
  • Design artifacts versus recent feature branches — a rough read on the one-design-artifact-per-feature rule.
  • Gate-config presence — whether .claude/settings.json hooks are actually wired, not just intended.
  • Oversized modules (over 3000 lines) — the structural-decay signal.

Mechanism: the report is generated by a checked-in script, so the numbers are reproducible and a skeptic can re-run them. Track the TREND across cycles, not any single absolute value — flags-retired-per-cycle going up, ADRs/month staying non-zero, design-artifact-per-feature holding near one, the gate config staying present.

What it does not measure: it counts artifacts, not their quality. A high ADR count says decisions are being written down, not that they are good decisions; a retired flag says the flag was removed, not that the consolidation was sound. The instrument is a tripwire for drift, not a verdict on correctness — that judgment still needs review. (Counting gate-saved incidents is intentionally out of scope until hooks persist block-events.)