Building Vault-Agent in the open
Data Vault 2.0 is the methodology of choice for enterprise warehouses that must stay auditable, historized, and resilient to change — common in Swiss and DACH banks, insurers, and pharma. But the initial modeling is slow and unforgiving: identifying business keys, structuring hubs, links, and satellites, and wiring up the loading logic is repetitive, error-prone work that consumes senior-architect weeks before a single row is loaded.
That is exactly the shape of problem agentic AI is good at — if you respect a few constraints.
Why Data Vault, of all things
Most “AI for X” attempts fail because the underlying task is fuzzy, unverifiable, or has no ground truth. Data Vault is the opposite:
- It is pattern-based. Hubs, links, and satellites follow standardized rules. Patterns are exactly what you can encode, check, and automate.
- It is verifiable. The rules (one hub per business key, links carry no descriptive attributes, satellites split by rate of change) can be checked deterministically — so an independent validator can catch what an LLM gets wrong.
- It separates the easy from the hard. The Raw Vault is integration-light and pattern-driven; the genuinely hard business logic lives downstream in the Business Vault and the marts. A responsible tool automates the former and assists the latter — it never pretends to own business rules.
The stance: assist, don’t replace
Vault-Agent is a multi-agent pipeline that reads requirements, proposes a model, generates AutomateDV/dbt code, and documents every decision as an ADR. The load-bearing design choice is assist + human ratification + rules-as-code: the LLM proposes, an independent validator gates, and a human signs off at a checkpoint. Non-determinism is quarantined to the proposal stage; everything downstream is deterministic and reviewable.
It is not a push-button warehouse, and it does not claim to be. The value is in the slow, senior-architect front of the work — turning intent into a scoped, contract-backed, documented model — not in emitting SQL.
Why in the open
I’m building this transparently, including its limits: a pre-mortem of where it could fail, a reality test on deliberately messy multi-source input, and an honest backlog. Openly named limitations are credibility, not weakness — and in a field full of “AI-ready” marketing, showing how and why the agent reasons is the differentiator.
More to come as the project grows. The code, the architecture decisions, and the findings all live in the repository.