Architecture RFC · ENG-5996
Data Model · Foundations

From a Rigid Schema to a Customer-Configurable Ontology

A constrained property graph for Vallor: keep stable identity as columns, keep dynamic attributes as schema-typed JSONB, and add a first-class labeled edge primitive for relationships. A small set of primitives engineering owns — infinite per-customer process variation that lives in data, not migrations.

Recommendation
Property Graph
Not
Classic EAV
Primitives to own
~4
Migration shape
Strangler-fig
The Problem We Actually Have

Three years in, the original assumption — that contract management is rigid and uniform — has proven false. Every company manages contracts differently, and the technology moves faster than our data model can flex. The variation is not noise; it is the domain.

Counterparties vary
One supplier, many suppliers, a customer, or a single person. Sometimes there is no company counterparty at all.
"A contract" is not a file
Sometimes many documents represent one contract. We hardcoded contract = PDF/DOCX. Files should be a primitive.
Roles are customer-defined
"Owner" vs "manager" vs something else — or a whole team. Some customers have no owner concept at all.
The cost of the rigid assumption

Today, every new way a customer thinks about contracts becomes a schema change and an engineering ticket. At venture scale, that puts engineering on the critical path of every customer's idiosyncratic process map. This is the tech debt to pay down. Dynamic fields were the first response to exactly this pressure — field shapes we can hand to an LLM (which takes a JSON Schema) that differ wildly across industries and even across teams in one org.

The Concept, in One Picture

Everything in the problem statement is one shape: a property graphProperty graph — nodes (entities) and edges (relationships) are both first-class and can both hold key/value properties. Unlike RDF triples, properties live on the element, not as separate statements. The model behind Neo4j and AWS Neptune.vs. EAV: keeps attributes and relationships in separate, typed structures — typed nodes connected by typed, labeled edges. The same contract, expressed for three different customers, changes only its edges and labels:

owner manager supplier supplier part_of about Contract node · dynamic_fields Personnode Teamnode Companynode ×N Companynode Filenode ×N Chat
Indigo = the entity in focus (node + its dynamic_fields). Cyan = related nodes. Pills = org-registered edge labels. Change a customer's process by changing edges and labels — never the schema.
"What is a contract owner other than a person related to a contract with the label 'owner'? How is that different from a manager — the same person entity, related to the same contract, with the label 'manager'?" — That observation is the property-graph model. The label is the only thing that varies, and the label is data.
We Already Grew the Parts

We have, piecemeal and without naming it, already built the four ingredients of this graph:

Ingredient already in the repoWhat it is todayWhat it becomes in the graph
dynamic_fields + extraction_schemaPer-org JSON-Schema metamodel + JSONB values on contract, redline_project, organization_company, task…Node attributes, typed by the org's schema. Keep as-is.
task_entity(task_id, entity_type, entity_id) — a polymorphic link, hardcoded to one source type.The edge primitive, generalized to any source + a label.
organization_labelPer-org registry of customer-defined labels with type, category, cardinality (single/multi).The relationship-type registry — the seed of an ontology.
entity_eventPolymorphic audit trail keyed by (entity_type, entity_id).Proof polymorphic references work in our stack at scale.
Evidence — this is not greenfield

We did not build EAV. We independently evolved the scaffolding of a constrained property graph. The work is not "adopt a new pattern" — it is "promote four ad-hoc pieces into two deliberate primitives." That de-risks the effort: polymorphic refs, the access patterns, and the per-org metamodel are already proven in production.

core/migrations/1739999145835_extraction-schema.mjs · 1777403653000_task-standalone.mjs · 1763994801816_create-organization-label.mjs · 1777403658000_entity-event-audit-trail.mjs
The Decision — and the One Thing EAV Gets Wrong

EAV can model relationships (the value holds a reference to another entity — "EAV with relationships"). So the question is not "can it," it is "should it." EAV's defining move is collapsing attributes and relationships into one untyped (subject, predicate, object) table. That is exactly the move to avoid — the two have different needs:

Per-org Ontology Registry legal node types · edge types · allowed endpoints · cardinality validates validates Nodes — attributes Stable identity columns + dynamic_fields (JSONB, JSON-Schema typed) KEEP · already in production co-located read · GIN-indexed Edges — relationships First-class labeled edge: entity_edge (generalized task_entity) BUILD · the missing primitive bidirectional · indexed both ways
The registry is the schema EAV lacks — expressed as data, editable without a migration. It governs attributes and relationships separately, each typed for its own access pattern.

So the recommendation is not EAV and not "keep extending dynamic_fields" (it structurally cannot link). It is: keep dynamic_fields for attributes, add a first-class labeled edge primitive for relationships, governed by a per-org registry that constrains which edges are legal — the same role extraction_schema plays for attributes.

Why the registry keeps this from becoming "EAV soup"

EAV famously lacks a schema; our registry is the schema. Without it you get the OTLT failure modeEAV / OTLT failure mode — the "One True Lookup Table" anti-pattern: everything dissolves into one untyped table, nothing is discoverable, every query becomes bespoke self-joins. Well-documented precisely because teams keep getting burned.Celko, "SQL for Smarties"; the MUCK / OTLT critique — undiscoverable, unconstrained, unqueryable. The registry makes this a constrained graph.

Mirrors extraction-schema.mjs · organization_label (type, category, kind: single|multi)
The Edge Primitive (Illustrative)
// packages/database/src/schema/entity-edge.ts — Drizzle definition; queried via Kysely
entity_edge (
  organization_id   not null,
  source_type      not null,  // 'contract' | 'person' | 'team' | 'file' | 'chat' | …
  source_id        not null,
  relationship_type not null,  // 'owner' | 'manager' | 'supplier' | 'part_of' | … (org-registered)
  target_type      not null,
  target_id        not null,
  dynamic_fields    jsonb,    // attributes OF the relationship (e.g. ownership start date)
  unique (source_type, source_id, relationship_type, target_type, target_id),
  index (source_type, source_id),  index (target_type, target_id),
  index (organization_id, relationship_type)
)
  • Nodes = existing core tables. Keep stable identity columns; attach dynamic_fields for the variable parts.
  • Edges = the table above. Bidirectionally indexed; the edge itself can hold attributes.
  • Registry validated with Zod at the boundaryWhy Zod, not a Postgres enum/FK — the codebase parses at the boundary and bans pgEnum + raw casts by hook. A polymorphic target can't carry a real FK anyway, so registry + Zod is the strongest integrity available, and it stays editable as data..agents/rules/block-postgres-enum.md · block-as-type-cast.md, like the attribute metamodel.
Your Cases, Resolved in the Model
The customer's realityHow the graph represents it — zero schema change
Owner vs. managerSame entities, different edge label. (person)-[owner]→(contract) vs (person)-[manager]→(contract).
No "owner" concept hereTheir registry omits owner. Nothing forced, no null field. Absence costs nothing — you store edges that exist.
A team or division owns itPolymorphic source: (team)-[owner]→(contract). Same edge type, different source node type.
1 / N suppliers / a customer / a person counterpartyCardinality is just "how many edges of that type exist." (company)-[supplier]→(contract) ×N, or (person)-[counterparty]→(contract).
Many documents = one contractfile becomes a node type. Many (file)-[part_of]→(contract) edges. "Contract = PDF" dissolves.
Files relate to companies, chats, files…A file is a node; its relations are edges. "Polymorphic, graph, EAV, or join?" → all the same answer at different zoom: a node + a generalized polymorphic edge = a graph.
Why This Is More LLM-Manageable, Not Less

This answers the venture-scale pain directly. Collapsing the world to a uniform vocabulary shrinks the LLM's surface to ~3 tools:

createNode(type, attributes)          // attributes validated against the org's JSON Schema
setAttributes(node, attributes)      // same metamodel dynamic_fields already uses
createEdge(source, relationshipType, target)  // validated against the org's edge registry

The LLM reads the org's ontology — node types + edge-type registry + attribute schemas, all data, all expressible as JSON SchemaWhy JSON Schema is the unlock — models consume and emit JSON Schema natively (structured-output / tool-arg format). The org's entire ontology becomes a prompt-able spec, and the LLM maps a customer's documents onto nodes + edges against it. The same bet dynamic_fields already made for attributes, extended to relationships.extraction_schema = JSON Schema Draft 2020-12 + x-metadata — and maps a customer's documents and intent onto nodes and edges. No bespoke per-customer code.

The strategic payoff

Engineering owns ~4 primitives. Customers' infinite process variation lives in the ontology + the graph instances — never in a migration. A new relationship kind ("outside counsel") is a registry edit, made by an admin or proposed by an LLM, not an engineering ticket.

Constrained Property Graph vs. Classic EAV
DimensionConstrained property graph (recommended)Classic EAV / triples
Attributes vs. relationshipsSeparate, each typed for its access patternCollapsed into one untyped table
TypingSchema-typed JSON Schema per org + ZodStringly-typed one value column
"Chats/files for contract X"Indexed edge lookup, both directionsSelf-join soup
Attribute readOne row + GIN (co-located JSONB)N rows reassembled per entity
Referential integrityApp-layer + registry (poly target — a wash)App-layer (poly target — a wash)
Cardinality / required rulesExpressible in the registryNot expressible
Fit with existing toolingReuses dynamic_fields, task_entity, entity_eventThird pattern; entities become islands
LLM surface~3 uniform ops over a JSON-Schema ontologyGeneric triples; ambiguous to map reliably

The graph takes the generality you want from EAV/graph while preserving the typing, tooling, and projection disciplines we've invested in. On the one dimension EAV could theoretically win — a real FK on the link — it doesn't: a polymorphic target can't carry a true FK in any encoding short of per-type tables, so EAV doesn't even collect its usual prize.

Honest Tradeoffs — Go In With Eyes Open
CostWhat it meansMitigation
Query complexity & perfSELECT owner FROM contract becomes an edge join / recursive CTE. Hot read paths get more expensive.Denormalized projections back into columns / search index — we already do this.
No real FKsPolymorphic edges can't enforce target existence at the DB.Registry + Zod validation; background GC for dangling edges.
Compile-time → runtime typingTrade contract.owner: Person for runtime-validated edges.Parse every edge against the registry at the boundary.
Paradigm migration3 years of code assumes contract.counterparty, contract = file, etc.Strangler-fig (below). Never big-bang.
Over-generalizationA fully generic graph can become undiscoverable "soup."The per-org registry constrains it; keep stable core as columns.
Live warning from our own codebase

The projection layer is where this architecture bites. Our existing dynamic_fields→column sync triggers are currently broken for 10/12 fields (snake- vs camel-case key mismatch; only name/title sync) and are being torn out in ENG-5965. Budget explicitly for the read-projection strategy. Classic EAV with self-joins would make this worse, not better.

contract sync trigger audit · ENG-5965 · orphan fns: delete_embedding, safe_sync_extracted_date
The Hybrid Boundary & Migration Path
Hybrid beats purist — where to draw the line

Do not dissolve everything into nodes/edges. Keep genuinely-stable identity as real columns (an org is an org; a contract existing is a fact; organization_id, timestamps, tenant ownership). Only the variable stuff goes graph/dynamic: counterparties, roles, custom files, custom attributes. Drawing that boundary is the whole art — the OTLT anti-pattern is what happens when you erase it.

Strangler-fig sequence (no big-bang)
PhaseActionOutcome
1Ship entity_edge + the per-org relationship-type registry. Generalize task_entity as the first consumer.New edges exist alongside legacy columns. Nothing breaks.
2Promote file to a first-class node; model contract↔file as part_of edges. Dual-write."Contract = PDF" retired behind the scenes.
3Move counterparty / owner / manager onto edges. Project edges → legacy columns for back-compat reads.Readers migrate incrementally; UI unchanged.
4Expose the 3-op LLM vocabulary over the org ontology. Open registry editing to admins.Customer process variation leaves the engineering critical path.
5Retire legacy columns as readers cut over; keep only hybrid-boundary core columns.Steady state: small primitives, data-driven ontology.
What ENG-5996 Prototypes First

The first draft PR is a thin vertical slice of Phase 1 + part of Phase 3 — enough to see the model working end to end for the two relationships you named:

contract ↔ company
(company)-[supplier|customer]→(contract) via entity_edge. Many-to-many, multiple labels, zero schema change to add a new counterparty kind.
contract ↔ owner (person)
(person)-[owner|manager]→(contract). Same primitive, different label — proving the "owner vs manager is just a label" claim in running code.
Prototype scope (deliberately thin)

In: entity_edge migration · Zod entity + relationship-type schemas in @vallor/types · access-controlled Kysely query helpers (create / list-by-source / list-by-target) · an oRPC procedure to relate a contract to a company and an owner · a minimal seed of the registry · unit tests. Out (for the first PR): UI, the LLM op vocabulary, file-as-node, and legacy-column projection — those are later phases.

Re-enriched from ENG-5996 (originally "use EAV model") · branch + draft PR via worktree
Vallor Architecture RFC · Draft for discussion · Internal · ENG-5996 · 2026-06-23