Architecture RFC · ENG-5996

Data Model · Foundations

From a Rigid Schema to a Customer-Configurable Ontology

A constrained property graph for Vallor: keep stable identity as columns, keep dynamic attributes as schema-typed JSONB, and add a first-class labeled edge primitive for relationships. A small set of primitives engineering owns — infinite per-customer process variation that lives in data, not migrations.

Recommendation

Property Graph

Not

Classic EAV

Primitives to own

Migration shape

Strangler-fig

The Problem We Actually Have

Three years in, the original assumption — that contract management is rigid and uniform — has proven false. Every company manages contracts differently, and the technology moves faster than our data model can flex. The variation is not noise; it is the domain.

Counterparties vary

One supplier, many suppliers, a customer, or a single person. Sometimes there is no company counterparty at all.

"A contract" is not a file

Sometimes many documents represent one contract. We hardcoded contract = PDF/DOCX. Files should be a primitive.

Roles are customer-defined

"Owner" vs "manager" vs something else — or a whole team. Some customers have no owner concept at all.

The cost of the rigid assumption

Today, every new way a customer thinks about contracts becomes a schema change and an engineering ticket. At venture scale, that puts engineering on the critical path of every customer's idiosyncratic process map. This is the tech debt to pay down. Dynamic fields were the first response to exactly this pressure — field shapes we can hand to an LLM (which takes a JSON Schema) that differ wildly across industries and even across teams in one org.

The Concept, in One Picture

Everything in the problem statement is one shape: a property graphProperty graph — nodes (entities) and edges (relationships) are both first-class and can both hold key/value properties. Unlike RDF triples, properties live on the element, not as separate statements. The model behind Neo4j and AWS Neptune.vs. EAV: keeps attributes and relationships in separate, typed structures — typed nodes connected by typed, labeled edges. The same contract, expressed for three different customers, changes only its edges and labels:

Indigo = the entity in focus (node + its dynamic_fields). Cyan = related nodes. Pills = org-registered edge labels. Change a customer's process by changing edges and labels — never the schema.

"What is a contract owner other than a person related to a contract with the label 'owner'? How is that different from a manager — the same person entity, related to the same contract, with the label 'manager'?" — That observation is the property-graph model. The label is the only thing that varies, and the label is data.

We Already Grew the Parts

We have, piecemeal and without naming it, already built the four ingredients of this graph:

Ingredient already in the repo	What it is today	What it becomes in the graph
dynamic_fields + extraction_schema	Per-org JSON-Schema metamodel + JSONB values on contract, redline_project, organization_company, task…	Node attributes, typed by the org's schema. Keep as-is.
task_entity	(task_id, entity_type, entity_id) — a polymorphic link, hardcoded to one source type.	The edge primitive, generalized to any source + a label.
organization_label	Per-org registry of customer-defined labels with type, category, cardinality (single/multi).	The relationship-type registry — the seed of an ontology.
entity_event	Polymorphic audit trail keyed by (entity_type, entity_id).	Proof polymorphic references work in our stack at scale.

Evidence — this is not greenfield

We did not build EAV. We independently evolved the scaffolding of a constrained property graph. The work is not "adopt a new pattern" — it is "promote four ad-hoc pieces into two deliberate primitives." That de-risks the effort: polymorphic refs, the access patterns, and the per-org metamodel are already proven in production.

core/migrations/1739999145835_extraction-schema.mjs · 1777403653000_task-standalone.mjs · 1763994801816_create-organization-label.mjs · 1777403658000_entity-event-audit-trail.mjs

The Decision — and the One Thing EAV Gets Wrong

EAV can model relationships (the value holds a reference to another entity — "EAV with relationships"). So the question is not "can it," it is "should it." EAV's defining move is collapsing attributes and relationships into one untyped (subject, predicate, object) table. That is exactly the move to avoid — the two have different needs:

The registry is the schema EAV lacks — expressed as data, editable without a migration. It governs attributes and relationships separately, each typed for its own access pattern.

So the recommendation is not EAV and not "keep extending dynamic_fields" (it structurally cannot link). It is: keep dynamic_fields for attributes, add a first-class labeled edge primitive for relationships, governed by a per-org registry that constrains which edges are legal — the same role extraction_schema plays for attributes.

Why the registry keeps this from becoming "EAV soup"

EAV famously lacks a schema; our registry is the schema. Without it you get the OTLT failure modeEAV / OTLT failure mode — the "One True Lookup Table" anti-pattern: everything dissolves into one untyped table, nothing is discoverable, every query becomes bespoke self-joins. Well-documented precisely because teams keep getting burned.Celko, "SQL for Smarties"; the MUCK / OTLT critique — undiscoverable, unconstrained, unqueryable. The registry makes this a constrained graph.

Mirrors extraction-schema.mjs · organization_label (type, category, kind: single|multi)

The Edge Primitive (Illustrative)

// packages/database/src/schema/entity-edge.ts — Drizzle definition; queried via Kysely
entity_edge (
  organization_id   not null,
  source_type      not null,  // 'contract' | 'person' | 'team' | 'file' | 'chat' | …
  source_id        not null,
  relationship_type not null,  // 'owner' | 'manager' | 'supplier' | 'part_of' | … (org-registered)
  target_type      not null,
  target_id        not null,
  dynamic_fields    jsonb,    // attributes OF the relationship (e.g. ownership start date)
  unique (source_type, source_id, relationship_type, target_type, target_id),
  index (source_type, source_id),  index (target_type, target_id),
  index (organization_id, relationship_type)
)

Nodes = existing core tables. Keep stable identity columns; attach dynamic_fields for the variable parts.
Edges = the table above. Bidirectionally indexed; the edge itself can hold attributes.
Registry validated with Zod at the boundaryWhy Zod, not a Postgres enum/FK — the codebase parses at the boundary and bans pgEnum + raw casts by hook. A polymorphic target can't carry a real FK anyway, so registry + Zod is the strongest integrity available, and it stays editable as data..agents/rules/block-postgres-enum.md · block-as-type-cast.md, like the attribute metamodel.

Your Cases, Resolved in the Model

The customer's reality	How the graph represents it — zero schema change
Owner vs. manager	Same entities, different edge label. (person)-[owner]→(contract) vs (person)-[manager]→(contract).
No "owner" concept here	Their registry omits owner. Nothing forced, no null field. Absence costs nothing — you store edges that exist.
A team or division owns it	Polymorphic source: (team)-[owner]→(contract). Same edge type, different source node type.
1 / N suppliers / a customer / a person counterparty	Cardinality is just "how many edges of that type exist." (company)-[supplier]→(contract) ×N, or (person)-[counterparty]→(contract).
Many documents = one contract	file becomes a node type. Many (file)-[part_of]→(contract) edges. "Contract = PDF" dissolves.
Files relate to companies, chats, files…	A file is a node; its relations are edges. "Polymorphic, graph, EAV, or join?" → all the same answer at different zoom: a node + a generalized polymorphic edge = a graph.

Why This Is More LLM-Manageable, Not Less

This answers the venture-scale pain directly. Collapsing the world to a uniform vocabulary shrinks the LLM's surface to ~3 tools:

createNode(type, attributes)          // attributes validated against the org's JSON Schema
setAttributes(node, attributes)      // same metamodel dynamic_fields already uses
createEdge(source, relationshipType, target)  // validated against the org's edge registry

The LLM reads the org's ontology — node types + edge-type registry + attribute schemas, all data, all expressible as JSON SchemaWhy JSON Schema is the unlock — models consume and emit JSON Schema natively (structured-output / tool-arg format). The org's entire ontology becomes a prompt-able spec, and the LLM maps a customer's documents onto nodes + edges against it. The same bet dynamic_fields already made for attributes, extended to relationships.extraction_schema = JSON Schema Draft 2020-12 + x-metadata — and maps a customer's documents and intent onto nodes and edges. No bespoke per-customer code.

The strategic payoff

Engineering owns ~4 primitives. Customers' infinite process variation lives in the ontology + the graph instances — never in a migration. A new relationship kind ("outside counsel") is a registry edit, made by an admin or proposed by an LLM, not an engineering ticket.

Constrained Property Graph vs. Classic EAV

Dimension	Constrained property graph (recommended)	Classic EAV / triples
Attributes vs. relationships	Separate, each typed for its access pattern	Collapsed into one untyped table
Typing	Schema-typed JSON Schema per org + Zod	Stringly-typed one value column
"Chats/files for contract X"	Indexed edge lookup, both directions	Self-join soup
Attribute read	One row + GIN (co-located JSONB)	N rows reassembled per entity
Referential integrity	App-layer + registry (poly target — a wash)	App-layer (poly target — a wash)
Cardinality / required rules	Expressible in the registry	Not expressible
Fit with existing tooling	Reuses dynamic_fields, task_entity, entity_event	Third pattern; entities become islands
LLM surface	~3 uniform ops over a JSON-Schema ontology	Generic triples; ambiguous to map reliably

The graph takes the generality you want from EAV/graph while preserving the typing, tooling, and projection disciplines we've invested in. On the one dimension EAV could theoretically win — a real FK on the link — it doesn't: a polymorphic target can't carry a true FK in any encoding short of per-type tables, so EAV doesn't even collect its usual prize.

Honest Tradeoffs — Go In With Eyes Open

Cost	What it means	Mitigation
Query complexity & perf	SELECT owner FROM contract becomes an edge join / recursive CTE. Hot read paths get more expensive.	Denormalized projections back into columns / search index — we already do this.
No real FKs	Polymorphic edges can't enforce target existence at the DB.	Registry + Zod validation; background GC for dangling edges.
Compile-time → runtime typing	Trade contract.owner: Person for runtime-validated edges.	Parse every edge against the registry at the boundary.
Paradigm migration	3 years of code assumes contract.counterparty, contract = file, etc.	Strangler-fig (below). Never big-bang.
Over-generalization	A fully generic graph can become undiscoverable "soup."	The per-org registry constrains it; keep stable core as columns.

Live warning from our own codebase

The projection layer is where this architecture bites. Our existing dynamic_fields→column sync triggers are currently broken for 10/12 fields (snake- vs camel-case key mismatch; only name/title sync) and are being torn out in ENG-5965. Budget explicitly for the read-projection strategy. Classic EAV with self-joins would make this worse, not better.

contract sync trigger audit · ENG-5965 · orphan fns: delete_embedding, safe_sync_extracted_date

The Hybrid Boundary & Migration Path

Hybrid beats purist — where to draw the line

Do not dissolve everything into nodes/edges. Keep genuinely-stable identity as real columns (an org is an org; a contract existing is a fact; organization_id, timestamps, tenant ownership). Only the variable stuff goes graph/dynamic: counterparties, roles, custom files, custom attributes. Drawing that boundary is the whole art — the OTLT anti-pattern is what happens when you erase it.

Strangler-fig sequence (no big-bang)

Phase	Action	Outcome
1	Ship entity_edge + the per-org relationship-type registry. Generalize task_entity as the first consumer.	New edges exist alongside legacy columns. Nothing breaks.
2	Promote file to a first-class node; model contract↔file as part_of edges. Dual-write.	"Contract = PDF" retired behind the scenes.
3	Move counterparty / owner / manager onto edges. Project edges → legacy columns for back-compat reads.	Readers migrate incrementally; UI unchanged.
4	Expose the 3-op LLM vocabulary over the org ontology. Open registry editing to admins.	Customer process variation leaves the engineering critical path.
5	Retire legacy columns as readers cut over; keep only hybrid-boundary core columns.	Steady state: small primitives, data-driven ontology.

What ENG-5996 Prototypes First

The first draft PR is a thin vertical slice of Phase 1 + part of Phase 3 — enough to see the model working end to end for the two relationships you named:

contract ↔ company

(company)-[supplier|customer]→(contract) via entity_edge. Many-to-many, multiple labels, zero schema change to add a new counterparty kind.

contract ↔ owner (person)

(person)-[owner|manager]→(contract). Same primitive, different label — proving the "owner vs manager is just a label" claim in running code.

Prototype scope (deliberately thin)

In: entity_edge migration · Zod entity + relationship-type schemas in @vallor/types · access-controlled Kysely query helpers (create / list-by-source / list-by-target) · an oRPC procedure to relate a contract to a company and an owner · a minimal seed of the registry · unit tests. Out (for the first PR): UI, the LLM op vocabulary, file-as-node, and legacy-column projection — those are later phases.

Re-enriched from ENG-5996 (originally "use EAV model") · branch + draft PR via worktree

Architecture RFC · Draft for discussion · Internal · ENG-5996 · 2026-06-23