From person-dependent to system-supported: a framework for governing data definitions

A six-step framework for moving institutional definitions out of one analyst's head and into governed, survivable infrastructure

CRT

Clema Research Team

June 26, 2026

12 mins read

Table of Contents

A gap in infrastructure, not knowledge

Most institutional research teams already know how to write a good definition. Ask any seasoned analyst what "full-time" means in an IPEDS submission versus a state report, and they can tell you precisely. The knowledge exists. What is usually missing is the structure that keeps that knowledge findable, current, and survivable after the person who holds it leaves.

That is the pattern across our research with 20 IR and IE professionals in 13 states. Creating definitions is achievable. Maintenance is where efforts collapse. A 300-definition document built by one IT staffer, then shelved when that person left after about a year, is not a documentation success; it is a maintenance failure waiting to be discovered. As one finding from the study puts it plainly: the institutional intelligence gap is not a gap in knowledge, it is a gap in infrastructure.

If you have read the first two posts in this series, you have the diagnosis. The first explained why one common term carries several valid definitions and why that fragments reporting. The second framed key-person dependency as the institutional intelligence gap and gave you a self-assessment. This post is about the build: how to move from person-dependent to system-supported without buying a single tool first.

The anatomy of a definition you can govern

A definition you can actually govern carries two dimensions at once: the functional (what a non-technical consumer needs to understand the number) and the technical (the logic a system needs to reproduce it). The first post lists the seven parts in full. The governance point is narrower: most documented definitions in the study captured the business meaning and stopped there, which is exactly the half that does not survive a staff change or a system migration. The governance metadata, who owns it, when it was last reviewed, what changed and why, is the part that decays first and is hardest to reconstruct after the fact.

The seven parts, viewed through a governance lens

Term name

The label as it appears in reports and systems. Trivial until two systems use the same word for different things; then the name is the first place ambiguity hides.

Plain-language business definition

Written for the requestor, not the analyst. If a dean cannot read it and know what they are getting, it will not drive adoption, and an unused definition is a definition that quietly competes with shadow versions.

Technical calculation and logic

Inclusion and exclusion criteria spelled out. This is the half that lets any analyst reproduce the number, and the half most documentation skips.

Source system and field

Banner, PeopleSoft, or Workday, plus the specific field. Without it, a migration forces you to rediscover lineage you already paid to learn once.

Reporting context

IPEDS, state, internal, or accreditation. The same term legitimately differs across these; naming the context is how you keep multiple valid definitions from reading as contradictions.

Ownership

Who is responsible, who approves, who gets notified on change. Tie this to a position, not a person, so the role outlives the individual who currently fills it.

Review history

Last reviewed, what changed, and why. A quarter of the study's participants kept no change record at all, which means their reasoning evaporated the moment the author moved on.

Of the 20 respondents, exactly one maintained a dictionary that covered both the functional and technical dimensions with active governance, versioning, and campus-wide accessibility. That office documented 495 definitions and sat in the lowest-risk tier. The lesson is not the number. It is that completeness plus governance, not volume, is what separated the one survivable dictionary from the rest.

The six-step framework

Data lineage

Where data came from and how it moved

Trace where each field originated, how it was transformed, and what it is for. Humans supply the "why" no system can infer; the rule a column encodes lives in someone's memory of a policy decision, not in the data itself. AI can organize that into an auditable flowchart once a person has supplied the reasoning. This is where you stop losing context every time a system changes.

Addresses self-assessment factors 4 (governance and memory) and 5 (reproducible and survivable).
AI organizes and visualizes; humans supply intent.

Governance

Who decides, and what the rules are

Decide where data sits, which policies apply, where it should and should not be displayed, and who holds authority over changes. This is FERPA flags, role-based access, update schedules, and approval workflows. Governance is the factor that predicted a team's challenge profile more strongly than team size did; a small office with governance outperformed a large one without it.

Addresses factor 4 (governance and institutional memory).
Humans set policy; tools enforce it.

Audit

Find the conflicts before they reach leadership

Flag redundancies and conflicting definitions, sample for quality, and decide what to fill, null out, or consolidate. Reviewing 500 columns by hand is not feasible, so AI earns its place here: it surfaces anomalies at scale and generates a prioritized list. A person still makes every keep-or-merge call.

Addresses factors 1 (documented) and 5 (reproducible and survivable).
AI surfaces anomalies at scale; humans adjudicate.

Definition

Draft fast, approve deliberately

With lineage in hand, AI can draft both the business and technical definitions, and for IPEDS it can flag which columns map to which federal requirements. Every draft then passes through a human approval workflow: propose, review, approve, publish, notify. The speed comes from the draft; the trust comes from the review.

Addresses factors 1 (documented), 2 (accessible and integrated), and 4 (governance).
AI drafts; humans approve before anything publishes.

Democratization

One source, two audiences

Serve different audiences from the same underlying definitions: a plain-language glossary for requestors and a full technical dictionary for analysts. Only 15% of the institutions studied maintained a campus-wide-accessible dictionary; in most offices the dictionary's only reader was the IR team itself. Closing that gap is how documentation stops being private and starts reducing the request load.

Addresses factors 2 (accessible and integrated) and 3 (literacy and adoption).
The same definition, rendered for two reading levels.

Maintenance

The step that keeps the other five alive

Keep an audit log of changes, send notifications when a definition updates, and run a dashboard of the most-referenced and most-overdue definitions. A quarter of respondents had no change history at all. Without maintenance, a dictionary begins degrading the moment it is published, which is precisely how the 300-definition shelf-document became worthless.

Addresses factor 4 (governance and institutional memory).
Mostly process; tooling automates the reminders and logs.

The framework is deliberately tool-agnostic. It pairs institutional process with selective AI rather than handing the whole problem to software. In the study, only one office ran anything close to a continuous-improvement model, updating definitions through a monthly KPI review by a governance council. That was the outlier, not because the others lacked tools, but because they lacked the process the framework formalizes.

The intelligence gap is the AI-readiness gap

Here is the part that makes this urgent rather than tidy. Adopting AI analytics, natural-language querying, or predictive models on top of an undocumented definition layer does not give you faster answers. It gives you faster wrong answers. A tool querying a warehouse where "enrolled student" has several undocumented meanings will return whichever one the query happens to hit, with no flag and no audit trail. The number looks authoritative because a machine produced it.

AI accelerates the work but does not replace the judgment. In one test from the research, roughly 30% of AI-generated definitions were usable on a first pass; the rest needed editing by someone with deep institutional knowledge. Scale that to a 500-definition dictionary and about 150 are ready immediately while 350 still need human context. That ratio is the whole argument: AI is a powerful drafting and auditing partner, and a poor substitute for the person who knows why a column named "master" is less authoritative than an unlabeled computed flag sitting next to it.

So closing the intelligence gap is not a prerequisite for using AI. It is a prerequisite for trusting what AI produces. The gap was always there; AI does not create the risk, it magnifies it. The institutions that document and govern their definitions first are the ones that will be able to believe their own dashboards once the querying gets automated.

Twelve best practices and what breaks without them

Best practice	What breaks when you skip it
Check the definition, do not assume it	Two analysts answer the same request differently because each assumed a different meaning.
Read the documentation before touching the data	Skipping the codebook is the single top source of downstream error in a new dataset.
Identify which variables drive decisions	Governance effort spreads evenly across trivial and high-stakes fields, so the board-report number gets no more care than a footnote.
Blank is not zero	Missing values get counted as real zeros, quietly biasing rates and totals.
Designate one authoritative source per variable	The same term defined in three places becomes three competing claims with no tiebreaker.
Document computed columns with logic, ownership, and intended use	A derived field outlives the memory of how it was built, and no one dares change it.
Know where each variable originates	A system migration forces you to rediscover lineage you already learned and lost.
Disclose when defaults or assumptions are applied	Consumers treat an assumption-laden number as ground truth and build decisions on it.
Tie definitions to self-interest so people adopt them	A technically perfect dictionary nobody opens, while shadow versions multiply in spreadsheets.
Assign ownership to positions, not people	Ownership leaves with the individual, and the definition is orphaned on their last day.
Use federal and state standards as a baseline	Internal definitions drift away from IPEDS and state reporting, breaking peer benchmarking.
Remember that a tool is not a framework	You put every definition in the platform, then ask "now what?", because the process was never built.

“"We put them all in the tool and then... now what?"”

Where to start, by tier

Large gap: document, then assign

For the 55% whose definitions live in one person's head

You do not need a budget, a tool, or a committee to start. The hardest move is the first one, from undocumented to documented, and a shared spreadsheet clears it.

Document what is in your head today; a shared Google Sheet is enough to begin.
Identify your five highest-risk variables: federal submissions, board reports, accreditation evidence.
Assign ownership of each to a position, not a person.

Moderate gap: connect, govern, consolidate

For the 40% with definitions that exist but stay fragmented

The raw material is there. The work now is making it accessible, governed, and singular, so experienced staff are no longer the only routing layer.

Connect definitions to the reports that use them by embedding in tooltips, headers, and the catalog.
Establish a review cadence, quarterly at minimum, so updates stop waiting for something to break.
Audit for fragmentation: where one term is defined in three places, consolidate to a single authoritative version.

Small gap: prepare for AI, then teach the field

For the 5% who are documented, governed, and survivable

You have the layer most institutions are still building. The next return comes from making it machine-readable and proving the case publicly.

Prepare the definition layer for AI: machine-readable formats, API-accessible catalogs, standardized metadata schemas.
Measure and publish adoption so the value is visible, not assumed.
Become a case study for the field; the sector has few worked examples.

Whichever tier you are in, the returns compound. Each definition you formalize lowers the cost of the next one, because the process, the template, and the precedent already exist. The first definition is expensive; the fiftieth is nearly free. That is the quiet argument for starting now rather than waiting for a migration or an accreditation year to force the issue. For more on turning that documentation into something leadership trusts, see our work on data-informed advocacy and best practices for IR and IE request workflows.

Where Clema fits in

Clema is an AI data-intelligence platform built for IR and IE teams, and it maps onto the framework rather than replacing it. It connects to nine federal data sources, including IPEDS, College Scorecard, EADA, Pell Grants, DAPIP, and PSEO, and lets you query institutional and federal data in plain language. It flags discrepancies across sources, which is the audit step at scale. It serves both a technical dictionary and a consumer-facing glossary from the same underlying definitions, which is the democratization step. And it builds institutional memory that persists beyond any single person's tenure, which is the part the research kept identifying as the missing infrastructure. Thirty-five institutions use it today.

The honest framing matches the AI-readiness argument above: the tool accelerates lineage, audit, drafting, and maintenance, and your team still supplies the judgment about which definition is right for which context. Clema makes the framework cheaper to run. It does not run it for you.

Read the full research

The whitepaper behind this series draws on 20 IR and IE interviews across 13 states: the three-tier intelligence gap, the cost model, the five-factor self-assessment, and the full six-step framework.

Read the whitepaper

See it on your own definitions

Bring your five highest-risk variables and we will show you how Clema drafts, audits, and governs them, with your team keeping every approval.

Book a demo

CRT

Written by

Clema Research Team

The Clema research team publishes original analysis and practical guides for institutional research and institutional effectiveness professionals.

Frequently asked questions

What is the difference between creating and maintaining data definitions?

Creating a definition is a one-time act of writing down what a term means and how it is calculated. Maintaining it means keeping it current, governed, and findable as systems and staff change. The research found that creation is achievable for most teams, while maintenance is where efforts collapse; a dictionary degrades from the moment it is published if no one owns its upkeep.

Do I need to buy a governance tool before I can start?

No. The framework is tool-agnostic, and the recommended first move for a high-risk team is simply documenting what is in your head in a shared spreadsheet. A tool is not a framework; buying software without a process produces a full catalog that no one knows how to maintain. Start with the process and add tooling once it is doing real work.

Where does AI actually help in governing definitions?

AI is strongest at organizing lineage into an auditable view, surfacing conflicts across hundreds of columns at scale, and drafting first-pass definitions. In testing, about 30% of AI-generated definitions were usable on a first pass; the remaining 70% needed editing by someone with deep institutional knowledge. AI accelerates the work, but every definition should pass through human approval before it is published.

Why is the institutional intelligence gap also an AI-readiness gap?

An AI tool querying a warehouse with undocumented definitions returns whichever meaning the query happens to use, with no flag and no audit trail, so it produces faster wrong answers. Closing the gap is not a prerequisite for using AI, but it is a prerequisite for trusting what AI produces. AI magnifies the existing risk rather than creating it.

Which step of the framework should we tackle first?

It depends on your starting tier. Large-gap teams should document their highest-risk variables and assign ownership to positions, not people. Moderate-gap teams should connect definitions to the reports that use them and set a quarterly review cadence. Small-gap teams should prepare their definition layer for AI with machine-readable, API-accessible formats. Returns compound at every tier, because each definition you formalize lowers the cost of the next.

Ready to get started?

Reclaim Your Team's Capacity

See how Clema can help your IR team handle routine requests automatically

Try for Free Book a Demo