Predicting graduation within 150% time: course combinations, confidence, and committee-ready explanations

A graduation model evaluated on 36,000 unseen students reached ~0.911 AUC. Here is what it sees that cohort reports do not.

CRT

Clema Research Team

June 5, 2026

11 mins read

Table of Contents

Introduction

Retention prediction is the use case most institutions start with. Graduation prediction is the use case that changes academic strategy. Where retention is about which students are at risk of not returning next term, graduation is about which course combinations, program pathways, and student profiles actually lead to a credential within a defensible horizon. The output lands in different rooms (the provost's office, the dean of academic affairs, curriculum committees) and gets weighed against different decisions (program redesign, prerequisite changes, advising guidance).

This post is the practical anatomy of a working graduation model. Example figures throughout come from a model evaluated on 36,000 unseen students from a six-year institutional dataset, reaching approximately 0.911 AUC. For the term-to-term version of the same exercise, see the retention prediction deep-dive.

Why 150% time is the standard horizon

Graduation rates are measured at multiple horizons (100%, 125%, 150%, 200%), but 150% is the federal-reporting standard and the one most commonly used in board-level conversations. For a four-year bachelor's program, 150% time means graduation within six years. For a two-year associate's program, three years.

The horizon is long enough to be realistic (most students who graduate, graduate within it) and short enough to be operationally useful (it does not require waiting a decade to score a cohort). It also lines up with how the rest of the sector talks about graduation, which matters when the model's output ends up in a board packet.

The catch with a six-year horizon is that the model is making a prediction whose ground truth is many years out. That changes how the model is built, how it is validated, and how often it needs to be retrained.

The data shape

Where the retention model lives at the student-term grain, the graduation model is built at the student grain. One row per student, with features summarizing the student's full institutional history through the prediction point. After encoding, the graduation dataset reaches approximately 240,000 rows and 326 columns from a six-year institutional dataset of roughly 40,000 students per year.

Property	Graduation model
Grain	One row per student
Rows after encoding	~240K
Columns after encoding	~326
Outcome variable	Graduated within 150% time (Y/N)
Class balance (graduated vs not)	~55.6% graduate / ~44.4% not
Test set used for evaluation	36,000 unseen students
Reported AUC	~0.911

The class balance for graduation (about 56% to 44%) is much closer to even than retention (87% to 13%). That makes the operating-metric choice less fraught: accuracy is more informative here than it is for retention, though recall still matters for catching the students whose predicted graduation probability is dropping.

What dominates a graduation prediction

A graduation model at this scale typically uses 9 source features plus 8 engineered features, for 17 features per student before encoding. The dominant signals on a well-built model are:

Course-difficulty of the degree (the average DFW rate across courses the student has taken or is taking) is consistently the strongest single predictor. Students taking harder course mixes have lower predicted graduation rates, after controlling for GPA and credits. This is what makes the model useful for curriculum decisions, not just student-level ones.

Degree-level DFW count (the number of D, F, or Withdrawal grades a student has accumulated through their program) is the second-strongest signal. Each additional DFW in the academic record materially lowers the predicted graduation probability.

High-school GPA and first-generation status contribute meaningfully but less than the in-program performance features. The model has learned, sensibly, that what a student does at the institution matters more than what they did before they arrived.

Cumulative credits earned and cumulative percent credit completion round out the top of the importance ranking. Both are obvious in retrospect: students who are on pace with credits graduate; students who are not, do not.

Course-combination effects

The reason graduation prediction is more interesting than retention prediction at the program level is that it can surface course-combination effects: specific pairings of courses that, taken together, change the predicted outcome more than either course taken alone.

The reason this is hard to do without a per-student model is the confounding problem. Strong students tend to choose harder course combinations, which makes those combinations look causally protective of graduation when they are really just selected by students who would graduate anyway. A gradient-boosted model can isolate the pure contribution of the combination after holding GPA, attendance, and credit-load constant. A simple correlation analysis cannot.

A real, model-derived statement looks like this:

“BIO101 is dangerous specifically for first-generation students with cumulative GPA below 2.5 who already have a prior-term DFW. Their graduation probability drops by 18%.”

That sentence is the unit of action for the academic affairs office. It does not say "BIO101 is a hard course," which the catalog already shows. It says "BIO101 is dangerous for this specific student profile, and the effect is large enough to matter." With that, advising can flag exactly which incoming students need an extra conversation before they enroll, and the curriculum committee has a concrete case for either restructuring the course or building a co-requisite support track.

The breakdown that produces a single per-student prediction comes from SHAP. A worked example for a 67.3% graduation probability:

Base rate at the institution

Starting point

The institution-wide average graduation probability for students with similar starting profiles. The model anchors here.

MATH301 + CS201 combination

+14.2%

This specific course pairing has historically been associated with higher completion for similar students. The model adds the contribution.

GPA 2.9

+6.1%

Slightly above the cohort-relative average for this program. Small positive contribution.

Attendance 74%

-8.4%

Below the institutional median. Largest single negative driver in this profile.

First-generation status

-2.6%

After controlling for everything else, the model still finds a modest first-gen graduation gap at this institution. Worth attention from a student-support standpoint.

Final prediction

67.3%

Base rate plus the contributions. The advisor now knows attendance is the biggest lever, not GPA or course choice.

Reading the metrics

When a graduation model is reported as having about 86.6% accuracy, 91% recall, and 0.911 AUC, here is what those numbers actually say in plain English.

Metric	What it measures	What ~0.91 AUC means in this problem
Accuracy	Share of all predictions that are correct	About 86 of every 100 students are classified correctly. Useful here because the class balance is close to even.
Recall	Of students who actually did not graduate, how many were flagged	About 9 of every 10 non-graduating students are flagged. The metric to optimize when the cost of missing one is high.
Precision	Of students flagged at risk of not graduating, how many actually did not	About 93 of every 100 flagged are genuinely at risk in the reported figures.
F1	Balance of precision and recall	Catching real non-graduators without flooding the advising office with false alarms.
AUC-ROC	Ability to rank a random non-graduator above a random graduator	About 0.911. The model picks the lower-probability student about 91 times out of 100. Strong on this kind of data.

The single most important sentence in the metrics conversation:

“Precise does not mean certain. It means the uncertainty is quantified and attributable, rather than guessed.”

A 67.3% predicted graduation probability does not mean "this student will probably graduate." It means the model, based on patterns from similar students whose outcomes are known, estimates a 67.3% chance, and the contributions of each driver can be inspected. That is a fundamentally different kind of claim than a confident bullet on a slide, and it is what makes the output defensible in a committee.

The min-group floor

The most common way a graduation model produces a misleading insight is small-group inference. A pipeline that surfaces a "course combination with 96% graduation rate" based on nine students has surfaced noise, not signal. The combination might genuinely be predictive, or it might be that those nine students happened to be the strongest cohort of the year, or that one outlier is pulling the average.

The right defense is procedural, not algorithmic: a minimum group size floor (commonly n ≥ 20) enforced on any aggregated output. Findings below the threshold get flagged as low-confidence and excluded from action recommendations.

This is the kind of detail that gets skipped in vendor demos and shows up six months later as a board-level embarrassment ("we recommended every freshman take this combination, based on nine students"). Asking how the vendor handles the small-group case is one of the highest-leverage questions in a procurement conversation.

Drift and the year-3 retrain rule

Predictive validity decays with horizon, and graduation prediction is the longest-horizon model an institution typically runs. Three forces drive the decay.

Concept drift: the relationship between features and outcomes shifts over time. Pre-COVID and post-COVID cohorts behave differently, and a model trained on one will not perform as well on the other.

Life-event unpredictability: financial, health, and personal events not captured in any feature set compound over a six-year horizon. The longer the horizon, the more of the outcome is determined by things the model cannot see.

Data staleness: the oldest cohort in the training data may no longer represent today's entrants. Curricular changes, admissions changes, and demographic shifts all matter.

The practical sweet spot looks like this. High confidence at roughly 1 to 2 years out. Still operationally usable at 3 to 4 years if the underlying data is stable. By year three, retrain regardless. Ideally, retrain each term or year as new data lands, reload, re-evaluate, and swap in the new model if it scores better on the held-out test set.

A graduation model that has not been retrained in three years is a stale model, even if its original validation numbers were strong. The retraining cadence is part of the design, not an afterthought. It is also the work most likely to get crowded out: in our 50-interview study, ad-hoc requests consumed 40-60% of IR team capacity, and recurring model maintenance is exactly the kind of strategic work that load pushes off the calendar.

Correlation, not causation

Every output of a graduation model carries a structural caveat that is worth surfacing in plain language. The model identifies course-combination effects, demographic patterns, and predictor relationships that are associated with graduation in the training data. It does not prove that making a course a prerequisite, or changing an admissions criterion, would change the outcome.

The right framing for the academic affairs office is: "this is the strongest empirical signal we have for further investigation," not "this is the cause." The pipeline itself should surface this caveat in its own output, not bury it in the methodology section.

In practice, the model's findings tend to be the starting point for a small pilot or a more rigorous quasi-experimental study, not the end of the conversation. That is the right use of the output.

Where Clema fits in

Clema builds graduation models in the shape described above: gradient-boosted trees as the predictor, SHAP for per-student explanations, a min-group floor enforced on aggregated findings, and a retraining cadence built into the pipeline from day one. Each model is partner-only; an institution's data trains its own model, not a shared cross-school one.

For the term-to-term version, the retention prediction deep-dive covers the parallel exercise. For the institution-level questions a graduation model does not answer (enrollment forecasting, program viability, transfer yield), the enrollment and program viability post is the companion. For the model-selection argument behind the choice of gradient-boosted trees, the XGBoost vs LLM post.

See a graduation model on your institution's data

Walk through what course-combination effects, per-student predictions, and program-level findings look like with the data your IR team already has.

Book a Clema demo

Sources

NCES IPEDS, graduation rate component. XGBoost documentation. SHAP documentation. Education Department, definition of 150% time. Georgia State University, completion outcomes. scikit-learn, model evaluation guide.

CRT

Written by

Clema Research Team

The Clema research team publishes original analysis and practical guides for institutional research and institutional effectiveness professionals.

Frequently asked questions

Why is 150% time the standard horizon for graduation prediction?

150% time is the federal-reporting standard and the horizon most used in board-level conversations: six years for a four-year bachelor's program, three years for a two-year associate's. It is long enough to be realistic, since most students who graduate do so within it, and short enough to be operationally useful without waiting a decade to score a cohort. It also matches how the rest of the sector talks about graduation.

What dominates a graduation prediction?

Course-difficulty of the degree, the average DFW rate across courses the student has taken, is consistently the strongest single predictor. Degree-level DFW count is second. High-school GPA and first-generation status contribute meaningfully but less than in-program performance, and cumulative credits earned and percent completion round out the top. The model has learned that what a student does at the institution matters more than what came before.

Why does identifying course-combination effects require a per-student model?

Because of confounding. Strong students tend to choose harder course combinations, which makes those combinations look causally protective when they are really just selected by students who would graduate anyway. A gradient-boosted model can isolate the pure contribution of a combination after holding GPA, attendance, and credit-load constant. A simple correlation analysis cannot separate the effect from the selection.

How does the model avoid drawing conclusions from tiny groups?

Through a minimum group size floor, commonly n of at least 20, enforced on any aggregated output. A "course combination with 96% graduation rate" based on nine students is noise, not signal, since one outlier could pull the average. Findings below the threshold are flagged as low-confidence and excluded from action recommendations. Asking how a vendor handles the small-group case is one of the highest-leverage procurement questions.

How often does a graduation model need to be retrained?

By year three, retrain regardless. Predictive validity decays with horizon because of concept drift (pre-COVID versus post-COVID cohorts), life-event unpredictability over six years, and data staleness as old cohorts stop representing today's entrants. Confidence is high at 1 to 2 years out and still usable at 3 to 4 if data is stable. Ideally, retrain each term or year and swap in the new model if it scores better on held-out data.

From reactive reporting to proactive prediction: a playbook for IR teams

IR teams already sit on years of attendance, GPA, LMS, and aid data. A practical guide to turning that record into forward-looking predictions that change what advisors, deans, and provosts can do this term.

Why gradient boosting beats LLMs for student retention prediction

A plain-English guide for IR leadership on why classical ML (specifically XGBoost) outperforms LLMs on student-level retention and graduation prediction, what the benchmark literature actually shows, and how to read a per-student risk score.

Predictive analytics in higher ed: the $40-a-month stack vs the $200K platform

A straight comparison between the open-source predictive-analytics stack ($40 to $95 per month) and the major incumbent platforms ($30K to $200K per year). Covers infrastructure cost, licensing, where incumbents are strong and where they leave gaps, and a build-vs-buy framework.

Ready to get started?

Reclaim Your Team's Capacity

See how Clema can help your IR team handle routine requests automatically

Try for Free Book a Demo