Introduction
Retention prediction is the use case most institutions start with. Graduation prediction is the use case that changes academic strategy. Where retention is about which students are at risk of not returning next term, graduation is about which course combinations, program pathways, and student profiles actually lead to a credential within a defensible horizon. The output lands in different rooms (the provost's office, the dean of academic affairs, curriculum committees) and gets weighed against different decisions (program redesign, prerequisite changes, advising guidance).
This post is the practical anatomy of a working graduation model. Example figures throughout come from a model evaluated on 36,000 unseen students from a six-year institutional dataset, reaching approximately 0.911 AUC. For the term-to-term version of the same exercise, see the retention prediction deep-dive.
Why 150% time is the standard horizon
Graduation rates are measured at multiple horizons (100%, 125%, 150%, 200%), but 150% is the federal-reporting standard and the one most commonly used in board-level conversations. For a four-year bachelor's program, 150% time means graduation within six years. For a two-year associate's program, three years.
The horizon is long enough to be realistic (most students who graduate, graduate within it) and short enough to be operationally useful (it does not require waiting a decade to score a cohort). It also lines up with how the rest of the sector talks about graduation, which matters when the model's output ends up in a board packet.
The catch with a six-year horizon is that the model is making a prediction whose ground truth is many years out. That changes how the model is built, how it is validated, and how often it needs to be retrained.
The data shape
Where the retention model lives at the student-term grain, the graduation model is built at the student grain. One row per student, with features summarizing the student's full institutional history through the prediction point. After encoding, the graduation dataset reaches approximately 240,000 rows and 326 columns from a six-year institutional dataset of roughly 40,000 students per year.
| Property | Graduation model |
|---|---|
| Grain | One row per student |
| Rows after encoding | ~240K |
| Columns after encoding | ~326 |
| Outcome variable | Graduated within 150% time (Y/N) |
| Class balance (graduated vs not) | ~55.6% graduate / ~44.4% not |
| Test set used for evaluation | 36,000 unseen students |
| Reported AUC | ~0.911 |
The class balance for graduation (about 56% to 44%) is much closer to even than retention (87% to 13%). That makes the operating-metric choice less fraught: accuracy is more informative here than it is for retention, though recall still matters for catching the students whose predicted graduation probability is dropping.
What dominates a graduation prediction
A graduation model at this scale typically uses 9 source features plus 8 engineered features, for 17 features per student before encoding. The dominant signals on a well-built model are:
Course-difficulty of the degree (the average DFW rate across courses the student has taken or is taking) is consistently the strongest single predictor. Students taking harder course mixes have lower predicted graduation rates, after controlling for GPA and credits. This is what makes the model useful for curriculum decisions, not just student-level ones.
Degree-level DFW count (the number of D, F, or Withdrawal grades a student has accumulated through their program) is the second-strongest signal. Each additional DFW in the academic record materially lowers the predicted graduation probability.
High-school GPA and first-generation status contribute meaningfully but less than the in-program performance features. The model has learned, sensibly, that what a student does at the institution matters more than what they did before they arrived.
Cumulative credits earned and cumulative percent credit completion round out the top of the importance ranking. Both are obvious in retrospect: students who are on pace with credits graduate; students who are not, do not.
Course-combination effects
The reason graduation prediction is more interesting than retention prediction at the program level is that it can surface course-combination effects: specific pairings of courses that, taken together, change the predicted outcome more than either course taken alone.
The reason this is hard to do without a per-student model is the confounding problem. Strong students tend to choose harder course combinations, which makes those combinations look causally protective of graduation when they are really just selected by students who would graduate anyway. A gradient-boosted model can isolate the pure contribution of the combination after holding GPA, attendance, and credit-load constant. A simple correlation analysis cannot.
A real, model-derived statement looks like this:
“BIO101 is dangerous specifically for first-generation students with cumulative GPA below 2.5 who already have a prior-term DFW. Their graduation probability drops by 18%.”
That sentence is the unit of action for the academic affairs office. It does not say "BIO101 is a hard course," which the catalog already shows. It says "BIO101 is dangerous for this specific student profile, and the effect is large enough to matter." With that, advising can flag exactly which incoming students need an extra conversation before they enroll, and the curriculum committee has a concrete case for either restructuring the course or building a co-requisite support track.
The breakdown that produces a single per-student prediction comes from SHAP. A worked example for a 67.3% graduation probability:
Base rate at the institution
Starting point
The institution-wide average graduation probability for students with similar starting profiles. The model anchors here.
MATH301 + CS201 combination
+14.2%
This specific course pairing has historically been associated with higher completion for similar students. The model adds the contribution.
GPA 2.9
+6.1%
Slightly above the cohort-relative average for this program. Small positive contribution.
Attendance 74%
-8.4%
Below the institutional median. Largest single negative driver in this profile.
First-generation status
-2.6%
After controlling for everything else, the model still finds a modest first-gen graduation gap at this institution. Worth attention from a student-support standpoint.
Final prediction
67.3%
Base rate plus the contributions. The advisor now knows attendance is the biggest lever, not GPA or course choice.
Reading the metrics
When a graduation model is reported as having about 86.6% accuracy, 91% recall, and 0.911 AUC, here is what those numbers actually say in plain English.
| Metric | What it measures | What ~0.91 AUC means in this problem |
|---|---|---|
| Accuracy | Share of all predictions that are correct | About 86 of every 100 students are classified correctly. Useful here because the class balance is close to even. |
| Recall | Of students who actually did not graduate, how many were flagged | About 9 of every 10 non-graduating students are flagged. The metric to optimize when the cost of missing one is high. |
| Precision | Of students flagged at risk of not graduating, how many actually did not | About 93 of every 100 flagged are genuinely at risk in the reported figures. |
| F1 | Balance of precision and recall | Catching real non-graduators without flooding the advising office with false alarms. |
| AUC-ROC | Ability to rank a random non-graduator above a random graduator | About 0.911. The model picks the lower-probability student about 91 times out of 100. Strong on this kind of data. |
The single most important sentence in the metrics conversation:
“Precise does not mean certain. It means the uncertainty is quantified and attributable, rather than guessed.”
A 67.3% predicted graduation probability does not mean "this student will probably graduate." It means the model, based on patterns from similar students whose outcomes are known, estimates a 67.3% chance, and the contributions of each driver can be inspected. That is a fundamentally different kind of claim than a confident bullet on a slide, and it is what makes the output defensible in a committee.
The min-group floor
The most common way a graduation model produces a misleading insight is small-group inference. A pipeline that surfaces a "course combination with 96% graduation rate" based on nine students has surfaced noise, not signal. The combination might genuinely be predictive, or it might be that those nine students happened to be the strongest cohort of the year, or that one outlier is pulling the average.
The right defense is procedural, not algorithmic: a minimum group size floor (commonly n ≥ 20) enforced on any aggregated output. Findings below the threshold get flagged as low-confidence and excluded from action recommendations.
This is the kind of detail that gets skipped in vendor demos and shows up six months later as a board-level embarrassment ("we recommended every freshman take this combination, based on nine students"). Asking how the vendor handles the small-group case is one of the highest-leverage questions in a procurement conversation.
Drift and the year-3 retrain rule
Predictive validity decays with horizon, and graduation prediction is the longest-horizon model an institution typically runs. Three forces drive the decay.
Concept drift: the relationship between features and outcomes shifts over time. Pre-COVID and post-COVID cohorts behave differently, and a model trained on one will not perform as well on the other.
Life-event unpredictability: financial, health, and personal events not captured in any feature set compound over a six-year horizon. The longer the horizon, the more of the outcome is determined by things the model cannot see.
Data staleness: the oldest cohort in the training data may no longer represent today's entrants. Curricular changes, admissions changes, and demographic shifts all matter.
The practical sweet spot looks like this. High confidence at roughly 1 to 2 years out. Still operationally usable at 3 to 4 years if the underlying data is stable. By year three, retrain regardless. Ideally, retrain each term or year as new data lands, reload, re-evaluate, and swap in the new model if it scores better on the held-out test set.
A graduation model that has not been retrained in three years is a stale model, even if its original validation numbers were strong. The retraining cadence is part of the design, not an afterthought.
Correlation, not causation
Every output of a graduation model carries a structural caveat that is worth surfacing in plain language. The model identifies course-combination effects, demographic patterns, and predictor relationships that are associated with graduation in the training data. It does not prove that making a course a prerequisite, or changing an admissions criterion, would change the outcome.
The right framing for the academic affairs office is: "this is the strongest empirical signal we have for further investigation," not "this is the cause." The pipeline itself should surface this caveat in its own output, not bury it in the methodology section.
In practice, the model's findings tend to be the starting point for a small pilot or a more rigorous quasi-experimental study, not the end of the conversation. That is the right use of the output.
Where Clema fits in
Clema builds graduation models in the shape described above: gradient-boosted trees as the predictor, SHAP for per-student explanations, a min-group floor enforced on aggregated findings, and a retraining cadence built into the pipeline from day one. Each model is partner-only; an institution's data trains its own model, not a shared cross-school one.
For the term-to-term version, the retention prediction deep-dive covers the parallel exercise. For the institution-level questions a graduation model does not answer (enrollment forecasting, program viability, transfer yield), the enrollment and program viability post is the companion. For the model-selection argument behind the choice of gradient-boosted trees, the XGBoost vs LLM post.
See a graduation model on your institution's data
Walk through what course-combination effects, per-student predictions, and program-level findings look like with the data your IR team already has.
Book a Clema demo