Predicting term-to-term retention: what a working model actually looks at

Inside a retention model evaluated on 276,519 unseen student-term records.

CRT
Clema Research Team
June 5, 2026
11 mins read
Share:
Table of Contents

Introduction

Retention prediction is the use case most IR teams start with, and for good reason. The decision window is short (between this term and the next), the financial argument is the easiest to make in a deans' meeting, and the data is among the cleanest the institution has.

This post is the practical anatomy of a working retention model. Not a pitch. Not a methodology paper. A walk through the features, engineering choices, and timing decisions that actually determine whether the model's output ends up on an advisor's desk on Monday morning. The example figures throughout come from a model evaluated on 276,519 unseen student-term records from a six-year institutional dataset.

What term-to-term retention means in practice

Term-to-term retention is the simplest version of the persistence question. For each enrolled student, the model predicts a probability that the student will be enrolled again in the next regular term. Yes or no, with a quantified confidence.

It is not graduation prediction, which has a multi-year horizon and a different feature mix. It is not cohort retention as reported to IPEDS, which is an aggregate, not a per-student score. It is the specific, narrow question: will this individual be back next term, and how sure is the model.

The reason it is worth modeling at all is that "will be back" is the single most expensive decision the student is about to make, both for the student and for the institution. Catching the signal a term in advance is the difference between an intervention and a post-mortem.

The data shape

A six-year institutional dataset at roughly 40,000 students per year produces around 240,000 student-level rows. For retention modeling, those rows are expanded to the student-term grain, where every student contributes one row per enrolled term. After encoding, the retention dataset reaches approximately 1.84 million rows and 631 columns. The graduation dataset is built differently and is covered in the graduation prediction deep-dive.

PropertyRetention model
GrainOne row per student per term
Rows after encoding~1.84M
Columns after encoding~631
Outcome variableRetained to next term (Y/N)
Class balance (retained vs not)~87% retained / ~13% not
Test set used for evaluation276,519 unseen student-term records

Two things to notice. First, the dataset is large enough that gradient-boosted trees are clearly the right tool. The XGBoost vs LLM post walks through why this matters and what the benchmark literature shows.

Second, the class balance (87% retained vs 13% not) is the single most important number on that table. It is the reason raw accuracy is misleading and the reason recall, not accuracy, is the operating metric for this problem.

Features that move the needle

A retention model at this scale typically uses 12 source features plus 16 engineered features, for 28 features per student-term row before encoding. The categories are familiar to any IR team: student background, academic performance, course load, risk signals, and program context.

The features that consistently dominate (across years, across cohorts, and across institutions) are academic-progress features. Cumulative credits earned and cumulative percent credit completion sit at the top. Prior-term credits and the GPA family (current term, prior term, cohort-relative) follow.

Demographic features (gender, race/ethnicity) sit near zero in importance. This is what you want: the model relies on performance and progress, not on demographic shortcuts. It is also one of the easier things to defend in a faculty governance conversation.

Engineering decisions that matter

The choices that distinguish a working model from a mediocre one are mostly upstream of the algorithm. They are in how the features are constructed.

1

DFW rate as a difficulty proxy

Historical share of D, F, and Withdrawal grades

Instead of labeling courses with a coarse STEM vs non-STEM flag, the model uses each course's historical DFW rate as a continuous measure of difficulty. Courses above a 0.30 DFW rate are treated as "hard." This single change captures something the categorical flag cannot: that intro physics and a freshman writing seminar can have very different difficulty profiles even when they share a department code.

2

Lag and cumulative features

Trajectory, not snapshot

Prior-term GPA, prior-term credits, prior DFW count, and running cumulative credit completion let the model see a student's trajectory over time, not just a current snapshot. Tree models do not natively know about sequence, so these engineered features are what encode "this student has been sliding for two terms" rather than just "this student's current GPA is 2.6."

3

Cohort-relative GPA

Performance in context

A 2.8 GPA in a program where the cohort average is 3.2 is a different signal than a 2.8 in a program where the cohort average is 2.5. Comparing each student's GPA against peers from the same entry term and program lets the model judge performance in context, which materially changes which students get flagged.

4

CIP family grouping

Signal over noise

Hundreds of distinct program codes collapse into a small set of CIP families (STEM, SOCIAL, HUMANITIES, APPLIED). This trades off some specificity for a much stronger signal. The model can learn that program types behave differently, without overfitting to programs that have only a handful of students per year.

5

Leakage control

The model cheating, prevented

Some features sit too close to the outcome and would let the model "cheat" by indirectly seeing the answer. For graduation prediction, total credit completion is the most common offender. The pipeline removes these deliberately. The cost is a few points of test accuracy. The benefit is that the score actually generalizes to next year's students.

Class imbalance and why 87% accuracy can mislead

If 87% of students are retained, a model that predicts "retained" for every single student scores 87% accuracy. It also catches zero real dropouts. That is the central problem with using raw accuracy on this kind of dataset, and it is why the operating metric for retention is recall (the share of real dropouts that the model catches), not accuracy.

The way a good retention pipeline handles this is not by oversampling the minority class. SMOTE-style techniques fabricate fractional categories on one-hot encoded course features and tend to make the model worse, not better. The approach that holds up is a dampened class reweighting: XGBoost's scale_pos_weight set to the square root of the negative-to-positive ratio. This lifts the minority class without flipping the natural prior, and it pairs well with stratified group K-fold cross-validation that preserves the dropout ratio across folds without duplicating students.

The practical result is a model that achieves roughly 86.6% accuracy and 91 to 92% recall on the held-out test set. The accuracy number is similar to "predict retained for everyone," but the recall number is what tells you the model is actually doing the job.

The prediction-timing question

The single most consequential design question for a retention model is when in the student timeline the prediction is made. The answer determines whether the system supports intervention or only retrospective analysis. The same model with the same features, run at different points in the term, produces very different operational value.

Prediction pointFeatures availableIntervention window
Start of termPrior GPA, Pell status, admit type, course load and difficulty, prior DFW. No current-term grades or attendance yet.Full semester. Proactive check-in with incoming students.
Mid-term (week 6 to 8)Adds mid-term grades, attendance, early DFW signals.Half a semester. Identify who is struggling now.
End of termAll features, including term GPA, this-term DFW, credits earned.Between-terms window only. Predict next-term non-return after the current term is over.

The trade-off is real and direct: earlier predictions have less information, so the AUC is lower; later predictions have more information but leave almost no window to act. The right answer for most institutions is to run more than one. A start-of-term model triggers proactive outreach for the highest-risk incoming students. A mid-term model picks up new signals from current performance. The end-of-term model is the most accurate but is best used for cohort analysis and program-level conclusions, not for individual intervention.

Most institutions that fail to operationalize a retention model do so because they only built the end-of-term version. The accuracy looked good in the demo and the output arrived too late to matter.

What the model found

A representative finding from running the model across the population: academic-progress features dominate the retention prediction. Cumulative credits earned, cumulative percent credit completion, prior-term credits, and the GPA family (current term, prior term, cohort-relative) account for most of the model's predictive signal.

Course difficulty (the DFW-rate engineered feature) contributes meaningfully, especially in combination with low cumulative credit completion. Demographic features sit near zero, which is the right answer.

The most operationally useful output is not the population-level summary but the per-student SHAP decomposition. A real, model-derived advising statement looks like this:

BIO101 is dangerous specifically for first-generation students with cumulative GPA below 2.5 who already have a prior-term DFW. Their predicted next-term retention drops by 18%.

That sentence is what makes the model useful. It tells the advising office exactly which students to talk to before they enroll in the next BIO101 section, and what specifically to talk to them about.

The XGBoost vs LLM post covers how SHAP produces statements like that in plain English, including a worked example of a single 67.3% prediction broken down by driver.

How leadership should read the output

Three questions to ask when a retention model lands on the leadership table.

What is the recall, not the accuracy. Anything in the 90% range on recall is operationally useful. Anything below 70% is going to miss real dropouts in volumes that matter.

When in the term is the prediction made. If the answer is "end of term," ask why there is not also a start-of-term version. The intervention window is the whole point.

How is the score decomposed. If the model returns only a number, with no SHAP-style explanation of what drove it, it is going to be hard to operationalize. Advisors do not act on bare numbers, they act on reasons.

Two questions to ignore in the first conversation.

How sophisticated is the model. The right model is the one that works on the data the institution has. XGBoost is the safe choice at this scale, and the XGBoost vs LLM post covers why.

How does it compare to the incumbent platform. The platform comparison matters at the procurement stage, not the model-evaluation stage. The cost and vendor reality post handles that question separately.

Where Clema fits in

Clema builds retention models in the shape described above: XGBoost as the predictor, SHAP for per-student explanations, recall as the operating metric, and partner-only training so the model is built on (and stays with) the institution's own data. We tend to build two variants where it makes sense (a start-of-term model for proactive intervention and an end-of-term model for cohort and program analysis), and we route the output to the office that actually owns the decision.

For the broader strategic case, the reactive-to-proactive playbook is the companion. For the model-science question, the XGBoost vs LLM post. For the graduation version of this same exercise, the graduation prediction deep-dive.

See a retention model on your institution's data

Walk through what a working retention model looks like with the SIS, LMS, and aid data your IR team already collects.

Book a Clema demo

Ready to get started?

Reclaim Your Team's Capacity

See how Clema can help your IR team handle routine requests automatically