Predicting term-to-term retention: what a working model actually looks at

Inside a retention model evaluated on 276,519 unseen student-term records.

June 5, 2026

11 mins read

Table of Contents

Introduction

Retention prediction is the use case most IR teams start with, and for good reason. The decision window is short (between this term and the next), the financial argument is the easiest to make in a deans' meeting, and the data is among the cleanest the institution has. It is also the use case most teams never reach: in our interviews with 50+ IR professionals across 19 states, ad-hoc requests consumed 40-60% of team capacity on average, which is precisely the time a modeling project needs.

This post is the practical anatomy of a working retention model. Not a pitch. Not a methodology paper. A walk through the features, engineering choices, and timing decisions that actually determine whether the model's output ends up on an advisor's desk on Monday morning. The example figures throughout come from a model evaluated on 276,519 unseen student-term records from a six-year institutional dataset.

What term-to-term retention means in practice

Term-to-term retention is the simplest version of the persistence question. For each enrolled student, the model predicts a probability that the student will be enrolled again in the next regular term. Yes or no, with a quantified confidence.

It is not graduation prediction, which has a multi-year horizon and a different feature mix. It is not cohort retention as reported to IPEDS, which is an aggregate, not a per-student score. It is the specific, narrow question: will this individual be back next term, and how sure is the model.

The reason it is worth modeling at all is that "will be back" is the single most expensive decision the student is about to make, both for the student and for the institution. Catching the signal a term in advance is the difference between an intervention and a post-mortem.

The data shape

A six-year institutional dataset at roughly 40,000 students per year produces around 240,000 student-level rows. For retention modeling, those rows are expanded to the student-term grain, where every student contributes one row per enrolled term. After encoding, the retention dataset reaches approximately 1.84 million rows and 631 columns. The graduation dataset is built differently and is covered in the graduation prediction deep-dive.

Property	Retention model
Grain	One row per student per term
Rows after encoding	~1.84M
Columns after encoding	~631
Outcome variable	Retained to next term (Y/N)
Class balance (retained vs not)	~87% retained / ~13% not
Test set used for evaluation	276,519 unseen student-term records

Two things to notice. First, the dataset is large enough that gradient-boosted trees are clearly the right tool. The XGBoost vs LLM post walks through why this matters and what the benchmark literature shows.

Second, the class balance (87% retained vs 13% not) is the single most important number on that table. It is the reason raw accuracy is misleading and the reason recall, not accuracy, is the operating metric for this problem.

Features that move the needle

A retention model at this scale typically uses 12 source features plus 16 engineered features, for 28 features per student-term row before encoding. The categories are familiar to any IR team: student background, academic performance, course load, risk signals, and program context.

The features that consistently dominate (across years, across cohorts, and across institutions) are academic-progress features. Cumulative credits earned and cumulative percent credit completion sit at the top. Prior-term credits and the GPA family (current term, prior term, cohort-relative) follow.

Demographic features (gender, race/ethnicity) sit near zero in importance. This is what you want: the model relies on performance and progress, not on demographic shortcuts. It is also one of the easier things to defend in a faculty governance conversation.

Engineering decisions that matter

The choices that distinguish a working model from a mediocre one are mostly upstream of the algorithm. They are in how the features are constructed.

DFW rate as a difficulty proxy

Historical share of D, F, and Withdrawal grades

Instead of labeling courses with a coarse STEM vs non-STEM flag, the model uses each course's historical DFW rate as a continuous measure of difficulty. Courses above a 0.30 DFW rate are treated as "hard." This single change captures something the categorical flag cannot: that intro physics and a freshman writing seminar can have very different difficulty profiles even when they share a department code.

Lag and cumulative features

Trajectory, not snapshot

Prior-term GPA, prior-term credits, prior DFW count, and running cumulative credit completion let the model see a student's trajectory over time, not just a current snapshot. Tree models do not natively know about sequence, so these engineered features are what encode "this student has been sliding for two terms" rather than just "this student's current GPA is 2.6."

Cohort-relative GPA

Performance in context

A 2.8 GPA in a program where the cohort average is 3.2 is a different signal than a 2.8 in a program where the cohort average is 2.5. Comparing each student's GPA against peers from the same entry term and program lets the model judge performance in context, which materially changes which students get flagged.

CIP family grouping

Signal over noise

Hundreds of distinct program codes collapse into a small set of CIP families (STEM, SOCIAL, HUMANITIES, APPLIED). This trades off some specificity for a much stronger signal. The model can learn that program types behave differently, without overfitting to programs that have only a handful of students per year.

Leakage control

The model cheating, prevented

Some features sit too close to the outcome and would let the model "cheat" by indirectly seeing the answer. For graduation prediction, total credit completion is the most common offender. The pipeline removes these deliberately. The cost is a few points of test accuracy. The benefit is that the score actually generalizes to next year's students.

Class imbalance and why 87% accuracy can mislead

If 87% of students are retained, a model that predicts "retained" for every single student scores 87% accuracy. It also catches zero real dropouts. That is the central problem with using raw accuracy on this kind of dataset, and it is why the operating metric for retention is recall (the share of real dropouts that the model catches), not accuracy.

The way a good retention pipeline handles this is not by oversampling the minority class. SMOTE-style techniques fabricate fractional categories on one-hot encoded course features and tend to make the model worse, not better. The approach that holds up is a dampened class reweighting: XGBoost's scale_pos_weight set to the square root of the negative-to-positive ratio. This lifts the minority class without flipping the natural prior, and it pairs well with stratified group K-fold cross-validation that preserves the dropout ratio across folds without duplicating students.

The practical result is a model that achieves roughly 86.6% accuracy and 91 to 92% recall on the held-out test set. The accuracy number is similar to "predict retained for everyone," but the recall number is what tells you the model is actually doing the job.

The prediction-timing question

The single most consequential design question for a retention model is when in the student timeline the prediction is made. The answer determines whether the system supports intervention or only retrospective analysis. The same model with the same features, run at different points in the term, produces very different operational value.

Prediction point	Features available	Intervention window
Start of term	Prior GPA, Pell status, admit type, course load and difficulty, prior DFW. No current-term grades or attendance yet.	Full semester. Proactive check-in with incoming students.
Mid-term (week 6 to 8)	Adds mid-term grades, attendance, early DFW signals.	Half a semester. Identify who is struggling now.
End of term	All features, including term GPA, this-term DFW, credits earned.	Between-terms window only. Predict next-term non-return after the current term is over.

The trade-off is real and direct: earlier predictions have less information, so the AUC is lower; later predictions have more information but leave almost no window to act. The right answer for most institutions is to run more than one. A start-of-term model triggers proactive outreach for the highest-risk incoming students. A mid-term model picks up new signals from current performance. The end-of-term model is the most accurate but is best used for cohort analysis and program-level conclusions, not for individual intervention.

Most institutions that fail to operationalize a retention model do so because they only built the end-of-term version. The accuracy looked good in the demo and the output arrived too late to matter.

What the model found

A representative finding from running the model across the population: academic-progress features dominate the retention prediction. Cumulative credits earned, cumulative percent credit completion, prior-term credits, and the GPA family (current term, prior term, cohort-relative) account for most of the model's predictive signal.

Course difficulty (the DFW-rate engineered feature) contributes meaningfully, especially in combination with low cumulative credit completion. Demographic features sit near zero, which is the right answer.

The most operationally useful output is not the population-level summary but the per-student SHAP decomposition. A real, model-derived advising statement looks like this:

“BIO101 is dangerous specifically for first-generation students with cumulative GPA below 2.5 who already have a prior-term DFW. Their predicted next-term retention drops by 18%.”

That sentence is what makes the model useful. It tells the advising office exactly which students to talk to before they enroll in the next BIO101 section, and what specifically to talk to them about.

The XGBoost vs LLM post covers how SHAP produces statements like that in plain English, including a worked example of a single 67.3% prediction broken down by driver.

How leadership should read the output

Three questions to ask when a retention model lands on the leadership table.

What is the recall, not the accuracy. Anything in the 90% range on recall is operationally useful. Anything below 70% is going to miss real dropouts in volumes that matter.

When in the term is the prediction made. If the answer is "end of term," ask why there is not also a start-of-term version. The intervention window is the whole point.

How is the score decomposed. If the model returns only a number, with no SHAP-style explanation of what drove it, it is going to be hard to operationalize. Advisors do not act on bare numbers, they act on reasons.

Two questions to ignore in the first conversation.

How sophisticated is the model. The right model is the one that works on the data the institution has. XGBoost is the safe choice at this scale, and the XGBoost vs LLM post covers why.

How does it compare to the incumbent platform. The platform comparison matters at the procurement stage, not the model-evaluation stage. The cost and vendor reality post handles that question separately.

Where Clema fits in

Clema builds retention models in the shape described above: XGBoost as the predictor, SHAP for per-student explanations, recall as the operating metric, and partner-only training so the model is built on (and stays with) the institution's own data. We tend to build two variants where it makes sense (a start-of-term model for proactive intervention and an end-of-term model for cohort and program analysis), and we route the output to the office that actually owns the decision.

For the broader strategic case, the reactive-to-proactive playbook is the companion. For the model-science question, the XGBoost vs LLM post. For the graduation version of this same exercise, the graduation prediction deep-dive.

See a retention model on your institution's data

Walk through what a working retention model looks like with the SIS, LMS, and aid data your IR team already collects.

Book a Clema demo

Sources

XGBoost documentation. SHAP documentation. scikit-learn, StratifiedGroupKFold reference. Georgia State University, student success outcomes. NCES Common Core of Data, retention definitions. IPEDS, persistence and retention.

CRT

Written by

Clema Research Team

The Clema research team publishes original analysis and practical guides for institutional research and institutional effectiveness professionals.

Frequently asked questions

What exactly does a term-to-term retention model predict?

For each enrolled student, it predicts the probability that the student will be enrolled again in the next regular term, with a quantified confidence. It is not graduation prediction, which has a multi-year horizon, and it is not the aggregate cohort retention reported to IPEDS. It is the narrow per-student question: will this individual be back next term, and how sure is the model.

Which features actually drive a retention prediction?

Academic-progress features dominate consistently across years, cohorts, and institutions. Cumulative credits earned and cumulative percent credit completion sit at the top, followed by prior-term credits and the GPA family (current term, prior term, cohort-relative). Demographic features like gender and race/ethnicity sit near zero in importance, which is what you want: the model relies on performance and progress, not demographic shortcuts.

Why is 87% accuracy misleading for retention?

If 87% of students are retained, a model that predicts retained for everyone scores 87% accuracy and catches zero real dropouts. That is why recall, the share of real dropouts the model catches, is the operating metric, not accuracy. A good pipeline handles the imbalance with dampened class reweighting (scale_pos_weight set to the square root of the negative-to-positive ratio), reaching roughly 86.6% accuracy and 91 to 92% recall.

When in the term should the prediction be made?

It depends on the intervention window you want. A start-of-term model uses prior GPA, Pell status, and course load to trigger proactive outreach across the full semester. A mid-term model adds grades and attendance to catch who is struggling now. An end-of-term model is most accurate but leaves only the between-terms window, so it is best for cohort analysis. Most institutions should run more than one.

Why do institutions fail to operationalize retention models?

The most common reason is that they only built the end-of-term version. The accuracy looked strong in the demo, but the output arrived too late to act on. Without an early-term model and a defined owner who receives the weekly list and closes the loop, the score never reaches an advisor's desk in time to change an outcome.

From reactive reporting to proactive prediction: a playbook for IR teams

IR teams already sit on years of attendance, GPA, LMS, and aid data. A practical guide to turning that record into forward-looking predictions that change what advisors, deans, and provosts can do this term.

Why gradient boosting beats LLMs for student retention prediction

A plain-English guide for IR leadership on why classical ML (specifically XGBoost) outperforms LLMs on student-level retention and graduation prediction, what the benchmark literature actually shows, and how to read a per-student risk score.

Predictive analytics in higher ed: the $40-a-month stack vs the $200K platform

A straight comparison between the open-source predictive-analytics stack ($40 to $95 per month) and the major incumbent platforms ($30K to $200K per year). Covers infrastructure cost, licensing, where incumbents are strong and where they leave gaps, and a build-vs-buy framework.

Ready to get started?

Reclaim Your Team's Capacity

See how Clema can help your IR team handle routine requests automatically

Try for Free Book a Demo