What is the difference between reactive reporting and proactive prediction in IR?

Reactive reporting answers past-tense questions at the cohort level, such as last fall's retention rate. By the time the report exists, the students inside the cohort are already past the decision point. Proactive prediction changes the unit of analysis to the individual student, producing a per-student probability and the specific drivers behind it, so an advisor can act while there is still time.

Do we need to buy a platform to start doing predictive analytics?

No. The post treats predictive analytics as a question-selection problem, not a procurement problem. The model families that work best (gradient-boosted trees with SHAP) are open-source and free, and a working retention model can be trained and served for under $100 a month. The harder work is choosing one question, auditing existing data, and deciding who owns the output.

Which prediction should an IR team build first?

Retention is almost always the right starting point. The data is densest, the decision window between this term and the next is shortest, and the financial argument is the easiest to make to leadership. The post recommends resisting the urge to scope a portfolio of models on day one: start with one model, one question, one audience, and one closed loop, then add a second.

What does the Georgia State example actually show about the economic case?

Georgia State tracked 800-plus risk factors daily across 40,000-plus students, ran roughly 90,000 targeted interventions a year, and raised its four-year graduation rate by seven percentage points, with the largest gains among underserved students. The post estimates one point of retention is worth around $3.18M per year in recurring revenue. The expensive part Georgia State did, proving the approach works, is now a free fact.

Where does the time to build a first model come from?

The post points to the ad-hoc request load itself. The companion whitepaper found IR teams spend 550 to 5,333 hours per year on ad-hoc request management, and reclaiming 40 to 60 percent of that frees 28 to 41 working days per year for a small team, which is enough to ship a first model.

Reactive to proactive: a predictive analytics playbook for IR

Introduction

Most institutions already collect every input a serious prediction system needs. Attendance, term-by-term GPA, LMS activity, financial-aid records, course enrollment, program progress, mid-term grades. The data lands in the warehouse, gets pulled into a slide once a year, and then sits.

The gap is not the data. It is the question being asked of it. Almost every standing IR report answers a past-tense question: what was last fall's retention, what was the four-year graduation rate, how did this cohort compare to last cohort. Useful, accurate, and almost always too late to do anything about the students inside it. Part of why it stays that way is capacity: in our interviews with 50+ IR professionals across 19 states, ad-hoc requests consumed 40-60% of team capacity on average, leaving little room for the forward-looking question.

This piece is about the other question. Which students, programs, and cohorts are heading where, while there is still time to change the answer. It is written for IR leadership, provosts, and VPs who own the decisions that prediction would actually inform.

The reporting gap

The gap between reporting and prediction is not technical sophistication. It is the unit of analysis.

Traditional IR reporting lives at the cohort level. "First-time, full-time freshmen entering fall 2023 had a 74% retention rate to fall 2024." That sentence is true, defensible, and almost impossible to act on. The cohort is already past the decision point. The students inside it are either back or not.

A predictive system changes the unit. Instead of describing the cohort, it scores each student inside it: this individual has an 85% modeled probability of not returning next term, and here are the three specific drivers behind that score. The same data, framed forward instead of backward, becomes a list of names an advisor can actually call.

The shift is threefold:

Reactive reporting

Becomes proactive prediction

Same warehouse, same data, different question. Stop asking "what was the retention rate" and start asking "which students are unlikely to return, and why."

Cohort summary

Becomes student-level score

A 74% cohort retention rate is a number on a slide. A student-level probability is a name on a call list, with a measurable confidence interval attached.

Uniform treatment

Becomes targeted intervention

Every student getting the same outreach is the same as no student getting outreach. Prediction lets advising spend the half-hour on the students whose half-hour matters.

Why now

Three things have converged that did not used to be true at the same time.

The data is finally clean enough. After a decade of warehouse and SIS modernization, most institutions have multi-year, student-term-level data that can support a real model. Six years of records at 40,000 students per year produces roughly 240,000 student-level rows, which expand to nearly 1.84 million student-term rows once you bring in term-by-term features. That is well past the point where standard machine learning works.

The methods are mature and free. The model families that perform best on this kind of institutional data (gradient-boosted trees, with SHAP for per-prediction explanation) are open-source, well-documented, and have stable APIs. There is no proprietary stack required to do this well. The cost and vendor reality post breaks down the math, but the headline is: a working retention model can be trained and served for under $100 a month.

The audience is finally bought in. Five years ago, taking a per-student risk score into a deans' meeting was a fight. Today the question in those rooms is "why don't we have one yet?" The Georgia State case has done most of the persuasion work for the sector.

The economic case

The financial argument for predictive student-success work is now well-established, and Georgia State University remains the most-cited single example. The numbers are striking enough that they have become the de-facto baseline every other institution is benchmarked against.

Metric	Georgia State result
Risk factors tracked per student, daily	800+
Students monitored	40,000+
Targeted interventions per year	~90,000
Increase in four-year graduation rate	+7 percentage points
Group with the largest gains	Underserved students
Estimated revenue per 1-point retention gain	~$3.18M per year

The revenue line is the one provosts and CFOs tend to remember. At a typical mid-sized institution, one percentage point of retention is worth seven figures in annual recurring tuition and aid revenue, and that number compounds because retained students keep enrolling.

The argument is not that every institution needs to spend what Georgia State spent to build its system internally. It is the opposite. The expensive thing Georgia State did was figure out that the approach works. That is now a free fact. The remaining work for any other institution is the operational one: choose the prediction problems that matter most, point a model at them, and route the output to the people who can act on it.

Catalog of predictions IR teams already need

The prediction opportunities below were mapped from the standing IR conference circuit (SCAIR 37th Annual, NCAIR 2026, NEAIR 2025, CAIR 2025). Each one corresponds to a session topic or a recurring institutional need. The point of the table is to show that the demand is not abstract. IR teams are already presenting on these problems. They are just presenting on the past version of them.

Prediction	Data needed	Suitable model family
Term-to-term retention	Attendance, GPA, LMS activity, demographics, aid status, prior DFW history	Gradient boosting (XGBoost) with SHAP
Graduation within 150% time	Cumulative GPA, credits earned, course difficulty, program code, prior-term DFW count	Gradient boosting (XGBoost) with SHAP
Academic-probation outcome	Probation history, intervention attendance, GPA trend, study-center usage	Random Forest, XGBoost, LSTM for GPA trend
Scholarship eligibility loss (HOPE, LIFE, similar)	Credit hours per term, GPA thresholds, gateway-course grades, aid disbursement	Logistic Regression baseline, XGBoost
Transfer and admissions yield	NSC enrollment data, admitted-student records, aid offers, distance, HS GPA	Logistic Regression, XGBoost
Enrollment forecasting	Historical headcount by term, IPEDS migration tables, county HS graduation rates	ARIMA, Prophet, LSTM for complex trends
Program margin and viability	Enrollment per program, cost to run, tuition revenue, graduation numbers	Regression, time-series (ARIMA, Prophet)
Workforce outcomes	Program GPA, internship history, past placement (PSEO)	Random Forest, XGBoost
Re-engagement of stop-outs	Exit reason, credits completed, GPA at exit, time since leaving	Logistic Regression, XGBoost
CTE Perkins V threshold attainment	CTE enrollment, completion, licensure, placement, earnings, federal benchmarks	Logistic Regression per threshold, Random Forest, multi-output classification

Two patterns are worth noticing. First, student-level outcomes (retention, graduation, probation, scholarship loss, yield) cluster around gradient-boosted trees on tabular institutional data. That is one tool, applied to most of the questions. Second, institution-level forecasting (enrollment, program viability) is a different tool family entirely (time-series), and policy-impact questions push you toward econometric designs. The enrollment and program viability post walks through that second cluster.

One model does not fit all of these. The value is in matching the method to the question and not over-promising in either direction.

What "actionable" actually looks like

The single biggest difference between a cohort report and a prediction system is not the math. It is what an advisor reads on Monday morning.

A cohort report tells an advisor: "Your caseload's retention rate is 81%." A prediction system tells the same advisor: "These 14 students on your caseload have a higher than 70% modeled probability of not returning next term. For each one, here are the two or three specific factors pushing the score up."

A real, model-derived statement looks like this:

“85% probability of not returning, driven by a DFW pattern in semester 2 and a credit-load drop in semester 3.”

Three things make that sentence different from anything a cohort report can produce. It is per-student. It carries a quantified confidence (85%, not "high risk"). And it names the specific drivers behind the score, which is what tells the advisor what to actually do about it. The mechanism that produces that explanation is called SHAP, and the XGBoost vs LLM post explains it in plain English.

That is the practical bar for "actionable." If the output cannot survive the test of "what does the advisor do at 9am on Monday," it is still reporting, regardless of how sophisticated the modeling underneath is.

How to get started without buying a platform

The most common mistake institutions make at this stage is treating predictive analytics as a procurement problem. It is not. It is a question-selection problem. Five practical steps:

Pick one question, not five. Retention is almost always the right place to start because the data is densest, the decision window is shortest, and the financial argument is the easiest to make to leadership. Resist the urge to scope a portfolio of models on day one.

Audit what data already exists. Most institutions discover they have everything they need sitting across the SIS, the LMS, and the financial-aid system. The integration is the work. The data is rarely missing.

Set the evaluation bar before you build. Pick the metric that matches the cost of being wrong. For retention, that is usually recall: how many real dropouts the model catches, because missing a dropout is more expensive than a false alarm. Write the target down before you train anything.

Plan for the explanation layer, not just the prediction. A score without a driver is useless to the advisor receiving it. Whatever path you take, make sure SHAP-style per-prediction explanations are part of the system.

Decide who owns the output. The model is the easy part. The harder question is which office (advising, retention, the dean's office) receives the weekly list, what they are expected to do with it, and how the loop closes. Without that, the model becomes another dashboard.

The time for all of this has to come from somewhere, and the most realistic source is the ad-hoc request load itself. Our whitepaper found IR teams spend 550 to 5,333 hours per year on ad-hoc request management; reclaiming 40-60% of that frees 28 to 41 working days per year for a small team, which is enough to ship a first model.

For most mid-market institutions, the right early sequence is: one model, one question, one audience, one closed loop. Then a second.

Where Clema fits in

Clema builds AI tools for IR and IE teams that sit on top of the data the institution already owns. The predictive work we do follows the pattern above: per-student probabilities, SHAP-backed explanations, and a clear owner for the output. We deliberately keep partner data on a partner-only model rather than pooling it into a shared cross-school model, which is the part of the incumbent platform pitch most institutions are quietly uncomfortable with.

If you are at the stage of "we know we should be doing this, we just don't know where to start," that is the conversation we tend to have most often. The right answer is usually smaller than expected.

See predictive analytics on your institution's data

Walk through what a retention or graduation prediction model would look like with the data your IR team already collects, and where it would route.

Book a Clema demo

Sources

Georgia State University, Student Success Programs and outcomes. Education Advisory Board, Navigate360 platform. Civitas Learning, student success platform. SCAIR, Southern Association for Institutional Research. NCAIR, North Carolina Association for Institutional Research. NEAIR, North East Association for Institutional Research. CAIR, California Association for Institutional Research.

From reactive reporting to proactive prediction: a playbook for IR teams