Why gradient boosting beats LLMs for student retention prediction

A general-purpose chatbot cannot tell you which students will not return next term. Here is why, and what does.

June 5, 2026

11 mins read

Table of Contents

Introduction

If you have sat through a vendor pitch in the last twelve months, you have probably heard some version of "we use AI to predict student outcomes," with "AI" doing a lot of unspecified work in that sentence. Sometimes it means a large language model wrapped around a dashboard. Sometimes it means a classical machine-learning model that does the actual prediction, with an LLM bolted on for the summary screen. The two are not equivalent, and the difference matters when the output is going to shape advising caseloads or program decisions.

This post is the short version of the argument for IR leadership: on structured institutional data, gradient-boosted trees (XGBoost, specifically) outperform LLMs by a clear margin, the benchmark literature is unambiguous about it, and the reasons are practical, not ideological. None of this means LLMs have no role. It means they have a different one.

The temptation to throw an LLM at it

The pull toward LLMs for retention prediction is understandable. They are general-purpose, they produce fluent output, and they require less obvious infrastructure than a trained model. Hand the chatbot a CSV of student records, ask "who is most likely to drop out next term," and get an answer in seconds. It feels efficient. The pull is strongest where capacity is thinnest: our 50-interview study found that 50% of institutions run IR teams of 1 to 3 people, and a team that size rarely has a spare data scientist, so the chatbot shortcut looks like the only feasible path.

The output is also confident, which is part of the problem. An LLM will return a graduation rate for nine students with no warning that nine students cannot support a graduation rate. It will assign a "high risk" label without telling you how it got there. It will produce a different answer the second time you ask, and a third one the third time. In a procurement demo, this looks like sophistication. In production, it is the worst possible foundation for decisions that affect students.

The retention question is, at its core, a numbers problem over 18 to 20 structured features. It is not a language problem. Asking a language model to solve it is using the wrong tool for the right reason.

Three things an LLM cannot do here

There are exactly three structural limitations that disqualify a general-purpose LLM as the predictor in this kind of system. Each one is fatal on its own.

Cannot control for confounders

Statistical isolation

Strong students tend to pick harder course combinations, so those combinations look like they cause higher graduation. A trained model can isolate the pure contribution of the combination after holding GPA, attendance, and credits constant. An LLM cannot separate cause from correlation, so any policy built on its output rests on a confound.

Cannot flag where its answer breaks down

Confidence quantification

Some course combinations appear only 8 to 10 times in the data. The right answer there is "we do not have enough evidence." A trained model can mark these as low-confidence. An LLM will return a confident graduation rate regardless, inviting a policy decision based on nine students.

Cannot answer follow-up policy questions from the data

Counterfactual analysis

"What if we made MATH301 a prerequisite?" or "Which combination works for students below 2.8 GPA?" are answered by filtering the data and re-running the model. An LLM generates plausible prose, none of it computed from the institution's records. Plausible is not the same as correct.

None of this is hypothetical. Each limitation has been demonstrated repeatedly in peer-reviewed benchmarks on exactly this kind of structured prediction problem.

What the benchmark literature actually says

The argument that LLMs win on structured tabular prediction has been made in print. Reading the actual studies, including the ones that argue the LLM case, produces a consistent pattern: classical machine learning wins on structured data with adequate volume, and LLMs only catch up in the tiny-data regime (roughly 32 to 64 rows) where there is essentially nothing else to fit.

Three results are worth knowing.

Study	Result	Why it is relevant
arXiv 2411.06469 (ICU mortality, MIMIC-III, ~46K admissions)	Classical ML models scored F1 between 59 and 65. All LLMs, including medical-domain models, scored between 20 and 43. The top LLM result was 43.0.	Binary outcome on structured EHR data. Structurally the same shape as graduate vs not-graduate.
JAMIA 32(5):811, 2025 (24-hour discharge / ICU transfer)	Gradient boosting AUROC 0.847 to 0.894. GPT-4 zero-shot AUROC 0.602 to 0.629. GPT-3.5 near random. Few-shot RAG did not meaningfully help.	Concludes that locally trained ML beats non-fine-tuned LLMs and that the model learns institution-specific patterns the LLM cannot.
arXiv 2406.12031 (TABULA-8B vs XGBoost)	LLM beats XGBoost only at 0 to 32 training rows. On numeric-only data it only matches. Commercial LLMs are worse than baselines beyond 3-shot.	Tested at 128 rows maximum. A real institutional dataset is ~1.5M rows. Far outside the LLM-favorable regime.

“On structured, numeric-heavy institutional data with hundreds of thousands of rows, gradient-boosted trees are the right tool. LLMs are complementary, not a replacement.”

The pro-LLM papers (UniPredict, TP-BERTa, GTL, the Amazon survey) reinforce the same boundary condition: LLM gains appear only in low-data, heavily categorical, or untuned-baseline settings. On numeric institutional data with enough rows, XGBoost wins or matches every time. An institution's six years of student-term records is not the LLM-favorable regime. It is the XGBoost-favorable regime by a wide margin.

Why XGBoost specifically

Saying "use gradient-boosted trees" still leaves a choice. The tree-based family includes Random Forest, the original GBM, AdaBoost, XGBoost, LightGBM, CatBoost, and NGBoost. They are not interchangeable, and the right pick depends on the dataset's size, feature mix, and operational constraints.

For a dataset on the order of 240,000 student-level rows expanding to roughly 1.84M student-term rows (six years at 40,000 students), with 19 to 20 features and a mix of numeric, binary, and categorical inputs plus class imbalance and an explainability requirement, the operating ranges look like this.

Model	Works to (rows)	Why	Struggles when
Logistic Regression	1K to 1M+	Linear, very scalable, the right baseline	Nonlinear patterns and feature interactions matter
Random Forest	1K to 200K	Handles interactions natively	Wide data and rare-class balancing needed
Gradient Boosting (older GBM)	1K to 50K	Reliable on small-to-medium data	Beyond medium scale
XGBoost	1K to 5M	Regularized, fast, mature tooling, excellent SHAP support	Needs tuning time to reach best results
LightGBM	5K to 20M	Leaf-wise growth, very fast at large scale	Small-data overfit
CatBoost	2K to 5M	Handles categoricals natively without manual encoding	Slower to train than XGBoost

The selection logic at this scale:

Logistic Regression is the right baseline but learns essentially one additive equation. With this much data, it cannot capture the higher-order interactions (e.g. attendance only helps if grades are adequate) that actually drive the outcome.

Random Forest handles interactions but is memory-heavy and harder to balance for the rare not-retained class on a wide course matrix.

CatBoost handles categoricals natively but trains slower at this scale than XGBoost with explicit encoding.

LightGBM is the planned successor once the dataset grows past roughly 6 to 7 million rows, where its leaf-wise growth becomes the speed advantage. For now, it is overkill.

XGBoost is the chosen model: strong on tabular data, robust regularization, native handling of missing values and class imbalance, mature tooling, and excellent SHAP support. The safest high-accuracy choice at this scale.

Reading the risk score in plain English

A trained model returns a probability between 0 and 1 for each student. A 0.85 means an 85% modeled probability of the outcome, based on patterns from similar students whose outcomes are known. The most common piece of confusion when a risk score reaches leadership is what "85%" actually claims.

"Precise" does not mean "certain." It means the uncertainty has been quantified and is attributable to specific factors, rather than guessed. That distinction is the entire point.

When a model is reported as having about 86.6% accuracy, ~91% recall, and ~0.91 AUC, here is what those numbers actually say.

Metric	What it measures	What it means here
Accuracy	Share of all predictions that are correct	About 86 of every 100 students are classified correctly
Recall	Of students who actually dropped out, how many were caught	About 9 of every 10 real dropouts are flagged. The metric to optimize, because missing a dropout is worse than a false alarm.
Precision	Of students flagged at risk, how many actually were	Of 100 flagged, around 93 genuinely at risk in the reported figures
F1	Balance of precision and recall	Catching real dropouts without flooding advisors with false alarms
AUC-ROC	Ability to rank a random dropout above a random non-dropout	About 0.91. Picks the higher-risk student about 91 times out of 100.

Two things to note. First, accuracy alone is misleading when 87% of students are retained. A model that predicts "retained" for everyone would score 87% accuracy and catch zero real dropouts. That is why recall, not accuracy, is the operating metric for this problem. The retention deep-dive post covers why and what to do about class imbalance.

Second, AUC of 0.91 is genuinely strong on this kind of data. The reason this is achievable is not magic. It is that there is enough signal in six years of attendance, grades, prior-term performance, and course difficulty for the model to rank students reliably.

SHAP: turning a number into an explanation

A bare 67.3% graduation probability is not useful to an advisor. The advisor needs to know what is dragging the number down or pushing it up, because that is where the conversation with the student starts.

SHAP (Shapley Additive Explanations) decomposes each prediction into per-feature contributions. The score is read as a base rate plus and minus the effect of each factor that the model considered.

Base rate at the institution

Starting point

The model starts at the institution-wide average graduation probability for similar students. Call it the floor.

MATH301 + CS201 combination

+14.2%

This specific course combination has historically been associated with higher completion among similar students. The model adds the effect.

GPA 2.9

+6.1%

Slightly above the cohort-relative average for this program. Small positive contribution.

Attendance 74%

-8.4%

The single biggest drag. Attendance below the institutional median is the strongest negative signal in this profile.

First-generation status

-2.6%

A modest negative contribution. The model has learned that first-gen students at this institution complete at a slightly lower rate, after controlling for everything else.

Final prediction

67.3%

The base rate plus the four contributions. The advisor now knows that attendance is the single biggest lever for this student, not GPA or course choice.

The same machinery, run across the whole population, surfaces which factors dominate across the board. On a well-built retention model, academic-progress features (cumulative credits earned, cumulative percent credit completion, prior-term GPA) dominate. Demographic features (gender, race/ethnicity) typically sit near zero, which is what you want: the model is relying on performance and progress, not on demographic shortcuts.

SHAP is what makes the prediction defensible in a committee meeting. Without it, the score is a black box. With it, every prediction is a conversation an advisor or dean can have.

LLMs in a supporting role

None of this excludes LLMs from the system. It defines what they should and should not do.

LLMs are excellent at two things in this stack. First, enriching course features. Pulling course descriptions and catalog metadata to tag courses by workload, prerequisite difficulty, and topic cluster is a language problem, and that is where LLMs shine. Those tags then become inputs to the gradient-boosted model.

Second, translating SHAP output into committee-ready language. Turning "feature X contributed -8.4% to student Y's prediction" into a paragraph an academic affairs committee can read and act on is exactly the kind of summarization LLMs are good at.

The division of labor is straightforward: classical machine learning does the prediction and the confidence estimation. The LLM does the language work around it. Mixing those up is how predictions stop being trustworthy.

Where Clema fits in

Clema's predictive models follow exactly this pattern. Gradient-boosted trees do the prediction, SHAP produces the per-student explanation, and we use LLMs only for the supporting tasks they actually do well (feature enrichment and natural-language summaries of the model output).

For the broader strategic case (catalog of predictions, economic argument, where to start), the reactive-to-proactive playbook is the companion to this post. For the cost question (open-source stack vs $30K to $200K incumbent platforms), the cost and vendor reality post covers the math.

See SHAP-backed predictions on real student data

Walk through what per-student risk scores and driver-level explanations look like with the data your IR team already has.

Book a Clema demo

Sources

XGBoost documentation, Distributed (Deep) Machine Learning Community. SHAP documentation, A unified approach to explaining model output. arXiv 2411.06469, LLMs vs ML on clinical prediction. JAMIA, comparing LLM and ML for early discharge prediction (2025). arXiv 2406.12031, TABULA-8B tabular transfer learning. scikit-learn, machine learning in Python.

CRT

Written by

Clema Research Team

The Clema research team publishes original analysis and practical guides for institutional research and institutional effectiveness professionals.

Frequently asked questions

Why can't we just hand an LLM a CSV of student records and ask who will drop out?

An LLM has three structural limits that disqualify it as the predictor. It cannot control for confounders, so it confuses correlation with cause. It cannot flag where its answer breaks down, so it returns a confident graduation rate for nine students. And it cannot answer follow-up policy questions from the data, generating plausible prose instead of figures computed from your records. The retention question is a numbers problem over 18 to 20 features, not a language problem.

Does the benchmark literature actually favor classical ML over LLMs on tabular data?

Yes, consistently. On ICU mortality data, classical ML scored F1 of 59 to 65 while the best LLM reached 43. On discharge prediction, gradient boosting hit AUROC 0.847 to 0.894 versus GPT-4 at 0.602 to 0.629. LLMs only catch up at roughly 32 to 64 training rows. An institution's six years of student-term records, around 1.5 million rows, is firmly in the XGBoost-favorable regime.

Why XGBoost specifically rather than another gradient-boosting model?

At roughly 1.84 million student-term rows with 19 to 20 mixed features, class imbalance, and an explainability requirement, XGBoost is the safest high-accuracy choice. It is regularized, fast, handles missing values and class imbalance natively, has mature tooling, and offers excellent SHAP support. LightGBM is the planned successor once the dataset passes roughly 6 to 7 million rows; CatBoost trains slower at this scale.

What does an 85% risk score actually claim?

It means an 85% modeled probability of the outcome, based on patterns from similar students whose outcomes are known. Precise does not mean certain: it means the uncertainty has been quantified and attributed to specific factors rather than guessed. Accuracy alone is misleading when 87% of students are retained, which is why recall, the share of real dropouts caught, is the operating metric.

What role do LLMs play if they are not doing the prediction?

Two supporting roles they genuinely do well. First, enriching course features by pulling catalog metadata to tag courses by workload, prerequisite difficulty, and topic cluster, which then feed the gradient-boosted model. Second, translating SHAP output into committee-ready language. Classical ML does the prediction and confidence estimation; the LLM does the language work around it.

From reactive reporting to proactive prediction: a playbook for IR teams

IR teams already sit on years of attendance, GPA, LMS, and aid data. A practical guide to turning that record into forward-looking predictions that change what advisors, deans, and provosts can do this term.

Predictive analytics in higher ed: the $40-a-month stack vs the $200K platform

A straight comparison between the open-source predictive-analytics stack ($40 to $95 per month) and the major incumbent platforms ($30K to $200K per year). Covers infrastructure cost, licensing, where incumbents are strong and where they leave gaps, and a build-vs-buy framework.

Predicting term-to-term retention: what a working model actually looks at

A practical look at the features, engineering choices, and timing decisions that drive a real term-to-term retention model. Written for IR leadership who want to understand what the model sees before they hand its output to advisors.

Ready to get started?

Reclaim Your Team's Capacity

See how Clema can help your IR team handle routine requests automatically

Try for Free Book a Demo