Introduction
If you have sat through a vendor pitch in the last twelve months, you have probably heard some version of "we use AI to predict student outcomes," with "AI" doing a lot of unspecified work in that sentence. Sometimes it means a large language model wrapped around a dashboard. Sometimes it means a classical machine-learning model that does the actual prediction, with an LLM bolted on for the summary screen. The two are not equivalent, and the difference matters when the output is going to shape advising caseloads or program decisions.
This post is the short version of the argument for IR leadership: on structured institutional data, gradient-boosted trees (XGBoost, specifically) outperform LLMs by a clear margin, the benchmark literature is unambiguous about it, and the reasons are practical, not ideological. None of this means LLMs have no role. It means they have a different one.
The temptation to throw an LLM at it
The pull toward LLMs for retention prediction is understandable. They are general-purpose, they produce fluent output, and they require less obvious infrastructure than a trained model. Hand the chatbot a CSV of student records, ask "who is most likely to drop out next term," and get an answer in seconds. It feels efficient.
The output is also confident, which is part of the problem. An LLM will return a graduation rate for nine students with no warning that nine students cannot support a graduation rate. It will assign a "high risk" label without telling you how it got there. It will produce a different answer the second time you ask, and a third one the third time. In a procurement demo, this looks like sophistication. In production, it is the worst possible foundation for decisions that affect students.
The retention question is, at its core, a numbers problem over 18 to 20 structured features. It is not a language problem. Asking a language model to solve it is using the wrong tool for the right reason.
Three things an LLM cannot do here
There are exactly three structural limitations that disqualify a general-purpose LLM as the predictor in this kind of system. Each one is fatal on its own.
Cannot control for confounders
Statistical isolation
Strong students tend to pick harder course combinations, so those combinations look like they cause higher graduation. A trained model can isolate the pure contribution of the combination after holding GPA, attendance, and credits constant. An LLM cannot separate cause from correlation, so any policy built on its output rests on a confound.
Cannot flag where its answer breaks down
Confidence quantification
Some course combinations appear only 8 to 10 times in the data. The right answer there is "we do not have enough evidence." A trained model can mark these as low-confidence. An LLM will return a confident graduation rate regardless, inviting a policy decision based on nine students.
Cannot answer follow-up policy questions from the data
Counterfactual analysis
"What if we made MATH301 a prerequisite?" or "Which combination works for students below 2.8 GPA?" are answered by filtering the data and re-running the model. An LLM generates plausible prose, none of it computed from the institution's records. Plausible is not the same as correct.
None of this is hypothetical. Each limitation has been demonstrated repeatedly in peer-reviewed benchmarks on exactly this kind of structured prediction problem.
What the benchmark literature actually says
The argument that LLMs win on structured tabular prediction has been made in print. Reading the actual studies, including the ones that argue the LLM case, produces a consistent pattern: classical machine learning wins on structured data with adequate volume, and LLMs only catch up in the tiny-data regime (roughly 32 to 64 rows) where there is essentially nothing else to fit.
Three results are worth knowing.
| Study | Result | Why it is relevant |
|---|---|---|
| arXiv 2411.06469 (ICU mortality, MIMIC-III, ~46K admissions) | Classical ML models scored F1 between 59 and 65. All LLMs, including medical-domain models, scored between 20 and 43. The top LLM result was 43.0. | Binary outcome on structured EHR data. Structurally the same shape as graduate vs not-graduate. |
| JAMIA 32(5):811, 2025 (24-hour discharge / ICU transfer) | Gradient boosting AUROC 0.847 to 0.894. GPT-4 zero-shot AUROC 0.602 to 0.629. GPT-3.5 near random. Few-shot RAG did not meaningfully help. | Concludes that locally trained ML beats non-fine-tuned LLMs and that the model learns institution-specific patterns the LLM cannot. |
| arXiv 2406.12031 (TABULA-8B vs XGBoost) | LLM beats XGBoost only at 0 to 32 training rows. On numeric-only data it only matches. Commercial LLMs are worse than baselines beyond 3-shot. | Tested at 128 rows maximum. A real institutional dataset is ~1.5M rows. Far outside the LLM-favorable regime. |
“On structured, numeric-heavy institutional data with hundreds of thousands of rows, gradient-boosted trees are the right tool. LLMs are complementary, not a replacement.”
The pro-LLM papers (UniPredict, TP-BERTa, GTL, the Amazon survey) reinforce the same boundary condition: LLM gains appear only in low-data, heavily categorical, or untuned-baseline settings. On numeric institutional data with enough rows, XGBoost wins or matches every time. An institution's six years of student-term records is not the LLM-favorable regime. It is the XGBoost-favorable regime by a wide margin.
Why XGBoost specifically
Saying "use gradient-boosted trees" still leaves a choice. The tree-based family includes Random Forest, the original GBM, AdaBoost, XGBoost, LightGBM, CatBoost, and NGBoost. They are not interchangeable, and the right pick depends on the dataset's size, feature mix, and operational constraints.
For a dataset on the order of 240,000 student-level rows expanding to roughly 1.84M student-term rows (six years at 40,000 students), with 19 to 20 features and a mix of numeric, binary, and categorical inputs plus class imbalance and an explainability requirement, the operating ranges look like this.
| Model | Works to (rows) | Why | Struggles when |
|---|---|---|---|
| Logistic Regression | 1K to 1M+ | Linear, very scalable, the right baseline | Nonlinear patterns and feature interactions matter |
| Random Forest | 1K to 200K | Handles interactions natively | Wide data and rare-class balancing needed |
| Gradient Boosting (older GBM) | 1K to 50K | Reliable on small-to-medium data | Beyond medium scale |
| XGBoost | 1K to 5M | Regularized, fast, mature tooling, excellent SHAP support | Needs tuning time to reach best results |
| LightGBM | 5K to 20M | Leaf-wise growth, very fast at large scale | Small-data overfit |
| CatBoost | 2K to 5M | Handles categoricals natively without manual encoding | Slower to train than XGBoost |
The selection logic at this scale:
Logistic Regression is the right baseline but learns essentially one additive equation. With this much data, it cannot capture the higher-order interactions (e.g. attendance only helps if grades are adequate) that actually drive the outcome.
Random Forest handles interactions but is memory-heavy and harder to balance for the rare not-retained class on a wide course matrix.
CatBoost handles categoricals natively but trains slower at this scale than XGBoost with explicit encoding.
LightGBM is the planned successor once the dataset grows past roughly 6 to 7 million rows, where its leaf-wise growth becomes the speed advantage. For now, it is overkill.
XGBoost is the chosen model: strong on tabular data, robust regularization, native handling of missing values and class imbalance, mature tooling, and excellent SHAP support. The safest high-accuracy choice at this scale.
Reading the risk score in plain English
A trained model returns a probability between 0 and 1 for each student. A 0.85 means an 85% modeled probability of the outcome, based on patterns from similar students whose outcomes are known. The most common piece of confusion when a risk score reaches leadership is what "85%" actually claims.
"Precise" does not mean "certain." It means the uncertainty has been quantified and is attributable to specific factors, rather than guessed. That distinction is the entire point.
When a model is reported as having about 86.6% accuracy, ~91% recall, and ~0.91 AUC, here is what those numbers actually say.
| Metric | What it measures | What it means here |
|---|---|---|
| Accuracy | Share of all predictions that are correct | About 86 of every 100 students are classified correctly |
| Recall | Of students who actually dropped out, how many were caught | About 9 of every 10 real dropouts are flagged. The metric to optimize, because missing a dropout is worse than a false alarm. |
| Precision | Of students flagged at risk, how many actually were | Of 100 flagged, around 93 genuinely at risk in the reported figures |
| F1 | Balance of precision and recall | Catching real dropouts without flooding advisors with false alarms |
| AUC-ROC | Ability to rank a random dropout above a random non-dropout | About 0.91. Picks the higher-risk student about 91 times out of 100. |
Two things to note. First, accuracy alone is misleading when 87% of students are retained. A model that predicts "retained" for everyone would score 87% accuracy and catch zero real dropouts. That is why recall, not accuracy, is the operating metric for this problem. The retention deep-dive post covers why and what to do about class imbalance.
Second, AUC of 0.91 is genuinely strong on this kind of data. The reason this is achievable is not magic. It is that there is enough signal in six years of attendance, grades, prior-term performance, and course difficulty for the model to rank students reliably.
SHAP: turning a number into an explanation
A bare 67.3% graduation probability is not useful to an advisor. The advisor needs to know what is dragging the number down or pushing it up, because that is where the conversation with the student starts.
SHAP (Shapley Additive Explanations) decomposes each prediction into per-feature contributions. The score is read as a base rate plus and minus the effect of each factor that the model considered.
Base rate at the institution
Starting point
The model starts at the institution-wide average graduation probability for similar students. Call it the floor.
MATH301 + CS201 combination
+14.2%
This specific course combination has historically been associated with higher completion among similar students. The model adds the effect.
GPA 2.9
+6.1%
Slightly above the cohort-relative average for this program. Small positive contribution.
Attendance 74%
-8.4%
The single biggest drag. Attendance below the institutional median is the strongest negative signal in this profile.
First-generation status
-2.6%
A modest negative contribution. The model has learned that first-gen students at this institution complete at a slightly lower rate, after controlling for everything else.
Final prediction
67.3%
The base rate plus the four contributions. The advisor now knows that attendance is the single biggest lever for this student, not GPA or course choice.
The same machinery, run across the whole population, surfaces which factors dominate across the board. On a well-built retention model, academic-progress features (cumulative credits earned, cumulative percent credit completion, prior-term GPA) dominate. Demographic features (gender, race/ethnicity) typically sit near zero, which is what you want: the model is relying on performance and progress, not on demographic shortcuts.
SHAP is what makes the prediction defensible in a committee meeting. Without it, the score is a black box. With it, every prediction is a conversation an advisor or dean can have.
LLMs in a supporting role
None of this excludes LLMs from the system. It defines what they should and should not do.
LLMs are excellent at two things in this stack. First, enriching course features. Pulling course descriptions and catalog metadata to tag courses by workload, prerequisite difficulty, and topic cluster is a language problem, and that is where LLMs shine. Those tags then become inputs to the gradient-boosted model.
Second, translating SHAP output into committee-ready language. Turning "feature X contributed -8.4% to student Y's prediction" into a paragraph an academic affairs committee can read and act on is exactly the kind of summarization LLMs are good at.
The division of labor is straightforward: classical machine learning does the prediction and the confidence estimation. The LLM does the language work around it. Mixing those up is how predictions stop being trustworthy.
Where Clema fits in
Clema's predictive models follow exactly this pattern. Gradient-boosted trees do the prediction, SHAP produces the per-student explanation, and we use LLMs only for the supporting tasks they actually do well (feature enrichment and natural-language summaries of the model output).
For the broader strategic case (catalog of predictions, economic argument, where to start), the reactive-to-proactive playbook is the companion to this post. For the cost question (open-source stack vs $30K to $200K incumbent platforms), the cost and vendor reality post covers the math.
See SHAP-backed predictions on real student data
Walk through what per-student risk scores and driver-level explanations look like with the data your IR team already has.
Book a Clema demoSources
XGBoost documentation, Distributed (Deep) Machine Learning Community. SHAP documentation, A unified approach to explaining model output. arXiv 2411.06469, LLMs vs ML on clinical prediction. JAMIA, comparing LLM and ML for early discharge prediction (2025). arXiv 2406.12031, TABULA-8B tabular transfer learning. scikit-learn, machine learning in Python.