Introduction
The first surprise most IR leaders hit when they actually price out a predictive analytics program is the size of the gap between options. On one end, the open-source stack (XGBoost plus SHAP plus a small cloud instance) runs at roughly $40 to $95 per month, plus the engineering time to build and operate it. On the other end, the major incumbent platforms (EAB Navigate, Civitas Learning, Liaison Othot) run between $30,000 and $200,000 per year, depending on institution size and contract scope.
That is a 300-to-1 cost ratio. It is not a small difference, and the right answer for any given institution depends on what is actually being paid for in each case. This post is the comparison in detail, written for Provost, VP, and CFO conversations where the build-vs-buy decision sits.
What it actually costs to run a retention model
For a working retention model trained on roughly 2 to 3 million student-term rows with 19 features, the monthly cloud cost is small. Training a fresh model takes 5 to 15 minutes on a cloud CPU instance, and the trained model serves predictions in milliseconds for individual students, a fraction of a second for a thousand students, and 2 to 3 minutes to score the entire institution's population.
The cost-efficient pattern is to spin up a training instance for about thirty minutes per retraining cycle (a few dollars per run) and keep only a small always-on instance for serving.
| Component | AWS estimate | GCP estimate |
|---|---|---|
| Training instance (intermittent, ~2 hours/month) | ~$2 | ~$0.50 (Spot) |
| Serving instance (24/7) | ~$70 (m5.large) | ~$35 to $49 (e2-standard-2 with SUD) |
| Storage (model + data) | ~$20 (EBS) | ~$2.30 (100 GB GCS Standard) |
| Total monthly | ~$92/month (~$1,100/year) | ~$38 to $52/month (~$456 to $624/year) |
This is not a "sticker price hides the real cost" trick. The compute and storage genuinely cost this little. What this table does not include is the engineering time to build, validate, and operate the system, which is the real cost of the build path. A reasonable internal estimate is one mid-level data scientist or ML engineer at half-time for the first six months, then somewhere between a quarter-time and a half-time owner thereafter.
That is the honest comparison: a few thousand dollars per year in cloud cost, plus roughly one full-time person amortized across the work, against tens to hundreds of thousands per year in license fees with the vendor doing the operational work.
The licensing picture
Every component of the core open-source predictive stack is under a permissive license (Apache 2.0, MIT, or BSD). None of them carry restrictions on commercial use, none require contributing changes back, and none impose copyleft on derived work. This matters in procurement because legal review tends to be where open-source proposals stall.
| Component | License | Origin | Commercial use |
|---|---|---|---|
| XGBoost | Apache 2.0 | University of Washington / open-source community | Yes, fully |
| LightGBM | MIT | Microsoft | Yes, fully |
| CatBoost | Apache 2.0 | Yandex | Yes, fully |
| scikit-learn | BSD | Open-source community | Yes, fully |
| SHAP | MIT | Open-source community | Yes, fully |
No proprietary model is strictly required to do this work. There are proprietary options worth knowing about (Kumo AI for relational foundation models with strong data-residency guarantees; Qlik Predict and Microsoft AI Builder for no-code prediction inside existing BI stacks), but they are accelerators rather than necessities. Most are not white-labelable, which limits how they show up in a vendor evaluation.
The incumbents
Three platforms dominate the higher-education predictive analytics conversation. Each is a real product with real customers, and each has earned its position. They also share a common cluster of weaknesses that drive the mid-market opportunity.
| Platform | Position | Approximate annual cost | Common weakness |
|---|---|---|---|
| EAB Navigate | Dominant student-success suite, 850+ institutions, 10M+ students | $50K to $200K | Aggregates institutional data for cross-school research, expensive for small schools, rigid workflows, limited modern GenAI integration, slow SIS integration |
| Civitas Learning | Pure analytics plus course-demand analytics, ~400 institutions | $30K to $150K | Long, expensive implementation, each deployment bespoke, reported downtime during setup |
| Liaison Othot | Enrollment and retention ML, community-college lean | Mid-market (custom) | Heavy on enrollment and financial-aid prediction, thinner on course-combination effects on graduation |
Where incumbents are strong and where they leave gaps
Where incumbents win, they win clearly. The advising workflow, the case-management tooling, the cross-campus reporting, the integration with existing student-success organizations: these are mature, deeply built, and they would be expensive to rebuild from scratch. If your institution is large, has the budget, and is willing to operate inside the vendor's data and workflow opinions, the incumbents are a reasonable answer.
The three places where incumbents consistently leave room are price, time-to-value, and data posture.
Price puts these platforms out of reach for a large segment of the mid-market. Community colleges, regional publics, and smaller privates frequently cannot justify a six-figure line item for one workflow, even if they would benefit. The open-source stack collapses that line item by roughly two orders of magnitude.
Time-to-value is the second wedge. The reported implementation timelines for the major platforms are measured in months, not weeks, with several public examples of multi-quarter rollouts and substantial downtime during the SIS integration phase. For an institution that wants a working retention list this academic year, that is too slow.
Data posture is the quietest, and increasingly the loudest. The incumbent platforms generally pool institutional data into shared models so that the system as a whole improves with each new partner. Whether that is technically described as "anonymized" or "aggregated" varies. The discomfort for institutions is structural: their data is improving a system that their competitors will also use. A partner-only model (your data trains only your model) is a direct answer to that concern, and it is becoming a more frequent ask in IR procurement.
Partial competitors and why they are not the same thing
Two adjacent tools come up often enough to address. Neither is in the same category as the incumbents above, and neither replaces a real predictive system, but both get pitched as "we already have AI for this."
Julius AI ($70/month) is a conversational data-exploration tool. It is good at answering "what happened" questions over a dataset. It does not produce per-student risk scores, it does not quantify confidence, and it is not built for the train-and-serve loop a real prediction system needs.
Zoho Analytics ($60 to $145/month) is a general business-intelligence platform with some built-in prediction features. It is competent at BI. It is not higher-ed-specific, and it is not architected around the kind of feature engineering (DFW rates, cohort-relative GPA, lag features) that drive accuracy on student-level data.
Both are useful tools. Neither is a substitute for a properly built retention or graduation model.
Three wedges for mid-market institutions
For an institution priced out of, or uncomfortable with, the major platforms, three structural openings are worth naming.
Price
The open-source stack
A working retention or graduation model on commodity cloud infrastructure runs at roughly $40 to $95 per month. That is one to two orders of magnitude below incumbent annual contracts. It does not eliminate the engineering cost, but it eliminates the licensing line item entirely.
Time-to-value
Weeks, not quarters
Connecting to a partner's SIS extract and producing a first useful score in days rather than months is a direct answer to the most common incumbent complaint. The model itself is fast to train. The integration is the real work, and it is bounded.
Privacy posture
Partner-only models
Training a model on a partner institution's data, for that partner only, without pooling into a shared cross-school model, addresses the most common procurement objection to the incumbent platforms. It also tends to be easier to defend in board and faculty governance conversations.
The build vs buy decision
The honest version of the build-vs-buy conversation has three questions, not one.
Do you have the people? If your institution has at least a half-time data scientist or ML engineer comfortable with Python, model evaluation, and the operational work of retraining and monitoring, the build path is realistic. If not, build is risky regardless of what the cloud bill says.
Do you have the data plumbing? A working model needs a clean, repeatable extract from the SIS, the LMS, and the financial-aid system. If those extracts already exist for reporting, the marginal effort is small. If they do not, the integration is the project, and it is the same integration whether you build or buy.
Do you have an owner for the output? The model produces a weekly list. Someone has to receive it, route it, and close the loop. Without that, no platform (built or bought) produces results. This is the most common failure mode across the sector, and it has nothing to do with the technology.
If all three answers are yes, build is competitive. If one is no, partnering with a vendor that handles that specific gap (the data plumbing, the ML expertise, or the workflow ownership) is usually the right call. Buying the full stack from an incumbent is right when all three are weak and the budget is there to absorb the cost.
Where Clema fits in
Clema is built around the three wedges above. We use the open-source stack (XGBoost, SHAP, scikit-learn) so the underlying cost stays low and the model is auditable. We integrate fast (days, not months) because the IR-facing data extracts are usually closer to ready than institutions realize. And every model we train is partner-only: your data does not improve another institution's predictions.
For the broader strategic case (why predict at all, the catalog of predictions), the reactive-to-proactive playbook is the companion piece. For the model-science argument behind the open-source stack, the XGBoost vs LLM post covers why gradient-boosted trees outperform the alternatives on this kind of data.
See what build, buy, or partner looks like for your institution
Walk through the cost, timeline, and data-posture picture for a retention or graduation model on your specific data and team setup.
Book a Clema demoSources
EAB Navigate, Education Advisory Board student success platform. Civitas Learning, student success and analytics. Liaison Othot, predictive analytics for enrollment. XGBoost, Apache 2.0 license. SHAP, MIT license. Georgia State University Student Success. AWS pricing, EC2 m5 instances. Google Cloud pricing, Compute Engine.