300 Real-World Machine Learning Systems: How They Went From Zero to Production
A plain-language field guide based on case studies from Netflix, Airbnb, DoorDash, and 77 other companies
“
If you can read a college textbook, you can read this post.
Every example comes from the public engineering blogs and papers listed at the end—nothing is made up, nothing is exaggerated.
Table of Contents
-
Why should you care about these 300 stories? -
The “elevator cheat sheet”: what problem each system solves in five words or less -
A bird’s-eye view of 10 industries and 300 lessons learned -
The universal seven-step playbook that keeps showing up -
Six stories told from the ground up -
Recommendation: Spotify’s “next song” engine -
Forecasting: DoorDash’s holiday surge predictor -
Fraud: Stripe’s real-time transaction guard -
Code generation: GitHub Copilot’s autocomplete brain -
Computer vision: Zillow’s floor-plan-from-photo tool -
Multimodal search: Airbnb’s “romantic cabin” sorter
-
-
Frequently asked questions (from juniors, by juniors) -
Your own 30-day starter plan -
Closing thoughts
1. Why should you care about these 300 stories?
Think of this article as a recipe book for machine-learning systems.
Instead of “how to bake a cake,” you get
-
“how Spotify decides which song to play next,” -
“how Uber predicts arrival times within one minute,” -
“how Stripe spots a stolen credit card in 200 ms.”
Each recipe contains:
Ingredient | What it means |
---|---|
Business goal | “Increase listening time by 5 %” |
Data sources | Play logs, weather, device type |
First model | Simple logistic regression |
Hurdles | Cold-start, data drift, latency |
Final architecture | Ensemble + feature store + canary |
By the end you will not be an expert, but you will never again ask, “How does a real system actually look?”
2. The “elevator cheat sheet”
Keyword you would type into Google | One-line summary | Example company |
---|---|---|
recommendation engine | “Show the most relevant item first” | Netflix, Spotify |
demand forecasting | “Predict tomorrow’s orders” | Uber, Walmart |
fraud detection | “Block the weird transactions” | Stripe, PayPal |
code completion | “Guess the next line of code” | GitHub |
image recognition | “Understand what’s in a picture” | Apple, Zillow |
search ranking | “Return the best result in 3 s” | Etsy, LinkedIn |
multimodal AI | “Use text + image + audio together” | Airbnb, Meta |
3. A bird’s-eye view of 10 industries and 300 lessons learned
Below is a map, not a table of contents.
Use it to jump to the corner of the world that matches your job or curiosity.
Sector | Typical ML task | Three quick examples (with link IDs) |
---|---|---|
FinTech | Fraud, risk, routing | Stripe Radar (#1), PayPal graph fraud (#193), Nubank phone routing (#90) |
E-commerce | Recommend, search, forecast | Walmart “complete the look” (#2), Etsy search-by-image (#63), Instacart availability (#31) |
Mobility & Delivery | ETA, supply/demand | Uber DeepETA (#114), DoorDash holiday surge (#42), Gojek Tensoba (#113) |
Streaming Media | Personalization, content analysis | Netflix in-video search (#40), Spotify podcast preview (#64) |
Travel & Hospitality | Price prediction, ranking | Airbnb categories (#29), Expedia CLV (#55) |
Social Platforms | Feed ranking, spam detection | LinkedIn feed (#32), Pinterest spam (#192) |
HealthTech | Sensor classification | Siemens test-suite optimization (#169) |
Gaming | Player modeling | King play-testing (#289) |
SaaS & Dev Tools | Code generation, ticket triage | GitHub Copilot (#53), Salesforce Slack summarizer (#80) |
Local Services | Menu ranking, delivery time | Foodpanda menu ranking (#8), Swiggy ETA (#57) |
4. The universal seven-step playbook that keeps showing up
Almost every case study fits this loop.
-
Translate the business goal
“More rides on Friday night” → “Increase Friday-night ride-request conversion by 3 %.” -
Inventory the data
Make a list: user logs, GPS pings, payment history, weather, public holidays. -
Label or define the target
Regression: ETA in minutes.
Classification: fraud = 1, safe = 0. -
Build a baseline
Start with logistic regression or gradient boosting—whatever runs in <1 hour on a laptop. -
Run a controlled experiment
A/B test or shadow mode (DoorDash calls it “dark launch”). -
Production plumbing
-
Feature store (Redis, BigQuery) -
Model registry (MLflow) -
Canary deploy (5 % traffic)
-
-
Continuous monitoring
Watch data drift, latency, cost. When any metric jumps 10 %, page the owner.
5. Six stories told from the ground up
5.1 Recommendation: Spotify’s “next song” engine
Goal
Keep the user listening instead of hitting “skip.”
Data
-
30 billion play events/day -
Audio features (tempo, key, valence) -
Context: time of day, device, playlist origin
Version history
-
v0: matrix factorization (2009) -
v1: Wide & Deep (2016) -
v2: Transformer + multi-task (2023)
Tricks that worked
-
Cold-start: use audio features only until enough play data arrives. -
Data imbalance: down-weight top artists to avoid feedback loops. -
Latency: 100 ms p95—embeddings pre-computed, served from Redis.
5.2 Forecasting: DoorDash’s holiday surge predictor
Problem
Thanksgiving volume spikes 4×; naive scaling wastes food and driver time.
Model stack
-
LightGBM for tabular history -
Prophet for weekly/annual seasonality -
Seq2Seq for city-level temporal patterns
Ensemble blended with Bayesian weights.
Features
-
Historical orders (3 years) -
Weather, school holidays -
Real-time driver count (Kafka stream)
Outcome
2023 Thanksgiving: +6 min average delivery vs. +28 min in 2022.
5.3 Fraud: Stripe’s real-time transaction guard
Window
200 ms to approve or decline.
Signals
-
Location jump: IP vs. shipping address distance -
Device fingerprint change -
Velocity: 3+ attempts in 60 s
Model
-
Gradient-boosted trees + graph neural network (cards, emails, IPs as nodes) -
SHAP values for human-readable reasons (required by regulators)
Result
False-positive rate cut by 30 % YoY without hurting conversion.
5.4 Code generation: GitHub Copilot’s autocomplete brain
Pipeline
-
Pre-train Code Llama on public GitHub code -
Fine-tune on permissively licensed snippets -
Context window: current file + 20 lines above cursor + repo path
Serving
-
KV-cache to reuse prefix tokens -
8-bit quantization, single GPU -
5 candidate completions, first-token latency 50 ms
Guardrails
-
Deduplication against public code -
Sensitive-word filter
5.5 Computer vision: Zillow’s floor-plan-from-photo tool
Input
360° panorama from phone camera.
Steps
-
Semantic segmentation (Detectron2) → walls, doors, windows -
Convert pixel mask to vector geometry -
Rule checker: doors must touch walls, rooms must form polygons
User impact
Brokers save ~30 min per listing.
5.6 Multimodal search: Airbnb’s “romantic cabin” sorter
Challenge
User types “romantic cabin with hot tub” and expects perfect matches.
Model
-
Text tower: BERT on listing title/description -
Image tower: ResNet on photo embeddings -
Cross-attention layer to score text-image fit
Gain
Couples segment booking conversion +12 %.
6. Frequently asked questions (from juniors, by juniors)
Q1: I only have a laptop. Can I still replicate these systems?
Yes. 80 % of the teams start with a 4-core CPU and <8 GB RAM. Move to GPU only after the baseline works.
Q2: What if my dataset is tiny?
-
Transfer learning: use BERT for text, ResNet for images. -
Weak supervision: DoorDash generated 1 M pseudo-labels with simple rules.
Q3: The model degrades after launch. How do I catch it early?
Plot daily distribution drift (Kolmogorov–Smirnov distance). Netflix alerts at 0.1.
Q4: How do I convince my manager to fund this?
Run a 2-week shadow mode and record “dollars saved” or “hours freed.” Stripe’s shadow run showed $3 M annual fraud loss reduction—budget approved overnight.
7. Your own 30-day starter plan
Week | Task | Tool suggestion | Success criterion |
---|---|---|---|
1 | Pick one use case | Copy DoorDash ETA or Spotify next-song | Write 200-word problem statement |
2 | Collect & label 1 k samples | CSV + Google Colab | Baseline MAE < 20 % worse than business heuristic |
3 | Train baseline | LightGBM or scikit-learn | Offline metric beats random |
4 | Shadow launch | FastAPI + Docker | Compare to current system, no user impact |
8. Closing thoughts
This post is a map, not a miracle cure.
Whenever you feel lost, open the original case study, find the company that solved a problem like yours, and copy the parts that fit.
If you want the raw links in one place, head to HorizonX.live or the Evidently ML-system-design repo.
Pick one story, run the 30-day plan above, and next month you will have your own production story to tell.