300 Real-World Machine Learning Systems: From Concept to Production Excellence

高效码农

5 months ago

300 Real-World Machine Learning Systems: How They Went From Zero to Production

A plain-language field guide based on case studies from Netflix, Airbnb, DoorDash, and 77 other companies

“

If you can read a college textbook, you can read this post.
Every example comes from the public engineering blogs and papers listed at the end—nothing is made up, nothing is exaggerated.

Why should you care about these 300 stories?
The “elevator cheat sheet”: what problem each system solves in five words or less
A bird’s-eye view of 10 industries and 300 lessons learned
The universal seven-step playbook that keeps showing up
Six stories told from the ground up
- Recommendation: Spotify’s “next song” engine
- Forecasting: DoorDash’s holiday surge predictor
- Fraud: Stripe’s real-time transaction guard
- Code generation: GitHub Copilot’s autocomplete brain
- Computer vision: Zillow’s floor-plan-from-photo tool
- Multimodal search: Airbnb’s “romantic cabin” sorter
Frequently asked questions (from juniors, by juniors)
Your own 30-day starter plan
Closing thoughts

1. Why should you care about these 300 stories?

Think of this article as a recipe book for machine-learning systems.
Instead of “how to bake a cake,” you get

“how Spotify decides which song to play next,”
“how Uber predicts arrival times within one minute,”
“how Stripe spots a stolen credit card in 200 ms.”

Each recipe contains:

Ingredient	What it means
Business goal	“Increase listening time by 5 %”
Data sources	Play logs, weather, device type
First model	Simple logistic regression
Hurdles	Cold-start, data drift, latency
Final architecture	Ensemble + feature store + canary

By the end you will not be an expert, but you will never again ask, “How does a real system actually look?”

2. The “elevator cheat sheet”

Keyword you would type into Google	One-line summary	Example company
recommendation engine	“Show the most relevant item first”	Netflix, Spotify
demand forecasting	“Predict tomorrow’s orders”	Uber, Walmart
fraud detection	“Block the weird transactions”	Stripe, PayPal
code completion	“Guess the next line of code”	GitHub
image recognition	“Understand what’s in a picture”	Apple, Zillow
search ranking	“Return the best result in 3 s”	Etsy, LinkedIn
multimodal AI	“Use text + image + audio together”	Airbnb, Meta

3. A bird’s-eye view of 10 industries and 300 lessons learned

Below is a map, not a table of contents.
Use it to jump to the corner of the world that matches your job or curiosity.

Sector	Typical ML task	Three quick examples (with link IDs)
FinTech	Fraud, risk, routing	Stripe Radar (#1), PayPal graph fraud (#193), Nubank phone routing (#90)
E-commerce	Recommend, search, forecast	Walmart “complete the look” (#2), Etsy search-by-image (#63), Instacart availability (#31)
Mobility & Delivery	ETA, supply/demand	Uber DeepETA (#114), DoorDash holiday surge (#42), Gojek Tensoba (#113)
Streaming Media	Personalization, content analysis	Netflix in-video search (#40), Spotify podcast preview (#64)
Travel & Hospitality	Price prediction, ranking	Airbnb categories (#29), Expedia CLV (#55)
Social Platforms	Feed ranking, spam detection	LinkedIn feed (#32), Pinterest spam (#192)
HealthTech	Sensor classification	Siemens test-suite optimization (#169)
Gaming	Player modeling	King play-testing (#289)
SaaS & Dev Tools	Code generation, ticket triage	GitHub Copilot (#53), Salesforce Slack summarizer (#80)
Local Services	Menu ranking, delivery time	Foodpanda menu ranking (#8), Swiggy ETA (#57)

4. The universal seven-step playbook that keeps showing up

Almost every case study fits this loop.

Translate the business goal
“More rides on Friday night” → “Increase Friday-night ride-request conversion by 3 %.”
Inventory the data
Make a list: user logs, GPS pings, payment history, weather, public holidays.
Label or define the target
Regression: ETA in minutes.
Classification: fraud = 1, safe = 0.
Build a baseline
Start with logistic regression or gradient boosting—whatever runs in <1 hour on a laptop.
Run a controlled experiment
A/B test or shadow mode (DoorDash calls it “dark launch”).
Production plumbing
- Feature store (Redis, BigQuery)
- Model registry (MLflow)
- Canary deploy (5 % traffic)
Continuous monitoring
Watch data drift, latency, cost. When any metric jumps 10 %, page the owner.

5. Six stories told from the ground up

5.1 Recommendation: Spotify’s “next song” engine

Goal
Keep the user listening instead of hitting “skip.”

Data

30 billion play events/day
Audio features (tempo, key, valence)
Context: time of day, device, playlist origin

Version history

v0: matrix factorization (2009)
v1: Wide & Deep (2016)
v2: Transformer + multi-task (2023)

Tricks that worked

Cold-start: use audio features only until enough play data arrives.
Data imbalance: down-weight top artists to avoid feedback loops.
Latency: 100 ms p95—embeddings pre-computed, served from Redis.

5.2 Forecasting: DoorDash’s holiday surge predictor

Problem
Thanksgiving volume spikes 4×; naive scaling wastes food and driver time.

Model stack

LightGBM for tabular history
Prophet for weekly/annual seasonality
Seq2Seq for city-level temporal patterns
Ensemble blended with Bayesian weights.

Features

Historical orders (3 years)
Weather, school holidays
Real-time driver count (Kafka stream)

Outcome
2023 Thanksgiving: +6 min average delivery vs. +28 min in 2022.

5.3 Fraud: Stripe’s real-time transaction guard

Window
200 ms to approve or decline.

Signals

Location jump: IP vs. shipping address distance
Device fingerprint change
Velocity: 3+ attempts in 60 s

Model

Gradient-boosted trees + graph neural network (cards, emails, IPs as nodes)
SHAP values for human-readable reasons (required by regulators)

Result
False-positive rate cut by 30 % YoY without hurting conversion.

5.4 Code generation: GitHub Copilot’s autocomplete brain

Pipeline

Pre-train Code Llama on public GitHub code
Fine-tune on permissively licensed snippets
Context window: current file + 20 lines above cursor + repo path

Serving

KV-cache to reuse prefix tokens
8-bit quantization, single GPU
5 candidate completions, first-token latency 50 ms

Guardrails

Deduplication against public code
Sensitive-word filter

5.5 Computer vision: Zillow’s floor-plan-from-photo tool

Input
360° panorama from phone camera.

Steps

Semantic segmentation (Detectron2) → walls, doors, windows
Convert pixel mask to vector geometry
Rule checker: doors must touch walls, rooms must form polygons

User impact
Brokers save ~30 min per listing.

5.6 Multimodal search: Airbnb’s “romantic cabin” sorter

Challenge
User types “romantic cabin with hot tub” and expects perfect matches.

Model

Text tower: BERT on listing title/description
Image tower: ResNet on photo embeddings
Cross-attention layer to score text-image fit

Gain
Couples segment booking conversion +12 %.

6. Frequently asked questions (from juniors, by juniors)

Q1: I only have a laptop. Can I still replicate these systems?
Yes. 80 % of the teams start with a 4-core CPU and <8 GB RAM. Move to GPU only after the baseline works.

Q2: What if my dataset is tiny?

Transfer learning: use BERT for text, ResNet for images.
Weak supervision: DoorDash generated 1 M pseudo-labels with simple rules.

Q3: The model degrades after launch. How do I catch it early?
Plot daily distribution drift (Kolmogorov–Smirnov distance). Netflix alerts at 0.1.

Q4: How do I convince my manager to fund this?
Run a 2-week shadow mode and record “dollars saved” or “hours freed.” Stripe’s shadow run showed $3 M annual fraud loss reduction—budget approved overnight.

7. Your own 30-day starter plan

Week	Task	Tool suggestion	Success criterion
1	Pick one use case	Copy DoorDash ETA or Spotify next-song	Write 200-word problem statement
2	Collect & label 1 k samples	CSV + Google Colab	Baseline MAE < 20 % worse than business heuristic
3	Train baseline	LightGBM or scikit-learn	Offline metric beats random
4	Shadow launch	FastAPI + Docker	Compare to current system, no user impact

8. Closing thoughts

This post is a map, not a miracle cure.
Whenever you feel lost, open the original case study, find the company that solved a problem like yours, and copy the parts that fit.

If you want the raw links in one place, head to HorizonX.live or the Evidently ML-system-design repo.

Pick one story, run the 30-day plan above, and next month you will have your own production story to tell.