300 Real-World Machine Learning Systems: How They Went From Zero to Production

A plain-language field guide based on case studies from Netflix, Airbnb, DoorDash, and 77 other companies

If you can read a college textbook, you can read this post.
Every example comes from the public engineering blogs and papers listed at the end—nothing is made up, nothing is exaggerated.


Table of Contents

  1. Why should you care about these 300 stories?
  2. The “elevator cheat sheet”: what problem each system solves in five words or less
  3. A bird’s-eye view of 10 industries and 300 lessons learned
  4. The universal seven-step playbook that keeps showing up
  5. Six stories told from the ground up

    • Recommendation: Spotify’s “next song” engine
    • Forecasting: DoorDash’s holiday surge predictor
    • Fraud: Stripe’s real-time transaction guard
    • Code generation: GitHub Copilot’s autocomplete brain
    • Computer vision: Zillow’s floor-plan-from-photo tool
    • Multimodal search: Airbnb’s “romantic cabin” sorter
  6. Frequently asked questions (from juniors, by juniors)
  7. Your own 30-day starter plan
  8. Closing thoughts

1. Why should you care about these 300 stories?

Think of this article as a recipe book for machine-learning systems.
Instead of “how to bake a cake,” you get

  • “how Spotify decides which song to play next,”
  • “how Uber predicts arrival times within one minute,”
  • “how Stripe spots a stolen credit card in 200 ms.”

Each recipe contains:

Ingredient What it means
Business goal “Increase listening time by 5 %”
Data sources Play logs, weather, device type
First model Simple logistic regression
Hurdles Cold-start, data drift, latency
Final architecture Ensemble + feature store + canary

By the end you will not be an expert, but you will never again ask, “How does a real system actually look?”


2. The “elevator cheat sheet”

Keyword you would type into Google One-line summary Example company
recommendation engine “Show the most relevant item first” Netflix, Spotify
demand forecasting “Predict tomorrow’s orders” Uber, Walmart
fraud detection “Block the weird transactions” Stripe, PayPal
code completion “Guess the next line of code” GitHub
image recognition “Understand what’s in a picture” Apple, Zillow
search ranking “Return the best result in 3 s” Etsy, LinkedIn
multimodal AI “Use text + image + audio together” Airbnb, Meta

3. A bird’s-eye view of 10 industries and 300 lessons learned

Below is a map, not a table of contents.
Use it to jump to the corner of the world that matches your job or curiosity.

Sector Typical ML task Three quick examples (with link IDs)
FinTech Fraud, risk, routing Stripe Radar (#1), PayPal graph fraud (#193), Nubank phone routing (#90)
E-commerce Recommend, search, forecast Walmart “complete the look” (#2), Etsy search-by-image (#63), Instacart availability (#31)
Mobility & Delivery ETA, supply/demand Uber DeepETA (#114), DoorDash holiday surge (#42), Gojek Tensoba (#113)
Streaming Media Personalization, content analysis Netflix in-video search (#40), Spotify podcast preview (#64)
Travel & Hospitality Price prediction, ranking Airbnb categories (#29), Expedia CLV (#55)
Social Platforms Feed ranking, spam detection LinkedIn feed (#32), Pinterest spam (#192)
HealthTech Sensor classification Siemens test-suite optimization (#169)
Gaming Player modeling King play-testing (#289)
SaaS & Dev Tools Code generation, ticket triage GitHub Copilot (#53), Salesforce Slack summarizer (#80)
Local Services Menu ranking, delivery time Foodpanda menu ranking (#8), Swiggy ETA (#57)

4. The universal seven-step playbook that keeps showing up

Almost every case study fits this loop.

  1. Translate the business goal
    “More rides on Friday night” → “Increase Friday-night ride-request conversion by 3 %.”

  2. Inventory the data
    Make a list: user logs, GPS pings, payment history, weather, public holidays.

  3. Label or define the target
    Regression: ETA in minutes.
    Classification: fraud = 1, safe = 0.

  4. Build a baseline
    Start with logistic regression or gradient boosting—whatever runs in <1 hour on a laptop.

  5. Run a controlled experiment
    A/B test or shadow mode (DoorDash calls it “dark launch”).

  6. Production plumbing

    • Feature store (Redis, BigQuery)
    • Model registry (MLflow)
    • Canary deploy (5 % traffic)
  7. Continuous monitoring
    Watch data drift, latency, cost. When any metric jumps 10 %, page the owner.


5. Six stories told from the ground up

5.1 Recommendation: Spotify’s “next song” engine

Goal
Keep the user listening instead of hitting “skip.”

Data

  • 30 billion play events/day
  • Audio features (tempo, key, valence)
  • Context: time of day, device, playlist origin

Version history

  • v0: matrix factorization (2009)
  • v1: Wide & Deep (2016)
  • v2: Transformer + multi-task (2023)

Tricks that worked

  • Cold-start: use audio features only until enough play data arrives.
  • Data imbalance: down-weight top artists to avoid feedback loops.
  • Latency: 100 ms p95—embeddings pre-computed, served from Redis.

5.2 Forecasting: DoorDash’s holiday surge predictor

Problem
Thanksgiving volume spikes 4×; naive scaling wastes food and driver time.

Model stack

  • LightGBM for tabular history
  • Prophet for weekly/annual seasonality
  • Seq2Seq for city-level temporal patterns
    Ensemble blended with Bayesian weights.

Features

  • Historical orders (3 years)
  • Weather, school holidays
  • Real-time driver count (Kafka stream)

Outcome
2023 Thanksgiving: +6 min average delivery vs. +28 min in 2022.

5.3 Fraud: Stripe’s real-time transaction guard

Window
200 ms to approve or decline.

Signals

  • Location jump: IP vs. shipping address distance
  • Device fingerprint change
  • Velocity: 3+ attempts in 60 s

Model

  • Gradient-boosted trees + graph neural network (cards, emails, IPs as nodes)
  • SHAP values for human-readable reasons (required by regulators)

Result
False-positive rate cut by 30 % YoY without hurting conversion.

5.4 Code generation: GitHub Copilot’s autocomplete brain

Pipeline

  • Pre-train Code Llama on public GitHub code
  • Fine-tune on permissively licensed snippets
  • Context window: current file + 20 lines above cursor + repo path

Serving

  • KV-cache to reuse prefix tokens
  • 8-bit quantization, single GPU
  • 5 candidate completions, first-token latency 50 ms

Guardrails

  • Deduplication against public code
  • Sensitive-word filter

5.5 Computer vision: Zillow’s floor-plan-from-photo tool

Input
360° panorama from phone camera.

Steps

  1. Semantic segmentation (Detectron2) → walls, doors, windows
  2. Convert pixel mask to vector geometry
  3. Rule checker: doors must touch walls, rooms must form polygons

User impact
Brokers save ~30 min per listing.

5.6 Multimodal search: Airbnb’s “romantic cabin” sorter

Challenge
User types “romantic cabin with hot tub” and expects perfect matches.

Model

  • Text tower: BERT on listing title/description
  • Image tower: ResNet on photo embeddings
  • Cross-attention layer to score text-image fit

Gain
Couples segment booking conversion +12 %.


6. Frequently asked questions (from juniors, by juniors)

Q1: I only have a laptop. Can I still replicate these systems?
Yes. 80 % of the teams start with a 4-core CPU and <8 GB RAM. Move to GPU only after the baseline works.

Q2: What if my dataset is tiny?

  • Transfer learning: use BERT for text, ResNet for images.
  • Weak supervision: DoorDash generated 1 M pseudo-labels with simple rules.

Q3: The model degrades after launch. How do I catch it early?
Plot daily distribution drift (Kolmogorov–Smirnov distance). Netflix alerts at 0.1.

Q4: How do I convince my manager to fund this?
Run a 2-week shadow mode and record “dollars saved” or “hours freed.” Stripe’s shadow run showed $3 M annual fraud loss reduction—budget approved overnight.


7. Your own 30-day starter plan

Week Task Tool suggestion Success criterion
1 Pick one use case Copy DoorDash ETA or Spotify next-song Write 200-word problem statement
2 Collect & label 1 k samples CSV + Google Colab Baseline MAE < 20 % worse than business heuristic
3 Train baseline LightGBM or scikit-learn Offline metric beats random
4 Shadow launch FastAPI + Docker Compare to current system, no user impact

8. Closing thoughts

This post is a map, not a miracle cure.
Whenever you feel lost, open the original case study, find the company that solved a problem like yours, and copy the parts that fit.

If you want the raw links in one place, head to HorizonX.live or the Evidently ML-system-design repo.

Pick one story, run the 30-day plan above, and next month you will have your own production story to tell.