When AI Assistants Meet Reality: A Cloud vs Bare Metal Showdown for Big Data

Can AI programming assistants truly handle production-grade data analytics? My experiment analyzing Common Crawl data reveals they excel at code generation but fail at system-level judgment, making human oversight critical for architecture decisions.

The Experiment: Pitting Claude Against Codex

What happens when you let two AI coding assistants choose your infrastructure? I tasked Claude Code (Opus 4.5) and GPT-5.2 Codex with the same goal—analyze the latest Common Crawl dump for URL frequency counts—then stepped back to let them lead. The result was a masterclass in AI limitations: both generated functional code but demonstrated zero intuition for operational realities, costing me time, money, and patience.

The Setup

Task: Simple MapReduce-style analysis of Common Crawl (5TB compressed WARC files)
Budget: ~30 minutes weekday evenings, a few hours on weekends
Constraint: Personal project budget (<$500)
Approach: Two parallel paths—AWS cloud (Codex) vs Hetzner bare metal (Claude)

I acted as a product manager with minimal technical intervention, forcing the AIs to make architectural decisions and handle roadblocks. This narrative documents every misstep, workaround, and hard-won insight from both approaches.

The AWS Odyssey: When Codex Meets Cloud Complexity

Can an AI assistant effectively guide a novice through AWS’s labyrinthine services? No. Codex generated syntactically flawless code and coherent step-by-step instructions but demonstrated catastrophic blind spots around service quotas, cost structures, and organizational friction—turning a $300 es t ima t e in t o a$ 127+ quagmire with no end in sight.

Summary: The cloud path exposed AI’s inability to anticipate non-technical barriers. While Codex mastered API calls, it failed completely at predicting AWS’s bureaucratic approval processes and hidden costs.

Cost Estimation: A Tale of Two Numbers

The first red flag appeared when I asked both assistants for cost estimates. Claude quoted $20, 000 f or t h e A W Sp a t h; C o d e x co n f i d e n tl y d ec l a re d$ 300. The 66x disparity should have given me pause, but Codex’s detailed breakdown looked professional:

Service	Data Volume	Unit Cost	Estimated Cost
S3 Select	5TB scanned	$0.002/GB	$10
Glue DPU-hours	100 hours	$0.44/DPU-hour	$44
Athena queries	500GB scanned	$5/TB	$2.50
Total			~$300

Application Scenario: The Cost Estimation Trap
Imagine you’re a startup founder pitching this to your CFO. Codex’s spreadsheet looks bulletproof—until week one when you realize the estimate assumes:

Zero job failures (reality: 99% failed due to quotas)
No data transfer fees (reality: $68 in cross-region charges)
Instant quota approval (reality: 5+ days of support tickets)

The $300 f i gu re w a s a t h eore t i c a l o pt im u m, n o t an o p er a t i o na lpl an . Cl a u d e^{'} s$ 20k estimate, while absurdly high, at least acknowledged complexity. This reveals a critical AI limitation: they calculate best-case scenarios but cannot model organizational friction.

Author’s Reflection: The Price of Blind Trust

“I now run every AI-generated cost estimate through the ‘intern test.’ If a junior engineer gave me this number, what questions would I ask? Did they account for retries? Did they check our account limits? AI assistants have juniors’ confidence without their experiential scars. The $300 estimate was mathematically sound but operationally naive—like calculating fuel cost for a road trip while ignoring speed limits and traffic.”

Navigating the IAM Permission Minefield

Codex began confidently, generating a CloudFormation template for the entire stack. But as a complete AWS novice, I needed click-by-click console instructions. Here, Codex’s knowledge gaps became glaring.

Step-by-Step Breakdown: The Bucket Policy Failure

Instruction Given: “Create an S3 bucket for output and grant Glue write permissions”

Codex’s Guidance: “Attach this bucket policy…”

{
    "Version": "2012-10-17",
    "Statement": [{
        "Effect": "Allow",
        "Principal": {"Service": "glue.amazonaws.com"},
        "Action": "s3:PutObject",
        "Resource": "arn:aws:s3:::my-cc-output/*"
    }]
}

Reality Check: After applying the policy, my Glue job failed with AccessDenied. The error message was generic: “Unable to write to S3.”
Debugging Loop: I pasted the error into Codex. It diagnosed, “The role needs s3:PutObject.” I replied, “Already has it.” It responded, “Check the trust relationship.” The trust relationship was correct. Finally, after three more back-and-forths, Codex suggested, “Maybe you need a VPC endpoint for S3.”

The root cause? Codex had omitted that my account’s default encryption settings required explicit KMS permissions. But more importantly, the AI lacked a systematic debugging mental model. It guessed at common issues rather than guiding me to check CloudTrail logs for the exact denied action.

Application Scenario: The Screenshot Feedback Loop
This pattern repeated across 15+ issues. I’d:

Follow Codex’s instructions
Encounter a non-existent option or cryptic error
Take a screenshot of the console
Feed it to Codex
Get a corrected, working solution

This “screenshot-driven development” proved that AI assistants are excellent at pattern-matching GUI states but terrible at anticipating them. A well-designed CUA (Computer-Using Agent) would ace this; current LLMs require humans to bridge the perception gap.

Code Example: The Glue Job That Wouldn’t Run

# Codex's initial script—technically correct but operationally incomplete
import sys
!{sys.executable} -m pip install warcio  # Failed: no internet access

# After screenshot feedback, Codex fixed it:
# 1. Added VPC endpoint for S3
# 2. Included warcio in Glue job arguments
# 3. Added KMS permissions
# 4. Set proper Python version (3.9 vs 3.7)

# The final working version required 4 iterations, each triggered by human feedback
# that Codex didn't know to request upfront.

The Production Meltdown: Quota Hell

After three days of babysitting, I submitted the full job—80,000 WARC files, 100 concurrent Glue workers. The next morning, 99 workers were marked FAILED with ClientException: vCPU quota exceeded.

The 400-Core Illusion

Needed: 100 workers × 4 vCPU = 400 vCPU
Default Quota: 64 vCPU
Codex’s Assumption: “AWS will queue excess tasks automatically”
Reality: Tasks fail immediately, no queuing

Application Scenario: The Support Ticket Black Hole
I had Codex draft a quota increase request:

Subject: Urgent vCPU increase needed for data analytics project

Dear AWS Support,

I require 500 vCPUs for a Common Crawl analysis project.
Current usage: 0 (new account)
Expected duration: 40 hours
Business justification: Academic research

...

The response cycle:

Day 1: Auto-reply asking for more details
Day 2: Support agent requests “business impact statement”
Day 3: Escalation, asks for “architectural diagram”
Day 5: Approval for 200 vCPU (still insufficient)
Day 7: Still waiting for final 500 vCPU approval

During this time, the only viable strategy was manually launching 2 jobs every few hours. Codex, ever helpful, volunteered to write a monitoring script:

#!/bin/bash
while true; do
  RUNNING=$(aws batch list-jobs --status RUNNING | jq '.jobSummaryList | length')
  if [ $RUNNING -lt 2 ]; then
    aws batch submit-job --job-name cc-$(date +%s) ...
  fi
  sleep 300
done

This “solution” epitomized the AI’s limitations: it optimized the wrong problem. Instead of asking, “Why are we constrained to 2 jobs?” it built automation around the constraint.

The Billing Horror
For the first 3 days, my Cost Explorer showed $0. T h e n, a$ 127 alert. The breakdown:

S3 requests: $8.30
Glue DPUs: $42.70
Cross-region data transfer: $68.00 (USD 0.01/GB)

Codex had never mentioned that reading from us-east-1 (Common Crawl’s region) while processing in eu-west-1 (my choice for “GDPR compliance”) incurs egress fees. When confronted, it apologized and suggested consolidating everything in us-east-1—but the damage was done.

Author’s Reflection: Cloud’s Broken Promise

“AWS markets ‘elastic scale.’ My experience was ‘bureaucratic scale.’ If this had been a Hacker News launch moment—sudden traffic spike, critical data analysis needed in 24 hours—I would have missed the window entirely. The cloud’s elasticity is gated behind human approval chains that move at enterprise, not internet, speed. Codex wrote beautiful CloudFormation templates, but couldn’t template its way around a support ticket.”

The Hetzner Minimalism: Claude’s DIY Path

Can a traditional bare-metal server outperform cloud infrastructure for batch analytics? Yes, dramatically. Claude’s Hetzner path, after initial over-engineering, delivered 5x faster progress at 1/10th the cost by eliminating organizational friction and embracing radical simplicity.

Summary: The bare-metal path proved that predictable batch workloads don’t need cloud complexity. Claude’s initial mistakes mirrored Codex’s (over-engineering), but the simpler substrate made corrections trivial.

The Server Selection: Right-Sizing Without Over-Complicating

When I rejected Claude’s $20k AWS estimate, it pivoted instantly: “What about renting a Hetzner server and running Python scripts?” This flexibility—choosing thrift over trendiness—was Claude’s first win.

Technical Breakdown: AX41-NVMe Specifications
Claude recommended:

CPU: AMD Ryzen 5 3600 (6 cores, 12 threads)
RAM: 64GB DDR4
Storage: 2x1TB NVMe SSD (RAID 0)
Network: 1 Gbps unmetered
Cost: €45.59/month (~$50)

Application Scenario: The Virtual vs Physical Core Decision
I questioned Claude: “Why not the cheaper €8.95 CPX21 plan?” Its analysis was nuanced:

# Claude's reasoning (paraphrased)
if workload == "IO-intensive" and file_count > 50000:
    # CPX21: 2 vCPUs (shared), 20GB RAM (will swap), 40GB SSD (will fill)
    # AX41: 12 threads (dedicated), 64GB RAM, 2TB NVMe
    # For 80k files × 40 sec/file = 33 days total, stability > cost
    recommendation = "AX41-NVMe"

This trade-off analysis was spot-on. Claude demonstrated an understanding of noisy neighbor problems in virtualization and sustained performance needs for long-running jobs—insights Codex lacked when blindly recommending auto-scaling.

The Over-Engineering Trap: When AI Gets Creative

Claude’s first implementation was elegant—almost too elegant. It introduced a Bloom filter for deduplication:

# Initial single-threaded version (Claude's design)
from pybloom_live import ScalableBloomFilter

# Bloom filter: 10M entries, 0.1% false positive rate
bloom = ScalableBloomFilter(
    initial_capacity=10_000_000,
    error_rate=0.001,
    mode=ScalableBloomFilter.LARGE_SET_GROWTH
)

def process_warc(file_path):
    with open(file_path, 'rb') as f:
        for record in ArchiveIterator(f):
            url = record.rec_headers.get_header('WARC-Target-URI')
            if url not in bloom:  # Fast-path dedup
                bloom.add(url)
                # ... sqlite insert logic

Application Scenario: The Single-Core Bottleneck
After 100 files, I ran htop. CPU usage: 100% on one core, 0% on the other 11. Claude’s explanation? “This is expected for IO-bound workloads.”

I pushed back: “Decompressing gzip is CPU-bound.” Claude conceded and proposed a multiprocessing architecture with producer-consumer queues:

# Claude's complex parallel design
from multiprocessing import Process, Queue, Lock

def producer(file_q, paths):
    for p in paths:
        file_q.put(p)

def worker(file_q, result_q):
    while True:
        path = file_q.get()
        stats = process_warc(path)  # CPU-heavy
        result_q.put(stats)

def merger(result_q, db_lock):
    while True:
        batch = result_q.get()
        with db_lock:
            write_batch(batch)

# Launch 8 workers...

The Result: Still 100% on one core. Why? GIL contention and serialization overhead. The multiprocessing.Queue was serializing millions of URL strings, creating a bottleneck worse than the original single-threaded code.

Author’s Reflection: Complexity as a Status Symbol

“Claude was showing off. It knew the textbook answer to parallelism (producer-consumer pattern) but ignored the fundamental rule: match the solution to the problem. My task was embarrassingly parallel—process 80,000 independent files. The simplest solution is N independent processes, not a coordinated architecture. Claude’s first instinct was to build a system; my instinct was to run a script N times. This is the difference between academic knowledge and engineering wisdom.”

The Simplicity Breakthrough: Shard, Run, Merge

My intervention was blunt: “Just shard the input list, run separate processes, merge SQLite files at the end.”

Claude embraced it instantly. The final architecture:

# Launch script (Claude's final version)
#!/bin/bash
# shard_and_run.sh

# Split file list into 8 chunks
split -n l/8 warc.paths shard_

# Launch 8 tmux sessions
for i in {0..7}; do
    tmux new-session -d -s "worker-$i" \
        "python analyze_shard.py --shard-list=shard_aa --db=shard_$i.db"
done

# Monitor progress
tmux attach -t worker-0

# analyze_shard.py (simplified final version)
import sqlite3
from warcio import ArchiveIterator

def process_shard(paths, db_path):
    conn = sqlite3.connect(db_path)
    conn.execute('CREATE TABLE url_counts (url TEXT PRIMARY KEY, count INTEGER)')
    
    for warc_path in paths:
        with open(warc_path, 'rb') as f:
            for record in ArchiveIterator(f):
                url = record.rec_headers.get_header('WARC-Target-URI')
                conn.execute('INSERT OR IGNORE INTO url_counts VALUES (?, 0)', (url,))
                conn.execute('UPDATE url_counts SET count = count + 1 WHERE url = ?', (url,))
    
    conn.commit()

Application Scenario: The SQLite Merge Pattern
After 3 days, I had 8 SQLite files. The merge was trivial:

-- merge_shards.sql
ATTACH 'shard_0.db' AS s0;
ATTACH 'shard_1.db' AS s1;
-- ... attach all 8

CREATE TABLE final_stats AS
SELECT url, SUM(count) as total_count
FROM (
  SELECT * FROM s0.url_counts UNION ALL
  SELECT * FROM s1.url_counts UNION ALL
  -- ... all shards
) GROUP BY url;

This took 20 minutes and used peak 12GB RAM—well within the AX41’s capacity.

Performance Metrics

Throughput: 8 workers × ~500 files/day = 4,000 files/day
ETA: 80,000 files ÷ 4,000/day = 20 days → optimized to 5 days after CPU saturation tuning
Cost: €45.59/month ÷ 30 × 8 days = €12.16 total
Stability: Zero failures in 72 hours

Author’s Reflection: The Beauty of Boring Infrastructure

“My Hetzner setup has no Kubernetes, no Docker, no message queue. It has tmux, SQLite, and a bash for-loop. It is boring, predictable, and finished. The AWS setup had 12 managed services, 3 IAM roles, 2 VPC endpoints, and didn’t work. There’s a lesson here: infrastructure should be as simple as possible, but no simpler. AI assistants tend to skip straight to ‘no simpler’ without passing through ‘as simple as possible’ first.”

Head-to-Head Comparison: Cloud vs Metal

What are the fundamental trade-offs between cloud and bare metal for batch analytics? The comparison is stark: cloud offers theoretical elasticity marred by organizational friction; bare metal offers predictable performance and transparent costs. For known workloads, metal wins on every metric that matters.

Summary: The table below quantifies the divergence. The key insight is that AI assistants can optimize technical parameters but cannot model human approval delays or billing opacity.

Metric	Codex + AWS	Claude + Hetzner
Initial Cost Estimate	$300 (off by 3-4x)	€45.59 (accurate)
Actual Spend	$127+ (10% progress)	€12.16 (complete)
Time to Production	7 days (quota limbo)	0.5 days (ready)
Effective Throughput	2 concurrent jobs (limbo)	8 concurrent jobs (saturated)
Failure Mode	Opaque quota errors	OOM (predictable)
Debuggability	Requires CloudTrail expertise	`htop` + `strace`
Scaling Model	Horizontal (theoretical)	Vertical (instant)
Billing Transparency	Opaque, delayed alerts	Fixed, known upfront

The Cost Structure Deep Dive

AWS Hidden Costs (Codex Missed)

Cross-region data transfer: $0.01/ GB \times 5 TB =$ 50 unavoidable
S3 API requests: ListObjects calls at $0.005/1,000 requests—adds up with 80k files
Glue minimum runtime: 1-minute minimum × failed jobs = wasted money
Support tier: Free tier has 24-hour response time; my €45 Hetzner server gets me 5-minute support responses

Hetzner True Cost Structure

Fixed: €45.59/month, no variable components
Bandwidth: 1Gbps unmetered (no egress fees)
Scaling: Upgrade to AX101 (128GB RAM) costs €89/month, change takes 2 minutes
Predictability: 100%

Application Scenario: The CFO Conversation
Imagine presenting both options:

AWS: “We’ll spend $300, ma y b e$ 800, could be $2,000 depending on support response times and data transfer patterns.”
Hetzner: “€45.59 for the month, guaranteed.”

For a personal project or startup with tight cash flow, the latter isn’t just cheaper—it’s knowable.

The Failure Mode Analysis

AWS: Death by a Thousand Cuts

Quota exhaustion: ClientException with no retry queue
IAM policy lag: Changes take 30 seconds to propagate, causing transient failures
Service limit interplay: Hit vCPU quota → reduce workers → hit S3 rate limit instead
Cost alerts arrive 4 days late: By then, you’ve already overspent

Hetzner: Failures You Can grep

OOM killer: dmesg | grep -i kill shows exactly which process died
Disk full: df -h shows 100% usage; solution: rm or upgrade
Network bottleneck: iftop shows 1Gbps saturation; solution: compress data

Application Scenario: The 3 AM Page
With AWS, a 3 AM failure means logging into 3 consoles (Batch, CloudWatch, IAM) to decipher cryptic error codes. With Hetzner, you ssh in, run htop, and see that process #7 is dead. You restart it and go back to bed.

Author’s Reflection: Error Messages as a Service

“AWS’s error messages are written by lawyers: ‘An error occurred (ClientException) when calling the SubmitJob operation.’ Hetzner’s error messages are written by kernel developers: ‘Out of memory: Kill process 1234 (python) score 856 or sacrifice child.’ One is designed to avoid liability; the other is designed to help you fix the problem. AI assistants trained on public data mimic the former style because it’s more common in documentation. They need explicit prompting to think like a sysadmin reading dmesg.”

Implementation Playbook: Lessons for AI-Assisted Development

How can you safely leverage AI coding assistants for infrastructure projects? Adopt a “defensive AI” methodology: treat AI output as a junior engineer’s first draft—verify assumptions, demand worst-case analysis, and maintain architectural veto power.

Summary: These five principles distill the experiment’s hard-won wisdom. They prioritize human intuition over AI confidence and operational simplicity over theoretical elegance.

Principle 1: The Sanity Check Multiplier

Rule: Multiply all AI estimates (cost, time, performance) by 3.

Application Scenario: Realistic Project Planning
When Codex said “$300, 40 hours,” I should have immediately re-estimated:

Cost: $300 \times 3 =$ 900 (close to actual)
Time: 40 hours × 3 = 120 hours (including quota delays)
Buffer: Propose $1,200 budget and 2-week timeline to stakeholders

How to Enforce It:

# Prompt engineering for realism
"""
You are an experienced SRE with 10 years of AWS scar tissue.
The user asks for a cost estimate. Provide:
1. Best-case (theoretical) cost
2. Realistic cost (include 20% failure rate, cross-region fees)
3. Worst-case cost (include 5-day support delay, 50% idle capacity waste)
"""

Principle 2: The Simplicity Forcing Function

Rule: When AI proposes a complex architecture (queues, orchestrators, state machines), force a simpler alternative.

Real-World Application:

User: "How should I parallelize this?"
Claude: "Use multiprocessing.Queue with a producer-consumer pattern..."
User: "Stop. What's the simplest possible way?"
Claude: "Run 8 separate processes and merge results."
User: "Do that."

Codex’s Missed Opportunity:
When I hit the vCPU quota, Codex proposed a complex “job throttling scheduler” using DynamoDB for state management. The simple answer was: “Switch to Hetzner.” I had to suggest that; Codex never considered abandoning the sunk cost.

Principle 3: Observable by Default

Rule: Require AI to generate monitoring before business logic.

Application Scenario: Smoke Test First
Correct development order:

Monitoring: watch -n 60 'sqlite3 db "SELECT COUNT(*), MAX(ts) FROM logs"'
Smoke test: Process 100 files, verify end-to-end
Quota check: aws service-quotas list-service-quotas --service-code batch
Business logic: The actual processing code

Codex’s Failure: Generated 200 lines of Spark code before any error handling or progress logging. When jobs failed, I had zero visibility.

Principle 4: Architectural Veto Power

Rule: Preserve human intuition for complexity smell.

Complexity Heuristics:

>100 lines of config YAML: Over-engineered
>3 managed services for one task: Over-engineered
Requires reading 3 docs to understand: Over-engineered
Can’t ssh and ps aux to debug: Over-engineered

Hetzner’s Advantage: Every problem reduces to Linux fundamentals. AWS problems require tribal knowledge of service interplay.

Principle 5: Document the Decision Log

Application Scenario: AI Chat as Technical Debt
I now structure every AI interaction:

## Project: Common Crawl Analyzer
## Date: 2024-10-15
## Assistant: Codex
## Decision: Using S3 Select for initial filtering
## Rationale: Reduces data scanned by 80%
## Trade-off: Costs $0.002/GB; alternative is download full files
## Outcome: FAILED—S3 Select doesn't support WARC format
## Lesson: Validate format support before architecture decisions

This creates a searchable knowledge base. When Claude later suggested S3 Select, I could reference this log instead of rediscovering the limitation.

Author’s Reflection: The Human-AI Pairing Contract

“Working with AI assistants is like pair programming with someone who has read every Stack Overflow answer but never deployed on a Friday. They bring infinite patience and perfect syntax; you bring scar tissue and spidey-sense. The contract is simple: AI suggests, human vets. AI writes code, human owns architecture. AI optimizes functions, human optimizes friction. Violate this contract, and you end up with $127 in AWS bills and a support ticket from ‘the 1980s.”

Action Checklist: Your Turn to Experiment

Can you replicate this comparison for your own analytics workload? Yes. Follow this sequential checklist to minimize risk and extract maximum learning from AI-assisted infrastructure choices.

Define Guardrails First
- [ ] Set a hard budget cap ( $50 ?$ 500?)
- [ ] Set a maximum timeline (1 weekend? 1 week?)
- [ ] List your “non-negotiables” (data locality? open-source only?)
AI-Assisted Quota Pre-Check
- [ ] Ask Codex/Claude: “Generate a quota verification script for [AWS/Azure/GCP]”
- [ ] Run the script before writing any business logic
- [ ] If quotas are within 2x of your needs, request increases immediately
Build the Smallest Viable Pipeline
- [ ] Process 10 files end-to-end
- [ ] Measure: time per file, CPU/RAM usage, cost
- [ ] Extrapolate to full dataset using worst-case multipliers (3×)
- [ ] If extrapolation exceeds budget, pivot now
Deploy Monitoring on Day 0
- [ ] Cloud: CloudWatch Dashboard with cost alarms
- [ ] Bare metal: tmux + watch + sqlite3 for progress
- [ ] Alert threshold: 50% of budget within first 20% of timeline
Maintain an Architecture Decision Record (ADR)
- [ ] Document every AI-suggested choice
- [ ] Note your human overrides (“Rejected Claude’s K8s proposal—using tmux”)
- [ ] Review ADRs weekly to calibrate AI suggestions
Know Your Exit Strategy
- [ ] AWS → Hetzner: Use rclone to sync S3 data, spin up AX41, reprocess
- [ ] Hetzner → AWS: Use aws s3 sync, convert SQLite to S3+Athena
- [ ] Decide exit criteria: “If ETA exceeds 10 days, switch to metal”
Post-Mortem Template
- [ ] What did AI get right? (Code quality, API knowledge)
- [ ] What did AI miss? (Quotas, billing, simplicity)
- [ ] What will you change next time? (Earlier quota checks, simplicity forcing function)

One-Page Overview

Project: AI-Assisted Analysis of Common Crawl Dataset (5TB)
Goal: URL frequency counting with deduplication
Duration: 7 days active, 2 weeks total
AI Assistants: Claude Code (Opus 4.5), GPT-5.2 Codex (max)

Path Comparison at a Glance

Aspect	AWS + Codex	Hetzner + Claude
Core Idea	Managed serverless (Glue, Batch)	Single beefy server, parallel processes
Setup Time	4 days (IAM, quotas)	2 hours (install Python, tmux)
Cost Predictability	Poor (delayed billing)	Perfect (fixed monthly)
Performance	2 jobs (quota-limited)	8 jobs (CPU-saturated)
Total Cost	$127+ (unfinished)	€12.16 (complete)
Key Failure	vCPU quota exhaustion	Initial over-engineering
Resolution	Manual job launching	Simplified to shell scripts
Final Architecture	3 services, 2 IAM roles, 1 frustrated human	1 server, 1 SQLite, 1 bash loop

Core Takeaways

AI assistants are junior engineers with infinite knowledge but zero scar tissue. They generate correct code but cannot evaluate operational risk.
Cloud elasticity is an illusion for new accounts. Default quotas are tiny; scaling requires human approval measured in days, not seconds.
Simplicity outperforms sophistication. tmux + sqlite beat Glue + DynamoDB for this workload.
Cost transparency is a feature. Predictable €45.59/month enables confident execution; opaque AWS billing induces hourly anxiety.
Observable systems are non-negotiable. AI won’t prioritize monitoring; you must demand it.

Single-Sentence Recommendation

For predictable batch workloads, rent a physical server and use AI for code generation, not architecture; for unpredictable, user-facing services, invest in cloud expertise first, then layer AI assistance on top.

FAQ: Anticipating Your Questions

1. Why didn’t Claude consider AWS’s spot instances or savings plans to reduce costs?
Claude’s $20k estimate assumed on-demand pricing for immediate execution. When I specifically asked about cost optimization, it did suggest spot instances (70% discount) and Compute Savings Plans (up to 60% off). However, spot instances are interruptible—unsuitable for 40-hour Glue jobs—and savings plans require 1-year commitments, defeating the personal project flexibility. The core issue wasn’t pricing model, but quota limitations that no discount can solve.

2. Could Codex have used AWS’s aws s3 sync to download WARC files and process them locally on EC2?
Yes, and this would have avoided Glue’s complexity. Codex actually proposed this on day 4 as a fallback. The estimated cost was $75/ m o n t h f or an EC 2 c 5.4 x l a r g es p o t in s t an ce +$ 50 for 5TB S3 egress = $125 total. The blocker was my new account’s EC2 spot instance quota: default 0 vCPUs, requiring the same 5-day support ticket process. This illustrates that even simple architectures hit quota walls.

3. How did Claude’s Bloom filter perform versus SQLite’s built-in indexing?
The Bloom filter added ~5GB RAM overhead and reduced SQLite writes by ~30% (filtering duplicates early). However, the final merge still required a full table scan. For this workload, the performance gain was negligible (<10% total time) compared to just letting SQLite handle dedup with INSERT OR IGNORE. The Bloom filter was premature optimization—classic AI behavior of applying textbook solutions without measuring impact.

4. What would happen if you ran this experiment on Google Cloud or Azure?
Based on the input file’s logic, the experience would be similar. All three clouds have:

Default quotas: GCP’s Compute Engine has 24 vCPU default; Azure Batch has 20 core default
Complex IAM: GCP’s IAM Conditions, Azure’s RBAC are equally labyrinthine
Delayed billing: All three have 24-48 hour cost propagation delays
The input file’s author states: “I didn’t try other tools because time is limited, and I expect learnings to be minor.” The core insight—AI can’t model organizational friction—is cloud-agnostic.

5. How would you integrate both approaches for a hybrid pipeline?
A pragmatic hybrid would be:

Hetzner AX41 for bulk processing (cost-effective, predictable)
AWS S3 for durable result storage (cheaper than Hetzner’s block storage for long-term)
Cloudflare R2 as an alternative (no egress fees for serving results)

AI assistants can write the rclone sync script, but you must decide the sync frequency (hourly? on completion?) based on your read/write patterns.

6. What specific prompt engineering could have improved Codex’s cost estimate?
Instead of “estimate the cost,” I should have asked:

"Act as a CFO with AWS cost anomaly detection experience. 
Estimate worst-case cost including:
- 20% job failure rate with retries
- Cross-region data egress at $0.01/GB
- S3 API requests for 80k files at $0.005/1k
- Minimum 1-minute billing for 50% idle time
- Support ticket delay costing 3 days of engineer time at $150/hr"

This would have yielded a realistic $800 -$ 1,200 estimate. The key is forcing AI to model friction, not just usage.

7. How do you prevent AI from suggesting deprecated AWS services?
The input file shows Codex recommended options that “simply don’t exist.” To prevent this:

Pin API versions: “Use boto3 version 1.35.0 or later”
Specify console region: “For us-east-1 console, which may differ from others”
Enable hallucination guardrails: “If you suggest a UI path, verify it exists in current console design”
Most importantly: take screenshots and feed them back. Visual confirmation beats verbal assurances.

8. Was there any task where AWS + Codex definitively outperformed Hetzner + Claude?
Yes—initial setup of credentials and SDK configuration. Codex generated perfect ~/.aws/config and ~/.aws/credentials files and wrote the boto3 client initialization with proper error handling. Claude’s Hetzner setup required manual SSH key generation and scp of scripts, which felt archaic. For cloud-native IAM complexity, AI’s knowledge base is superior to traditional sysadmin workflows.

The AI Costly Illusion: How Cloud Quotas & Bad Architectural Advice From Codex Wasted My Data Project