Exploring the Past: Crafting a 19th-Century “Time Capsule” Language Model
Introduction
Imagine stepping back in time to chat with someone from 19th-century London—an era of horse-drawn carriages, gas lamps, and the hum of the Industrial Revolution. What if an AI could bring that experience to life? That’s the heart of the TimeCapsule LLM project: a language model trained solely on texts from 1800 to 1850 London, designed to think, speak, and “live” like a person from that time. This article takes you through the project’s purpose, how it’s being built, and what it’s achieved so far—all while showing how technology can connect us to history in a fresh, authentic way.
What Is TimeCapsule LLM?
TimeCapsule LLM is a unique language model built from the ground up using the nanoGPT framework, created by Andrej Karpathy. Unlike most AI models that pull from a mix of modern and historical data, this one is different. It’s trained only on texts from London between 1800 and 1850—think books, newspapers, and legal papers from that period. The result? An AI that doesn’t know about smartphones, airplanes, or anything after 1850. It’s like a digital time traveler, speaking the English of the early 19th century and reflecting the knowledge and attitudes of that world.
Why Build a Time Capsule AI?
Starting Fresh for Authenticity
You might ask, “Why not tweak an existing AI like GPT-2 instead of starting from scratch?” Here’s the catch: modern AI models already know too much. Even if you adjust them, traces of today’s world—like the internet or 21st-century slang—stick around. TimeCapsule LLM wipes the slate clean. By training it only on historical texts, it stays true to the past, free from modern ideas or languageVague guesses.
What the Project Aims to Do
This project has clear goals:
-
No Modern Knowledge: The AI shouldn’t know about anything after 1850—like electricity as we understand it today or global events that came later. -
Old-Fashioned Language: It responds in the style of 19th-century English—formal, proper, and a bit quaint by today’s standards. -
Sticking to the Facts: It avoids making up details or guessing beyond what’s in its historical texts.
Picture it as a conversation with someone who’s never left 1840s London—a window into how they thought and spoke.
How the Project Is Coming Along
Picking the Perfect Era
The years 1800 to 1850 in London were a goldmine of change. The Industrial Revolution was in full swing—factories popping up, steam engines chugging along, and cities growing fast. It’s also a time with plenty of written records: novels, legal documents, and newspapers. These make it possible to train an AI that captures the spirit of the age.
Collecting the Right Texts
By July 9, 2025, the project had gathered 50 text files of public domain books from this period. The plan is to collect 500 to 600 texts eventually. Why so many? More texts mean the AI can learn a wider range of words, ideas, and styles—making it sharper and more accurate. A big challenge is keeping these texts pure. Modern notes or edits in old books could sneak in today’s language or ideas, so the team carefully cleans each file to preserve its historical flavor.
First Steps in Training
On July 13, 2025, the team ran the first training session with 187MB of text data. The early results were rough but exciting. For example, when asked, “Who art Henry?” the AI replied, “I know that man, I have did not a black, the storm.” It’s a bit jumbled—like a child learning to talk—but it shows the AI picking up the old-timey tone (“art” instead of “are”) and trying to respond based on what it knows. With more data and training, these responses will get clearer.
What’s Next?
The project has big plans ahead:
-
More Texts: Expand the collection to 500-600 books for richer learning. -
Cleaner Data: Improve the process to strip out any modern bits that slip into the files. -
Better Model: Train a stronger Version 1 that can hold real conversations.
How You Can Build Your Own Time Capsule AI
Want to create an AI stuck in a different time or place? Here’s how to do it, step by step, based on the TimeCapsule LLM approach.
Step 1: Gather and Clean Historical Texts
-
Finding Texts: Look for free, public domain works from your chosen era. A script like download_texts_improved.py
can help grab these automatically from online archives. -
Cleaning Up: Use a tool like prepare_dataset.py
to remove modern extras—think page numbers, editor’s notes, or scanning errors—so the AI only learns from the original words.
Step 2: Make a Custom Tokenizer
-
Why It Matters: A tokenizer breaks text into pieces the AI can understand. For old texts, you need one tailored to their unique spelling, grammar, and vocabulary. -
How to Do It: Run a script like train_tokenizer.py
ortrain_tokenizer_hf.py
on your cleaned texts. This creates two files—vocab.json
andmerges.txt
—that teach the AI your era’s language.
Step 3: Train the Model
-
Tools: Use nanoGPT or a similar framework that’s easy to work with. -
Gear: You don’t need a supercomputer. The project ran on a GeForce RTX 4060 GPU, an i5-13400F CPU, and 16GB of DDR5 RAM—stuff you might already have at home.
With these steps, you could build an AI for, say, Victorian England, ancient Rome (if you’ve got Latin texts), or any era with enough written records.
Common Questions Answered
Why Not Just Adjust an Existing Model?
Tweaking a ready-made AI (like fine-tuning or using a method called LoRA) keeps its modern roots intact. You can’t fully scrub out today’s knowledge that way. Starting fresh with only historical texts guarantees a pure, old-world perspective.
What Kinds of Texts Are Used?
The project pulls from books, legal papers, and newspapers written in London between 1800 and 1850. So far, 50 out of a planned 200+ documents have been used, all carefully chosen to reflect the time.
How Big Is the First Model?
The initial version, called Version 0, has about 16 million parameters. That’s enough to experiment with, but it’s not ready for deep thinking yet—more data and training will beef it up.
What Do You Need to Run It?
The setup is pretty doable: a GeForce RTX 4060 GPU, i5-13400F CPU, and 16GB DDR5 RAM. No fancy lab required—just a decent home computer can handle the early stages.
Why This Project Matters
TimeCapsule LLM isn’t just a cool trick. It’s a way to step into the shoes of people long gone—to hear their words and glimpse their world through AI. As it grows, it could act like a digital historian, helping us explore how 19th-century Londoners saw life, from their daily routines to their big ideas.
This idea could stretch further. Imagine an AI trained on Shakespeare’s plays, ancient Chinese poetry, or colonial-era letters. It’s a tool for students, writers, or anyone curious about the past, blending tech and history in a way that’s both fun and useful.
Wrapping Up: A Digital Door to Yesterday
TimeCapsule LLM is still young, but it’s already opening a window to 19th-century London. It’s not about flashy tricks or quick wins—it’s a steady, thoughtful effort to link us with history through technology. Whether you’re a tech fan, a history buff, or just someone who loves a good story, this project offers a chance to peek into the past and even build your own bridge to another time.
This journey into the past is just beginning. With more texts and smarter training, TimeCapsule LLM could soon chat with us as naturally as a friend from 1840s London, sharing tales of a world we’ve only read about—until now.