From Wall-of-Text to Structured Gold: A Beginner-Friendly Guide to LangExtract
Audience: Junior-college graduates with basic Python
Goal: Extract structured data from any long document in under 30 minutes
Reading time: ~20 minutes for the first successful run
Table of Contents
-
Why LangExtract Exists -
What It Actually Does -
Your First Extraction in 5 Minutes -
Handling Long Documents Without Headaches -
Real-World Use Cases — Scripts, Medical Notes, Radiology Reports -
FAQ Corner -
Going Further — Local Models & Contributing Back
1. Why LangExtract Exists
Imagine these Monday-morning requests:
• “Turn this 150 000-word novel into a spreadsheet of every character and their relationships.”
• “Convert 300 free-text radiology reports into a searchable database.”
• “Pull out every drug name, dosage, and route from 10 years of clinical notes.”
Traditional routes mean weeks of regex writing or training a custom model.
LangExtract short-circuits the process: write a short prompt, give a handful of examples, and let a large language model (LLM) do the heavy lifting—no fine-tuning, no PhD required.
2. What It Actually Does
Benefit | Plain-English Explanation | Typical Scene |
---|---|---|
Precise Source Grounding | Each extracted fact links back to the exact sentence it came from | Legal audits, fact-checking |
Reliable Schema | Output is always the same JSON shape, ready for databases or BI tools | Production pipelines |
Long-Document Optimised | Automatic chunking, parallel requests, and multi-pass scanning | Entire novels, EHR dumps |
One-Click Visualisation | Generates a single HTML file with highlights and cards | Demos, stakeholder reviews |
Model Flexibility | Works with Google Gemini, local Ollama, or any OpenAI-compatible endpoint | Cost and privacy control |
Zero-Shot Friendly | Works even without examples; gets better with 3–5 high-quality ones | Brand-new domains |
World-Knowledge Leverage | Lets the LLM fill gentle gaps when your prompt explicitly allows | Historical context, synonyms |
3. Your First Extraction in 5 Minutes
3.1 Install
# Create a clean environment
python -m venv lx_env
source lx_env/bin/activate # Windows: lx_env\Scripts\activate
pip install langextract
3.2 Get an API Key (Only for Cloud Models)
-
Visit AI Studio → create a key. -
Save it in a .env
file next to your script:
LANGEXTRACT_API_KEY=your_real_key_here
-
Add .env
to.gitignore
so the key never reaches GitHub.
Using a local model via Ollama? Skip the key entirely.
3.3 Write Ten Lines of Code
import langextract as lx
import textwrap
# 1. Describe what you want
prompt = textwrap.dedent("""\
Extract characters, emotions and relationships in order of appearance.
Use exact text—no paraphrasing, no overlapping entities.
Add meaningful attributes to give context.""")
# 2. Show one high-quality example
examples = [
lx.data.ExampleData(
text="ROMEO. But soft! What light through yonder window breaks? It is the east, and Juliet is the sun.",
extractions=[
lx.data.Extraction(
extraction_class="character",
extraction_text="ROMEO",
attributes={"emotion": "wonder"}
),
lx.data.Extraction(
extraction_class="emotion",
extraction_text="But soft!",
attributes={"feeling": "gentle awe"}
),
lx.data.Extraction(
extraction_class="relationship",
extraction_text="Juliet is the sun",
attributes={"type": "metaphor"}
),
]
)
]
# 3. Choose your text
input_text = "Lady Juliet gazed longingly at the stars, her heart aching for Romeo"
# 4. Run
result = lx.extract(
text_or_documents=input_text,
prompt_description=prompt,
examples=examples,
model_id="gemini-2.5-flash" # balanced speed & cost
)
Model choice notes
•gemini-2.5-flash
— default sweet spot.
•gemini-2.5-pro
— heavier reasoning, slower.
• Tier-2 quota recommended for large jobs; see Google rate-limit docs.
3.4 Save & Visualise
# Save results
lx.io.save_annotated_documents([result], "demo.jsonl")
# Build interactive HTML
html = lx.visualize("demo.jsonl")
open("demo.html", "w", encoding="utf-8").write(html)
Open demo.html
in any browser; hover over highlights to see cards.
4. Handling Long Documents Without Headaches
4.1 One-Line URL Processing
result = lx.extract(
text_or_documents="https://www.gutenberg.org/files/1513/1513-0.txt",
prompt_description=prompt,
examples=examples,
model_id="gemini-2.5-flash",
extraction_passes=3, # multi-pass recall boost
max_workers=20, # parallel requests
max_char_buffer=1000 # smaller chunks = higher precision
)
4.2 Rough Performance Table (Local Test)
Text Size | Chunks | Passes | Wall Time | Entities Found |
---|---|---|---|---|
147 843 chars | 20 | 3 | ~90 s | 600+ |
30 000 chars | 8 | 2 | ~25 s | 150+ |
5 000 chars | 1 | 1 | ~5 s | 30+ |
Times depend on network and quota; numbers above are on a 100 Mbps line with Tier-2 quota.
5. Real-World Use Cases
5.1 Full Novel — Romeo and Juliet
-
Source: Project Gutenberg plain-text -
Goal: Character, emotion, relationship timeline -
Outcome: JSONL ready for Neo4j import -
Official walkthrough
5.2 Medical — Medication Extraction
Disclaimer: Example is for capability demonstration only. Not for clinical decisions.
prompt = "Extract drug name, dose, route and frequency."
examples = [
lx.data.ExampleData(
text="Patient takes aspirin 100 mg orally twice daily.",
extractions=[
lx.data.Extraction(
extraction_class="medication",
extraction_text="aspirin",
attributes={
"dose": "100 mg",
"route": "oral",
"frequency": "twice daily"
}
)
]
)
]
Output columns slot directly into hospital information systems.
More medication examples
5.3 Radiology Report Structuring — RadExtract Demo
No install, zero setup:
Try RadExtract on Hugging Face Spaces
6. FAQ Corner
Q1: I don’t have a GPU. Can I still run this?
Yes. Cloud models like Gemini do the compute; you only need internet.
Q2: Does my text leave my laptop?
Only if you use a cloud model. Choose Ollama or another local backend for full privacy.
Q3: How do I switch to Chinese prompts?
Write your prompt and examples in Chinese; Gemini handles it natively.
Q4: Can I change the output schema?
Absolutely—define any extraction_class
names you like; the JSON shape auto-matches.
Q5: Is offline usage possible?
Yes. Spin up a local model with Ollama and point LangExtract to its endpoint.
Q6: How many examples are enough?
Zero works; 3–5 high-quality ones usually lift accuracy and consistency.
Q7: Which languages are supported?
Whatever language your prompt and examples use is the language the model will extract.
7. Going Further — Local Models & Contributing Back
7.1 Install from Source (Dev + Test)
git clone https://github.com/google/langextract.git
cd langextract
# Basic editable install
pip install -e .
# With linting tools
pip install -e ".[dev]"
# With test suite
pip install -e ".[test]"
7.2 Run the Test Suite
pytest tests
# or the full CI matrix
tox # runs pylint + pytest on Python 3.10 & 3.11
7.3 Contribute
-
Fork the repo. -
Create a feature branch: git checkout -b feature/my-idea
. -
Add tests → ensure pytest
passes. -
Sign the Google CLA. -
Open a pull request.
Closing Thoughts
LangExtract turns “weeks of regex and model training” into “one prompt and a coffee break.”
Next time you face a mountain of unstructured text, copy the ten-line snippet above and see if it saves you three days of work.
Happy extracting!