From Wall-of-Text to Structured Gold: A Beginner-Friendly Guide to LangExtract

Audience: Junior-college graduates with basic Python
Goal: Extract structured data from any long document in under 30 minutes
Reading time: ~20 minutes for the first successful run


Table of Contents

  1. Why LangExtract Exists
  2. What It Actually Does
  3. Your First Extraction in 5 Minutes
  4. Handling Long Documents Without Headaches
  5. Real-World Use Cases — Scripts, Medical Notes, Radiology Reports
  6. FAQ Corner
  7. Going Further — Local Models & Contributing Back

1. Why LangExtract Exists

Imagine these Monday-morning requests:

• “Turn this 150 000-word novel into a spreadsheet of every character and their relationships.”
• “Convert 300 free-text radiology reports into a searchable database.”
• “Pull out every drug name, dosage, and route from 10 years of clinical notes.”

Traditional routes mean weeks of regex writing or training a custom model.
LangExtract short-circuits the process: write a short prompt, give a handful of examples, and let a large language model (LLM) do the heavy lifting—no fine-tuning, no PhD required.


2. What It Actually Does

Benefit Plain-English Explanation Typical Scene
Precise Source Grounding Each extracted fact links back to the exact sentence it came from Legal audits, fact-checking
Reliable Schema Output is always the same JSON shape, ready for databases or BI tools Production pipelines
Long-Document Optimised Automatic chunking, parallel requests, and multi-pass scanning Entire novels, EHR dumps
One-Click Visualisation Generates a single HTML file with highlights and cards Demos, stakeholder reviews
Model Flexibility Works with Google Gemini, local Ollama, or any OpenAI-compatible endpoint Cost and privacy control
Zero-Shot Friendly Works even without examples; gets better with 3–5 high-quality ones Brand-new domains
World-Knowledge Leverage Lets the LLM fill gentle gaps when your prompt explicitly allows Historical context, synonyms

3. Your First Extraction in 5 Minutes

3.1 Install

# Create a clean environment
python -m venv lx_env
source lx_env/bin/activate  # Windows: lx_env\Scripts\activate
pip install langextract

3.2 Get an API Key (Only for Cloud Models)

  1. Visit AI Studio → create a key.
  2. Save it in a .env file next to your script:
LANGEXTRACT_API_KEY=your_real_key_here
  1. Add .env to .gitignore so the key never reaches GitHub.

Using a local model via Ollama? Skip the key entirely.

3.3 Write Ten Lines of Code

import langextract as lx
import textwrap

# 1. Describe what you want
prompt = textwrap.dedent("""\
    Extract characters, emotions and relationships in order of appearance.
    Use exact text—no paraphrasing, no overlapping entities.
    Add meaningful attributes to give context.""")

# 2. Show one high-quality example
examples = [
    lx.data.ExampleData(
        text="ROMEO. But soft! What light through yonder window breaks? It is the east, and Juliet is the sun.",
        extractions=[
            lx.data.Extraction(
                extraction_class="character",
                extraction_text="ROMEO",
                attributes={"emotion": "wonder"}
            ),
            lx.data.Extraction(
                extraction_class="emotion",
                extraction_text="But soft!",
                attributes={"feeling": "gentle awe"}
            ),
            lx.data.Extraction(
                extraction_class="relationship",
                extraction_text="Juliet is the sun",
                attributes={"type": "metaphor"}
            ),
        ]
    )
]

# 3. Choose your text
input_text = "Lady Juliet gazed longingly at the stars, her heart aching for Romeo"

# 4. Run
result = lx.extract(
    text_or_documents=input_text,
    prompt_description=prompt,
    examples=examples,
    model_id="gemini-2.5-flash"  # balanced speed & cost
)

Model choice notes
gemini-2.5-flash — default sweet spot.
gemini-2.5-pro — heavier reasoning, slower.
• Tier-2 quota recommended for large jobs; see Google rate-limit docs.

3.4 Save & Visualise

# Save results
lx.io.save_annotated_documents([result], "demo.jsonl")

# Build interactive HTML
html = lx.visualize("demo.jsonl")
open("demo.html", "w", encoding="utf-8").write(html)

Open demo.html in any browser; hover over highlights to see cards.
Basic visualisation


4. Handling Long Documents Without Headaches

4.1 One-Line URL Processing

result = lx.extract(
    text_or_documents="https://www.gutenberg.org/files/1513/1513-0.txt",
    prompt_description=prompt,
    examples=examples,
    model_id="gemini-2.5-flash",
    extraction_passes=3,   # multi-pass recall boost
    max_workers=20,        # parallel requests
    max_char_buffer=1000   # smaller chunks = higher precision
)

4.2 Rough Performance Table (Local Test)

Text Size Chunks Passes Wall Time Entities Found
147 843 chars 20 3 ~90 s 600+
30 000 chars 8 2 ~25 s 150+
5 000 chars 1 1 ~5 s 30+

Times depend on network and quota; numbers above are on a 100 Mbps line with Tier-2 quota.


5. Real-World Use Cases

5.1 Full Novel — Romeo and Juliet

  • Source: Project Gutenberg plain-text
  • Goal: Character, emotion, relationship timeline
  • Outcome: JSONL ready for Neo4j import
  • Official walkthrough

5.2 Medical — Medication Extraction

Disclaimer: Example is for capability demonstration only. Not for clinical decisions.

prompt = "Extract drug name, dose, route and frequency."
examples = [
    lx.data.ExampleData(
        text="Patient takes aspirin 100 mg orally twice daily.",
        extractions=[
            lx.data.Extraction(
                extraction_class="medication",
                extraction_text="aspirin",
                attributes={
                    "dose": "100 mg",
                    "route": "oral",
                    "frequency": "twice daily"
                }
            )
        ]
    )
]

Output columns slot directly into hospital information systems.
More medication examples

5.3 Radiology Report Structuring — RadExtract Demo

No install, zero setup:
Try RadExtract on Hugging Face Spaces


6. FAQ Corner

Q1: I don’t have a GPU. Can I still run this?
Yes. Cloud models like Gemini do the compute; you only need internet.

Q2: Does my text leave my laptop?
Only if you use a cloud model. Choose Ollama or another local backend for full privacy.

Q3: How do I switch to Chinese prompts?
Write your prompt and examples in Chinese; Gemini handles it natively.

Q4: Can I change the output schema?
Absolutely—define any extraction_class names you like; the JSON shape auto-matches.

Q5: Is offline usage possible?
Yes. Spin up a local model with Ollama and point LangExtract to its endpoint.

Q6: How many examples are enough?
Zero works; 3–5 high-quality ones usually lift accuracy and consistency.

Q7: Which languages are supported?
Whatever language your prompt and examples use is the language the model will extract.


7. Going Further — Local Models & Contributing Back

7.1 Install from Source (Dev + Test)

git clone https://github.com/google/langextract.git
cd langextract

# Basic editable install
pip install -e .

# With linting tools
pip install -e ".[dev]"

# With test suite
pip install -e ".[test]"

7.2 Run the Test Suite

pytest tests
# or the full CI matrix
tox  # runs pylint + pytest on Python 3.10 & 3.11

7.3 Contribute

  1. Fork the repo.
  2. Create a feature branch: git checkout -b feature/my-idea.
  3. Add tests → ensure pytest passes.
  4. Sign the Google CLA.
  5. Open a pull request.

Closing Thoughts

LangExtract turns “weeks of regex and model training” into “one prompt and a coffee break.”
Next time you face a mountain of unstructured text, copy the ten-line snippet above and see if it saves you three days of work.

Happy extracting!