From Wall-of-Text to Structured Gold: A Beginner-Friendly Guide to LangExtract

Audience: Junior-college graduates with basic Python
Goal: Extract structured data from any long document in under 30 minutes
Reading time: ~20 minutes for the first successful run

Why LangExtract Exists
What It Actually Does
Your First Extraction in 5 Minutes
Handling Long Documents Without Headaches
Real-World Use Cases — Scripts, Medical Notes, Radiology Reports
FAQ Corner
Going Further — Local Models & Contributing Back

1. Why LangExtract Exists

Imagine these Monday-morning requests:

• “Turn this 150 000-word novel into a spreadsheet of every character and their relationships.”
• “Convert 300 free-text radiology reports into a searchable database.”
• “Pull out every drug name, dosage, and route from 10 years of clinical notes.”

Traditional routes mean weeks of regex writing or training a custom model.
LangExtract short-circuits the process: write a short prompt, give a handful of examples, and let a large language model (LLM) do the heavy lifting—no fine-tuning, no PhD required.

2. What It Actually Does

Benefit	Plain-English Explanation	Typical Scene
Precise Source Grounding	Each extracted fact links back to the exact sentence it came from	Legal audits, fact-checking
Reliable Schema	Output is always the same JSON shape, ready for databases or BI tools	Production pipelines
Long-Document Optimised	Automatic chunking, parallel requests, and multi-pass scanning	Entire novels, EHR dumps
One-Click Visualisation	Generates a single HTML file with highlights and cards	Demos, stakeholder reviews
Model Flexibility	Works with Google Gemini, local Ollama, or any OpenAI-compatible endpoint	Cost and privacy control
Zero-Shot Friendly	Works even without examples; gets better with 3–5 high-quality ones	Brand-new domains
World-Knowledge Leverage	Lets the LLM fill gentle gaps when your prompt explicitly allows	Historical context, synonyms

3. Your First Extraction in 5 Minutes

3.1 Install

# Create a clean environment
python -m venv lx_env
source lx_env/bin/activate  # Windows: lx_env\Scripts\activate
pip install langextract

3.2 Get an API Key (Only for Cloud Models)

Visit AI Studio → create a key.
Save it in a .env file next to your script:

LANGEXTRACT_API_KEY=your_real_key_here

Add .env to .gitignore so the key never reaches GitHub.

Using a local model via Ollama? Skip the key entirely.

3.3 Write Ten Lines of Code

import langextract as lx
import textwrap

# 1. Describe what you want
prompt = textwrap.dedent("""\
    Extract characters, emotions and relationships in order of appearance.
    Use exact text—no paraphrasing, no overlapping entities.
    Add meaningful attributes to give context.""")

# 2. Show one high-quality example
examples = [
    lx.data.ExampleData(
        text="ROMEO. But soft! What light through yonder window breaks? It is the east, and Juliet is the sun.",
        extractions=[
            lx.data.Extraction(
                extraction_class="character",
                extraction_text="ROMEO",
                attributes={"emotion": "wonder"}
            ),
            lx.data.Extraction(
                extraction_class="emotion",
                extraction_text="But soft!",
                attributes={"feeling": "gentle awe"}
            ),
            lx.data.Extraction(
                extraction_class="relationship",
                extraction_text="Juliet is the sun",
                attributes={"type": "metaphor"}
            ),
        ]
    )
]

# 3. Choose your text
input_text = "Lady Juliet gazed longingly at the stars, her heart aching for Romeo"

# 4. Run
result = lx.extract(
    text_or_documents=input_text,
    prompt_description=prompt,
    examples=examples,
    model_id="gemini-2.5-flash"  # balanced speed & cost
)

Model choice notes
• gemini-2.5-flash — default sweet spot.
• gemini-2.5-pro — heavier reasoning, slower.
• Tier-2 quota recommended for large jobs; see Google rate-limit docs.

3.4 Save & Visualise

# Save results
lx.io.save_annotated_documents([result], "demo.jsonl")

# Build interactive HTML
html = lx.visualize("demo.jsonl")
open("demo.html", "w", encoding="utf-8").write(html)

Open demo.html in any browser; hover over highlights to see cards.
Basic visualisation

4. Handling Long Documents Without Headaches

4.1 One-Line URL Processing

result = lx.extract(
    text_or_documents="https://www.gutenberg.org/files/1513/1513-0.txt",
    prompt_description=prompt,
    examples=examples,
    model_id="gemini-2.5-flash",
    extraction_passes=3,   # multi-pass recall boost
    max_workers=20,        # parallel requests
    max_char_buffer=1000   # smaller chunks = higher precision
)

4.2 Rough Performance Table (Local Test)

Text Size	Chunks	Passes	Wall Time	Entities Found
147 843 chars	20	3	~90 s	600+
30 000 chars	8	2	~25 s	150+
5 000 chars	1	1	~5 s	30+

Times depend on network and quota; numbers above are on a 100 Mbps line with Tier-2 quota.

5. Real-World Use Cases

5.1 Full Novel — Romeo and Juliet

Source: Project Gutenberg plain-text
Goal: Character, emotion, relationship timeline
Outcome: JSONL ready for Neo4j import
Official walkthrough

5.2 Medical — Medication Extraction

Disclaimer: Example is for capability demonstration only. Not for clinical decisions.

prompt = "Extract drug name, dose, route and frequency."
examples = [
    lx.data.ExampleData(
        text="Patient takes aspirin 100 mg orally twice daily.",
        extractions=[
            lx.data.Extraction(
                extraction_class="medication",
                extraction_text="aspirin",
                attributes={
                    "dose": "100 mg",
                    "route": "oral",
                    "frequency": "twice daily"
                }
            )
        ]
    )
]

Output columns slot directly into hospital information systems.
More medication examples

5.3 Radiology Report Structuring — RadExtract Demo

No install, zero setup:
Try RadExtract on Hugging Face Spaces

6. FAQ Corner

Q1: I don’t have a GPU. Can I still run this?
Yes. Cloud models like Gemini do the compute; you only need internet.

Q2: Does my text leave my laptop?
Only if you use a cloud model. Choose Ollama or another local backend for full privacy.

Q3: How do I switch to Chinese prompts?
Write your prompt and examples in Chinese; Gemini handles it natively.

Q4: Can I change the output schema?
Absolutely—define any extraction_class names you like; the JSON shape auto-matches.

Q5: Is offline usage possible?
Yes. Spin up a local model with Ollama and point LangExtract to its endpoint.

Q6: How many examples are enough?
Zero works; 3–5 high-quality ones usually lift accuracy and consistency.

Q7: Which languages are supported?
Whatever language your prompt and examples use is the language the model will extract.

7. Going Further — Local Models & Contributing Back

7.1 Install from Source (Dev + Test)

git clone https://github.com/google/langextract.git
cd langextract

# Basic editable install
pip install -e .

# With linting tools
pip install -e ".[dev]"

# With test suite
pip install -e ".[test]"

7.2 Run the Test Suite

pytest tests
# or the full CI matrix
tox  # runs pylint + pytest on Python 3.10 & 3.11

7.3 Contribute

Fork the repo.
Create a feature branch: git checkout -b feature/my-idea.
Add tests → ensure pytest passes.
Sign the Google CLA.
Open a pull request.

Closing Thoughts

LangExtract turns “weeks of regex and model training” into “one prompt and a coffee break.”
Next time you face a mountain of unstructured text, copy the ten-line snippet above and see if it saves you three days of work.

Happy extracting!

Master LangExtract: Transform Wall-of-Text into Structured Data in 5 Minutes

From Wall-of-Text to Structured Gold: A Beginner-Friendly Guide to LangExtract

Table of Contents

1. Why LangExtract Exists

2. What It Actually Does

3. Your First Extraction in 5 Minutes

3.1 Install

3.2 Get an API Key (Only for Cloud Models)

3.3 Write Ten Lines of Code

3.4 Save & Visualise

4. Handling Long Documents Without Headaches

4.1 One-Line URL Processing

4.2 Rough Performance Table (Local Test)

5. Real-World Use Cases

5.1 Full Novel — Romeo and Juliet

5.2 Medical — Medication Extraction

5.3 Radiology Report Structuring — RadExtract Demo

6. FAQ Corner

7. Going Further — Local Models & Contributing Back

7.1 Install from Source (Dev + Test)

7.2 Run the Test Suite

7.3 Contribute

Closing Thoughts

Master LangExtract: Transform Wall-of-Text into Structured Data in 5 Minutes

From Wall-of-Text to Structured Gold: A Beginner-Friendly Guide to LangExtract

Table of Contents

1. Why LangExtract Exists

2. What It Actually Does

3. Your First Extraction in 5 Minutes

3.1 Install

3.2 Get an API Key (Only for Cloud Models)

3.3 Write Ten Lines of Code

3.4 Save & Visualise

4. Handling Long Documents Without Headaches

4.1 One-Line URL Processing

4.2 Rough Performance Table (Local Test)

5. Real-World Use Cases

5.1 Full Novel — Romeo and Juliet

5.2 Medical — Medication Extraction

5.3 Radiology Report Structuring — RadExtract Demo

6. FAQ Corner

7. Going Further — Local Models & Contributing Back

7.1 Install from Source (Dev + Test)

7.2 Run the Test Suite

7.3 Contribute

Closing Thoughts

Related Posts