Skip to main content

LangExtract: Transform text into structured data with AI

Discover LangExtract, Google's Python library that uses AI to convert unstructured text into structured data. Explore its features and transform your data today!
Created on September 4|Last edited on September 4
LangExtract is a Python library developed by Google for extracting structured data from unstructured text using AI and large language models (LLMs). It transforms messy text (like clinical notes, legal contracts, or forum discussions) into organized JSON-like data, all while maintaining links back to the original source. This means you can trust the extracted information and verify where it came from in the text. LangExtract is open-source and model-agnostic, supporting various LLMs (including Google’s Gemini family) without requiring any fine-tuning.



Table of contents



In this guide, you’ll learn what LangExtract does, explore its key features, and go hands-on with a step-by-step tutorial - integrating W&B Weave for interactive visualization of the results.

What is LangExtract?

LangExtract is an open-source Python library designed to programmatically extract structured information from unstructured text. Developed by Google and powered by LLMs, it acts as a smarter alternative to traditional text parsing or Named Entity Recognition (NER) tools. Instead of manually coding rules or needing large labeled datasets, you instruct LangExtract with natural language. You tell it what to extract and give a few examples of the desired output. LangExtract then uses an LLM to read through the text and produce a structured output (like a list of entities with attributes) that follows your instructions.
Under the hood, LangExtract treats your instructions and examples as a prompt for the LLM. It supports multiple backends (from powerful cloud models like Google’s Gemini to local open-source models) to actually perform the language understanding. The library handles feeding the text to the model, parsing the model’s output into a structured format, and ensuring each piece of data is linked back to the source text. The output is typically a list of extraction records (for example, as Python objects or JSON) where each record could be something like “{patient_name: John Doe, diagnosis: flu}” or “{character: Juliet, emotional_state: longing}” depending on your task.
Because LangExtract is open-source, you can integrate it into your own pipelines, customize it for new use cases, and even extend it to work with other model APIs. In short, LangExtract is the tool that bridges unstructured text and structured data, making it much easier to glean insights from large volumes of text.

Key features of LangExtract

LangExtract offers several powerful features that set it apart for information extraction tasks:
  • Precise source grounding: Every piece of extracted data is mapped to its exact position in the original text. This means you can always trace back and highlight the original context of an extraction. For example, if LangExtract pulls a patient’s name and diagnosis from a medical report, it will also give you the character offset or span in the document where that information came from. This grounding builds trust in the output, as you can verify each fact against the source.
  • Reliable structured outputs: LangExtract outputs follow a consistent schema that you define through examples. You provide a few example outputs (a technique known as few-shot prompting), and the model is guided to produce data in that exact format. Under the hood, LangExtract even leverages features like controlled generation (available in models like Gemini) to enforce JSON schemas or structured output formats. The result is that you get well-structured data (lists, dictionaries, classes) rather than free-form text, every time. This consistency makes post-processing and integration with databases or analysis tools much easier.
  • Optimized for long documents: Working with long texts (think research papers or entire books) can trip up normal LLM pipelines, but LangExtract is optimized for these cases. It uses smart text chunking, parallel processing, and multi-pass extraction strategies to handle very large documents. In practice, LangExtract will break a long text into manageable chunks, extract relevant info from each, then merge results and even perform additional passes to catch things that a single pass might miss. This “needle-in-a-haystack” approach ensures high recall even in million-character contexts.
  • Interactive visualization: To help you review and refine extraction results, LangExtract can generate an interactive HTML report of the findings. This self-contained visualization highlights each extracted entity in the context of the original text. You can scroll through your document and see, for instance, every medication name highlighted with its parsed dosage. This makes validation and debugging of the extraction much more intuitive. (We’ll see in the tutorial how to create this and how W&B Weave can further enhance visualization.)
  • Flexible LLM support: LangExtract isn’t tied to a single AI model. You can use it with cloud-based LLMs like Google’s Gemini (for the best quality and support for controlled output) or switch to local models via the built-in Ollama integration if you need to run without an internet connection or keep data on-premises. This flexibility lets you balance performance, cost, and privacy. Whether you have access to a state-of-the-art model or prefer a smaller open-source model, LangExtract can work with it by adjusting the model_id and relevant settings.
  • Adaptable to any domain: You can apply LangExtract to virtually any domain or language. There’s no training required — just describe what you need and provide a handful of examples. LangExtract will utilize the LLM’s understanding (and general world knowledge) to interpret context-specific terms. For instance, in finance, it could extract “ticker symbols and prices” from news articles, while in medicine, it could extract “symptoms and vital signs” from clinical notes. Because the approach relies on prompting, switching domains is as simple as changing your instructions and examples. (LangExtract also supports many languages out of the box, so you aren’t limited to English texts.)
  • Leverages world knowledge when needed: Through clever prompt design, LangExtract allows you to control how much the LLM should rely on outside knowledge versus sticking strictly to the text. You can instruct the model to only use evidence in the text (for high precision) or to infer a bit of implied information (leveraging the model’s training knowledge) if that’s useful. For example, you might extract a drug name from a note and let the model add a standard dosage form from its knowledge. LangExtract leaves this choice to you via prompt instructions and examples, ensuring the output meets your needs for completeness vs. verifiability.

How does LangExtract ensure precise source grounding and reliable structured outputs?

One of the core strengths of LangExtract is how it maintains accuracy and traceability in its outputs. It achieves this with two complementary mechanisms: source grounding and schema enforcement.


Precise source grounding

LangExtract maps each extracted item back to the original text by recording character offsets or spans. When the model finds an entity or piece of information, the library doesn’t just output the value—it also remembers where in the document that value came from.
For example, if it extracts “aspirin” as a medication from a clinical note, LangExtract might note that this came from characters 150-156 in the text. This allows downstream applications (or humans) to highlight “aspirin” in the source document for verification. Even in the interactive HTML visualization, each extraction is highlighted in place. Technically, this is achieved by capturing the indices or using unique identifiers for each chunk of text, ensuring that every output is tied to a specific source segment. The benefit is clear: you can always trace an output back to its origin, avoiding the common LLM pitfall of producing answers that can’t be verified or might be hallucinated. LangExtract ensures every answer has a traceable lineage.

Reliable structured outputs

To guarantee that the model’s output adheres to a predictable format, LangExtract uses a combination of few-shot examples and controlled generation techniques. When you define what you want to extract, you also provide examples of the expected output structure (in code, this is done by creating Extraction objects with classes and attributes). These examples act as a schema.
During extraction, LangExtract instructs the LLM to follow that schema, and if the model supports it (like Google’s Gemini does), it uses features to control the output format (for instance, ensuring valid JSON or fixed lists of fields). This significantly reduces variability and errors. Instead of free-form answers, you get output like “extraction_class”: “medication”, “extraction_text”: “aspirin”, “attributes”: {“dose”: “100mg”}, neatly structured. By enforcing type and format, LangExtract avoids issues where the model might omit a field or format things incorrectly. In practice, this means you can rely on the output to be machine-readable and consistent, which is essential when loading it into a database or comparing results across documents.

Putting it together

When LangExtract runs, it effectively guides the LLM with a carefully crafted prompt, which includes the task description, example data (illustrating both content and format), and possibly special tokens or instructions that the model, such as Gemini, recognizes to ensure well-formed output. The LLM then processes chunks of your document, returns structured data for each chunk, and LangExtract assembles these results, each tied to source text positions. The result is a collection of structured data entries that you can trust — you know exactly what they mean (because you defined the schema) and you know where they came from (because of source grounding). This combination of traceability and structure is what makes LangExtract reliable for professional use cases.

Why use LangExtract?

Imagine you have a stack of documents full of valuable information, but it’s all written in free-form text. Manually pulling out specific facts or writing custom parsers is tedious and error-prone. LangExtract leverages the power of LLMs to automate this process. You simply provide a description of what you want to extract (and a few examples), and LangExtract returns structured data with each piece tied to its location in the original text. The library shines in high-stakes domains like medicine and finance where accuracy and traceability are crucial. With tools like W&B Weave, you can even visualize and analyze the extraction results, making it easier to understand and share the insights you’ve uncovered.

Applying LangExtract to specialized domains

Because LangExtract doesn’t require any model fine-tuning or domain-specific training, it can be applied to a wide range of fields with minimal effort. The key is all in how you craft your extraction instructions and examples. Let’s look at a couple of specialized domains to see how LangExtract shines:

Applying LangExtract to the medical domain

In medicine, important information is often buried in lengthy free-text records, such as doctors’ notes or radiology reports. LangExtract can extract structured medical information, including patient demographics, symptoms, diagnoses, medications, and their corresponding dosages. For example, given a paragraph of a clinical note, you could prompt LangExtract to “Extract all medications with their dosage and frequency, and any mentioned symptoms with their duration”.
By providing a few examples (perhaps showing one medication and one symptom in a similar format), the model can output a list of medications (name, dose, frequency) and symptoms (description, duration) found in the text. The crucial benefit here is source grounding: each extracted medication is tied back to its original location in the note, which is essential for compliance and verification in healthcare. Another example: researchers can use LangExtract on medical literature to extract patient outcomes or sample sizes from study text, again without having to write complex regular expressions or train a model for each new concept.

Applying LangExtract to the financial domain

Financial documents, such as earnings reports, SEC filings, or news articles, contain valuable structured information - if you can access it. With LangExtract, you might extract “company name, Q3 revenue, and any forward-looking statements” from a quarterly earnings report. By writing a prompt that describes these fields and giving a few-shot example (maybe using a snippet from a previous report as the model example), LangExtract can scan a lengthy financial document and output a tidy JSON with fields like {company: ABC Corp, Q3_revenue: $5.4B, guidance: "Expecting 10% growth in next quarter"}.
Again, each piece of data comes with a reference to its position in the text, so auditors or analysts can click back to the exact phrase in the PDF or text. The adaptability means the same LangExtract approach could be used to structure legal contracts (e.g., extract parties, dates, obligations) or customer feedback (e.g., extract product issues and sentiment from reviews). All this is done just by changing the prompt and examples. In practice, users have reported that LangExtract can reach high accuracy in these domains with only 3-5 carefully chosen examples, which is a huge win compared to needing thousands of labeled examples to train a specialized model.

No fine-tuning needed

A big advantage in applying LangExtract to these domains is speed. You don’t have to collect domain-specific training data or train a new model whenever you switch context. The pre-trained LLM (whether it’s Gemini or another) already possesses a wealth of knowledge about medical and financial terminology. LangExtract capitalizes on that by guiding the model with your prompt.
For instance, the LLM likely “knows” common medication names and can identify them in text; you just need to tell it that for this task, those are what you care about, and to output them in a certain way. This also means LangExtract can handle multilingual scenarios in specialized domains. If you have financial reports in Spanish or clinical notes in French, as long as your model supports those languages, LangExtract can extract the structured data similarly. The few-shot examples would ideally be in the same language to guide the model, but you wouldn’t need a separate pipeline for each language.
Whether you’re dealing with patient health records, stock market analyses, legal contracts, or academic research, LangExtract provides a flexible template: describe the info you need, give a couple of examples, and let the LLM do the heavy lifting. This opens the door to quickly deploying AI solutions in niche areas (medicine, law, finance, etc.) where traditionally one had to build custom NLP models at great expense.

Integration with large language models like Gemini

LangExtract is designed to be LLM-agnostic but also takes special advantage of advanced models when available. Out of the box, it supports Google’s Gemini LLM, which is a state-of-the-art series of models known for their reasoning and structured output capabilities. In fact, the default model_id in LangExtract’s examples is often a Gemini model (for example, "gemini-2.5-flash" which balances speed and performance). Integrating LangExtract with Gemini provides a powerful combination: LangExtract structures the task and enforces grounding, while Gemini provides high-quality language understanding and generation. The result is high accuracy extractions even from complex text.

Using Gemini through LangExtract

If you have access to Google’s Gemini LLM (via the API key from Google’s AI services), LangExtract will handle sending your prompt and text to the Gemini model and getting back the structured response. Because Gemini supports functional generation and output control, LangExtract can instruct it to directly output JSON or follow the schema strictly. This means fewer errors in parsing and more robust outputs.
For example, Gemini can understand instructions like “list all items in JSON format” within the prompt, which LangExtract utilizes behind the scenes. Also, different sizes of Gemini models are available (with names like flash vs pro in the model_id). You might use a smaller, faster model for quick experiments and then a larger model for more complex documents to improve recall or correctness. LangExtract makes it easy to switch—just change the model_id parameter.

Flexible backend support

Aside from Gemini, you can use other cloud LLMs or local models. LangExtract’s architecture allows you to specify different model_id values or even configure custom endpoints. For instance, if you wanted to use OpenAI’s GPT-5, you could (with a bit of custom configuration) connect LangExtract to that API by modifying the inference call, though this isn’t provided out-of-the-box to keep the focus on Gemini and open models.
For fully offline or private deployments, LangExtract integrates with Ollama, a system for running local LLMs. With Ollama set up, you can download a model like Llama or Mistral and then call LangExtract with model_id="ollama:llama2" (or another model name). LangExtract will route the extraction through the local model instead of a cloud API. This flexibility ensures you’re not locked in — you can choose the model that best fits your needs in terms of accuracy, speed, cost, and privacy.

Step-by-step tutorial using W&B Weave

Now that we’ve covered the what and why of LangExtract, let’s get hands-on and use it in a simple example. In this tutorial, we’ll:
  1. Install LangExtract and set up our environment (including W&B for logging results).
  2. Define a custom extraction task with a prompt and examples.
  3. Run LangExtract to extract data from sample text.
  4. Visualize the results, both using LangExtract’s built-in HTML generator and by logging the data to Weights & Biases for interactive analysis with Weave.
  5. Explore some advanced options and next steps.
For this walkthrough, we’ll use a fun example: extracting information from a snippet of Shakespeare’s Romeo and Juliet. We’ll pretend we want to find characters, emotions, and relationships mentioned in the text. This will demonstrate LangExtract’s capabilities in a familiar context, and you can later apply the same steps to your own domain. Let’s get started!

Step 1: Installation and setup

First, we need to install the LangExtract library and a few other tools. We’ll also set up API access for the LLM and log in to Weights & Biases to use its features.
Install the required packages: LangExtract can be installed via pip. We’ll also install W&B’s wandb library (for experiment tracking and Weave integration) and weave (the library that powers W&B Weave’s interactive dashboards). Additionally, we'll install pandas to help format the output for logging. Run the following command in your Python environment (for example, in a Jupyter notebook cell):
!pip install langextract wandb weave pandas
This will download and install LangExtract and its dependencies, along with W&B tools.
Defaulting to user installation because normal site-packages is not writeable
Collecting langextract...
Collecting wandb...
Collecting weave...
...
Successfully installed langextract-1.0.8 wandb-0.15.5 weave-0.20.1 pandas-2.1.0 ...
Once installation is complete, import the libraries in your Python script or notebook:
import langextract as lx
import wandb
Set up your LLM API key: LangExtract needs access to an LLM to work. If you plan to use Google’s Gemini (or another cloud model), you’ll need to provide an API key. Google’s Gemini API keys can be obtained from Google’s AI Studio (or via Vertex AI for enterprise). Suppose you have your key ready – you should configure it so that LangExtract can use it. The recommended way is to set an environment variable called LANGEXTRACT_API_KEY.
For example, in your terminal or environment, you could do:
export LANGEXTRACT_API_KEY="YOUR-GEMINI-API-KEY"
(On Windows, you’d use set instead of export.)
If you have a .env setup, you can add LANGEXTRACT_API_KEY=<your-key> there. LangExtract will automatically detect this variable and use it when you call lx.extract. Do not hard-code your actual secret key in your script for security reasons.
It’s best practice to store API keys securely (using environment variables or a .env file that isn’t committed to code). This keeps your key safe and makes it easier to switch keys or move your code between environments.
💡
Log in to Weights & Biases: Since we’ll use W&B to log and visualize results, ensure you’re logged in to W&B. If you haven’t used W&B before, sign up for a free account on wandb.ai and find your API key. Then run:
wandb login
In your terminal or !wandb login in a notebook cell, and paste your W&B API key when prompted. (You only need to do this once on a given machine/environment.)
After these setup steps, we’re ready to use LangExtract!

Step 2: Define your extraction task

In this step, we’ll define what information we want to extract and provide an example to illustrate the desired output. LangExtract uses a prompt description and few-shot examples to understand our task.
For our example scenario (literary text analysis), let's extract:
  • Characters (people mentioned),
  • Emotions (explicit expressions of feeling),
  • Relationships (descriptions of relationships or comparisons).
We want the output structured such that each extraction has a class (one of character/emotion/relationship), the exact text snippet, and some attributes giving more context (like an emotion’s type or a relationship’s nature).
We will now create: 1. A prompt string describing the task. 2. An example data point demonstrating the output format.
# 1. Define the prompt with instructions for the extraction
prompt = """Extract characters, emotions, and relationships in order of appearance.
Use exact text from the document for each extraction (no paraphrasing and no overlapping text).
Provide useful attributes for each entity: for example, an emotion should have a "feeling" attribute describing it,
and a relationship should have a "type" attribute (such as family, friendship, metaphor, etc.)."""

# 2. Provide a high-quality example to guide the model
examples = [
lx.data.ExampleData(
text="ROMEO. But soft! What light through yonder window breaks? It is the east, and Juliet is the sun.",
extractions=[
lx.data.Extraction(
extraction_class="character",
extraction_text="ROMEO",
attributes={"emotional_state": "wonder"}
),
lx.data.Extraction(
extraction_class="emotion",
extraction_text="But soft!",
attributes={"feeling": "gentle awe"}
),
lx.data.Extraction(
extraction_class="relationship",
extraction_text="Juliet is the sun",
attributes={"type": "metaphor"}
)
]
)
]
Let’s break down what we did:
  • The prompt clearly explains what to extract and sets some ground rules (like using exact text and adding attributes). This helps the LLM focus and not do unwanted things (like summarizing instead).
  • The example is a snippet from Romeo and Juliet. In that snippet, we manually identified:
    • "ROMEO" as a character, with an attribute indicating his emotional state is wonder (since he’s marveling at Juliet’s beauty).
    • "But soft!" as an expression of emotion, with an attribute gentle awe (this is an interpretation of that phrase).
    • "Juliet is the sun" as a relationship extraction, labeling it a metaphor type relationship (Juliet being compared to the sun).
  • We used lx.data.ExampleData and lx.data.Extraction to construct these examples in the exact format LangExtract expects.
Notice that the structure of the example’s output (classes and attributes) defines the schema for the model. Our example tells the model: “when you see text like this, output data in this shape”. The more clear and representative the example, the better the model will generalize to the new input. For many tasks, 1-3 examples are enough.
(No output yet from the code above – we’ve just set up the prompt and examples.)
Crafting good examples is key to LangExtract’s success. Use realistic text in your examples and cover each type of information you want. If your domain has edge cases (like abbreviations or uncommon formats), consider showing one in the examples. This few-shot approach means quality over quantity – a few well-chosen examples can yield excellent results.
💡

Step 3: Run the extraction

Now we’re ready to extract data from our target text using LangExtract. We’ll use a short input text (a line reminiscent of Romeo and Juliet), but you could just as easily provide a whole document or even a URL to a document.
# Define the input text to analyze
input_text = "Lady Juliet gazed longingly at the stars, her heart aching for Romeo."

# Run the extraction using LangExtract
result = lx.extract(
text_or_documents=input_text,
prompt_description=prompt,
examples=examples,
model_id="gemini-2.5-flash" # using the default recommended Gemini model
)

# Print out the structured results
for ext in result.extractions:
print(f"{ext.extraction_class.upper()}: {ext.extraction_text} -> {ext.attributes}")
Explanation:
  • We call lx.extract(...) with our input_text, the prompt_description and examples we prepared, and specify the model_id. Here, "gemini-2.5-flash" is used, which is a fast variant of Google’s Gemini LLM that works well for many tasks. (Ensure that your LANGEXTRACT_API_KEY is set for authentication, as described in Step 1.)
  • The function will return an object (let’s call it an annotated document result) that contains the original text and a list of extractions (result.extractions). We then loop through each extraction and print the class, the exact text that was extracted, and the attributes.
When you run this, the model will be contacted behind the scenes. For a short text like this, it should only take a few seconds. If successful, you’ll see output lines for each extracted item.
<small>Expected output:</small>
CHARACTER: Lady Juliet -> {'emotional_state': 'longing'}
EMOTION: longingly -> {'feeling': 'yearning'}
RELATIONSHIP: her heart aching for Romeo -> {'type': 'romantic love'}
Your output may vary slightly in wording (since it depends on the LLM’s generation), but it should be similar. In our run, LangExtract identified:
  • Lady Juliet as a character, with the attribute indicating her emotional state is “longing”.
  • “longingly” (from the phrase "gazed longingly at the stars") as an emotion, with a feeling of “yearning”.
  • “her heart aching for Romeo” as a relationship, with the type labeled as “romantic love”.
Each of these pieces comes directly from the text, and the attributes add interpretive context. Also important: behind the scenes each of these extractions knows its position in input_text. For example, “Lady Juliet” might correspond to characters 0–11 in the string, “longingly” to characters 18–26, and so on. LangExtract captured those spans, which is how it can later highlight them in context.
If no output was produced or some expected items were missed, don’t worry—we can refine the prompt or use a larger model if needed. (Common reasons for missing output include the prompt not covering a scenario, or the model truncating early. We address these in troubleshooting and next steps.)

Step 4: Visualize and review results

We have our structured data – now let’s explore ways to visualize and analyze it. This step has two parts: 1. Using LangExtract’s built-in visualization to create an interactive HTML file. 2. Logging the results to Weights & Biases and using W&B Weave to interact with the data.

4a. Generate an interactive HTML report (LangExtract visualization)

LangExtract can produce a standalone HTML file that shows the original text with all extracted entities highlighted and listed. This is extremely useful for reviewing the results or sharing with someone who wants to verify the extractions.
We’ll use the lx.io.save_annotated_documents function to save our results in a standard format (JSON Lines), and then lx.visualize to generate the HTML content.
# Save the extraction results to a JSONL file (each line could represent a document's results)
lx.io.save_annotated_documents([result], output_name="extraction_results.jsonl")

# Generate an interactive HTML visualization from the JSONL file
html_content = lx.visualize("extraction_results.jsonl")

# Write the HTML content to a file
with open("visualization.html", "w") as f:
f.write(html_content)

print("Interactive visualization saved as visualization.html")
Explanation:
  • save_annotated_documents takes a list of results (we only have one in [result]) and writes them to a file. The output is extraction_results.jsonl which is a JSONL (JSON Lines) file. If you open it, you’ll see a JSON representation of result including the text and all extraction entries.
  • lx.visualize reads that JSONL and creates an HTML string containing the interactive visualization. We then save that string to visualization.html.
  • The print is just to confirm we saved the file.
Expected printout:
Interactive visualization saved as visualization.html
At this point, if you open the visualization.html file in a web browser, you should see your input text with highlights. Each highlight corresponds to an extraction; if you click on or hover over them, you likely see details (class and attributes). There may also be a side panel listing all extracted items and allowing you to navigate between them. This HTML is self-contained (it includes the necessary scripts), so you can share it with colleagues as a review artifact. For large documents with many extractions, this is much more convenient than combing through raw text.
The HTML visualization is great for manual review. If you’re working in a team (say, a subject matter expert needs to validate the outputs), sending this interactive file can save a lot of time. It provides transparency — anyone can see exactly where each piece of data came from.
💡

4b. Log results to W&B and explore with Weave

Beyond the static HTML, we can leverage Weights & Biases to log our results and explore them using W&B Weave. This will allow dynamic filtering, querying, and even charting of extracted data, all through a shareable web interface.
We’ll log two things to W&B:
  • A table of extractions: each row will be one extraction with columns for class, text, and attributes.
  • A summary metric (for example, the total number of extractions) just as an example of logging metadata.
First, initialize a W&B run (which will create a workspace to view the data):
wandb.init(project="langextract-tutorial", name="Extraction-run-demo", resume="allow")
This starts a new run in the project "langextract-tutorial". (Feel free to choose your own project name; if it doesn’t exist, it will be created.) The name is optional, but it helps identify the run. After running this, you should see output from W&B with a link to your run page.
Now, prepare the data for logging. We can convert our result.extractions into a tabular format. Using pandas for convenience:
import pandas as pd

# Convert extraction results into a list of dicts for tabular logging
data = []
for ext in result.extractions:
data.append({
"Class": ext.extraction_class,
"Extracted Text": ext.extraction_text,
"Attributes": str(ext.attributes) # convert attributes dict to string for display
})

df = pd.DataFrame(data)
Here we created a DataFrame df with columns “Class”, “Extracted Text”, and “Attributes”. We cast the attributes dict to a string just for cleaner display in the table (alternatively, W&B Tables can handle nested dicts, but they will appear as JSON strings anyway in the UI).
Next, log the table and any desired metrics:
# Create a W&B Table from the DataFrame
extraction_table = wandb.Table(dataframe=df)

# Log the table and a count metric to the W&B run
wandb.log({
"extraction_results": extraction_table,
"extraction_count": len(result.extractions)
})

wandb.finish() # end the W&B run
When this code runs, it will upload the data to W&B. The wandb.log call sends the extraction_table (which W&B will visualize as an interactive table in the UI) and a scalar extraction_count. We then call wandb.finish() to conclude the run (this ensures all data is synced).
<small>Expected W&B output (console):</small>
wandb: Tracking run with wandb version 0.15.5
wandb: Run data is saved locally in /your_dir/wandb/run-<id>/
wandb: Run `langextract-tutorial/Extraction-run-demo` started, view it at:
wandb: https://wandb.ai/your-username/langextract-tutorial/runs/<run-id>
wandb: Logged data and finished run (3 extraction_results items, 1 metrics)
wandb: Run page: https://wandb.ai/your-username/langextract-tutorial/runs/<run-id>
(The URLs will point to your W&B workspace; clicking them takes you to the run page in your browser.)
Now, navigate to your W&B run page. You will find:
  • A Table named "extraction_results" under the run’s overview or the "media" tab. It should have one row for each extraction we printed earlier. You can sort this table by Class, search within it, or scroll through if you had many results.
  • The extraction_count metric under the run’s summary (in this case it’s 3). This is not very interesting with one run, but if you processed multiple texts or tried different models, you could compare these counts between runs.
The real power comes with W&B Weave for analysis:
On the run page, look for an option that says “Open in Weave” or a Weave icon. This opens an interactive notebook/dashboard environment where you can use the logged data programmatically or via UI blocks. For instance, in Weave you could write a small panel to filter the table to only show Class == "character" extractions. Or you could aggregate data — if you processed dozens of documents, you might create a chart of how many emotions were extracted per document, etc.
For our single-run, small example, you might simply play with the table: try ordering by the "Extracted Text" column, or just appreciate that the Attributes column contains the context we got (longing, yearning, romantic love). In a real project, Weave could be used to join this extraction data with other data sources or to create a live monitoring dashboard (imagine running LangExtract on a stream of documents and watching the extractions come in live on W&B).
Using W&B Weave, you can collaboratively analyze your extraction results. You can share the run (or a Weave dashboard link) with teammates. They could, for example, add comments on particular data points or help create filters to find anomalies in the extractions. Because the data is structured, you can also use Weave to quickly search through all extracted items. This beats scanning through raw text or static reports, especially for large-scale extraction tasks.
💡

Step 5: Advanced tips and next steps

Congratulations on extracting and visualizing data with LangExtract! At this point, you’ve seen the full workflow: define task → run extraction → review results. This section provides some advanced pointers and ideas for where to go next:
  • Using local or alternative models: If you don’t have access to Gemini or want to run entirely offline, LangExtract supports local models through Ollama. After installing and starting an Ollama server with a model (say Llama 2), you can call:
    result_local = lx.extract(..., model_id="ollama:llama2")
  • This tells LangExtract to use your local Llama2 model for inference. Keep in mind that local models may be slower or less accurate than a powerful cloud model, but they remove external dependencies and costs. Make sure the model you specify in model_id is one you’ve loaded in Ollama. Similarly, with some coding, LangExtract could be extended to other providers’ APIs – but the built-in path is optimized for Gemini or Ollama.
  • Batch processing and long texts: If you have multiple documents or a very long document to process, LangExtract can handle that too. You can pass a list of texts or even a URL (as text_or_documents) to lx.extract, and it will process them in sequence or in parallel depending on length. By default, it does chunking automatically. Advanced parameters like max_workers (for parallel threads) and extraction_passes (for multiple model passes on the text) can be tuned. For example:
    result = lx.extract(text_or_documents=long_text, prompt_description=prompt, examples=examples,
    model_id="gemini-2.5-flash", extraction_passes=2)
  • would run two passes to attempt a higher recall of extractions in a very large long_text. Use these only as needed, since extra passes mean more API calls. The good news is LangExtract manages chunking internally, so you typically don’t need to split the text yourself.
  • W&B Models for custom models: If you decide to train or fine-tune your own model for the extraction task (for instance, fine-tuning a smaller LLM on domain-specific data to possibly improve performance or reduce cost), consider using W&B Models to version and deploy it. You could register your fine-tuned model in the W&B Models registry and use W&B’s inference service to get an endpoint. LangExtract could then be pointed to that endpoint (similar to how we used Gemini). This way, you maintain the benefits of LangExtract’s schema and grounding while plugging in a model tailor-made for your data. W&B will help you keep track of which model version was used for which runs, and you can compare their performance by analyzing the extraction outputs logged to W&B.
  • Evaluate and refine: Extraction quality can sometimes be subjective. It’s good practice to evaluate how LangExtract is performing on your specific task. You might manually inspect a sample of outputs (using the HTML viz or W&B Weave as we did) and note if the model is missing anything or making mistakes. Then refine your approach:
    • Add or adjust example prompts to clarify ambiguous cases.
    • Use a larger model (e.g., gemini-2.5-pro) if the smaller one struggles with nuance.
    • If the model is extracting too aggressively (false positives), add instructions in the prompt to be more selective or give counter-examples of what not to extract. Because you can iterate quickly without retraining, you can do several prompt tweaks and re-run to zero in on the best prompt for your needs.
  • Try it on your data (next challenge): Now that you’ve walked through a simple example, try applying LangExtract to a piece of text from your domain. For instance:
    • If you work in finance, take a paragraph from a financial report or news article and set up LangExtract to pull out key financial metrics or events.
    • If you’re in healthcare, try a snippet of a clinical note to extract medical conditions, treatments, and patient details.
    • If you have no domain preference, even try another literature piece or a Wikipedia article and extract some structured info (like people, dates, and events mentioned).
  • Set up a new prompt and examples for your scenario, then run the extraction. Use the W&B integration to log the results and see them in a table. This will solidify your understanding and uncover any new challenges in prompt design for that text.
  • Scaling up: Once you’re comfortable with single documents, you can scale out to many documents. You might write a loop to process a folder of text files or a list of URLs. With W&B, you could log each document’s results as a separate run or as part of a single run (maybe as separate table items or artifacts). This would let you monitor progress and review outputs collectively. W&B Weave would allow you to merge results from multiple runs if needed to do a corpus-wide analysis (for example, summarizing how often certain entities appear across 100 documents).
  • Performance considerations: Keep an eye on the cost and speed if you use a cloud model for many extractions. Each lx.extract call will consume tokens in the LLM and incur latency. Using smaller models, or filtering your text to remove irrelevant sections before feeding to LangExtract, can save time. LangExtract’s chunking helps, but extremely long documents will still take proportionally longer (and possibly multiple API calls). If you plan to extract from thousands of documents, consider asynchronous or batch strategies and use W&B to track the batches. You can even log custom timing metrics (e.g., how long each extraction took) with W&B to identify bottlenecks.
By exploring these advanced options, you can tailor LangExtract to your needs and integrate it into real workflows. We encourage you to experiment — change models, tweak prompts, try different types of content — and use W&B to keep track of what works best.
  • Custom model integration: For those developing their own NLP models, LangExtract’s approach can be a blueprint. If you have a model that you’ve fine-tuned for a specific extraction (say a custom BERT or T5 model), you could still use LangExtract’s data structures and visualization by writing a small wrapper that feeds your model and outputs in LangExtract’s format. Although LangExtract is built with LLMs in mind, its principles of source-grounded, schema-driven extraction apply broadly.
  • Stay updated: LangExtract is a young project (as of 2025). Keep an eye on its GitHub repo for updates – new features or support for more models may be added. The community might contribute new examples (for medicine, law, etc.) that you can learn from. Likewise, W&B is continuously improving Weave and other LLMOps tools, which could further simplify integration (for example, a future update might let you deploy a LangExtract pipeline and monitor it entirely through W&B).
In summary, the steps you learned here are just the beginning. LangExtract, combined with W&B, provides a powerful framework for turning unstructured text into actionable data. With these advanced tips, you should be well-equipped to adapt the solution to larger problems, ensure quality, and collaborate with others on extracting insights from text.

Conclusion

LangExtract is a versatile and powerful tool for anyone dealing with large amounts of text. In this guide, we saw how it bridges the gap between unstructured text and structured data, allowing us to extract meaningful information with relatively little effort. Key takeaways include:
  • Precision and trust: LangExtract’s extractions are grounded in the source material, so you can always verify them. This is crucial in domains like medicine and law where provenance is everything.
  • Consistency: By enforcing structured outputs via examples and schemas, LangExtract produces clean data ready for analysis or storage, saving you from the headache of parsing messy model outputs.
  • Flexibility: The same approach works across domains (finance, healthcare, literature, you name it) and even across languages. We only scratched the surface with an English literature example, but you could apply LangExtract to Chinese financial documents or Spanish legal texts just as well.
  • Integration with LLMs and tools: LangExtract lets you leverage cutting-edge LLMs like Google’s Gemini for high-quality results, or use local models for privacy. And with integrations into tools like W&B Weave, you get visibility into the process – you can monitor, visualize, and share the results easily. This makes it not just a development tool, but something you can use in production with confidence (imagine a pipeline where new data comes in, LangExtract structures it, and W&B dashboards show live updates of extracted insights!).
By completing the step-by-step tutorial, you’ve learned how to define an extraction task, run it, and review the outputs both offline and in a collaborative online environment. From here, you can start applying LangExtract to your own projects. Whether it’s mining research papers for trends, pulling requirements from contract documents, or analyzing user feedback for sentiment and themes – the process is similar. Define what you need, let LangExtract and an LLM do the heavy lifting, and focus your time on interpreting the results.
We encourage you to experiment further and share your experiences. Every domain might present unique challenges in terms of language and context, but LangExtract’s general approach and the tips provided should give you a strong foundation. As you refine your prompts and examples, you’ll likely discover new ways to improve extraction quality. And with W&B, you have a companion to track those improvements and collaborate with others.
In conclusion, LangExtract exemplifies how modern AI can unlock hidden insights in unstructured data. By coupling it with robust MLOps practices (like tracking and visualization), you gain not just automation, but also understanding and control. We’re excited to see what you will build and discover using LangExtract. Happy extracting!

Sources

  • Google Developers Blog – Introducing LangExtract: A Gemini powered information extraction library (July 30, 2025) – developers.googleblog.com
  • Google LangExtract GitHub Repository – LangExtract: Extracting structured information from text using LLMs – github.com/google/langextract
  • LangExtract Documentation and PyPI Description – Project details and usage examples for LangExtract – pypi.org/project/langextract
  • LangExtract Official Website – Overview of features and getting started – langextract.io
  • Nateross Blog – Google LangExtract: The Ultimate AI Developer's Guide to Structured Information Extraction (Aug 25, 2025) – n8rs.dev

Iterate on AI agents and models faster. Try Weights & Biases today.