LangExtract — How I Turned Messy Text Into Clean JSON for Free
One of my Telegram bots found this tool for me.
I have 7 bots running on my Mac Mini via NanoClaw. Riko is the research one — I send her links (changelogs, blog posts, releases) and she tells me if something matters for our stack.
A few weeks ago I sent her an article about structured extraction. She came back with: “This fits your RSS pipeline. Replace the raw prompts with example-based schemas. Consistent output, zero cost.”
She was right. That tool was LangExtract.
The problem it solves
I run 11 n8n workflows. Many of them process text that isn’t structured: RSS articles, competitor pages, invoices, prospect websites. Before LangExtract, each workflow had a Gemini or Claude node with a long prompt: “Extract the following fields… return JSON… don’t add extra keys…”
It worked 80% of the time. The other 20%: wrong field names, hallucinated data, inconsistent formats. Every edge case meant a longer prompt. Longer prompts meant more edge cases. Classic loop.
LangExtract flips the approach. Instead of describing what you want in prose, you show it 2-3 examples of input → output. It learns the pattern and applies it consistently.
My setup
FastAPI server on Mac Mini, localhost:8765. Starts at boot via launchd. One endpoint:
POST /extract
{ "text": "...", "schema_name": "invoice" }
7 schemas loaded at startup, each a JSON file with 2-3 real examples:
| Schema | What it does | Where it’s used |
|---|---|---|
article-enrichment | RSS articles → topic, insight, action item | n8n RSS pipelines (ITA + BCN) |
prospect-audit | Company website → ICP fit, team size, funding, pain signals | prospect-audit.py CLI → Crono |
invoice | Invoice text → supplier, VAT, amount, category | parse-invoice.py CLI |
competitor-profile | Competitor page → pricing, features, positioning | CI Agent n8n pipeline |
rss-funding | Funding news → company, round, amount, investors | ABM Signal Monitor |
linkedin-post-analysis | Post text → hook type, CTA style, engagement pattern | Content strategy |
content-repurpose | Blog/newsletter → key quotes, stats, social snippets | Content Repurposing Pipeline |
Model: Gemma 3 4B running locally via Ollama. No API key, no rate limits, no cost. Gemini 2.0 Flash as fallback if the local model is down.
Real examples from my schemas
RSS article → structured insight
I feed it an article like “HubSpot just launched Breeze, their new AI layer that sits across the entire CRM. Breeze Copilot for in-app AI, Breeze Agents for automation, Breeze Intelligence for enrichment…”
It returns:
{
"topic": "CRM AI Integration",
"category": "product_launch",
"key_insight": "HubSpot embedding AI across entire CRM stack — copilot, agents, and data enrichment bundled together",
"relevance_score": "high",
"action_item": "Analyze Breeze feature gaps vs our custom stack — potential content angle"
}
This goes straight into a Google Sheet. The n8n RSS pipeline processes articles from 9 subreddits and 20+ RSS feeds — all structured, all filterable.
Prospect website → ICP fit score
Before a sales call, I run python3 prospect-audit.py https://factorial.co. It scrapes the homepage + /about page, sends the text to LangExtract, and returns:
{
"company_stage": "Series A",
"team_size": "25",
"location": "Barcelona",
"funding": "€5M from Nauta Capital",
"icp_fit": "strong",
"pain_signals": "Expanding from SMB to enterprise — needs growth infrastructure",
"outreach_angle": "Enterprise expansion playbook + positioning for upmarket move"
}
Same script also catches weak fits — a local web agency in Milan with 8 people and WordPress gets icp_fit: "weak" and outreach_angle: "Not ICP — skip". Saves me the call.
Spanish invoice → accounting data
I’m an autonomo in Spain. Every quarter: categorize expenses, map VAT rates, send to accountant. Invoices from Netlify, Qonto, Google, OpenAI — all different formats, some in English, some in Spanish.
LangExtract parses them all. A Qonto invoice:
{
"supplier": "Qonto S.A.S.",
"invoice_number": "INV-2026-02-8847",
"amount_total": "10.89",
"currency": "EUR",
"tax_rate": "21%",
"category": "Software & SaaS"
}
The parse-invoice.py script adds a VAT mapping layer on top — it knows Google and LinkedIn are intracomunitario (casilla 10/11), while Netlify and Anthropic are extracomunitario. My accountant gets a clean spreadsheet instead of a pile of PDFs.
Why examples beat prompts
After running both approaches across hundreds of extractions:
Prompts give you ~80% consistency. You spend the other 20% patching edge cases with more instructions, which create new edge cases.
Examples give you ~95%+ consistency. When you hit an edge case, you add one example. The old ones still work. No prompt drift.
The schema files are version-controlled. I can see exactly what each extraction is supposed to return. When something breaks, I compare the output to the example — the debugging surface is tiny.
Three access patterns
Everything calls the same server:
- n8n workflows → HTTP Request node to
localhost:8765/extract. RSS enrichment, CI Agent, content repurposing — all hit the same endpoint - Python scripts →
prospect-audit.py,parse-invoice.py. CLI tools for batch operations - Claude Code via MCP → ad-hoc extraction when I need something quick without building a schema
One server, 7 schemas, three ways to call it. Total cost: $0.
I write about the systems I build every week in Inference. If you’re building with AI — not just prompting — this is for you.