00:00:00You've probably tried turning emails, PDFs, or transcripts into structured data at one
00:00:04point and it went sideways fast.
00:00:07Everyone thinks that the hard part is building the app.
00:00:09It's not.
00:00:10It's the text, because a huge chunk of real-world data is often unstructured and most pipelines
00:00:15fall apart right here.
00:00:16Now you'd expect the fix to be more roles, more NLP, but some devs are actually doing
00:00:21the opposite.
00:00:22This is Lang Extract.
00:00:23It's a free Google open source tool that's quietly growing and fast.
00:00:27We have videos coming out all the time.
00:00:29Be sure to subscribe.
00:00:32Okay, now Lang Extract just sounds like another extraction library, and at first glance it
00:00:40kind of is, but here's what makes it different.
00:00:43Lang Extract is a Python library that uses LLMs like Gemini or GPT to pool structured
00:00:49data out of messy text.
00:00:51So yes, entities attributes relationships into clean output like JSON or even interactive
00:00:57HTML.
00:00:58The final reason devs care is every single extraction is grounded back to the exact text
00:01:02span it came from.
00:01:04Which means instead of the model saying, "Trust me," it says, "Here's the exact sentence I
00:01:09used."
00:01:10That's the big change here.
00:01:11Now the workflow here basically is something like the prompt goes in, extraction happens,
00:01:15and then you get this structured output you can actually verify.
00:01:19Before I answer the big question of why are devs ditching old school NLP for this, let
00:01:24me show you how all this works first so you can try it out.
00:01:27All right, here's a simple example.
00:01:29On the screen, we've got unstructured text that I found of some clinical notes, and right
00:01:33now it's just text.
00:01:34It's in a text file.
00:01:36A human can read it and pull out the important parts, but a computer sees it all as gibberish.
00:01:41First off, I had to clone the Git repo and install the requirements, then I also needed
00:01:45to get my Gemini API key, which I just housed in an EMV file.
00:01:49I then typed out this Python script here to run this and describe what I wanted to extract
00:01:54in my prompt.
00:01:56This is why you need some understanding of Python.
00:01:58All my entities, attributes, relationships, all written as this prompt.
00:02:02There is no training data, there is no model tuning.
00:02:05Then lang extract runs and I get a structured JSON output.
00:02:09Now here's the part I want you to notice because this is the whole point.
00:02:12Every extracted field here is linked back to the same exact sentence it came from in my
00:02:18JSON.
00:02:19So if you're reviewing it, debugging, explaining to someone else, you're no more guessing.
00:02:23But one of the coolest features I found here was the interactive HTML page, which it auto-generates.
00:02:29This is where you can click an entity and see it highlighted in the original text and
00:02:33run through for a quicker visual to see all the targeted words you were after.
00:02:38That's why it's huge for debugging, audits, reviews, that kind of thing.
00:02:42And if you need to do this at scale, batch mode lets you run it across thousands of documents
00:02:46more efficiently.
00:02:48So yeah, this looks great.
00:02:50This was really cool too, especially the HTML part.
00:02:52Okay, now why are devs ditching old school NLP for this?
00:02:56And that's because messy text isn't just annoying, right?
00:02:59It is annoying, but it's also expensive.
00:03:01It costs time and it breaks things.
00:03:03That's why we're seeing lang extract show up where accuracy and traceability actually matter.
00:03:08Things like extracting structured data from clinical notes while still being able to audit
00:03:12where it came from.
00:03:13That's huge.
00:03:14Or maybe we're turning feedback and support tickets into knowledge graphs instead of those
00:03:18giant CSV files.
00:03:20With all the good we get from these style tools, we also get some bad too.
00:03:24These will influence how you decide to use it.
00:03:26For the good, we have a lot here.
00:03:27The setup is simple, right?
00:03:29Pip install, write a prompt, go.
00:03:31Grounded outputs reduced LLM trust issues because you can verify everything and you're not locked
00:03:36into one model.
00:03:37It works local or cloud.
00:03:39Both of these are going to work and it handles long documents better than most tools.
00:03:43It's free, it's open source, and it's moving fast.
00:03:45There are some drawbacks here that you may feel because you still pay LLM costs at scale.
00:03:51Really noisy text can cause incomplete extractions.
00:03:53It's Python first, so if you don't know Python, there might be a bit of a learning curve, but
00:03:57Python's great.
00:03:58It's not ideal for ultra low latency real time apps.
00:04:01Why should you care?
00:04:02Because lang extract lowers the barrier to working with unstructured data without building
00:04:07custom models or fragile pipelines.
00:04:09It makes LLM output something you can actually trust in production because it's tied back
00:04:14to where it came from, especially in sectors, maybe like finance, healthcare, compliance,
00:04:19that sort of stuff where it really does matter.
00:04:21Plus, it fits right into modern stacks, rag, search, knowledge graph, analytics, whatever
00:04:26you're building.
00:04:27If unstructured data is slowing you down, this tool can seriously level you up.
00:04:31If data is part of your job and let's be real, it's probably worth checking out.
00:04:35We'll see you in another video.