Listen to this article:
We walked through the door of an otherwise unremarkable industrial park suite into a gymnasium-sized room packed full of shelves. The shelves were packed floor-to-ceiling with banker boxes. The banker boxes were packed full of documents.
Back in the early 1990s, two large aerospace companies got crosswise with each other over some intellectual property claims. The dispute ended up in court. Discovery included more than a million pages of documents.
Through an interior door was another similarly-sized room. This one was filled with scanners, workstations, optical drives, and high school students. Their job was to prepare all of this documentation for use at trial.
The process began with an expert looking at the document and highlighting key words: names, products, processes, or anything else that might need to be cross-referenced at trial. The reviewed document was scanned, given a unique identifier, and stored on an optical disk by one student, and then displayed on a workstation manned by another. The student at the workstation typed the keywords into an indexing program. Each document was keyed twice to catch errors. I think they must have employed most of the nearby high school, with several dozen students working shifts around the clock. Even then, it took more than a year to process all of the documents.
I was working as a consultant and developer for an image and database processing consultancy, and we were there to check out that indexing package. It was an impressive operation, especially considering that storage media capacity at the time was measured in kilobytes or single-digit megabytes.
I think back on that episode often, amazed at how differently the process would look today. Instead of just keywords, the entire document text would be stored and available to be searched. Nobody would have to type anything. High speed scanners, high capacity drives (or better yet, cloud storage), and OCR software are commonplace. The whole thing would be finished in a couple weeks, tops. To top it all off, ingest the documents into a large language model and answer questions in seconds.
But regardless of whether we do it the fast way or the slow way, all of that “unstructured data” has to be given some structure to make it usable.
Maybe I’m just old and feisty. Or not. I’ve had this opinion for a long time. Maybe that means I’ve been old and feisty for a long time. But talk about “unstructured data” has always annoyed me. “Unstructured data” is an oxymoron. All data has structure. Even if it’s not stored in a computer, data requires structure to be interpretable. And, of course, anything that gets processed by a computer has structure.
A Microsoft Word file is essentially a zip archive containing XML files defining the structure and content of the document. A JPG file starts with a header with a specific two-byte code. An MP3 audio file consists of an optional header followed by a sequence of frames, each with its own header.
Even so-called “semi-structured” data like XML or JSON files conform to a schema defined within the file itself. Nevertheless, I can already hear you saying, “that’s not what we’re talking about when we say ‘unstructured.’” It’s that the content does not conform to a specific data model or schema. I know, I know. That doesn’t make it any less oxymoronic. And I just needed to get that off my chest.
Unstructured data is receiving renewed attention as a result of its use in training large language models. Books. Images. Papers. Websites. Audio and video. Chat. Bulletin board and social media posts. And emails. It all goes into the big large language data pot.
And not surprisingly, the quality of this unstructured data is also receiving renewed attention. When you write an email, do you think that it might someday be used to train a large language model at your company? One of the emerging uses for LLMs is to analyze communications between salespeople and potential clients to try to identify strategies and language that are more likely to result in a sale.
What is the “data quality” of your emails? Or presentations or papers? Have you ever gone back and looked at something you wrote a long time ago and realize that your thoughts on the subject had evolved in the interim? Those original, unevolved thoughts may be immortalized in an LLM somewhere.
When it comes to unstructured data, the same dimensions of Data Quality still apply: accuracy, completeness, consistency, timeliness, and so forth. It’s just that they are more often considered within the context of data capture, not data content. Was the OCR text extracted from the document accurately transcribed? Were all of the audio segments of the interview included?
But just like structured data, the accuracy of the content must be considered as well. I would consider that to be at least as important. And much harder. When you have a database table, it’s easy to say that the airport_code field must contain an IATA airport code. You can cross-reference customer account numbers associated with transactions with customer account numbers in the customer master table.
But how do you assess the accuracy and completeness, a.k.a. the quality, of a document. Or an article. Or a website. Or an email. How do you assess its reliability. How often do we discover that something that was an unquestionable news story yesterday was actually wrong. Perhaps intentionally wrong. Or maybe it was the other way around: something that was considered a tinfoil hat conspiracy theory yesterday is discovered to be true today. It wasn’t a conspiracy theory, it was a spoiler.
How many commercial large language models were trained with the incorrect information? How do they get corrected? And who is the authority anyway? How do we know what’s been corrected and what’s still wrong. We truly have a crisis of authority (this is a topic I’ll be writing more about in the future).
One of my hobbies is family history research. Unstructured data is everywhere, but especially in family trees and in historical records. The objective is to maximize confidence in your family tree connections by finding as many confirming documentary sources as possible. Many times you can jumpstart your family tree by copying someone else’s.
The problem is that not all sources are created equally, and not all family trees are reliable. Family connections can be incorrect. Record transcriptions can be incorrect. Asserted facts might not have any associated sources. And some of the sources are just wrong. Over time you come to know which are reliable and which to avoid or ignore.
The trouble is that everybody is encouraged to leverage everyone else’s research. An incorrect association gets perpetuated when others copy it. Multiple family trees having the same association reinforces it. (Sounds like the same path that leads to LLM model collapse to me.) When the error is identified and corrected, the incorrect interpretations still far outweigh the correct ones. I’ve seen many instances where what looked like an outlier was actually the correction. (This is also an LLM training challenge.)
Are you going to go review all of your company’s memos and emails? Probably not. But it might be worthwhile to make your folks aware of how the artifacts they create, like emails, presentations, and reports could impact the performance of the company’s AI models.
Back in the old days, people had to read through the documents and pull out the relevant or interesting information. They would also at the same time evaluate the quality, accuracy, and reliability of the document’s content. We can now do the same thing orders of magnitude faster using AI. Nevertheless, the Data Quality still needs to be validated. AI systems can evaluate the quality of unstructured data sources, but there still needs to be some level of human validation or confirmation. How much “hands off” are you comfortable with?
Are you evaluating the quality of the unstructured data you’re feeding to your large language models? Might be time to at least start thinking about it.