A RAG Engineer's Guide to Document Parsing
DATE: 12th October 2024
The foundation of any Retrieval-Augmented Generation (RAG) application begins with effective document parsing. Parsing goes beyond simple text extraction—it must preserve the structure, context, and relationships within the document to ensure accuracy. Get this wrong, and your entire RAG pipeline suffers. If you don’t get the information out of your giant set of documents in the first place, which is often where RAG starts, it’s “garbage in and garbage out” and nothing else will work properly.
The main challenge lies in the fact that language models currently struggle with understanding complex visual documents that contain tables, charts, or figures. While it’s possible to feed a single page from a PDF into a language model like ChatGPT, scaling this to handle thousands or millions of pages is problematic. This can lead to downstream hallucinations—where the model misinterprets or fabricates details.
- Hallucinations refer to instances where a language model generates information that is either inaccurate, fabricated or not grounded in the provided data. This typically occurs when the model misinterprets the input or doesn’t have access to the full, correct context, leading it to “guess” or fabricate details in its output.
- Downstream Hallucinations in RAG systems mean that issues occurring early in the pipeline, such as incorrect or incomplete parsing of documents, can result in inaccurate responses from the model later on. This might happen due to many reasons such as misinterpreting document structure, loss of key information, ambiguity from poorly parsed data.
Developers must therefore break apart complex documents, isolate various elements such as text blocks and tables, and convert these into formats, like plain text or JSON, that language models can understand. This process is not new; industries such as healthcare and retail have long used specialised vision models to parse medical bills and receipts. However, RAG poses a unique challenge because it often deals with highly varied content types, requiring flexibility and accuracy across different kinds of documents.
The ultimate goal is to convert these diverse documents into language model-ready data that can be stored in the RAG system for intelligent retrieval and generation.
Some Existing Parsing Strategies: Strengths and Limitations
- PyPDF
PyPDF is a Python library, and has long been a go-to tool for basic PDF text extraction. While effective for simple, text-heavy PDFs, it struggles with preserving the structure of complex layouts and formatted text, often losing key information. PyPDF also lacks the capability to handle visual objects like charts, tables, or graphs, limiting its use for complex documents.
Sample videos or blogs:
- Tesseract
Tesseract is an OCR (Optical Character Recognition) engine designed to extract text from scanned documents and images. It excels in converting image-based text into machine-readable format but has similar limitations when it comes to maintaining document structure, especially with complex tables and layouts. Like PyPDF, Tesseract doesn’t process visual elements like charts and figures, which may require additional post-processing to achieve usable results.
Sample videos or blogs:
- GitHub – tesseract-ocr/tessdoc
- PyTesseract: Python Optical Character Recognition | Using Tesseract OCR with Python
- Unstructured
Unstructured is a modern document parsing library that handles various document formats by combining text extraction, table detection, and layout analysis. While it performs better than traditional tools in dealing with structured data, it can still struggle with highly complex or non-standard document formats, especially when handling visual content like images or charts.
Sample videos or blogs:
- LlamaParse
It aims to solve these challenges by preserving the document’s structure, including tables and formatted text, outputting results in a markdown format compatible with language models. Although promising, LlamaParse is still being evaluated in real-world applications, and its capabilities are not yet fully established.
Sample videos or blogs:
- GitHub – run-llama/llama_parse: Parse files for optimal RAG
- How to Use LlamaParse? 🦙 LlamaIndex Tutorial
- A quick walk-through of LlamaParse: simplified document parsing for generative AI applications
- X-Ray
X-Ray by EyeLevel.ai offers a more advanced solution by using a fine-tuned vision model to identify text blocks, tables, charts, and other objects across various document types, including those with complex visuals. X-Ray extracts this data and converts it into language model-ready information, producing a JSON-like output with rich metadata, document summaries, and keyword extraction. However, as a relatively new technology, its practical applications and performance are still being explored by developers.
Sample videos or blogs:
In summary, choosing the right parsing strategy is critical for ensuring that the data passed into the RAG system is both complete and contextually rich, ultimately boosting the performance and reliability of the language models it powers.
Error Analysis: Common Parsing Pitfalls
Several common parsing errors can significantly affect the performance of RAG systems.
- One frequent issue is table misinterpretation where parsers fail to accurately detect table structures, treating them as unstructured text. This can lead to incorrect answers when responding to queries involving tabular data.
- Another pitfall is the loss of formatting, where the document’s structure isn’t preserved, causing headers to mix with body text or labels to be misaligned with data, resulting in scrambled outputs. Header and footer confusion also arises when parsers mistakenly include these sections as part of the main content, potentially distorting the context of the extracted information and affecting the accuracy of downstream tasks.
- Additionally, image handling is a common weakness—most parsers either ignore embedded images and diagrams or attempt to process them through OCR, often misinterpreting them.
Developing Custom Parsing Strategies
For developers working with specialised document types, creating custom parsing strategies can enhance accuracy and efficiency.
- One effective approach is combining existing tools, using multiple parsers to handle different sections of a document and capitalising on their individual strengths.
For example, Use PyPDF for text extraction and Tesseract OCR for scanning images within the same document, combining outputs for more comprehensive data parsing.
- Another technique involves using regular expressions to extract specific information consistently found in documents, providing precise control over what is parsed.
For Example, Use PyPDF for text extraction and Tesseract OCR for scanning images within the same document, combining outputs for more comprehensive data parsing.
- For specialised domains, implementing domain-specific rules can improve the handling of industry-specific formats, ensuring more accurate extraction of relevant data.
For example, For legal documents, create rules to accurately identify and extract sections like clauses, case numbers, or legal terms.
- Additionally, machine learning augmentation allows developers to train models that recognize and extract complex patterns or structures unique to their documents, significantly enhancing parsing capabilities. These custom strategies offer more control and accuracy, particularly for non-standard or complex document types.
For example, Train a model to detect and extract specific data like medical codes or billing details from healthcare records with non-standard layouts.
ML Augmentation can be achieved by fulfilling requirements such as:
- Define the problem: The goal is to develop a machine learning model that can accurately identify and extract specific pieces of information that may have inconsistent layouts.Non-standard layouts may include variations in formatting, structure, and presentation of data, making it difficult for traditional parsing methods to achieve high accuracy.
- Data Collection: Gather a diverse dataset, and ensure the dataset includes records with different layouts, font styles, tables, forms, and scanned documents.
- Annotation: Manually annotate the dataset to highlight key data points or use annotation tools like Label Studio, Prodigy, or VGG Image Annotator to facilitate the process.
- Choose the right model: Choose an appropriate machine learning model like Natural Language Processing (NLP) Models which is capable of understanding both text and layout information or Object Detection Models If working with scanned documents or images
- Feature Engineering: Preprocess the collected documents. Convert scanned images to text using OCR (Optical Character Recognition) tools like Tesseract and segment documents into logical sections. Identify Spatial information about where specific elements are located within the document.
- Training the Model: Use frameworks like TensorFlow or PyTorch to establish a training environment. Split the dataset into training, validation, and test sets. Then, train the chosen model using the annotated
- Evaluation and testing: After training, assess the model on the test set using metrics and analyse any misclassified examples to understand common errors and refine the model. The metrics that can be used are:
- Precision: The accuracy of the extracted medical codes and billing details.
- Recall: The model’s ability to identify all relevant data points in the documents.
- F1 Score: The balance between precision and recall.
Best Practices for Selecting a Parsing Strategy
Choosing the right parsing strategy requires a two-step approach.
- First, conduct a visual inspection by running your documents through various parsers and manually reviewing the output. This allows you to quickly identify major issues with structure or formatting.
- Once you’ve shortlisted the most promising parsers, move to end-to-end testing by running your entire RAG pipeline with these parsers integrated. This process simulates real-world use cases, providing a more comprehensive understanding of how well each parsing strategy functions within the larger system.
To quantitatively compare the parsing strategies, measure their performance across several key metrics:
- Accuracy in table and graphical extraction: Measures the percentage of correctly extracted table and graphical elements compared to the total elements present in the document. This can be done as:
True Positives (TP): Number of correctly extracted elements (tables, charts).
False Positives (FP): Number of incorrectly extracted elements.
False Negatives (FN): Number of elements that were not extracted.
Accuracy = TP/TP+FP+FN
- Preservation of document structure: Does the parser maintain the logical flow of the document, keeping headers, footers, and formatting intact?
Qualitative Assessment: Manual review of parsed documents to ensure headers, footers, and sections are maintained. Use a checklist to evaluate if all the headers and footers are preserved and included wherever necessary and if the logical flow of the sections in the document is intact.
- Ability to turn extractions into LLM-friendly data: How effectively does the parser convert document elements into formats that are easy for language models to interpret, such as plain text or structured data like JSON?
Evaluation: Manually check if extracted data is formatted correctly for language models (e.g., in JSON or plain text). Use a validation set to quantify how many extractions meet format specifications.
Example Metric:
Format Accuracy=Number of Correctly Formatted Extractions / Total Extractions
- Parsing speed: How quickly does the parser process documents, especially at scale with large datasets?
Measurement: Track the time taken to parse a batch of documents (in seconds).
Calculation: Speed = Total Number of Documents / Total Time Taken (seconds)
- Consistency across different document types: Does the parser perform reliably with various types of documents (e.g., PDFs, scanned images, reports, etc.)?
Quantitative Assessment: Run the parser on various document types (e.g., PDF, scanned images, structured reports) and compare performance metrics (accuracy, precision, recall).
Stability Analysis: Calculate variance in performance metrics across document types. A lower variance indicates higher consistency.
Example Metric: Calculate the average performance metrics across all document types, using standard deviation to assess consistency.
- Ability to handle complex formatting: Can the parser accurately interpret non-standard or highly formatted documents, avoiding errors such as scrambled text or misplaced data?
Quantitative Metrics: Count instances of formatting errors (e.g., misplaced headers, scrambled text). Calculate the error rate:
Error Rate = Number of Formatting Errors / Total Documents Processed
The Challenge of Evaluation
Evaluating parsing quality remains a complex, mostly manual task. Creating question-answer pairs for evaluation is time-consuming but essential for developing automated evaluation tools. Despite advancements, human oversight in parsing evaluation is still necessary to ensure accuracy, as fully automated solutions have yet to reach the required sophistication.
This limitation highlights a significant opportunity for innovation in the field. As technologies advance, there is potential for breakthroughs in automated parsing evaluation, and this post will be updated when a sufficiently advanced solution is discovered.
Conclusion
As the capabilities of Retrieval-Augmented Generation (RAG) applications continue to expand, document parsing remains a cornerstone of their success. There is significant potential for innovation in both parsing technologies and evaluation methods, offering new ways to improve performance and accuracy. For developers working on RAG systems, it’s essential to prioritise parsing. Carefully evaluating different parsing strategies and their impact on your use case is important to be studied. In the world of RAG, your system is only as good as the data you provide it—and everything starts with effective parsing.
The article was originally shared on Reddit.
We have added our own twists, with additional information for the reader to be more informed 🙂