Note: This is entirely a Python-based note. There are, of course, many mechanisms to work with PDF files and their data in other languages.

PDFs are as commonplace as anything else in the world of files. Although far from perfect, these documents have become a de-facto standard of how we share text based documents with each other. As such, they hold an immense amount of the publicly accessible information out there, and while their creation is often based on usable data and sources, we only get to see the final, immutable(ish) document. Still, we have to make due, and use these files.

This quickly gets us to the question of how best to extract the information in a document so we can use it for further work.

Most simply, we can read the information and remember it. This is neat, but quickly runs into problems - remember the file you read last Wedneday in detail?

We can improve this step by taking notes, and highlighting pertinent sections of the document. This is an improvement, but leaves the data itself contained in the document.

You can tell your intern or analyst to transfer data you are interested in to excel sheets, to recreate the tables and figures in a mutable format. Neat, but relies on the availability of an analyst (whose time is not entirely cheap, at the end of the day.)

We can try to improve the data extraction process by using OCR to make the text in a document selectable, so that at least parts of it can be copy/pasted to a new document. Great, but as we surely have experienced, far from flawless. We can improve the OCR software (i.e. - try to download a better one), but keep running into largely the same problems. OCR itself tries to perform a relatively simple process on documents, which works decently for a clean, modern PDF document, but runs into many issues with tables and column alignment.

Solutions?¶

With the advent of LLMs, we have gained a new tool that can significantly aid in the process of processign PDF data at scale, and therefore speed up research and analysis steps. Just like most things in the world of tools, there are many solutions that have been built and modified to achieve these tasks, adn I couldn't posibly list all of them.

Below, I will provide a quick overview of the solutions I have recently tried, and their performance.

The next step will be to not only aid the parsing of PDF files into JSON or Markdown documents, but to work further with this data.

Disclaimer¶

Most of this is clobbered together from other sources. By no means am I a programmer, just curious. This code is probably clunky, not clean, and serious professionals may laugh at it.

I have tried my best to link these where relevant.

OpenAI API¶

We can use the OpenAI API to process our file as text (for the elements we can read) and images (for non-structured data). To my surprise, LLMs read PDF files primarily by looking at them as an image, not trying to OCR the text in them.

We can also instruct the client model to use this input data as content, along with a prompt that gives it more conditions on how to process / output the data.

In [1]:

# Imports
import os
import io
from dotenv import load_dotenv
import base64
import re

import json
import numpy as np
import pandas as pd 
from tqdm import tqdm

from rich import print

from openai import OpenAI
from pdf2image import convert_from_path

from pdfminer.high_level import extract_text

import concurrent.futures

load_dotenv()

client = OpenAI()

In [2]:

# OpenAI Helper Functions
def convert_doc_to_images(path):
    images = convert_from_path(path)
    return images

def extract_text_from_doc(path):
    text = extract_text(path)
    return text

# Converting images to base64 encoded images in a data URI format to use with the ChatCompletions API
def get_img_uri(img):
    png_buffer = io.BytesIO()
    img.save(png_buffer, format="PNG")
    png_buffer.seek(0)

    base64_png = base64.b64encode(png_buffer.read()).decode('utf-8')

    data_uri = f"data:image/png;base64,{base64_png}"
    return data_uri

def analyze_image(data_uri):
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {"role": "system", "content": system_prompt},
            {
                "role": "user",
                "content": [
                    {
                    "type": "image_url",
                    "image_url": {
                        "url": f"{data_uri}"
                    }
                    }
                ]
                },
        ],
        max_tokens=2000,
        temperature=0,
        top_p=0.1
    )
    return response.choices[0].message.content

def analyze_doc_image(img):
    img_uri = get_img_uri(img)
    data = analyze_image(img_uri)
    return data

Having imported our function set, we prompt the LLM with instructions to understand and analyse the data.

This step will be particularly important for getting structured outputs in a format that is understandable and repeatable. There is little use for a system that parses data, but does so incoherently.

In [3]:

system_prompt = '''
You will be provided with an image of a PDF page or a slide. Your goal is to deliver a detailed and engaging presentation about the content you see, using clear and accessible language suitable for a 101-level audience.

If there is an identifiable title, start by stating the title to provide context for your audience.

Describe visual elements in detail:

- **Diagrams**: Explain each component and how they interact. For example, "The process begins with X, which then leads to Y and results in Z."
  
- **Tables**: Break down the information logically. For instance, "Product A costs X dollars, while Product B is priced at Y dollars."

Focus on the content itself rather than the format:

- **DO NOT** include terms referring to the content format.
  
- **DO NOT** mention the content type. Instead, directly discuss the information presented.

Keep your explanation comprehensive yet concise:

- Be exhaustive in describing the content, as your audience cannot see the image.

- where appropriate, quote from the original text instead of summarizing. 
  
- Exclude irrelevant details such as page numbers or the position of elements on the image.

- **DO NOT** alter any numbers or data presented in the image.

Use expert language:

- Explain technical terms or concepts in clear terms, but with the assumption that the reader is of a professional and educated audience.

Engage with the content:

- Interpret and analyze the information where appropriate, offering insights to help the audience understand its significance.

------

If there is an identifiable title, present the output in the following format:

{TITLE}

{Content description}

If there is no clear title, simply provide the content description.
'''

For this example workflow, we will look at pages from a company annual report.

In [4]:

from IPython.display import Image
Image(filename='background/sample.png')

Out[4]:

No description has been provided for this image

In [ ]:

files_path = "data/example_pdfs"

all_items = os.listdir(files_path)
files = [item for item in all_items if os.path.isfile(os.path.join(files_path, item))]

docs = []

for f in files:
    
    path = f"{files_path}/{f}"
    doc = {
        "filename": f
    }
    text = extract_text_from_doc(path)
    doc['text'] = text
    imgs = convert_doc_to_images(path)
    pages_description = []
    
    print(f"Analyzing pages for doc {f}")
    
    # Concurrent execution
    with concurrent.futures.ThreadPoolExecutor(max_workers=8) as executor:
        
        futures = [
            executor.submit(analyze_doc_image, img)
            for img in imgs
        ]
        
        with tqdm(total=len(imgs)-1) as pbar:
            for _ in concurrent.futures.as_completed(futures):
                pbar.update(1)
        
        for f in futures:
            res = f.result()
            pages_description.append(res)
        
    doc['pages_description'] = pages_description
    docs.append(doc)

Analyzing pages for doc test_page.pdf

1it [00:29, 29.50s/it]

In [6]:

json_path = "data/output/tp_oai_parsed.json"

with open(json_path, 'w') as f:
    json.dump(doc, f)
    
with open('data/output/tp_oai_parsed.txt', 'w') as file:
    for page in doc['pages_description']:
        file.write(page + '\n\n')

Once the data is loaded and analysed, we can save it to a JSON file for further use, or extract the text into a simple, readable text file. We can do this for all files if we have processed multiple, or a specific file.

Sample Text Output:¶

Other Financial Data (€ millions)

As of and for the year ended December 31, the financial data for 2022, 2021, and 2020 is presented as follows:

Gross Profit:
- 2022: €1,641.6 million
- 2021: €1,516.8 million
- 2020: €1,482.6 million
Profit for the Year from Continuing Operations (EBIT):
- 2022: €16.4 million
- 2021: €1.9 million
- 2020: €(25.6) million
EBITDA:
- 2022: €297.3 million
- 2021: €281.8 million
- 2020: €272.1 million

This is great, but limited to a full-document text-based output. We cannot easily access individual parts of the data

For financial applications and usability, we need to be able to load / process data.

Within PDFs, this is largely relevant to tables (forecasts, historical financials etc).

Importantly, we cannot rely on a consistent formatting of pages (as would be the case when, for example, processing invoice documents, or intake forms) based on which we could instruct the program to extract based on a schema.

Therefore, we need the ability to work with unstructured / inconsistent PDF data.

While the OpenAI client can handle this data, the 'easiest' approach leads us to the Unstructured library, which integrates nicely into LangChain

Unstructured¶

We can use a version of the unstructured API which provides a decent overview of the document, and lets us process the document, save it into a JSON file, and then converts this to a markdown version.

This works... kind of. In my test cases, the output is decent, and probably good enough to be fed into an LLM for further use, but is not enjoyable to look at.

Back to the drawing board.

In [7]:

from langchain_unstructured import UnstructuredLoader

file_path = 'data/example_pdfs/test_page.pdf'

loader = UnstructuredLoader(
    file_path=file_path,
    strategy="hi_res",
    partition_via_api=True,
    coordinates=True,
    ocr_languages=["en"],
)

docs = []

for doc in loader.lazy_load():
    docs.append(doc)

json_path = "data/output/tp_parsed_unstr.json"

# Convert Document objects to dictionaries
docs_serializable = [
    {
        "page_content": doc.page_content,
        "metadata": doc.metadata
    }
    for doc in docs
]

with open(json_path, 'w') as f:
    json.dump(docs_serializable, f)

INFO: Preparing to split document for partition.
INFO: Starting page number set to 1
INFO: Allow failed set to 0
INFO: Concurrency level set to 5
INFO: Splitting pages 1 to 1 (1 total)
INFO: Determined optimal split size of 2 pages.
INFO: Document has too few pages (1) to be split efficiently. Partitioning without split.
INFO: Starting page number set to 1
INFO: Allow failed set to 0
INFO: Concurrency level set to 5
INFO: Splitting pages 1 to 1 (1 total)
INFO: Determined optimal split size of 2 pages.
INFO: Document has too few pages (1) to be split efficiently. Partitioning without split.

Unstructured analyses files by partitioning a page based on the type of element (Title, Text, Image or Table). We can see the rendering of a page using the code below, which in this case shows quite nice recognition of the document. The parsing process takes about 10 seconds to run for this single-page file.

Code from https://python.langchain.com/docs/how_to/document_loader_pdf/#use-of-multimodal-models

In [8]:

from utility.u_render import render_uns_page
render_uns_page(docs, 1, file_path, print_text=False)

In [9]:

from IPython.display import HTML, display

segments = [
    doc.metadata
    for doc in docs
    if doc.metadata.get("page_number") == 1 and doc.metadata.get("category") == "Table"
]

display(HTML(segments[0]["text_as_html"]))

dataframes = []

for segment in segments:
    if segment.get("category") == "Table" and segment.get("text_as_html"):
        # Wrap the HTML string in a StringIO object to avoid deprecation warnings.
        tables = pd.read_html(io.StringIO(segment["text_as_html"]))
        dataframes.extend(tables)

	As of and for the year ended December 31,
	2022	2021	2020
Gross profit @	1,641.6	1,516.8	1,482.6
Profit for the year romntinuing rion(EBT) (2)	16.4	1.9	(25.6)
EBITDA	297.3	281.8	272.1
Gross profit margin )	35.6%	36.2%	36.5%
EBIT margin ¥	0.4%	0.0%	(0.6%)
EBITDA margin	6.7%	7.0%	6.8%
Capital expenditures )	200.3	217.1	177.8
Cash and bank balances	311.2	440.8	401.7
) Bank loans, debentures and other marketable securities ®	1,133.2	1,171.8	1,193.4
Financial debt	1,156.1	1,200.6	1,218.1
Net financial debt	844.9	759.9	816.4

We can display specific elements from the page in easily readable formats, and in the saved JSON file.

Great in principle, but unfortunately we can see in the output table that Unstructured makes some errors in parsing the text. (e.g. EBIT margin ¥)

Additionally, we have to consider cost. The API is priced at $2/1,000 pages for the 'fast' processing model, and $20/1,000 pages for the 'hi-res' model which is recommended for PDF files.

OpenParse¶

Another option comes in the form of OpenParse, from Sergey Filimonov. the library is designed for RAG and promises to provide a 'flexible, easy-to-use library capable of visually discerning document layouts and chunking them effectively'.

OpenParse can be quite significantly customised, and also allows for the use of OpenAI models. In its most straightforward form, for this example, we run it largely 'as-is', to trial its capability. The program runs on device primarily (as opposed to e.g. Unstructured, which uses a cloud API), and performance is therefore tied to your machine.

In [10]:

import openparse
from openparse import processing, DocumentParser

basic_doc_path = "data/example_pdfs/test_page.pdf"

parser = openparse.DocumentParser(
    table_args={"parsing_algorithm": "unitable"})

parsed_pmpdf = parser.parse(basic_doc_path)

doc = openparse.Pdf(file=basic_doc_path)
doc.display_with_bboxes(parsed_pmpdf.nodes)

for node in parsed_pmpdf.nodes:
    display(node)

INFO: Unsupported color space: ICCBased
INFO: Models loaded successfully 🚀: 0.61s

Finished loading models. Ready for inference.

[html]As of and for the year ended December 31.2022

Gross profit ( 1 )	1.641.6	1.516.8
1.482.6	Profit for the year from continuing operations ( EBIT ) ( 2 )	16.4	1.9
( 25.6 )	EBITDA ( 2 )	297.3	281.8
272.1	Gross profit margin ( 3 )	35.6 %	36.2 %
36.5 %	EBIT margin ( 3 )	0.4 %	0.0 %
( 0.6 %)	EBITDA margin ( 3 )	6.7 %	7.0 %
6.8 %	Capital expenditures ( 4 )	200.3	217.1
177.8	Cash and bank balances	311.2	440.8
401.7	Bank loans, debentures and other marketable securities ( 5 )	1.133.2	1.171.8
1.193.4	Financial debt ( 5 )	1.156.1	1.200.6
1.218.1	Net financial debt ( 5 )	844.9	759.9
816.4

[html]

	FY 2022	FY 2021	FY 2020
Total operating income	4.617.7	4.184.8	4.062.2
Adjusted for :
Supplies	( 2.976.2 )	( 2.668.0 )	( 2.579.6 )
Gross profit	1.641.6	1.516.8	1.482.6

(2) “EBITDA” represents profit for the year from continuing operations (“EBIT”) after adding back depreciation and amortization expenses. Our management believes that EBITDA is meaningful for investors because it provides an analysis of our operating results, profitability and ability to service debt and because EBITDA is used by our chief operating decision makers to track our business evolution, establish operational and strategic targets and make important business decisions. EBITDA is also a measure commonly reported and widely used by analysts, investors and other interested parties in our industry. To facilitate the analysis of our operations, EBITDA excludes depreciation and amortization expenses from EBIT in order to eliminate the impact of general long-term capital investment. Although we are presenting EBITDA to enhance the understanding of our historical operating performance, EBITDA should not be considered an alternative to EBIT as an indicator of our operating performance, or an alternative to cash flows from ordinary operating activities as a measure of our liquidity. The following table presents the calculation of EBITDA:

[html]

	Profit for the year from continuing operations ( EBIT )	16.4	1.9
( 25.6 )	Adjusted for :	Depreciation and amortization expenses	280.9
279.9
297.7	EBITDA	297.3	281.8
272.1

(3)“Gross profit margin”is gross profit divided by total operating income. EBIT margin is EBIT divided by revenue.

EBITDA margin is EBITDA divided by revenue.

(4)“Capital expenditures”consist of expenditures in property plant and equipment, plus expenditures in intangible assets. See “Operating and Financial Review and Prospects—Key factors affecting our results of operations—Capital Expenditures”.

(5)“Bank loans, debentures and other marketable securities”consists of current and non-current payables under finance leases, the 2026 Notes, the 2028 Notes, the Senior Facilities Agreement, as well as other loans, credit lines, invoice discount lines, interest payable and financial remeasurements. Financial debt consists of bank loans, debentures and other marketable securities plus non-recourse factoring and other financial liabilities. Net financial debt consists of financial debt less cash and bank balances. The following table presents a calculation of net financial debt:

Classified as Public

We can see that it did generally pretty ok, but the tables are not usable in this form. I have tried some of the options suggested by the author to improve performance, but as of yet, have had no luck.

Moving on...

Marker¶

Another option to Unstructured is Marker. It runs on device, and is a bit slower.

It is also not entirely free for commercial usage, but this largely applies to companies with revenues >$5mn.

So, free for me, at least.

Marker allows for the extraction of a full PDF document, or the extraction only of relevant tables.

Like OpenParse, the program largely runs on-device, and is significantly reliant on processing power. In my trials, I have found that real-world usage is best handled by 'proper' hardware like a GPU cluster, or lacking this, Google Colab. More on that below.

In [11]:

from datetime import datetime

from marker.converters.pdf import PdfConverter
from marker.converters.table import TableConverter
from marker.models import create_model_dict
from marker.output import json_to_html, text_from_rendered, save_output
from marker.config.parser import ConfigParser

# Input File Path
file_name = 'test_page'
input_file_path = 'data/example_pdfs/' + file_name +'.pdf'

# Output File Path
output_file_path = 'data/output/'

In [12]:

# Initial rendering through Marker, without additional settings
config = {
    "strip_existing_ocr": True,
    "force_ocr": True,
    "output_format": "json"}

config_parser = ConfigParser(config)

converter = PdfConverter(
    config=config_parser.generate_config_dict(),
    artifact_dict=create_model_dict(),
    processor_list=config_parser.get_processors(),
    renderer=config_parser.get_renderer()
)

rendered = converter(input_file_path)
text, _, images = text_from_rendered(rendered)

save_output(rendered, output_file_path, 'tp_marker_parsed')

Loaded layout model datalab-to/surya_layout on device mps with dtype torch.float16
Loaded texify model datalab-to/texify on device mps with dtype torch.float16
Loaded recognition model vikp/surya_rec2 on device mps with dtype torch.float16
Loaded table recognition model datalab-to/surya_tablerec on device mps with dtype torch.float16
Loaded detection model vikp/surya_det3 on device mps with dtype torch.float16

Recognizing layout: 100%|██████████| 1/1 [00:02<00:00,  2.61s/it]
100%|██████████| 1/1 [00:00<00:00, 10.28it/s]
Detecting bboxes: 100%|██████████| 1/1 [00:00<00:00,  1.28it/s]
Recognizing Text: 100%|██████████| 1/1 [00:20<00:00, 20.61s/it]
Recognizing equations: 0it [00:00, ?it/s]
Detecting bboxes: 100%|██████████| 1/1 [00:01<00:00,  1.49s/it]
Recognizing Text: 100%|██████████| 3/3 [00:41<00:00, 13.85s/it]
Recognizing tables: 100%|██████████| 1/1 [00:05<00:00,  5.92s/it]

If we output the PDF as JSON, we can apply a version of the rendering script from the Unstructured code.

We can see that Marker seems to do a great job of recognizing the elements on the page.

In [13]:

from utility.u_render import render_marker_page
render_marker_page(rendered, 1, file_location='data/example_pdfs/test_page.pdf',print_text=False)

In [14]:

from utility.u_jsonmd import json_to_markdown
sampletest = rendered.copy()

markdown_output = json_to_markdown(sampletest)

However, in markdown, the first table is lacking its descriptive column, we are not getting the line item names for the table, just the numbers.

Output¶

As of and for the year ended December 31,
2022	2021	2020
1,641.6	1,516.8	1,482.6
16.4	1.9	(25.6)
297.3	281.8	272.1
35.6%	36.2%	36.5%
0.4%	0.0%	(0.6%)
6.7%	7.0%	6.8%
200.3	217.1	177.8
311.2	440.8	401.7
1,133.2	1,171.8	1,193.4
1,156.1	1,200.6	1,218.1
844.9	759.9	816.4

Depending on the document, stripping and re-running OCR, while costing time, seems to help with overall accuracy.

Marker allows for the option to use the Gemini API to enhance extraction. This is as simple as adding your Gemini key to the environment, and adding the '--use-llm' flag to the config.

If desired, we can also load a specific Gemini Model based on preference. Gemini 1.5 Flash seems to work pretty well, too, and is cheaper to use than the latest models.

YMMV.

In [15]:

config = {
    "output_format": "markdown",
    "strip_existing_ocr": True,
    "force_ocr": True,
    "use_llm": True,
    "model_name": "gemini-1.5-flash"
}

config_parser = ConfigParser(config)

converter = PdfConverter(
    config=config_parser.generate_config_dict(),
    artifact_dict=create_model_dict(),
    processor_list=config_parser.get_processors(),
    renderer=config_parser.get_renderer()
)

rendered = converter(input_file_path)
text, _, images = text_from_rendered(rendered)

# save_as_markdown(rendered, output_file_path, file_name)

save_file_name = file_name +'_'+ datetime.now().strftime("%Y%m%d_%H%M%S")
save_output(rendered, output_file_path, save_file_name)

Loaded layout model datalab-to/surya_layout on device mps with dtype torch.float16
Loaded texify model datalab-to/texify on device mps with dtype torch.float16
Loaded recognition model vikp/surya_rec2 on device mps with dtype torch.float16
Loaded table recognition model datalab-to/surya_tablerec on device mps with dtype torch.float16
Loaded detection model vikp/surya_det3 on device mps with dtype torch.float16

Recognizing layout: 100%|██████████| 1/1 [00:01<00:00,  1.50s/it]
100%|██████████| 1/1 [00:00<00:00, 14.19it/s]
LLM layout relabelling: 0it [00:00, ?it/s]
Detecting bboxes: 100%|██████████| 1/1 [00:00<00:00,  2.28it/s]
Recognizing Text: 100%|██████████| 1/1 [00:14<00:00, 14.95s/it]
Recognizing equations: 0it [00:00, ?it/s]
Detecting bboxes: 100%|██████████| 1/1 [00:01<00:00,  1.31s/it]
Recognizing Text: 100%|██████████| 3/3 [00:41<00:00, 13.89s/it]
Recognizing tables: 100%|██████████| 1/1 [00:03<00:00,  3.49s/it]
LLMTableProcessor running: 3it [00:03,  1.30s/it]
LLMTableMergeProcessor running: 0it [00:00, ?it/s]
LLMHandwritingProcessor running: 3it [00:00, 1917.25it/s]

The resulting output, as we can see, renders the first table with more accuracy.

Still not flawless, but a significant step in the right direction, if what we want is to 'trust' this to run in the background as a step in the process.

Notably, both versions of marker appear to be clearer in their text processing than the Unstructured approach, which mangeled some column items.

Of course, we have to account for the cost of the Google API. The simple 1-Page test run used, in this case, used approx. 4,000 input tokens.

Output¶

As of and for the year ended December 31,
2022	2021	2020
Gross profit (1)	1,641.6	1,516.8	1,482.6
Profit for the year from continuing operations (EBIT) (2)	16.4	1.9	(25.6)
EBITDA (2)	297.3	281.8	272.1
Gross profit margin (3)	35.6%	36.2%	36.5%
EBIT margin (3)	0.4%	0.0%	(0.6%)
EBITDA margin (3)	6.7%	7.0%	6.8%
Capital expenditures (4)	200.3	217.1	177.8
Cash and bank balances	311.2	440.8	401.7
Bank loans, debentures and other marketable securities (5)	1,133.2	1,171.8	1,193.4
Financial debt (5)	1,156.1	1,200.6	1,218.1
Net financial debt (5)	844.9	759.9	816.4

Batch Processing¶

The whole point of this is to find a solution that we can set to work on a whole batch of documents, and save their outputs for later use. Once we can cleanly and reliably extract the data from the confines of a PDF document, we will have much more, easy ways to proceed.

The above marker script can be very easily applied for a batch process.

Below is an extract of the script I use in Google Colab to iterate over a folder of data relatively quickly.

In [ ]:

# Imports
import os
import json
from datetime import datetime

from marker.converters.pdf import PdfConverter
from marker.converters.table import TableConverter
from marker.models import create_model_dict
from marker.output import json_to_html, text_from_rendered, save_output
from marker.config.parser import ConfigParser

from google.colab import userdata

# Marker Configuration
config = {
    "output_format": "markdown",
    "strip_existing_ocr": True,
    "force_ocr": True,
    "use_llm": True,
    "google_api_key": userdata.get('GOOGLE_API_KEY'),
    "model_name": "gemini-2.0-flash",
}

config_parser = ConfigParser(config)

converter = PdfConverter(
    config=config_parser.generate_config_dict(),
    artifact_dict=create_model_dict(),
    processor_list=config_parser.get_processors(),
    renderer=config_parser.get_renderer()
)

timestmp = datetime.now().strftime("%Y%m%d_%H%M%S")

# File Locations
input_folder = '/content/drive/MyDrive/Colab Notebooks/data/input'
output_folder = '/content/drive/MyDrive/Colab Notebooks/data/output/axactor'

all_items = os.listdir(input_folder)
files = [item for item in all_items if os.path.isfile(os.path.join(input_folder, item))]

# Main Batch Process
for f in files:
    fpath = os.path.join(input_folder, f)
    print(f"Processing: {f}")
    result = converter(fpath)

    run_subfolder = os.path.join(output_folder, f)
    os.makedirs(run_subfolder, exist_ok=True)

    save_output(result, run_subfolder,f"{timestmp}_{f}")

Marker, as mentioned, takes a fair bit of time.

20 minutes, to be exact, for the initial 11 page run using Gemini (Macbook M2 Air).

The whole process can be sped up considerably if you are on a machine with an actual GPU. Alternatively, even the free version of Google Colab with a T4 GPU Runtime cut processing time to just about 3 minutes.

Still, the results are great.

The file is processed, as far as I can tell, without notable errors. One or two lines have been converted to Sanskrit, which is a bummer since I can't read it.

Even tables that have empty / subheader rows are appropriately allocated.

L4 GPU¶

To trial the speed of this process, I got some compute units (100 Units = GBP 9.72) and set the program to work on a set of three quarterly reports (123 pages total). Theres a good pricing overview here.

The process seems to be quite low-impact, costing just about 3 compute units for the three documents using an L4 GPU cluster, and took just under 15 minutes. The resulting files are saved down in folders for each original document, with a markdown version of the file and its metadata. Obviously all of this could be tweaked for different workflows.

Next Steps¶

This is all just a setup stage for the process, and I am far from done with it (as you can probably tell). Now that we have the data in a usable format, the real meat of the process will be to further work on and through the data.

More to come, then...

Solutions?¶

Disclaimer¶

OpenAI API¶

Sample Text Output:¶

Unstructured¶

OpenParse¶

Marker¶

Output¶

Output¶

Batch Processing¶

L4 GPU¶

Next Steps¶

Subscribe to Felix Research