🧾 Lab – PDF Images OCR & Structure Understanding to Get Data#

“Turning PDFs into structured data — because some businesses still think ‘Export to PDF’ is a database strategy.”


🎯 Lab Goals#

Welcome to AI vs. Paperwork. Your mission: teach a neural network to read PDF invoices, receipts, or reports like a caffeinated intern armed with a highlighter.

You’ll:

  1. Use PyTorch for Optical Character Recognition (OCR)

  2. Extract structured data (tables, totals, headings)

  3. Appreciate how chaotic real-world business documents are

  4. Laugh when your model confuses “$1,000” with “LOOO”


📦 Setup: PyTorch + OCR Tools#

Let’s grab a few key libraries first:

pip install pytorch torchvision torchaudio
pip install pytesseract pdf2image pillow opencv-python

🧠 Why PyTorch?#

Because TensorFlow would make you fill out three forms just to load a model. PyTorch lets you write code that feels like Python, not penance.

💬 “TensorFlow wants you to build a computational graph. PyTorch says: ‘Just run it, dude.’”


🧩 Step 1: Convert PDF Pages → Images#

from pdf2image import convert_from_path
from PIL import Image

pages = convert_from_path("sample_invoice.pdf")
for i, page in enumerate(pages):
    page.save(f"page_{i}.png", "PNG")

You just turned your boss’s 200-page PDF into a stack of PNGs. Congratulations — now your GPU is crying.


👀 Step 2: Run OCR with Tesseract#

import pytesseract
import cv2

img = cv2.imread("page_0.png")
gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
text = pytesseract.image_to_string(gray)
print(text[:300])

This should output something like:

Invoice #12345
Customer: ACME Corp
Total: $4,200.00

…or if your PDF was messy:

Invoce #l23AS
Custom: ACNE Crop
Total: S42000,00

(Welcome to OCR reality.)


🤖 Step 3: Neural Network for Structure Understanding#

We’ll now train a small CNN+MLP hybrid to identify key layout elements:

  • Headers

  • Tables

  • Totals

  • Logos (because branding is life)


🔧 Define a Tiny CNN in PyTorch#

import torch
import torch.nn as nn
import torch.nn.functional as F

class LayoutNet(nn.Module):
    def __init__(self):
        super().__init__()
        self.conv1 = nn.Conv2d(1, 16, 3, padding=1)
        self.conv2 = nn.Conv2d(16, 32, 3, padding=1)
        self.fc1 = nn.Linear(32 * 32 * 32, 128)
        self.fc2 = nn.Linear(128, 4)  # 4 layout types

    def forward(self, x):
        x = F.relu(self.conv1(x))
        x = F.max_pool2d(x, 2)
        x = F.relu(self.conv2(x))
        x = F.max_pool2d(x, 2)
        x = x.view(x.size(0), -1)
        x = F.relu(self.fc1(x))
        return self.fc2(x)

🎨 Input: patches of document images 🎯 Output: “header”, “table”, “total”, “other”


🧪 Step 4: Train (or Pretend to)#

model = LayoutNet()
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)
criterion = nn.CrossEntropyLoss()

# Dummy loop
for epoch in range(3):
    optimizer.zero_grad()
    dummy_input = torch.randn(8, 1, 128, 128)
    dummy_target = torch.randint(0, 4, (8,))
    output = model(dummy_input)
    loss = criterion(output, dummy_target)
    loss.backward()
    optimizer.step()
    print(f"Epoch {epoch+1}, Loss: {loss.item():.3f}")

🧘‍♂️ The model doesn’t really learn much — but it feels smarter now. Just like some managers after an AI conference.


🧩 Step 5: Extract Data from Layout Predictions#

Once the network identifies regions, you can:

  • Run OCR again on those patches

  • Parse key-value pairs (e.g., “Total: $4200”)

  • Export as structured data (CSV, JSON, or to your BI tool)

structured_data = {
    "Invoice_No": "12345",
    "Customer": "ACME Corp",
    "Total": 4200.00
}

🎉 Congratulations! You just saved your data team 6 hours of manual entry.


📊 Bonus: Business Value#

Role

Why They’ll Love It

Finance

Automates invoice processing

Analytics

Turns PDFs into usable datasets

Ops

Speeds up reporting

Interns

Fewer hours of “Ctrl+C → Excel” suffering


🧠 Stretch Challenge#

  1. Fine-tune a pre-trained model like LayoutLMv3 or Donut for structured extraction.

  2. Add a table detector using bounding boxes from detectron2.

  3. Use Hugging Face Transformers for OCR-based text classification.

  4. Build a dashboard to visualize extracted fields.


😂 TL;DR#

Step

Description

Feeling

PDF → Image

The calm before the storm

😇

OCR

“Is that a 6 or a G?”

😤

CNN for Layout

“My AI understands headers better than my interns.”

😎

Structured Data

“Finally, usable numbers!”

🕺


🧭 Summary#

  • PyTorch makes experimenting fast and flexible

  • OCR brings text out of PDFs (even if it’s messy)

  • CNNs can help detect document structure

  • Business value: turn unstructured PDFs into gold mines of analytics


“In the end, it’s not about automating paperwork — it’s about freeing humans to make new paperwork.” 🧑‍💼💻

# Your code here