🧾 Lab – PDF Images OCR & Structure Understanding to Get Data

🧾 Lab – PDF Images OCR & Structure Understanding to Get Data#

“Turning PDFs into structured data — because some businesses still think ‘Export to PDF’ is a database strategy.”

🎯 Lab Goals#

Welcome to AI vs. Paperwork. Your mission: teach a neural network to read PDF invoices, receipts, or reports like a caffeinated intern armed with a highlighter.

You’ll:

Use PyTorch for Optical Character Recognition (OCR)
Extract structured data (tables, totals, headings)
Appreciate how chaotic real-world business documents are
Laugh when your model confuses “$1,000” with “LOOO”

📦 Setup: PyTorch + OCR Tools#

Let’s grab a few key libraries first:

pip install pytorch torchvision torchaudio
pip install pytesseract pdf2image pillow opencv-python

🧠 Why PyTorch?#

Because TensorFlow would make you fill out three forms just to load a model. PyTorch lets you write code that feels like Python, not penance.

💬 “TensorFlow wants you to build a computational graph. PyTorch says: ‘Just run it, dude.’”

🧩 Step 1: Convert PDF Pages → Images#

from pdf2image import convert_from_path
from PIL import Image

pages = convert_from_path("sample_invoice.pdf")
for i, page in enumerate(pages):
    page.save(f"page_{i}.png", "PNG")

You just turned your boss’s 200-page PDF into a stack of PNGs. Congratulations — now your GPU is crying.

👀 Step 2: Run OCR with Tesseract#

import pytesseract
import cv2

img = cv2.imread("page_0.png")
gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
text = pytesseract.image_to_string(gray)
print(text[:300])

This should output something like:

Invoice #12345
Customer: ACME Corp
Total: $4,200.00

…or if your PDF was messy:

Invoce #l23AS
Custom: ACNE Crop
Total: S42000,00

(Welcome to OCR reality.)

🤖 Step 3: Neural Network for Structure Understanding#

We’ll now train a small CNN+MLP hybrid to identify key layout elements:

Headers
Tables
Totals
Logos (because branding is life)

🔧 Define a Tiny CNN in PyTorch#

import torch
import torch.nn as nn
import torch.nn.functional as F

class LayoutNet(nn.Module):
    def __init__(self):
        super().__init__()
        self.conv1 = nn.Conv2d(1, 16, 3, padding=1)
        self.conv2 = nn.Conv2d(16, 32, 3, padding=1)
        self.fc1 = nn.Linear(32 * 32 * 32, 128)
        self.fc2 = nn.Linear(128, 4)  # 4 layout types

    def forward(self, x):
        x = F.relu(self.conv1(x))
        x = F.max_pool2d(x, 2)
        x = F.relu(self.conv2(x))
        x = F.max_pool2d(x, 2)
        x = x.view(x.size(0), -1)
        x = F.relu(self.fc1(x))
        return self.fc2(x)

🎨 Input: patches of document images 🎯 Output: “header”, “table”, “total”, “other”

🧪 Step 4: Train (or Pretend to)#

model = LayoutNet()
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)
criterion = nn.CrossEntropyLoss()

# Dummy loop
for epoch in range(3):
    optimizer.zero_grad()
    dummy_input = torch.randn(8, 1, 128, 128)
    dummy_target = torch.randint(0, 4, (8,))
    output = model(dummy_input)
    loss = criterion(output, dummy_target)
    loss.backward()
    optimizer.step()
    print(f"Epoch {epoch+1}, Loss: {loss.item():.3f}")

🧘‍♂️ The model doesn’t really learn much — but it feels smarter now. Just like some managers after an AI conference.

🧩 Step 5: Extract Data from Layout Predictions#

Once the network identifies regions, you can:

Run OCR again on those patches
Parse key-value pairs (e.g., “Total: $4200”)
Export as structured data (CSV, JSON, or to your BI tool)

structured_data = {
    "Invoice_No": "12345",
    "Customer": "ACME Corp",
    "Total": 4200.00
}

🎉 Congratulations! You just saved your data team 6 hours of manual entry.

📊 Bonus: Business Value#

Role	Why They’ll Love It
Finance	Automates invoice processing
Analytics	Turns PDFs into usable datasets
Ops	Speeds up reporting
Interns	Fewer hours of “Ctrl+C → Excel” suffering

🧠 Stretch Challenge#

Fine-tune a pre-trained model like LayoutLMv3 or Donut for structured extraction.
Add a table detector using bounding boxes from detectron2.
Use Hugging Face Transformers for OCR-based text classification.
Build a dashboard to visualize extracted fields.

😂 TL;DR#

Step	Description	Feeling
PDF → Image	The calm before the storm	😇
OCR	“Is that a 6 or a G?”	😤
CNN for Layout	“My AI understands headers better than my interns.”	😎
Structured Data	“Finally, usable numbers!”	🕺

🧭 Summary#

PyTorch makes experimenting fast and flexible
OCR brings text out of PDFs (even if it’s messy)
CNNs can help detect document structure
Business value: turn unstructured PDFs into gold mines of analytics

“In the end, it’s not about automating paperwork — it’s about freeing humans to make new paperwork.” 🧑‍💼💻

# Your code here