Extracting PDF Tables on Apple Silicon: olmOCR-2 vs PaddleOCR-VL
Compare olmOCR-2 and PaddleOCR-VL for PDF table extraction on a Mac. See which open-source VLM handles merged cells and numeric tables better.
Introduction
In a previous article, we tested three Python tools for PDF table extraction: Docling, Marker, and LlamaParse. None of them handled the test document perfectly: Docling hallucinated values, Marker merged columns on borderless rows, and LlamaParse added a duplicate empty column.
After publishing part 1, I came across two more tools that target the same problem and wanted to see how they perform compared to the ones we already tested:
olmOCR-2 from Allen Institute for AI, a 7B fine-tune of Qwen2.5-VL
PaddleOCR-VL 1.6 from Baidu, a 1B model with a layout-detection pipeline
Both claim state-of-the-art table extraction. We’ll test them on a Mac (Apple M5 Pro), using the same PDF as part 1, to see if they fix the failures we saw there. Both tools run on Apple Silicon via GGUF quantizations with llama.cpp or native CPU PaddlePaddle.
📖 Full version: This is a condensed version focusing on the hardest table in the document. For the first-table walkthrough and a full discussion of the tradeoffs, see the complete comparison.
The Test Document
For a fair comparison, we will use the same PDF as part 1: the Docling Technical Report from arXiv:
import urllib.request
source = "https://arxiv.org/pdf/2408.09869"
local_pdf = "docling_report.pdf"
urllib.request.urlretrieve(source, local_pdf)
olmOCR-2: Qwen2.5-VL Fine-Tune
olmOCR-2 is Allen AI’s open-weight OCR model. It stands out for three reasons:
A 7B fine-tune of Qwen2.5-VL reads each PDF page as an image
Cheap to run at scale: on a rented NVIDIA H100, olmOCR-2 processes a few pages per second, working out to about $2 per 10,000 pages in cloud costs
Strongest table benchmark: scores 84.9 on tables on its own olmOCR-Bench, the highest among open VLM-OCR models at release
olmOCR-2 takes the whole PDF page as an image and produces structured output in a single step. This is the same architecture as Docling’s VLM pipeline from part 1, just with a different model.
PDF page rendered as image
┌─────────────────────┐
│ Text paragraph... │
│ Name Score │
│ Alice 92 │
│ Bob 85 │
└─────────────────────┘
│
▼
One model reads the page
and writes the output
│
▼
| Name | Score |
|-------|-------|
| Alice | 92 |
| Bob | 85 |
Download the GGUF and vision projector
To use olmOCR-2 with llama.cpp on Apple Silicon, install llama.cpp and download two files: the model weights and the vision projector (mmproj).
brew install llama.cpp
# Language model (Q8_0, ~8 GB)
curl -L -O https://huggingface.co/lmstudio-community/olmOCR-2-7B-1025-GGUF/resolve/main/olmOCR-2-7B-1025-Q8_0.gguf
# Vision projector (F16, ~1.4 GB)
curl -L -O https://huggingface.co/lmstudio-community/olmOCR-2-7B-1025-GGUF/resolve/main/mmproj-olmOCR-2-7B-1025-F16.gguf
Table extraction
olmOCR-2 reads images, not PDFs, so we’ll extract tables in three steps:
Convert each PDF page to an image
Run olmOCR-2 on each image and collect the output
Extract the tables from the combined output with a regex
For step 1, we will use pdf2image, which depends on the poppler system binary. Install both:
brew install poppler
pip install pdf2image
Now convert each page to a JPEG:
import subprocess
from pathlib import Path
from pdf2image import convert_from_path
images_dir = Path("images")
images_dir.mkdir(exist_ok=True)
pages = convert_from_path(local_pdf, dpi=200)
for i, page in enumerate(pages):
page.save(images_dir / f"page_{i}.jpg")
olmOCR-2 doesn’t have a pure-Python API that runs on Apple Silicon, so we shell out to llama-mtmd-cli via subprocess for each page. The command for one page looks like this:
llama-mtmd-cli \
-m olmOCR-2-7B-1025-Q8_0.gguf \
--mmproj mmproj-olmOCR-2-7B-1025-F16.gguf \
--image page_0.jpg \
-p "Convert this page to markdown. Preserve tables exactly. Output tables in HTML format." \
--n-predict 3072
What each flag does:
-m: the language model weights (the.ggufwe downloaded)--mmproj: the vision encoder (themmprojwe downloaded)--image: the input image to process-p: the prompt sent to the model--n-predict: the maximum number of tokens to generate (3072 is enough for most table-heavy pages)
Wrap it in a Python helper so we can loop over pages:
import re
def extract_with_olmocr(page_path: str) -> str:
result = subprocess.run(
[
"llama-mtmd-cli",
"-m", "olmOCR-2-7B-1025-Q8_0.gguf",
"--mmproj", "mmproj-olmOCR-2-7B-1025-F16.gguf",
"--image", page_path,
"-p", "Convert this page to markdown. Preserve tables exactly. Output tables in HTML format.",
"--n-predict", "3072",
],
capture_output=True,
text=True,
)
return result.stdout
Run the helper on every page and combine the outputs:
%%time
olmocr_output = "\n".join(
extract_with_olmocr(str(images_dir / f"page_{i}.jpg")) for i in range(len(pages))
)
Wall time: 5min 34s
olmOCR-2’s output is mostly Markdown but tables come out as HTML blocks. Extract them with a regex:
all_tables = re.findall(r"<table>.*?</table>", olmocr_output, re.DOTALL)
print(f"Items tagged as table: {len(all_tables)}")
Items tagged as table: 4
Not every block tagged <table> is actually a table. olmOCR-2 misreads the author block on the title page as a table and outputs two copies of it. We filter both out:
incorrect_table_indices = (1, 2)
tables = [t for i, t in enumerate(all_tables) if i not in incorrect_table_indices]
print(f"Actual tables: {len(tables)}")
Actual tables: 2
The output is HTML, so use IPython.display.HTML to see it rendered:
from IPython.display import display, HTML
Let’s look at the hardest table in the document: 12 rows of similar-looking numbers and no cell borders to mark column boundaries. Here’s the original from the PDF:
And here’s what olmOCR-2 extracted:
display(HTML(tables[1]))
Worked:
All 12 row labels (Caption, Footnote, ..., All) preserved
12 data rows extracted with one numeric value per cell
Didn’t work:
Two column headers are missing: Only 4 of the 6 columns have headers, so the class-label column and one of the model columns appear unlabeled.
MRCNN R101is dropped from the header row: The numeric values in that column still appear, but they sit under the wrong header name.Hyphenated ranges become decimals: Every entry in the “human” range column is wrong:
84-89becomes84.89,83-91becomes83.91, and so on.Numeric values drift in several cells: Most rows have at least one digit substitution (Page-footer
61.6→74.6, List-item81.2→81.6, All-row72.4→77.4).
Conclusion: olmOCR-2’s output looks clean but can be quietly wrong. It introduces character-level errors on dense numeric tables. Verify numeric values before trusting them.
Performance
olmOCR-2 took 5 min 34 s for the 9-page PDF on an Apple M5 Pro (64 GB RAM), about 37 seconds per page through GGUF + llama.cpp.
For production on a Mac, switch to the native MLX build (mlx-community/olmOCR-2-7B-1025-8bit), which runs about 20% faster than GGUF.
PaddleOCR-VL 1.6: Pipeline VLM
PaddleOCR-VL is Baidu’s open-weight document parser. It stands out for three reasons:
A 1B fine-tune of ERNIE-4.5, the smallest model of the new VLM-OCR generation
Strong multilingual support including Chinese ancient documents, scans, and stamps (not tested in this article)
Mature ecosystem: PaddleOCR has 78.9k stars on GitHub and a long history of production deployment
Unlike olmOCR-2’s single-pass approach, PaddleOCR-VL splits table extraction into two stages:
Layout detection locates each text block, table, and figure on the page
Element-level VL recognition reads each detected region and converts it to text or structured Markdown
PDF page
┌─────────────────────┐
│ Text paragraph... │
│ Name Score │
│ Alice 92 │
│ Bob 85 │
└─────────────────────┘
│
▼
1. Layout detection identifies [TABLE] region
│
▼
2. Element-level VL reads only the table region
│
▼
| Name | Score |
|-------|-------|
| Alice | 92 |
| Bob | 85 |
Install
Pick the install that matches your hardware.
Apple Silicon (Mac):
pip install paddlepaddle
pip install -U "paddleocr[doc-parser]>=3.6.0"
Linux / Windows (NVIDIA):
pip install paddlepaddle-gpu==3.2.1
pip install -U "paddleocr[doc-parser]>=3.6.0"
This article uses PaddleOCR v3.6.0.
Table extraction
Unlike olmOCR-2, PaddleOCR-VL accepts a PDF path directly and returns a result object per page. No PDF-to-image conversion or subprocess loop required:
from paddleocr import PaddleOCRVL
pipeline = PaddleOCRVL(pipeline_version="v1.6")
Run the pipeline on the PDF:
%%time
results = pipeline.predict(local_pdf)
Wall time: 7min 56s
Each entry in results corresponds to one page of the PDF. Loop through them and collect the tables:
# Create an output directory for the per-page markdown files
paddle_output_dir = Path("paddle_output")
paddle_output_dir.mkdir(exist_ok=True)
# Save each page's markdown to disk
for res in results:
res.save_to_markdown(save_path=str(paddle_output_dir))
# Find every HTML table block
all_paddle_tables = []
for md_file in sorted(paddle_output_dir.glob("*.md")):
content = md_file.read_text()
all_paddle_tables.extend(re.findall(r"<table[^>]*>.*?</table>", content, re.DOTALL))
print(f"Items tagged as table: {len(all_paddle_tables)}")
Items tagged as table: 3
Not every block PaddleOCR-VL tagged as a table is a unique table. The third item is a malformed near-duplicate of the second. Let’s filter it out:
incorrect_table_indices = (2,)
paddle_tables = [t for i, t in enumerate(all_paddle_tables) if i not in incorrect_table_indices]
print(f"Actual tables: {len(paddle_tables)}")
Actual tables: 2
Now the same hard table. Here’s the original from the PDF:
And here’s what PaddleOCR-VL extracted:
display(HTML(paddle_tables[1]))
Worked:
All 12 class-label rows plus the Total row are present (truncated above for space)
Hyphenated ranges preserved correctly as “84-89”, “40-61”, exactly where olmOCR-2 misread them as decimals
“n/a” entries preserved
All numeric values match the source
Didn’t work:
Header grouping is wrong: The two parent headers in the original PDF get split into three in the extraction: “Count” is absorbed into “% of Total”, and “triple inter-annotator mAP @ 0.5-0.95 (%)” is split into two separate parents.
Conclusion: PaddleOCR-VL is 7x smaller than olmOCR-2 (1B vs 7B parameters) and still more accurate on this PDF. All numeric values match the source, and the only real flaw is the mis-grouped multi-tier headers.
Performance
PaddleOCR-VL 1.6 took about 7 min 56 s for the full 9-page PDF on an Apple M5 Pro running CPU PaddlePaddle, roughly 53 seconds per page.
Even though the model is smaller than olmOCR-2, the pipeline overhead (layout detection plus element-level recognition) makes it slower per page than olmOCR-2 on this hardware.
Try It Yourself
These benchmarks are based on a single academic PDF tested on an Apple M5 Pro (64 GB RAM) using GGUF Q8_0 quantizations via llama.cpp. Table complexity, document language, scan quality, and hardware all affect the results. The best way to pick the right tool is to run each one on a sample of your own PDFs.
For the first-table walkthrough on each tool and the full runtime setup, see the complete comparison.
Originally published on CodeCut.




