pandas 3.0: 5-10x Faster String Operations with PyArrow
Plus build readable regex with Pregex
Grab your coffee. Here are this week’s highlights.
📅 Today’s Picks
pandas 3.0: 5-10x Faster String Operations with PyArrow
Problem
Traditionally, pandas stores strings as object dtype, where each string is a separate Python object scattered across memory.
This makes string operations slow and the dtype ambiguous, since both pure string columns and mixed-type columns show up as object.
Solution
pandas 3.0 introduces a dedicated str dtype backed by PyArrow, which stores strings in contiguous memory blocks instead of individual Python objects.
Key benefits:
5-10x faster string operations because data is stored contiguously
50% lower memory by eliminating Python object overhead
Clear distinction between string and mixed-type columns
📖 View Full Article | 🧪 Run code
Build Self-Documenting Regex with Pregex
Problem
Regex patterns like [a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,} are difficult to read and intimidating.
Team members without regex expertise might struggle to understand and modify these validation patterns.
Solution
Team members without regex expertise might struggle to understand and modify these validation patterns.
Pregex transforms regex into readable Python code using descriptive components.
Key benefits:
Code that explains its intent without comments
Easy modification without regex expertise
Composable patterns for complex validation
Export to regex format when needed
📖 View Full Article | 🧪 Run code
📚 Latest Deep Dives
PDF Table Extraction: Docling vs Marker vs LlamaParse Compared
PDF files do not store tables as structured data. Instead, they position text at specific coordinates on the page. Table extraction tools must reconstruct the structure by determining which values belong in which rows and columns.
The problem becomes even harder when tables include multi-level headers, merged cells, or complex layouts.
To explore this problem, I experimented with three tools designed for PDF table extraction: LlamaParse, Marker, and Docling.
This article shows the results of those experiments.
☕️ Weekly Finds
Lance [Data Processing] - Modern columnar data format for ML with 100x faster random access than Parquet
Mathesar [Dashboard] - Spreadsheet-like interface for PostgreSQL that lets anyone view, edit, and query data
dotenvx [DevOps] - A better dotenv with encryption, multiple environments, and cross-platform support
💬 Rate Your Experience
How would you rate your newsletter experience? Share your feedback →
🔍 Explore More on CodeCut
Tool Selector - Discover 70+ Python tools for AI and data science
Production Ready Data Science - A practical book for taking projects from prototype to production


