pandas vs Polars vs DuckDB: A Data Scientist’s Guide to Choosing the Right Tool
Compare pandas, Polars, and DuckDB for data analysis. Learn when to use each tool based on data size, performance needs, and workflow preferences.
Personal Update
A quick personal update: I welcomed my baby boy last week, and both mom and baby are doing well.
I’ll be taking some time to enjoy this special chapter with my family, but I’ve prepared plenty of content in advance, so you can still expect the same high-quality posts and newsletters during my maternity leave.
Introduction
pandas has been the standard tool for working with tabular data in Python for over a decade. But as datasets grow larger and performance requirements increase, two modern alternatives have emerged: Polars, a DataFrame library written in Rust, and DuckDB, an embedded SQL database optimized for analytics.
Each tool excels in different scenarios:
Tool Backend Execution Model Best For
pandas C/Python Eager, single-threaded Small datasets, prototyping, ML integration
Polars Rust Lazy/Eager, multi-threaded Large-scale analytics, data pipelines
DuckDB C++ SQL-first, multi-threaded SQL workflows, embedded analytics, file queriesThis guide compares all three tools with practical examples, helping you choose the right one for your workflow.
This is a condensed version focused on the performance benchmarks. For the full guide with side-by-side syntax comparisons, interoperability examples, and a decision matrix, see the complete comparison.
Tool Strengths at a Glance
pandas
pandas is the original DataFrame library for Python that excels at interactive data exploration and integrates seamlessly with the ML ecosystem. Key capabilities include:
Direct compatibility with scikit-learn, statsmodels, and visualization libraries
Rich ecosystem of extensions (pandas-profiling, pandasql, etc.)
Mature time series functionality
Familiar syntax that most data scientists already know
Polars
Polars is a Rust-powered DataFrame library designed for speed that brings multi-threaded execution and query optimization to Python. Key capabilities include:
Speeds up operations by using all available CPU cores by default
Builds a query plan first, then executes only what’s needed
Streaming mode for processing datasets larger than RAM
Expressive method chaining with a pandas-like API
DuckDB
DuckDB is an embedded SQL database optimized for analytics that brings database-level query optimization to local files. Key capabilities include:
Native SQL syntax with full analytical query support
Queries CSV, Parquet, and JSON files directly without loading
Uses disk storage automatically when data exceeds available memory
Zero-configuration embedded database requiring no server setup
Setup
Install all three libraries:
pip install pandas polars duckdb
Generate sample data for benchmarking:
import pandas as pd
import numpy as np
np.random.seed(42)
n_rows = 5_000_000
data = {
"category": np.random.choice(["Electronics", "Clothing", "Food", "Books"], size=n_rows),
"region": np.random.choice(["North", "South", "East", "West"], size=n_rows),
"amount": np.random.rand(n_rows) * 1000,
"quantity": np.random.randint(1, 100, size=n_rows),
}
df_pandas = pd.DataFrame(data)
df_pandas.to_csv("sales_data.csv", index=False)
print(f"Created sales_data.csv with {n_rows:,} rows")
Created sales_data.csv with 5,000,000 rows
Data Loading Performance
pandas reads CSV files on a single CPU core. Polars and DuckDB use multi-threaded execution, distributing the work across all available cores to read different parts of the file simultaneously.
pandas
Single-threaded CSV parsing loads data sequentially.
┌─────────────────────────────────────────────┐
│ CPU Core 1 │
│ ┌─────────────────────────────────────────┐ │
│ │ Chunk 1 → Chunk 2 → Chunk 3 → ... → End │ │
│ └─────────────────────────────────────────┘ │
│ CPU Core 2 [idle] │
│ CPU Core 3 [idle] │
│ CPU Core 4 [idle] │
└─────────────────────────────────────────────┘
pandas_time = %timeit -o pd.read_csv("sales_data.csv")
1.05 s ± 26.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Polars
Multi-threaded parsing distributes file reading across all available cores.
┌─────────────────────────────────────────────┐
│ CPU Core 1 ┌────────────────┐ │
│ │ ████████████ │ │
│ CPU Core 2 ┌────────────────┐ │
│ │ ████████████ │ │
│ CPU Core 3 ┌────────────────┐ │
│ │ ████████████ │ │
│ CPU Core 4 ┌────────────────┐ │
│ │ ████████████ │ │
└─────────────────────────────────────────────┘
import polars as pl
polars_time = %timeit -o pl.read_csv("sales_data.csv")
137 ms ± 34 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
DuckDB
Similar to Polars, file reading is distributed across all available cores.
┌─────────────────────────────────────────────┐
│ CPU Core 1 ┌────────────────┐ │
│ │ ████████████ │ │
│ CPU Core 2 ┌────────────────┐ │
│ │ ████████████ │ │
│ CPU Core 3 ┌────────────────┐ │
│ │ ████████████ │ │
│ CPU Core 4 ┌────────────────┐ │
│ │ ████████████ │ │
└─────────────────────────────────────────────┘
import duckdb
duckdb_time = %timeit -o duckdb.sql("SELECT * FROM 'sales_data.csv'").df()
762 ms ± 77.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
print(f"Polars is {pandas_time.average / polars_time.average:.1f}× faster than pandas")
print(f"DuckDB is {pandas_time.average / duckdb_time.average:.1f}× faster than pandas")
Polars is 7.7× faster than pandas
DuckDB is 1.4× faster than pandas
While Polars leads with a 7.7× speedup in CSV reading, DuckDB’s 1.4× improvement shows parsing isn’t its focus. DuckDB shines when querying files directly or running complex analytical queries.
Query Optimization
pandas: No Optimization
pandas executes operations eagerly, creating intermediate DataFrames at each step. This wastes memory and prevents optimization.
┌─────────────────────────────────────────────────────────────┐
│ Step 1: Load ALL rows → 10M rows in memory │
│ Step 2: Filter (amount > 100) → 5M rows in memory │
│ Step 3: GroupBy → New DataFrame │
│ Step 4: Mean → Final result │
└─────────────────────────────────────────────────────────────┘
Memory: ████████████████████████████████ (high - stores all intermediates)
def pandas_query():
return (
pd.read_csv("sales_data.csv")
.query('amount > 100')
.groupby('category')['amount']
.mean()
)
pandas_opt_time = %timeit -o pandas_query()
1.46 s ± 88.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
This approach has three problems:
Full CSV load: All rows are read before filtering
No predicate pushdown: Rows are filtered after loading the entire file into memory
No projection pushdown: All columns are loaded, even unused ones
You can manually add usecols to load fewer columns:
def pandas_query_optimized():
return (
pd.read_csv("sales_data.csv", usecols=["category", "amount"])
.query('amount > 100')
.groupby('category')['amount']
.mean()
)
pandas_usecols_time = %timeit -o pandas_query_optimized()
1.06 s ± 48.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
This is faster, but has two drawbacks:
Manual tracking: You must specify columns yourself; change the query, update
usecolsNo row filtering: All rows still load before the filter applies
Polars and DuckDB handle both automatically by analyzing your query before execution.
Polars: Lazy Evaluation
Polars supports lazy evaluation, which builds a query plan and optimizes it before execution:
┌─────────────────────────────────────────────────────────────┐
│ Query Plan Built: │
│ scan_csv → filter → group_by → agg │
│ │
│ Optimizations Applied: │
│ • Predicate pushdown (filter during scan) │
│ • Projection pushdown (read only needed columns) │
│ • Multi-threaded execution (parallel across CPU cores) │
└─────────────────────────────────────────────────────────────┘
Memory: ████████ (low - no intermediate DataFrames)
query_pl = (
pl.scan_csv("sales_data.csv")
.filter(pl.col("amount") > 100)
.group_by("category")
.agg(pl.col("amount").mean().alias("avg_amount"))
)
# View the optimized query plan
print(query_pl.explain())
AGGREGATE[maintain_order: false]
[col("amount").mean().alias("avg_amount")] BY [col("category")]
FROM
Csv SCAN [sales_data.csv] [id: 4687118704]
PROJECT 2/4 COLUMNS
SELECTION: [(col("amount")) > (100.0)]
The query plan shows these optimizations:
Predicate pushdown:
SELECTIONfilters during scan, not after loadingProjection pushdown:
PROJECT 2/4 COLUMNSreads only what’s neededOperation reordering: Aggregate runs on filtered data, not the full dataset
Execute the optimized query:
def polars_query():
return (
pl.scan_csv("sales_data.csv")
.filter(pl.col("amount") > 100)
.group_by("category")
.agg(pl.col("amount").mean().alias("avg_amount"))
.collect()
)
polars_opt_time = %timeit -o polars_query()
148 ms ± 32.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
DuckDB: SQL Optimizer
DuckDB’s SQL optimizer applies similar optimizations automatically:
┌─────────────────────────────────────────────────────────────┐
│ Query Plan Built: │
│ SQL → Parser → Optimizer → Execution Plan │
│ │
│ Optimizations Applied: │
│ • Predicate pushdown (WHERE during scan) │
│ • Projection pushdown (SELECT only needed columns) │
│ • Vectorized execution (process 1024 rows per batch) │
└─────────────────────────────────────────────────────────────┘
Memory: ████████ (low - streaming execution)
def duckdb_query():
return duckdb.sql("""
SELECT category, AVG(amount) as avg_amount
FROM 'sales_data.csv'
WHERE amount > 100
GROUP BY category
""").df()
duckdb_opt_time = %timeit -o duckdb_query()
245 ms ± 12.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Let’s compare the performance of the optimized queries:
print(f"Polars is {pandas_opt_time.average / polars_opt_time.average:.1f}× faster than pandas")
print(f"DuckDB is {pandas_opt_time.average / duckdb_opt_time.average:.1f}× faster than pandas")
Polars is 9.9× faster than pandas
DuckDB is 6.0× faster than pandas
Polars outperforms DuckDB (9.9× vs 6.0×) in this benchmark because its Rust-based engine handles the filter-then-aggregate pattern efficiently. DuckDB’s strength lies in complex SQL queries with joins and subqueries.
Memory Efficiency
pandas: Full Memory Load
pandas loads the entire dataset into RAM:
┌─────────────────────────────────────────────────────────────┐
│ RAM │
│ ┌────────────────────────────────────────────────────────┐ │
│ │████████████████████████████████████████████████████████│ │
│ │██████████████████ ALL 10M ROWS ████████████████████████│ │
│ │████████████████████████████████████████████████████████│ │
│ └────────────────────────────────────────────────────────┘ │
│ Usage: 707,495 KB (entire dataset in memory) │
└─────────────────────────────────────────────────────────────┘
df_pd_mem = pd.read_csv("sales_data.csv")
pandas_mem = df_pd_mem.memory_usage(deep=True).sum() / 1e3
print(f"pandas memory usage: {pandas_mem:,.0f} KB")
pandas memory usage: 707,495 KB
For larger-than-RAM datasets, pandas throws an out-of-memory error.
Polars: Streaming Mode
Polars can process data in streaming mode, handling chunks without loading everything:
┌─────────────────────────────────────────────────────────────┐
│ RAM │
│ ┌────────────────────────────────────────────────────────┐ │
│ │█ │ │
│ │ (result only) │ │
│ │ │ │
│ └────────────────────────────────────────────────────────┘ │
│ Usage: 0.06 KB (streams chunks, keeps only result) │
└─────────────────────────────────────────────────────────────┘
result_pl_stream = (
pl.scan_csv("sales_data.csv")
.group_by("category")
.agg(pl.col("amount").mean())
.collect(streaming=True)
)
polars_mem = result_pl_stream.estimated_size() / 1e3
print(f"Polars result memory: {polars_mem:.2f} KB")
Polars result memory: 0.06 KB
For larger-than-RAM files, use sink_parquet instead of collect(). It writes results directly to disk as chunks are processed, never holding the full dataset in memory:
(
pl.scan_csv("sales_data.csv")
.filter(pl.col("amount") > 500)
.sink_parquet("filtered_sales.parquet")
)
DuckDB: Automatic Spill-to-Disk
DuckDB automatically writes intermediate results to temporary files when data exceeds available RAM:
┌─────────────────────────────────────────────────────────────┐
│ RAM Disk (if needed) │
│ ┌──────────────────────────┐ ┌──────────────────────┐ │
│ │█ │ │░░░░░░░░░░░░░░░░░░░░░░│ │
│ │ (up to 500MB) │ → │ (overflow here) │ │
│ │ │ │ │ │
│ └──────────────────────────┘ └──────────────────────┘ │
│ Usage: 0.42 KB (spills to disk when RAM full) │
└─────────────────────────────────────────────────────────────┘
# Configure memory limit and temp directory
duckdb.sql("SET memory_limit = '500MB'")
duckdb.sql("SET temp_directory = '/tmp/duckdb_temp'")
# DuckDB handles larger-than-RAM automatically
result_duckdb_mem = duckdb.sql("""
SELECT category, AVG(amount) as avg_amount
FROM 'sales_data.csv'
GROUP BY category
""").df()
duckdb_mem = result_duckdb_mem.memory_usage(deep=True).sum() / 1e3
print(f"DuckDB result memory: {duckdb_mem:.2f} KB")
DuckDB result memory: 0.42 KB
DuckDB’s out-of-core processing makes it ideal for embedded analytics where memory is limited.
print(f"pandas: {pandas_mem:,.0f} KB (full dataset)")
print(f"Polars: {polars_mem:.2f} KB (result only)")
print(f"DuckDB: {duckdb_mem:.2f} KB (result only)")
print(f"\nPolars uses {pandas_mem / polars_mem:,.0f}× less memory than pandas")
print(f"DuckDB uses {pandas_mem / duckdb_mem:,.0f}× less memory than pandas")
pandas: 707,495 KB (full dataset)
Polars: 0.06 KB (result only)
DuckDB: 0.42 KB (result only)
Polars uses 11,791,583× less memory than pandas
DuckDB uses 1,684,512× less memory than pandas
The million-fold reduction comes from streaming: Polars and DuckDB process data in chunks and only keep the 4-row result in memory, while pandas must hold all 10 million rows to compute the same aggregation.
Final Thoughts
If your code is all written in pandas, you don’t need to rewrite it all. You can migrate where it matters:
Profile first: Find which pandas operations are slow
Replace with Polars: CSV reads, groupbys, and joins see the biggest gains
Add DuckDB: When SQL is cleaner than chained DataFrame operations
Keep pandas for final ML steps. Convert with df.to_pandas() when needed.
For side-by-side syntax comparisons, interoperability examples, and a full decision matrix with recommendations by data size and workflow, see the complete comparison.
Originally published on CodeCut.
