pandas vs Polars vs DuckDB: A Data Scientist’s Guide to Choosing the Right Tool

Compare pandas, Polars, and DuckDB for data analysis. Learn when to use each tool based on data size, performance needs, and workflow preferences.

Jun 30, 2026

Personal Update

A quick personal update: I welcomed my baby boy last week, and both mom and baby are doing well.

I’ll be taking some time to enjoy this special chapter with my family, but I’ve prepared plenty of content in advance, so you can still expect the same high-quality posts and newsletters during my maternity leave.

Introduction

pandas has been the standard tool for working with tabular data in Python for over a decade. But as datasets grow larger and performance requirements increase, two modern alternatives have emerged: Polars, a DataFrame library written in Rust, and DuckDB, an embedded SQL database optimized for analytics.

Each tool excels in different scenarios:

Tool    Backend   Execution Model             Best For
pandas  C/Python  Eager, single-threaded      Small datasets, prototyping, ML integration
Polars  Rust      Lazy/Eager, multi-threaded  Large-scale analytics, data pipelines
DuckDB  C++       SQL-first, multi-threaded   SQL workflows, embedded analytics, file queries

This guide compares all three tools with practical examples, helping you choose the right one for your workflow.

This is a condensed version focused on the performance benchmarks. For the full guide with side-by-side syntax comparisons, interoperability examples, and a decision matrix, see the complete comparison.

Tool Strengths at a Glance

pandas

pandas is the original DataFrame library for Python that excels at interactive data exploration and integrates seamlessly with the ML ecosystem. Key capabilities include:

Direct compatibility with scikit-learn, statsmodels, and visualization libraries
Rich ecosystem of extensions (pandas-profiling, pandasql, etc.)
Mature time series functionality
Familiar syntax that most data scientists already know

Polars

Polars is a Rust-powered DataFrame library designed for speed that brings multi-threaded execution and query optimization to Python. Key capabilities include:

Speeds up operations by using all available CPU cores by default
Builds a query plan first, then executes only what’s needed
Streaming mode for processing datasets larger than RAM
Expressive method chaining with a pandas-like API

DuckDB

DuckDB is an embedded SQL database optimized for analytics that brings database-level query optimization to local files. Key capabilities include:

Native SQL syntax with full analytical query support
Queries CSV, Parquet, and JSON files directly without loading
Uses disk storage automatically when data exceeds available memory
Zero-configuration embedded database requiring no server setup

Setup

Install all three libraries:

pip install pandas polars duckdb

Generate sample data for benchmarking:

import pandas as pd
import numpy as np

np.random.seed(42)
n_rows = 5_000_000

data = {
    "category": np.random.choice(["Electronics", "Clothing", "Food", "Books"], size=n_rows),
    "region": np.random.choice(["North", "South", "East", "West"], size=n_rows),
    "amount": np.random.rand(n_rows) * 1000,
    "quantity": np.random.randint(1, 100, size=n_rows),
}

df_pandas = pd.DataFrame(data)
df_pandas.to_csv("sales_data.csv", index=False)
print(f"Created sales_data.csv with {n_rows:,} rows")

Created sales_data.csv with 5,000,000 rows

Data Loading Performance

pandas reads CSV files on a single CPU core. Polars and DuckDB use multi-threaded execution, distributing the work across all available cores to read different parts of the file simultaneously.

pandas

Single-threaded CSV parsing loads data sequentially.

┌─────────────────────────────────────────────┐
│ CPU Core 1                                  │
│ ┌─────────────────────────────────────────┐ │
│ │ Chunk 1 → Chunk 2 → Chunk 3 → ... → End │ │
│ └─────────────────────────────────────────┘ │
│ CPU Core 2  [idle]                          │
│ CPU Core 3  [idle]                          │
│ CPU Core 4  [idle]                          │
└─────────────────────────────────────────────┘

pandas_time = %timeit -o pd.read_csv("sales_data.csv")

1.05 s ± 26.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Polars

Multi-threaded parsing distributes file reading across all available cores.

┌─────────────────────────────────────────────┐
│ CPU Core 1  ┌────────────────┐              │
│             │ ████████████   │              │
│ CPU Core 2  ┌────────────────┐              │
│             │ ████████████   │              │
│ CPU Core 3  ┌────────────────┐              │
│             │ ████████████   │              │
│ CPU Core 4  ┌────────────────┐              │
│             │ ████████████   │              │
└─────────────────────────────────────────────┘

import polars as pl

polars_time = %timeit -o pl.read_csv("sales_data.csv")

137 ms ± 34 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

DuckDB

Similar to Polars, file reading is distributed across all available cores.

┌─────────────────────────────────────────────┐
│ CPU Core 1  ┌────────────────┐              │
│             │ ████████████   │              │
│ CPU Core 2  ┌────────────────┐              │
│             │ ████████████   │              │
│ CPU Core 3  ┌────────────────┐              │
│             │ ████████████   │              │
│ CPU Core 4  ┌────────────────┐              │
│             │ ████████████   │              │
└─────────────────────────────────────────────┘

import duckdb

duckdb_time = %timeit -o duckdb.sql("SELECT * FROM 'sales_data.csv'").df()

762 ms ± 77.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

print(f"Polars is {pandas_time.average / polars_time.average:.1f}× faster than pandas")
print(f"DuckDB is {pandas_time.average / duckdb_time.average:.1f}× faster than pandas")

Polars is 7.7× faster than pandas
DuckDB is 1.4× faster than pandas

While Polars leads with a 7.7× speedup in CSV reading, DuckDB’s 1.4× improvement shows parsing isn’t its focus. DuckDB shines when querying files directly or running complex analytical queries.

Query Optimization

pandas: No Optimization

pandas executes operations eagerly, creating intermediate DataFrames at each step. This wastes memory and prevents optimization.

┌─────────────────────────────────────────────────────────────┐
│ Step 1: Load ALL rows          → 10M rows in memory         │
│ Step 2: Filter (amount > 100)  → 5M rows in memory          │
│ Step 3: GroupBy                → New DataFrame              │
│ Step 4: Mean                   → Final result               │
└─────────────────────────────────────────────────────────────┘
Memory: ████████████████████████████████ (high - stores all intermediates)

def pandas_query():
    return (
        pd.read_csv("sales_data.csv")
        .query('amount > 100')
        .groupby('category')['amount']
        .mean()
    )

pandas_opt_time = %timeit -o pandas_query()

1.46 s ± 88.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

This approach has three problems:

Full CSV load: All rows are read before filtering
No predicate pushdown: Rows are filtered after loading the entire file into memory
No projection pushdown: All columns are loaded, even unused ones

You can manually add usecols to load fewer columns:

def pandas_query_optimized():
    return (
        pd.read_csv("sales_data.csv", usecols=["category", "amount"])
        .query('amount > 100')
        .groupby('category')['amount']
        .mean()
    )

pandas_usecols_time = %timeit -o pandas_query_optimized()

1.06 s ± 48.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

This is faster, but has two drawbacks:

Manual tracking: You must specify columns yourself; change the query, update usecols
No row filtering: All rows still load before the filter applies

Polars and DuckDB handle both automatically by analyzing your query before execution.

Polars: Lazy Evaluation

Polars supports lazy evaluation, which builds a query plan and optimizes it before execution:

┌─────────────────────────────────────────────────────────────┐
│ Query Plan Built:                                           │
│   scan_csv → filter → group_by → agg                        │
│                                                             │
│ Optimizations Applied:                                      │
│   • Predicate pushdown (filter during scan)                 │
│   • Projection pushdown (read only needed columns)          │
│   • Multi-threaded execution (parallel across CPU cores)    │
└─────────────────────────────────────────────────────────────┘
Memory: ████████ (low - no intermediate DataFrames)

query_pl = (
    pl.scan_csv("sales_data.csv")
    .filter(pl.col("amount") > 100)
    .group_by("category")
    .agg(pl.col("amount").mean().alias("avg_amount"))
)

# View the optimized query plan
print(query_pl.explain())

AGGREGATE[maintain_order: false]
  [col("amount").mean().alias("avg_amount")] BY [col("category")]
  FROM
  Csv SCAN [sales_data.csv] [id: 4687118704]
  PROJECT 2/4 COLUMNS
  SELECTION: [(col("amount")) > (100.0)]

The query plan shows these optimizations:

Predicate pushdown: SELECTION filters during scan, not after loading
Projection pushdown: PROJECT 2/4 COLUMNS reads only what’s needed
Operation reordering: Aggregate runs on filtered data, not the full dataset

Execute the optimized query:

def polars_query():
    return (
        pl.scan_csv("sales_data.csv")
        .filter(pl.col("amount") > 100)
        .group_by("category")
        .agg(pl.col("amount").mean().alias("avg_amount"))
        .collect()
    )

polars_opt_time = %timeit -o polars_query()

148 ms ± 32.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

DuckDB: SQL Optimizer

DuckDB’s SQL optimizer applies similar optimizations automatically:

┌─────────────────────────────────────────────────────────────┐
│ Query Plan Built:                                           │
│   SQL → Parser → Optimizer → Execution Plan                 │
│                                                             │
│ Optimizations Applied:                                      │
│   • Predicate pushdown (WHERE during scan)                  │
│   • Projection pushdown (SELECT only needed columns)        │
│   • Vectorized execution (process 1024 rows per batch)      │
└─────────────────────────────────────────────────────────────┘
Memory: ████████ (low - streaming execution)

def duckdb_query():
    return duckdb.sql("""
        SELECT category, AVG(amount) as avg_amount
        FROM 'sales_data.csv'
        WHERE amount > 100
        GROUP BY category
    """).df()

duckdb_opt_time = %timeit -o duckdb_query()

245 ms ± 12.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Let’s compare the performance of the optimized queries:

print(f"Polars is {pandas_opt_time.average / polars_opt_time.average:.1f}× faster than pandas")
print(f"DuckDB is {pandas_opt_time.average / duckdb_opt_time.average:.1f}× faster than pandas")

Polars is 9.9× faster than pandas
DuckDB is 6.0× faster than pandas

Polars outperforms DuckDB (9.9× vs 6.0×) in this benchmark because its Rust-based engine handles the filter-then-aggregate pattern efficiently. DuckDB’s strength lies in complex SQL queries with joins and subqueries.

Memory Efficiency

pandas: Full Memory Load

pandas loads the entire dataset into RAM:

┌─────────────────────────────────────────────────────────────┐
│  RAM                                                        │
│  ┌────────────────────────────────────────────────────────┐ │
│  │████████████████████████████████████████████████████████│ │
│  │██████████████████ ALL 10M ROWS ████████████████████████│ │
│  │████████████████████████████████████████████████████████│ │
│  └────────────────────────────────────────────────────────┘ │
│  Usage: 707,495 KB (entire dataset in memory)               │
└─────────────────────────────────────────────────────────────┘

df_pd_mem = pd.read_csv("sales_data.csv")
pandas_mem = df_pd_mem.memory_usage(deep=True).sum() / 1e3
print(f"pandas memory usage: {pandas_mem:,.0f} KB")

pandas memory usage: 707,495 KB

For larger-than-RAM datasets, pandas throws an out-of-memory error.

Polars: Streaming Mode

Polars can process data in streaming mode, handling chunks without loading everything:

┌─────────────────────────────────────────────────────────────┐
│  RAM                                                        │
│  ┌────────────────────────────────────────────────────────┐ │
│  │█                                                       │ │
│  │                    (result only)                       │ │
│  │                                                        │ │
│  └────────────────────────────────────────────────────────┘ │
│  Usage: 0.06 KB (streams chunks, keeps only result)         │
└─────────────────────────────────────────────────────────────┘

result_pl_stream = (
    pl.scan_csv("sales_data.csv")
    .group_by("category")
    .agg(pl.col("amount").mean())
    .collect(streaming=True)
)

polars_mem = result_pl_stream.estimated_size() / 1e3
print(f"Polars result memory: {polars_mem:.2f} KB")

Polars result memory: 0.06 KB

For larger-than-RAM files, use sink_parquet instead of collect(). It writes results directly to disk as chunks are processed, never holding the full dataset in memory:

(
    pl.scan_csv("sales_data.csv")
    .filter(pl.col("amount") > 500)
    .sink_parquet("filtered_sales.parquet")
)

DuckDB: Automatic Spill-to-Disk

DuckDB automatically writes intermediate results to temporary files when data exceeds available RAM:

┌─────────────────────────────────────────────────────────────┐
│  RAM                              Disk (if needed)          │
│  ┌──────────────────────────┐     ┌──────────────────────┐  │
│  │█                         │     │░░░░░░░░░░░░░░░░░░░░░░│  │
│  │     (up to 500MB)        │  →  │    (overflow here)   │  │
│  │                          │     │                      │  │
│  └──────────────────────────┘     └──────────────────────┘  │
│  Usage: 0.42 KB (spills to disk when RAM full)              │
└─────────────────────────────────────────────────────────────┘

# Configure memory limit and temp directory
duckdb.sql("SET memory_limit = '500MB'")
duckdb.sql("SET temp_directory = '/tmp/duckdb_temp'")

# DuckDB handles larger-than-RAM automatically
result_duckdb_mem = duckdb.sql("""
    SELECT category, AVG(amount) as avg_amount
    FROM 'sales_data.csv'
    GROUP BY category
""").df()

duckdb_mem = result_duckdb_mem.memory_usage(deep=True).sum() / 1e3
print(f"DuckDB result memory: {duckdb_mem:.2f} KB")

DuckDB result memory: 0.42 KB

DuckDB’s out-of-core processing makes it ideal for embedded analytics where memory is limited.

print(f"pandas: {pandas_mem:,.0f} KB (full dataset)")
print(f"Polars: {polars_mem:.2f} KB (result only)")
print(f"DuckDB: {duckdb_mem:.2f} KB (result only)")
print(f"\nPolars uses {pandas_mem / polars_mem:,.0f}× less memory than pandas")
print(f"DuckDB uses {pandas_mem / duckdb_mem:,.0f}× less memory than pandas")

pandas: 707,495 KB (full dataset)
Polars: 0.06 KB (result only)
DuckDB: 0.42 KB (result only)

Polars uses 11,791,583× less memory than pandas
DuckDB uses 1,684,512× less memory than pandas

The million-fold reduction comes from streaming: Polars and DuckDB process data in chunks and only keep the 4-row result in memory, while pandas must hold all 10 million rows to compute the same aggregation.

Final Thoughts

If your code is all written in pandas, you don’t need to rewrite it all. You can migrate where it matters:

Profile first: Find which pandas operations are slow
Replace with Polars: CSV reads, groupbys, and joins see the biggest gains
Add DuckDB: When SQL is cleaner than chained DataFrame operations

Keep pandas for final ML steps. Convert with df.to_pandas() when needed.

For side-by-side syntax comparisons, interoperability examples, and a full decision matrix with recommendations by data size and workflow, see the complete comparison.

Originally published on CodeCut.

CodeCut Newsletter

Discussion about this post

Ready for more?