Member-only story

This Alternative to Pandas Processes Data 350 Times Faster

Learn How to Handle 111 Million Rows in Less Than 2 Seconds with Python

Elshad Karimov
2 min readApr 29, 2024
Photo by Julian Hochgesang on Unsplash

Everyone knows Pandas — it’s a favorite among newcomers for data analytics but tends to lag with large datasets. Now, let’s talk about DuckDB. Although it comes with some technical jargon, it boils down to being an open-source, in-process relational OLAP database that runs in memory and prioritizes speed. It’s significantly faster than Pandas, particularly for handling big data volumes.

www.appmillers.com

What’s even better? DuckDB offers a Python library. This means you can quickly replace your sluggish Pandas operations with DuckDB, especially if you’re familiar with SQL.

In this post, we’ll explore a head-to-head comparison of these two as we aggregate over 100 million rows of data. Let’s get started!

Using Pandas: Pandas is straightforward and excellent for smaller datasets. Here’s how you might typically load and aggregate data with Pandas:

import pandas as pd

# Load data from a CSV file
df = pd.read_csv('large_dataset.csv')

# Aggregate data
result = df.groupby('category')['value'].sum()
print(result)

--

--

Elshad Karimov
Elshad Karimov

Written by Elshad Karimov

Software Engineer, Udemy Instructor and Book Author, Founder at AppMillers

No responses yet