0% found this document useful (0 votes)
10 views6 pages

Still Stuck With Pandas - Discover DuckDB - by Sarthak Shah

Uploaded by

sarthakshah1920
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views6 pages

Still Stuck With Pandas - Discover DuckDB - by Sarthak Shah

Uploaded by

sarthakshah1920
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

Open in app

Search Write

Still Stuck with Pandas? Discover


DuckDB — A Library That’s 20x
Faster!
Sarthak Shah
3 min read · Jan 5, 2025

Introduction
Pandas has been the go-to library for Python data manipulation for years. It’s
user-friendly, versatile, and powerful. However, as datasets grow larger,
Pandas starts showing its limitations, particularly in terms of performance
and memory usage. But what if I told you there’s a library that’s up to 20x
faster and just as easy to integrate into your Python workflows?

Enter DuckDB — a lightweight, high-performance, in-memory analytical


database designed for SQL-based operations directly within Python. In this
article, we’ll explore why DuckDB might just be the secret weapon your data
projects have been missing.

What is DuckDB?
DuckDB is an in-memory SQL database management system tailored for
Online Analytical Processing (OLAP) workloads. It’s a powerful tool for
running analytical queries on structured data, offering seamless integration
with Python’s Pandas library and supporting various file formats like CSV
and Parquet.

Unlike traditional database systems, DuckDB shines in local analytical tasks,


making it perfect for developers, data scientists and analysts working on
personal systems or small-scale projects.
Key Features of DuckDB

1. Blazing Speed
Optimized for columnar data.

Efficient query planning and execution.

Handles datasets that Pandas struggles with in a fraction of the time.

2. Seamless Integration
Directly supports Pandas DataFrames.

Familiar SQL syntax simplifies complex data transformations and


aggregations.

simplifies the process of importing CSV files directly into your database,
ensuring quick and efficient data loading.

3. In-Memory and On-Disk Flexibility


Works in-memory for fast operations.

Supports persistent storage for handling large datasets.

4. Cross-Platform Compatibility
Runs on Windows, macOS, and Linux.

Integrates with Python, R, and C++.

5. Lightweight and Portable


Minimal dependencies.

Easy to install and use without extensive setup.

Why DuckDB Over Pandas?

The Pandas Problem


Pandas is fantastic for small-to-medium datasets but:

Memory Intensive: It loads entire datasets into memory, which can be a


bottleneck for large files.

Performance Bottlenecks: Computationally intensive operations like


filtering and aggregations become sluggish with larger datasets.

The DuckDB Advantage


SQL-First Approach: Complex tasks become simpler with SQL queries.

Optimized for Scale: Handles datasets much larger than memory


efficiently.

Faster Execution: Often 20x or more faster for large datasets.

Minimal Setup: DuckDB integrates seamlessly into Python workflows.

Getting Started with DuckDB in Python

1. Installation
pip install duckdb

2. Loading Data
From a CSV File:

import duckdb
conn = duckdb.connect()
conn.execute("CREATE TABLE cars AS SELECT * FROM read_csv_auto('cars.csv');")

From a Pandas DataFrame:

import pandas as pd
import duckdb
df = pd.DataFrame({"A": [1, 2, 3], "B": [4, 5, 6]})
result = duckdb.query("SELECT * FROM df WHERE A > 1").fetchdf()
print(result)

3. In-Memory Mode
For blazing-fast operations, use DuckDB’s in-memory mode:

conn = duckdb.connect(":memory:")

Comparing DuckDB and Pandas


Let’s compare the performance of DuckDB and Pandas using a practical
example:

Code Example

import pandas as pd
import numpy as np
import time
import duckdb

# Generate a large dataset


data_size = 10**8
df = pd.DataFrame({
'A': np.random.rand(data_size),
'C': np.random.randint(0, 100, size=data_size),
'D': np.random.choice(['X', 'Y', 'Z'], size=data_size)
})

# Pandas Operation
start_time = time.time()
result_pandas = df[df['C'] > 50]['A'].mean()
end_time = time.time()
pandas_time = end_time - start_time
print(f"Pandas Result: {result_pandas}, Time taken: {pandas_time:.6f} seconds")

# DuckDB Operation
st = time.time()
result_duckdb = duckdb.query("SELECT AVG(A) FROM df WHERE C > 50").fetchall()[0]
et = time.time()
duckdb_time = et - st
print(f"DuckDB Result: {result_duckdb}, Time taken: {duckdb_time:.6f} seconds")

# Performance Comparison
print(f"DuckDB is {pandas_time / duckdb_time:.2f} times faster than Pandas!")

Results
For a dataset of 100 million rows, the results are:

Pandas: ~2.29 seconds.

DuckDB: ~0.09 seconds.

DuckDB is ~25x faster than Pandas!

Key Takeaways
DuckDB is a game-changer for large-scale data analysis in Python.

It integrates seamlessly with Pandas, making the transition effortless.

Its SQL-first approach and in-memory mode make it ideal for fast and
scalable operations.

Conclusion
DuckDB is not just a faster alternative to Pandas; it’s a powerful tool for
modern data analysis. Whether you’re working with small datasets or
handling large-scale operations, DuckDB’s speed, flexibility, and SQL-based
simplicity make it a must-have in your Python toolkit.

So, are you ready to embrace the future of data analysis and leave Pandas’
performance limitations behind? Give DuckDB a try and experience the
difference firsthand!

Pandas Duckdb Dataanalys Python

Written by Sarthak Shah Edit profile


116 Followers · 11 Following

Senior Software Engineer @ LibelluleMonde | Passionate about Embedded, IoT &


Edge Computing | Python Django, Computer Vision, AWS, PostgreSQL,
DynamoDB, MQTT

No responses yet

What are your thoughts?

Respond

More from Sarthak Shah


Sarthak Shah Sarthak Shah

Stateless JWT or Stateful Sessions: Harnessing the Power of In-


Why Sessions Sometimes Win Ov… Memory Buffers with BytesIO
Analogy: Imagine you’re throwing a party. You In the realm of digital content processing,
hand each guest a party pass (think of it as a… efficiency is often paramount. Whether…

Jan 25 Dec 24, 2023 9 1

Sarthak Shah Sarthak Shah

Computer Vision: Fundamentals to Mastering Web Application


Advanced for the Next 100 Years |… Sessions: A Comprehensive Guid…
Hi, I will discuss everything in below attached Why is session management important ?
fashion. :)

Aug 24, 2024 183 1 Feb 5, 2024 5

See all from Sarthak Shah

Recommended from Medium

Maksim Kazartsev Anshu Bantra

Getting Started with DuckDB🦆 Using DuckDB in Python: A


This short guide walks you through the Comprehensive Guide
essentials of DuckDB — installation, queryin… Introduction to DuckDB

Dec 1, 2024 4 Sep 30, 2024 279 3

Lists

Coding & Development Predictive Modeling w/


11 stories · 1018 saves Python
20 stories · 1840 saves
Practical Guides to Machine ChatGPT
Learning 21 stories · 979 saves
10 stories · 2215 saves

Ong Xuan Hong Josef Machytka

DataOps 05: Big Data with Limited How DuckDB handles data not
resources — Bioinformatics Data… fitting into memory?
In the ever-evolving field of bioinformatics, In my previous article about DuckDB I
efficiently processing large datasets is a… described how to use this database as an…

Nov 9, 2024 5 Nov 13, 2024 7

In Python in Plain English by Leo Liu Anurag RANA

DuckDB: The High-Performance FireDucks, Pandas, DuckDB, and


Data Hub Your Data Science… Polars: A Comprehensive…
Imagine slashing your data processing time in In the rapidly evolving world of data analysis,
half without spinning up a server or breaking… selecting the right tool can significantly…

5d ago 23 Jan 2 1

See more recommendations

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy