0% found this document useful (0 votes)

10 views6 pages

Still Stuck With Pandas - Discover DuckDB - by Sarthak Shah

Uploaded by

sarthakshah1920

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

10 views6 pages

Still Stuck With Pandas - Discover DuckDB - by Sarthak Shah

Uploaded by

sarthakshah1920

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 6

Open in app

Search Write

Still Stuck with Pandas? Discover

DuckDB — A Library That’s 20x
Faster!
Sarthak Shah
3 min read · Jan 5, 2025

Introduction
Pandas has been the go-to library for Python data manipulation for years. It’s
user-friendly, versatile, and powerful. However, as datasets grow larger,
Pandas starts showing its limitations, particularly in terms of performance
and memory usage. But what if I told you there’s a library that’s up to 20x
faster and just as easy to integrate into your Python workflows?

Enter DuckDB — a lightweight, high-performance, in-memory analytical

database designed for SQL-based operations directly within Python. In this
article, we’ll explore why DuckDB might just be the secret weapon your data
projects have been missing.

What is DuckDB?
DuckDB is an in-memory SQL database management system tailored for
Online Analytical Processing (OLAP) workloads. It’s a powerful tool for
running analytical queries on structured data, offering seamless integration
with Python’s Pandas library and supporting various file formats like CSV
and Parquet.

Unlike traditional database systems, DuckDB shines in local analytical tasks,

making it perfect for developers, data scientists and analysts working on
personal systems or small-scale projects.
Key Features of DuckDB

1. Blazing Speed
Optimized for columnar data.

Efficient query planning and execution.

Handles datasets that Pandas struggles with in a fraction of the time.

2. Seamless Integration
Directly supports Pandas DataFrames.

Familiar SQL syntax simplifies complex data transformations and

aggregations.

simplifies the process of importing CSV files directly into your database,
ensuring quick and efficient data loading.

3. In-Memory and On-Disk Flexibility

Works in-memory for fast operations.

Supports persistent storage for handling large datasets.

4. Cross-Platform Compatibility
Runs on Windows, macOS, and Linux.

Integrates with Python, R, and C++.

5. Lightweight and Portable

Minimal dependencies.

Easy to install and use without extensive setup.

Why DuckDB Over Pandas?

The Pandas Problem

Pandas is fantastic for small-to-medium datasets but:

Memory Intensive: It loads entire datasets into memory, which can be a

bottleneck for large files.

Performance Bottlenecks: Computationally intensive operations like

filtering and aggregations become sluggish with larger datasets.

The DuckDB Advantage

SQL-First Approach: Complex tasks become simpler with SQL queries.

Optimized for Scale: Handles datasets much larger than memory

efficiently.

Faster Execution: Often 20x or more faster for large datasets.

Minimal Setup: DuckDB integrates seamlessly into Python workflows.

Getting Started with DuckDB in Python

1. Installation
pip install duckdb

2. Loading Data
From a CSV File:

import duckdb
conn = duckdb.connect()
conn.execute("CREATE TABLE cars AS SELECT * FROM read_csv_auto('cars.csv');")

From a Pandas DataFrame:

import pandas as pd
import duckdb
df = pd.DataFrame({"A": [1, 2, 3], "B": [4, 5, 6]})
result = duckdb.query("SELECT * FROM df WHERE A > 1").fetchdf()
print(result)

3. In-Memory Mode
For blazing-fast operations, use DuckDB’s in-memory mode:

conn = duckdb.connect(":memory:")

Comparing DuckDB and Pandas

Let’s compare the performance of DuckDB and Pandas using a practical
example:

Code Example

import pandas as pd
import numpy as np
import time
import duckdb

# Generate a large dataset

data_size = 10**8
df = pd.DataFrame({
'A': np.random.rand(data_size),
'C': np.random.randint(0, 100, size=data_size),
'D': np.random.choice(['X', 'Y', 'Z'], size=data_size)
})

# Pandas Operation
start_time = time.time()
result_pandas = df[df['C'] > 50]['A'].mean()
end_time = time.time()
pandas_time = end_time - start_time
print(f"Pandas Result: {result_pandas}, Time taken: {pandas_time:.6f} seconds")

# DuckDB Operation
st = time.time()
result_duckdb = duckdb.query("SELECT AVG(A) FROM df WHERE C > 50").fetchall()[0]
et = time.time()
duckdb_time = et - st
print(f"DuckDB Result: {result_duckdb}, Time taken: {duckdb_time:.6f} seconds")

# Performance Comparison
print(f"DuckDB is {pandas_time / duckdb_time:.2f} times faster than Pandas!")

Results
For a dataset of 100 million rows, the results are:

Pandas: ~2.29 seconds.

DuckDB: ~0.09 seconds.

DuckDB is ~25x faster than Pandas!

Key Takeaways
DuckDB is a game-changer for large-scale data analysis in Python.

It integrates seamlessly with Pandas, making the transition effortless.

Its SQL-first approach and in-memory mode make it ideal for fast and
scalable operations.

Conclusion
DuckDB is not just a faster alternative to Pandas; it’s a powerful tool for
modern data analysis. Whether you’re working with small datasets or
handling large-scale operations, DuckDB’s speed, flexibility, and SQL-based
simplicity make it a must-have in your Python toolkit.

So, are you ready to embrace the future of data analysis and leave Pandas’
performance limitations behind? Give DuckDB a try and experience the
difference firsthand!

Pandas Duckdb Dataanalys Python

Written by Sarthak Shah Edit profile

116 Followers · 11 Following

Senior Software Engineer @ LibelluleMonde | Passionate about Embedded, IoT &

Edge Computing | Python Django, Computer Vision, AWS, PostgreSQL,
DynamoDB, MQTT

No responses yet

What are your thoughts?

Respond

More from Sarthak Shah

Sarthak Shah Sarthak Shah

Stateless JWT or Stateful Sessions: Harnessing the Power of In-

Why Sessions Sometimes Win Ov… Memory Buffers with BytesIO
Analogy: Imagine you’re throwing a party. You In the realm of digital content processing,
hand each guest a party pass (think of it as a… efficiency is often paramount. Whether…

Jan 25 Dec 24, 2023 9 1

Sarthak Shah Sarthak Shah

Computer Vision: Fundamentals to Mastering Web Application

Advanced for the Next 100 Years |… Sessions: A Comprehensive Guid…
Hi, I will discuss everything in below attached Why is session management important ?
fashion. :)

Aug 24, 2024 183 1 Feb 5, 2024 5

See all from Sarthak Shah

Recommended from Medium

Maksim Kazartsev Anshu Bantra

Getting Started with DuckDB🦆 Using DuckDB in Python: A

This short guide walks you through the Comprehensive Guide
essentials of DuckDB — installation, queryin… Introduction to DuckDB

Dec 1, 2024 4 Sep 30, 2024 279 3

Lists

Coding & Development Predictive Modeling w/

11 stories · 1018 saves Python
20 stories · 1840 saves
Practical Guides to Machine ChatGPT
Learning 21 stories · 979 saves
10 stories · 2215 saves

Ong Xuan Hong Josef Machytka

DataOps 05: Big Data with Limited How DuckDB handles data not
resources — Bioinformatics Data… fitting into memory?
In the ever-evolving field of bioinformatics, In my previous article about DuckDB I
efficiently processing large datasets is a… described how to use this database as an…

Nov 9, 2024 5 Nov 13, 2024 7

In Python in Plain English by Leo Liu Anurag RANA

DuckDB: The High-Performance FireDucks, Pandas, DuckDB, and

Data Hub Your Data Science… Polars: A Comprehensive…
Imagine slashing your data processing time in In the rapidly evolving world of data analysis,
half without spinning up a server or breaking… selecting the right tool can significantly…

5d ago 23 Jan 2 1

See more recommendations

Pandas PDF
No ratings yet
Pandas PDF
3,021 pages
4 BNI Python Training
100% (1)
4 BNI Python Training
126 pages
Exadata Backup
100% (1)
Exadata Backup
27 pages
Mark Raasveldt & Hannes Mühleisen: Duckdb
No ratings yet
Mark Raasveldt & Hannes Mühleisen: Duckdb
38 pages
Pandas Illustrated: The Definitive Visual Guide To Pandas - by Lev Maximov - Jan, 2023 - Better Programming
No ratings yet
Pandas Illustrated: The Definitive Visual Guide To Pandas - by Lev Maximov - Jan, 2023 - Better Programming
99 pages
BSM 461 Introduction To Big Data: Kevser Ovaz Akpınar, PHD
No ratings yet
BSM 461 Introduction To Big Data: Kevser Ovaz Akpınar, PHD
26 pages
Data Mining Question Bank Chapter-1 (Introduction To Data Warehouse and Data Mining) Expected Questions 1 Mark Questions
No ratings yet
Data Mining Question Bank Chapter-1 (Introduction To Data Warehouse and Data Mining) Expected Questions 1 Mark Questions
6 pages
Database For Financial Accounting Application II - Infrastructure - CodeProject
No ratings yet
Database For Financial Accounting Application II - Infrastructure - CodeProject
33 pages
Delphi Getting Started With SQL Part 1 PDF
No ratings yet
Delphi Getting Started With SQL Part 1 PDF
6 pages
DuckDB in Action MEAP v01 Chptrs 1to3 MotheDuck
100% (1)
DuckDB in Action MEAP v01 Chptrs 1to3 MotheDuck
71 pages
Amdocs Questions
No ratings yet
Amdocs Questions
8 pages
Duckdb: An Embeddable Analytical Database: Mark Raasveldt Hannes Mühleisen
No ratings yet
Duckdb: An Embeddable Analytical Database: Mark Raasveldt Hannes Mühleisen
4 pages
Chapter 7: Relational Database Design by ER-to-Relational Mapping
No ratings yet
Chapter 7: Relational Database Design by ER-to-Relational Mapping
18 pages
Data Analysis With Python
No ratings yet
Data Analysis With Python
12 pages
CS Chapter 8 Notes (Updated)
No ratings yet
CS Chapter 8 Notes (Updated)
17 pages
Report File
No ratings yet
Report File
40 pages
Are You Still Using Pandas For Big Data
No ratings yet
Are You Still Using Pandas For Big Data
14 pages
Analyst Technical Interview Prep
No ratings yet
Analyst Technical Interview Prep
11 pages
Jacky Bai - Pandas Hands-On - Data Analysis Crash Course (2020)
No ratings yet
Jacky Bai - Pandas Hands-On - Data Analysis Crash Course (2020)
139 pages
DuckDB Benchmarking
No ratings yet
DuckDB Benchmarking
4 pages
C2150-606.exam.33q: Website: VCE To PDF Converter: Facebook: Twitter
No ratings yet
C2150-606.exam.33q: Website: VCE To PDF Converter: Facebook: Twitter
27 pages
More On Pandas
No ratings yet
More On Pandas
51 pages
SIGMOD2019 Demo Duckdb
No ratings yet
SIGMOD2019 Demo Duckdb
4 pages
"Exploring Duckdb - The Fast - Embeddable Analytical Database For Modern Data Challenges" - by Martin Jurado Pedroza - Aug - 2023 - Medium
No ratings yet
"Exploring Duckdb - The Fast - Embeddable Analytical Database For Modern Data Challenges" - by Martin Jurado Pedroza - Aug - 2023 - Medium
5 pages
Ip 2019
No ratings yet
Ip 2019
12 pages
Fundamentals - of - Relational - Database - Management - Systems QUIZ
No ratings yet
Fundamentals - of - Relational - Database - Management - Systems QUIZ
51 pages
Codd Rules
No ratings yet
Codd Rules
2 pages
CH 3 2
No ratings yet
CH 3 2
17 pages
DUCKDB Outline
No ratings yet
DUCKDB Outline
3 pages
Pandas Illustrated The Definitive Visual Guide To Pandas by Lev Maximov Jan, 2023 Better Programming - Semplificato
No ratings yet
Pandas Illustrated The Definitive Visual Guide To Pandas by Lev Maximov Jan, 2023 Better Programming - Semplificato
63 pages
TP - SAP Flight Data Model
No ratings yet
TP - SAP Flight Data Model
5 pages
Data Engineering 101 PySpark Vs Pandas 1721887961
No ratings yet
Data Engineering 101 PySpark Vs Pandas 1721887961
36 pages
A Performance Analysis of The DBMS - MySQL Vs Post
No ratings yet
A Performance Analysis of The DBMS - MySQL Vs Post
5 pages
Communications Data Model Datasheet 131313
No ratings yet
Communications Data Model Datasheet 131313
4 pages
Question Bank
No ratings yet
Question Bank
84 pages
Lecture 2.3.4 Cursors
No ratings yet
Lecture 2.3.4 Cursors
33 pages
Mdad - Numpy ML
No ratings yet
Mdad - Numpy ML
85 pages
Unit 345 DW Autosaved
No ratings yet
Unit 345 DW Autosaved
68 pages
Practical 01 Dms
No ratings yet
Practical 01 Dms
2 pages
Dataset #1
No ratings yet
Dataset #1
5 pages
Data Exploration With SQL & Python - Ipynb - Colab
No ratings yet
Data Exploration With SQL & Python - Ipynb - Colab
6 pages
P2 - Unit 9 - Databases
No ratings yet
P2 - Unit 9 - Databases
39 pages
L1 Pandaseries
No ratings yet
L1 Pandaseries
21 pages
Chapter 3.1 Complete (Relational Model)
No ratings yet
Chapter 3.1 Complete (Relational Model)
33 pages
Data Exploration With SQL & Python
No ratings yet
Data Exploration With SQL & Python
4 pages
SQL and NoSQL
No ratings yet
SQL and NoSQL
5 pages
Siwes Report
No ratings yet
Siwes Report
29 pages
Python & MySQL For Data Analysis
No ratings yet
Python & MySQL For Data Analysis
45 pages
Attachment 1
No ratings yet
Attachment 1
1 page
2) Introduction To MySQL Database
No ratings yet
2) Introduction To MySQL Database
41 pages
Neo 4 J
No ratings yet
Neo 4 J
29 pages
Unit 1
No ratings yet
Unit 1
57 pages
DBMS - Unit 4 - Question Bank - Reg 2021
No ratings yet
DBMS - Unit 4 - Question Bank - Reg 2021
16 pages
Deloitte Data Engineer Interview Experience (0-3 Yoe)
No ratings yet
Deloitte Data Engineer Interview Experience (0-3 Yoe)
22 pages
DAP Module3
No ratings yet
DAP Module3
42 pages
DevOps Session 3 Pandas
No ratings yet
DevOps Session 3 Pandas
33 pages
Agriconnect Dbms Project
No ratings yet
Agriconnect Dbms Project
27 pages
Lecture Week2
No ratings yet
Lecture Week2
72 pages
Data Analysis Using Python2
No ratings yet
Data Analysis Using Python2
27 pages
Introduction To Pandas
No ratings yet
Introduction To Pandas
93 pages
Schema & Olap
No ratings yet
Schema & Olap
49 pages
Usage of NumPy For Numerical Data in Detail
No ratings yet
Usage of NumPy For Numerical Data in Detail
52 pages
Module 3 Notes
No ratings yet
Module 3 Notes
45 pages
Course For Azure
No ratings yet
Course For Azure
5 pages
Ebooks File Learning SPARQL Second Edition Bob Ducharme All Chapters
100% (1)
Ebooks File Learning SPARQL Second Edition Bob Ducharme All Chapters
51 pages
Adobe Scan 28-Apr-2025
No ratings yet
Adobe Scan 28-Apr-2025
3 pages
Practical
No ratings yet
Practical
12 pages
Ramya Report Python
No ratings yet
Ramya Report Python
4 pages
Python Pandas
No ratings yet
Python Pandas
2 pages
Introduction To DuckDB
No ratings yet
Introduction To DuckDB
3 pages
Mastering DuckDB: High-Performance Analytics Made Easy
From Everand
Mastering DuckDB: High-Performance Analytics Made Easy
Robert Johnson
No ratings yet
PySpark Essentials: A Practical Guide to Distributed Computing
From Everand
PySpark Essentials: A Practical Guide to Distributed Computing
Robert Johnson
No ratings yet
Large Scale Machine Learning with Python
From Everand
Large Scale Machine Learning with Python
Bastiaan Sjardin
2/5 (1)
Data Engineering with Scala and Spark: Build streaming and batch pipelines that process massive amounts of data using Scala
From Everand
Data Engineering with Scala and Spark: Build streaming and batch pipelines that process massive amounts of data using Scala
Eric Tome
No ratings yet
Mastering Pandas in Python: Course Book
From Everand
Mastering Pandas in Python: Course Book
Pedro Martins
No ratings yet
Python Data Science Cookbook
From Everand
Python Data Science Cookbook
Taryn Voska
No ratings yet
Parallel Python with Dask: Perform distributed computing, concurrent programming and manage large dataset
From Everand
Parallel Python with Dask: Perform distributed computing, concurrent programming and manage large dataset
Tim Peters
No ratings yet
Python Data Science Cookbook: Practical solutions across fast data cleaning, processing, and machine learning workflows with pandas, NumPy, and scikit-learn
From Everand
Python Data Science Cookbook: Practical solutions across fast data cleaning, processing, and machine learning workflows with pandas, NumPy, and scikit-learn
Taryn Voska
No ratings yet
Exploring Hadoop Ecosystem (Volume 2): Stream Processing
From Everand
Exploring Hadoop Ecosystem (Volume 2): Stream Processing
Wei Liu
No ratings yet
Parallel Python with Dask
From Everand
Parallel Python with Dask
Tim Peters
No ratings yet
Real-Time Big Data Analytics
From Everand
Real-Time Big Data Analytics
Shilpi
5/5 (1)
Building a Product Master
From Everand
Building a Product Master
Edufdev
No ratings yet
Cassandra 3.x High Availability - Second Edition
From Everand
Cassandra 3.x High Availability - Second Edition
Robbie Strickland
No ratings yet
Mastering DynamoDB
From Everand
Mastering DynamoDB
Tanmay Deshpande
No ratings yet
OpenStack Sahara Essentials
From Everand
OpenStack Sahara Essentials
Omar Khedher
No ratings yet
Learning Hadoop 2
From Everand
Learning Hadoop 2
Garry Turkington
4/5 (1)
Learning Cascading
From Everand
Learning Cascading
Michael Covert
No ratings yet
Mastering Apache Cassandra - Second Edition
From Everand
Mastering Apache Cassandra - Second Edition
Nishant Neeraj
No ratings yet
Programming MapReduce with Scalding
From Everand
Programming MapReduce with Scalding
Antonios Chalkiopoulos
No ratings yet
Amazon SimpleDB: LITE
From Everand
Amazon SimpleDB: LITE
Prabhakar Chaganti
No ratings yet

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

Still Stuck With Pandas - Discover DuckDB - by Sarthak Shah

Uploaded by

Still Stuck With Pandas - Discover DuckDB - by Sarthak Shah

Uploaded by

Open in app

Still Stuck with Pandas? Discover

Enter DuckDB — a lightweight, high-performance, in-memory analytical

Unlike traditional database systems, DuckDB shines in local analytical tasks,

Efficient query planning and execution.

Handles datasets that Pandas struggles with in a fraction of the time.

Familiar SQL syntax simplifies complex data transformations and

3. In-Memory and On-Disk Flexibility

Supports persistent storage for handling large datasets.

Integrates with Python, R, and C++.

5. Lightweight and Portable

Easy to install and use without extensive setup.

Why DuckDB Over Pandas?

The Pandas Problem

Memory Intensive: It loads entire datasets into memory, which can be a

Performance Bottlenecks: Computationally intensive operations like

The DuckDB Advantage

Optimized for Scale: Handles datasets much larger than memory

Faster Execution: Often 20x or more faster for large datasets.

Minimal Setup: DuckDB integrates seamlessly into Python workflows.

Getting Started with DuckDB in Python

From a Pandas DataFrame:

Comparing DuckDB and Pandas

# Generate a large dataset

Pandas: ~2.29 seconds.

DuckDB: ~0.09 seconds.

DuckDB is ~25x faster than Pandas!

It integrates seamlessly with Pandas, making the transition effortless.

Pandas Duckdb Dataanalys Python

Written by Sarthak Shah Edit profile

Senior Software Engineer @ LibelluleMonde | Passionate about Embedded, IoT &

What are your thoughts?

More from Sarthak Shah

Stateless JWT or Stateful Sessions: Harnessing the Power of In-

Jan 25 Dec 24, 2023 9 1

Sarthak Shah Sarthak Shah

Computer Vision: Fundamentals to Mastering Web Application

Aug 24, 2024 183 1 Feb 5, 2024 5

See all from Sarthak Shah

Recommended from Medium

Maksim Kazartsev Anshu Bantra

Getting Started with DuckDB🦆 Using DuckDB in Python: A

Dec 1, 2024 4 Sep 30, 2024 279 3

Coding & Development Predictive Modeling w/

Ong Xuan Hong Josef Machytka

Nov 9, 2024 5 Nov 13, 2024 7

In Python in Plain English by Leo Liu Anurag RANA

DuckDB: The High-Performance FireDucks, Pandas, DuckDB, and

See more recommendations

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.