Bringing the most effective of each worlds collectively
I’m eager on discovering methods to hurry up analytical duties.
I’ve written a number of posts previously about measuring reminiscence utilization, working time, and tricks to velocity up data-intensive duties. Since we use Pandas primarily every day, I maintain researching for improved efficiency.
This spectacular python library is characteristic wealthy and dependable. However velocity is just not considered one of its superpowers. With a little bit of optimization, we are able to acquire some velocity. However these methods, too, have their points.
New replacements, equivalent to Polars, are gaining momentum too. However I nonetheless want Pandas, like many different information professionals.
That is one purpose why we use databases as a substitute of dataframes in reminiscence. Databases are designed to retrieve information from disk and carry out computations effectively.
Nevertheless, you don’t want a database for many information science tasks. It’s like utilizing a sledgehammer to crack a nut. That is the place in-memory databases equivalent to SQLite come.
In reminiscence, databases don’t have a separate course of working, or they don’t want a fancy setup. The most recent variations of Python include SQLite.
However SQLite, too, has its drawbacks. Though you should use SQL to question the database, the velocity profit isn’t important.
Nevertheless, we have now a contemporary resolution for this — DuckDB. Like SQLite, the setup is tremendous easy, and you’ll port the database alongside together with your code. And you may have way more velocity than in Pandas API and SQLite.
Querying a Pandas dataframe with SQL — utilizing DuckDB.
DuckDB is a database by itself. Nevertheless, you can even flip any dataframe right into a DuckDB desk and question on it. However earlier than doing all of them, right here’s how we set up it:
pip set up duckdb
This may shock you. However that’s how we set up DuckDB. Though it’s a database, it’s a Python package deal to put in.
Since it is a easy in-memory database, there aren’t any different advanced configurations such as you see with Postgres and different common databases. There are a handful of parameters you may set, which you will discover within the documentation.
Let’s suppose we have now a taxi driving length dataset. We have to compute the typical journey length of these journeys that began west of longitude -73.95.
In case you’re following together with this, you may generate a take a look at dataset utilizing the beneath code.
import pandas as pd
import numpy as np# Outline the variety of rows within the dataset
num_rows = 10000000
# Generate a random longitude for every row
pickup_longitude = np.random.uniform(low=-38.0, excessive=-94.0, measurement=num_rows)
# Generate a random journey length for every row
trip_duration = np.random.regular(loc=10, scale=5, measurement=num_rows)
# Create a DataFrame with the pickup longitude and journey length columns
df = pd.DataFrame(
{"pickup_longitude": pickup_longitude, "trip_duration": trip_duration}
)
We used the timing decorator to measure the time it takes to run.
...@timing_decorator
def find_avg_trip_duration_in_the_west():
return df[df['pickup_longitude'] < -73.95]['trip_duration'].imply()
find_avg_trip_duration_in_the_west()
>> Perform find_avg_trip_duration_in_the_west took 0.49195261001586914 seconds to run.
>> 9.995189356480168
The above code took 0.2 seconds to run on my pc. Right here’s the identical question utilizing DuckDB API as a substitute of the Pandas API:
...
import duckdb@timing_decorator
def find_avg_trip_duration_in_the_west():
return duckdb.execute(
'SELECT AVG(trip_duration) FROM df WHERE pickup_longitude < -73.95'
).df()
>> Perform find_avg_trip_duration_in_the_west took 0.05432271957397461 seconds to run.
>> | | avg(trip_duration) |
>> |---:|---------------------:|
>> | 0 | 9.995189|
As you may see, utilizing the Duck DB API has run ten instances sooner than the native Pandas API.
How does DuckDB differ from SQLite’s Efficiency?
SQLite is the most well-liked in-memory database proper now. That’s why we get it with Python out of the field.
However DuckDB does effectively in two areas. One, when you don’t must persist, you don’t must create it both. You may straight question your Pandas dataframe like a database desk. That is what we’ve achieved within the earlier instance.
The opposite profit is that DuckDB continues to be sooner than SQLite. The truth is, replicating our instance on this publish to SQLite runs slower than Pandas. This makes SQLite the final possibility when efficiency is taken into account a precedence. However it nonetheless does an important job of persisting and porting information and your code.
Right here’s the SQLite model of our instance:
...
import sqlite3conn = sqlite3.join("taxi.db")
df.to_sql("journeys", conn)
@timing_decorator
def find_avg_trip_duration_in_the_west():
cursor = conn.cursor()
cursor.execute(
"SELECT AVG(trip_duration) FROM journeys WHERE pickup_longitude < -73.95"
)
end result = cursor.fetchone()[0]
cursor.shut()
return end result
find_avg_trip_duration_in_the_west()
>> Perform find_avg_trip_duration_in_the_west took 0.5150568962097168 seconds to run.
>> 9.995189
As you may see, working SQL on Pandas utilizing SQLite requires us to create a separate occasion of the database on disk and insert information into it. All these steps usually are not vital if we’re utilizing DuckDB.
Additionally, it took .51 seconds, whereas Pandas API took solely .49 seconds.
Conclusion
Pandas, past argument, is likely one of the miracles that made Python a preferred alternative for information science. Nevertheless, this matured library makes data-wrangling duties gradual.
On this publish, we mentioned utilizing a easy question API from DuckDB to enhance velocity. If you recognize SQL, and not using a advanced setup, you may straight question your Pandas dataframe with SQL as a substitute of Pandas API.
Whereas DuckDB has benefits, it isn’t a cure-all.
You continue to want scalable databases equivalent to Postgres for large-scale tasks. And utilizing Pandas API could also be faster to develop, and everybody who understands Python can perceive them.