Skip to content
This repository was archived by the owner on May 7, 2021. It is now read-only.
This repository was archived by the owner on May 7, 2021. It is now read-only.

Speed comparison of package compared to Pandas / Pyarrow_ops #4

@TomScheffers

Description

@TomScheffers

I wrote a comparison between this package and pandas / pyarrow_ops on a task where we join, group and aggregate two tables with ~300K rows. The Datafusion package is about 3-5 times slower than its alternatives.

What is causing this performance hit? Is it the serialization between C / Python or is it the performance of DataFusion itself?

import datafusion, time
import pyarrow as pa
import pyarrow.parquet as pq
from pyarrow_ops import head, join, groupby

f1 = "path to local parquet file"
f2 = "path to local parquet file"

# Pandas
time1 = time.time()
t1 = pq.read_table(f1, columns=['sku_key', 'option_key']).to_pandas()
t2 = pq.read_table(f2, columns=['sku_key', 'economical']).to_pandas()
r = t1.merge(t2, on=['sku_key']).groupby(['option_key']).agg({'economical': 'sum'})
print("Query in Pandas took:", time.time() - time1)
print(r.head())

# Pyarrow ops
time2 = time.time()
t1 = pq.read_table(f1, columns=['sku_key', 'option_key'])
t2 = pq.read_table(f2, columns=['sku_key', 'economical'])
r = groupby(join(t1, t2, on=['sku_key']), by=['option_key']).agg({'economical': 'sum'})
print("\nQuery in Pyarrow ops took:", time.time() - time2)
head(r)

# Datafusion
time3 = time.time()

f = datafusion.functions
ctx = datafusion.ExecutionContext()
ctx.register_parquet("skus", f1)
ctx.register_parquet("stock_current", f2)
result = ctx.sql("SELECT option_key, SUM(economical) as stock FROM stock_current as sc JOIN skus as sk USING (sku_key) GROUP BY option_key").collect()
r = pa.Table.from_batches(result)
print("\nQuery in DataFusion took:", time.time() - time3)
head(r)
Query in Pandas took: 0.6710879802703857
            economical
option_key
15847197             4
15978197           455
15984669            56
15985197           907
16066197           460

Query in Pyarrow ops took: 1.0179059505462646
Row  option_key  economical
0    15847197    4
1    15978197    455
2    15984669    56
3    15985197    907
4    16066197    460



Query in DataFusion took: 3.2192792892456055
Row  option_key  stock
0    26284326    0
1    25214207    -1
2    30372204    16
3    33163308    156
4    26880505    10

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions