# Import pandas
import pandas as pd
πΌ From Pandas to Polars π»ββοΈ
As datasets grow in size and complexity, performance and efficiency become critical in data processing. While Pandas has long been the go-to library for data manipulation in Python, it can struggle with speed and memory usage, especially on large datasets. Polars, a newer DataFrame library written in Rust, offers a faster, more memory-efficient alternative with support for lazy evaluation and multi-threading.
This guide explores how to convert Pandas DataFrames to Polars, and highlights key differences in syntax, performance, and functionality. Whether youβre looking to speed up your data workflows or just exploring modern tools, understanding the transition from Pandas to Polars is a valuable step.
Table of Contents
Installation and Setup
Pandas
Polars
# Import polars
import polars as pl
Creating DataFrames
From dictionaries
Pandas
import pandas as pd
# Create DataFrame from dictionary
= {
data 'name': ['Alice', 'Bob', 'Charlie', 'David'],
'age': [25, 30, 35, 40],
'city': ['New York', 'Los Angeles', 'Chicago', 'Houston']
}= pd.DataFrame(data)
df_pd print(df_pd)
name age city
0 Alice 25 New York
1 Bob 30 Los Angeles
2 Charlie 35 Chicago
3 David 40 Houston
Polars
import polars as pl
# Create DataFrame from dictionary
= {
data 'name': ['Alice', 'Bob', 'Charlie', 'David'],
'age': [25, 30, 35, 40],
'city': ['New York', 'Los Angeles', 'Chicago', 'Houston']
}= pl.DataFrame(data)
df_pl print(df_pl)
shape: (4, 3)
βββββββββββ¬ββββββ¬ββββββββββββββ
β name β age β city β
β --- β --- β --- β
β str β i64 β str β
βββββββββββͺββββββͺββββββββββββββ‘
β Alice β 25 β New York β
β Bob β 30 β Los Angeles β
β Charlie β 35 β Chicago β
β David β 40 β Houston β
βββββββββββ΄ββββββ΄ββββββββββββββ
Basic Operations
Selecting columns
Pandas
# Select a single column (returns Series)
= df_pd['name']
series
# Select multiple columns
= df_pd[['name', 'age']] df_subset
Polars
# Select a single column (returns Series)
= df_pl['name']
series # Alternative method
= df_pl.select(pl.col('name')).to_series()
series
# Select multiple columns
= df_pl.select(['name', 'age'])
df_subset # Alternative method
= df_pl.select(pl.col(['name', 'age'])) df_subset
Adding a new column
Pandas
# Add a new column
'is_adult'] = df_pd['age'] >= 18
df_pd[
# Using assign (creates a new DataFrame)
= df_pd.assign(age_squared=df_pd['age'] ** 2) df_pd
Polars
# Add a new column
= df_pl.with_columns(
df_pl 'age') >= 18).then(True).otherwise(False).alias('is_adult')
pl.when(pl.col(
)
# Creating derived columns
= df_pl.with_columns(
df_pl 'age') ** 2).alias('age_squared')
(pl.col(
)
# Multiple columns at once
= df_pl.with_columns([
df_pl 'age').is_null().alias('age_is_null'),
pl.col('age') * 2).alias('age_doubled')
(pl.col( ])
Basic statistics
Pandas
# Get summary statistics
= df_pd.describe()
summary
# Individual statistics
= df_pd['age'].mean()
mean_age = df_pd['age'].median()
median_age = df_pd['age'].min()
min_age = df_pd['age'].max() max_age
Polars
# Get summary statistics
= df_pl.describe()
summary
# Individual statistics
= df_pl.select(pl.col('age').mean()).item()
mean_age = df_pl.select(pl.col('age').median()).item()
median_age = df_pl.select(pl.col('age').min()).item()
min_age = df_pl.select(pl.col('age').max()).item() max_age
Filtering Data
Simple filtering
Pandas
# Filter rows
= df_pd[df_pd['age'] >= 18]
adults
# Multiple conditions
= df_pd[(df_pd['age'] > 30) & (df_pd['city'] == 'Chicago')] filtered
Polars
# Filter rows
= df_pl.filter(pl.col('age') >= 18)
adults
# Multiple conditions
= df_pl.filter((pl.col('age') > 30) & (pl.col('city') == 'Chicago')) filtered
Complex filtering
Pandas
# Filter with OR conditions
= df_pd[(df_pd['city'] == 'New York') | (df_pd['city'] == 'Chicago')]
df_filtered
# Using isin
= ['New York', 'Chicago']
cities = df_pd[df_pd['city'].isin(cities)]
df_filtered
# String contains
= df_pd[df_pd['name'].str.contains('li')] df_filtered
Polars
# Filter with OR conditions
= df_pl.filter((pl.col('city') == 'New York') | (pl.col('city') == 'Chicago'))
df_filtered
# Using is_in
= ['New York', 'Chicago']
cities = df_pl.filter(pl.col('city').is_in(cities))
df_filtered
# String contains
= df_pl.filter(pl.col('name').str.contains('li')) df_filtered
Grouping and Aggregation
Basic groupby
Pandas
# Group by one column and aggregate
= df_pd.groupby('city').agg({
city_stats 'age': ['mean', 'min', 'max', 'count']
})
# Reset index for flat DataFrame
= city_stats.reset_index() city_stats
Polars
# Group by one column and aggregate
= df_pl.group_by('city').agg([
city_stats 'age').mean().alias('age_mean'),
pl.col('age').min().alias('age_min'),
pl.col('age').max().alias('age_max'),
pl.col('age').count().alias('age_count')
pl.col( ])
Joining/Merging DataFrames
Inner Join
Pandas
# Create another DataFrame
= {
employee_data 'emp_id': [1, 2, 3, 4],
'name': ['Alice', 'Bob', 'Charlie', 'David'],
'dept': ['HR', 'IT', 'Finance', 'IT']
}= pd.DataFrame(employee_data)
employee_df_pd
= {
salary_data 'emp_id': [1, 2, 3, 5],
'salary': [50000, 60000, 70000, 80000]
}= pd.DataFrame(salary_data)
salary_df_pd
# Inner join
= employee_df_pd.merge(
merged_df
salary_df_pd,='emp_id',
on='inner'
how )
Polars
# Create another DataFrame
= {
employee_data 'emp_id': [1, 2, 3, 4],
'name': ['Alice', 'Bob', 'Charlie', 'David'],
'dept': ['HR', 'IT', 'Finance', 'IT']
}= pl.DataFrame(employee_data)
employee_df_pl
= {
salary_data 'emp_id': [1, 2, 3, 5],
'salary': [50000, 60000, 70000, 80000]
}= pl.DataFrame(salary_data)
salary_df_pl
# Inner join
= employee_df_pl.join(
merged_df
salary_df_pl,='emp_id',
on='inner'
how )
Different join types
Pandas
# Left join
= employee_df_pd.merge(salary_df_pd, on='emp_id', how='left')
left_join
# Right join
= employee_df_pd.merge(salary_df_pd, on='emp_id', how='right')
right_join
# Outer join
= employee_df_pd.merge(salary_df_pd, on='emp_id', how='outer') outer_join
Polars
# Left join
= employee_df_pl.join(salary_df_pl, on='emp_id', how='left')
left_join
# Right join
= employee_df_pl.join(salary_df_pl, on='emp_id', how='right')
right_join
# Outer join
= employee_df_pl.join(salary_df_pl, on='emp_id', how='full') outer_join
Handling Missing Values
Checking for missing values
Pandas
# Check for missing values
= df_pd.isnull().sum()
missing_count
# Check if any column has missing values
= df_pd.isnull().any().any() has_missing
Polars
# Check for missing values
= df_pl.null_count()
missing_count
# Check if specific column has missing values
= df_pl.select(pl.col('age').is_null().any()).item() has_missing
Handling missing values
Pandas
# Drop rows with any missing values
= df_pd.dropna()
df_pd_clean
# Fill missing values
= df_pd.fillna({
df_pd_filled 'age': 0,
'city': 'Unknown'
})
# Forward fill
= df_pd.ffill() df_pd_ffill
Polars
# Drop rows with any missing values
= df_pl.drop_nulls()
df_pl_clean
# Fill missing values
= df_pl.with_columns([
df_pl_filled 'age').fill_null(0),
pl.col('city').fill_null('Unknown')
pl.col(
])
# Forward fill
= df_pl.with_columns([
df_pl_ffill 'age').fill_null(strategy='forward'),
pl.col('city').fill_null(strategy='forward')
pl.col( ])
String Operations
Basic string operations
Pandas
# Convert to uppercase
'name_upper'] = df_pd['name'].str.upper()
df_pd[
# Get string length
'name_length'] = df_pd['name'].str.len()
df_pd[
# Extract substring
'name_first_char'] = df_pd['name'].str[0]
df_pd[
# Replace substrings
'city_replaced'] = df_pd['city'].str.replace('New', 'Old') df_pd[
Polars
# Convert to uppercase
= df_pl.with_columns(pl.col('name').str.to_uppercase().alias('name_upper'))
df_pl
# Get string length
= df_pl.with_columns(pl.col('name').str.len_chars().alias('name_length'))
df_pl
# Extract substring
= df_pl.with_columns(pl.col('name').str.slice(0, 1).alias('name_first_char'))
df_pl
# Replace substrings
= df_pl.with_columns(pl.col('city').str.replace('New', 'Old').alias('city_replaced')) df_pl
Advanced string operations
Pandas
# Split string
'first_word'] = df_pd['city'].str.split(' ').str[0]
df_pd[
# Pattern matching
= df_pd['city'].str.contains('New')
has_new
# Extract with regex
'extracted'] = df_pd['city'].str.extract(r'(\w+)\s') df_pd[
Polars
# Split string
= df_pl.with_columns(
df_pl 'city').str.split(' ').list.get(0).alias('first_word')
pl.col(
)
# Pattern matching
= df_pl.with_columns(
df_pl 'city').str.contains('New').alias('has_new')
pl.col(
)
# Extract with regex
= df_pl.with_columns(
df_pl 'city').str.extract(r'(\w+)\s').alias('extracted')
pl.col( )
Time Series Operations
Date parsing and creation
Pandas
# Create DataFrame with dates
= pd.DataFrame({
dates_pd 'date_str': ['2023-01-01', '2023-02-15', '2023-03-30']
})
# Parse dates
'date'] = pd.to_datetime(dates_pd['date_str'])
dates_pd[
# Extract components
'year'] = dates_pd['date'].dt.year
dates_pd['month'] = dates_pd['date'].dt.month
dates_pd['day'] = dates_pd['date'].dt.day
dates_pd['weekday'] = dates_pd['date'].dt.day_name() dates_pd[
Polars
# Create DataFrame with dates
= pl.DataFrame({
dates_pl 'date_str': ['2023-01-01', '2023-02-15', '2023-03-30']
})
# Parse dates
= dates_pl.with_columns(
dates_pl 'date_str').str.strptime(pl.Datetime, '%Y-%m-%d').alias('date')
pl.col(
)
# Extract components
= dates_pl.with_columns([
dates_pl 'date').dt.year().alias('year'),
pl.col('date').dt.month().alias('month'),
pl.col('date').dt.day().alias('day'),
pl.col('date').dt.weekday().replace_strict({
pl.col(0: 'Monday', 1: 'Tuesday', 2: 'Wednesday',
3: 'Thursday', 4: 'Friday', 5: 'Saturday', 6: 'Sunday'
="unknown").alias('weekday')
}, default ])
Date arithmetic
Pandas
# Add days
'next_week'] = dates_pd['date'] + pd.Timedelta(days=7)
dates_pd[
# Date difference
= pd.date_range(start='2023-01-01', end='2023-01-10')
date_range = pd.DataFrame({'date': date_range})
df_dates 'days_since_start'] = (df_dates['date'] - df_dates['date'].min()).dt.days df_dates[
Polars
# Add days
= dates_pl.with_columns(
dates_pl 'date') + pl.duration(days=7)).alias('next_week')
(pl.col(
)
# Date difference
= pd.date_range(start='2023-01-01', end='2023-01-10') # Using pandas to generate range
date_range = pl.DataFrame({'date': date_range})
df_dates = df_dates.with_columns(
df_dates 'date') - pl.col('date').min()).dt.total_days().alias('days_since_start')
(pl.col( )
Performance Comparison
This section demonstrates performance differences between pandas and polars for a large dataset operation.
import pandas as pd
import polars as pl
import time
import numpy as np
# Generate a large dataset (10 million rows)
= 10_000_000
n = {
data 'id': np.arange(n),
'value': np.random.randn(n),
'group': np.random.choice(['A', 'B', 'C', 'D'], n)
}
# Convert to pandas DataFrame
= pd.DataFrame(data)
df_pd
# Convert to polars DataFrame
= pl.DataFrame(data)
df_pl
# Benchmark: Group by and calculate mean, min, max
print("Running pandas groupby...")
= time.time()
start = df_pd.groupby('group').agg({
result_pd 'value': ['mean', 'min', 'max', 'count']
})= time.time() - start
pd_time print(f"Pandas time: {pd_time:.4f} seconds")
print("Running polars groupby...")
= time.time()
start = df_pl.group_by('group').agg([
result_pl 'value').mean().alias('value_mean'),
pl.col('value').min().alias('value_min'),
pl.col('value').max().alias('value_max'),
pl.col('value').count().alias('value_count')
pl.col(
])= time.time() - start
pl_time print(f"Polars time: {pl_time:.4f} seconds")
print(f"Polars is {pd_time / pl_time:.2f}x faster")
Running pandas groupby...
Pandas time: 0.2320 seconds
Running polars groupby...
Polars time: 0.0545 seconds
Polars is 4.26x faster
Typically, for operations like this, Polars will be 3-10x faster than pandas, especially as data sizes increase. The performance gap widens further with more complex operations that can benefit from query optimization.
API Philosophy Differences
Pandas and Polars differ in several fundamental aspects:
1. Eager vs. Lazy Execution
Pandas uses eager execution by default:
Polars supports both eager and lazy execution:
2. Method Chaining vs. Assignment
Pandas often uses assignment operations:
# Many pandas operations use in-place assignment
'new_col'] = pd_df['new_col'] * 2
pd_df['new_col'] = pd_df['new_col'].fillna(0)
pd_df[
# Some operations return new DataFrames
= pd_df.sort_values('new_col') pd_df
Polars consistently uses method chaining:
# All operations return new DataFrames and can be chained
= (pl_df
pl_df 'new_col') * 2).alias('new_col'))
.with_columns((pl.col('new_col').fill_null(0))
.with_columns(pl.col('new_col')
.sort( )
3. Expression API vs. Direct References
Pandas directly references columns:
'result'] = pd_df['age'] + pd_df['new_col']
pd_df[= pd_df[pd_df['age'] > pd_df['age'].mean()] filtered
Polars uses an expression API:
= pl_df.with_columns(
pl_df 'age') + pl.col('new_col')).alias('result')
(pl.col(
)= pl_df.filter(pl.col('age') > pl.col('age').mean()) filtered
Migration Guide
If youβre transitioning from pandas to polars, here are key mappings between common operations:
Operation | Pandas | Polars |
---|---|---|
Read CSV | pd.read_csv('file.csv') |
pl.read_csv('file.csv') |
Select columns | df[['col1', 'col2']] |
df.select(['col1', 'col2']) |
Add column | df['new'] = df['col1'] * 2 |
df.with_columns((pl.col('col1') * 2).alias('new')) |
Filter rows | df[df['col'] > 5] |
df.filter(pl.col('col') > 5) |
Sort | df.sort_values('col') |
df.sort('col') |
Group by | df.groupby('col').agg({'val': 'sum'}) |
df.group_by('col').agg(pl.col('val').sum()) |
Join | df1.merge(df2, on='key') |
df1.join(df2, on='key') |
Fill NA | df.fillna(0) |
df.fill_null(0) |
Drop NA | df.dropna() |
df.drop_nulls() |
Rename | df.rename(columns={'a': 'b'}) |
df.rename({'a': 'b'}) |
Unique values | df['col'].unique() |
df.select(pl.col('col').unique()) |
Value counts | df['col'].value_counts() |
df.group_by('col').count() |
Key Tips for Migration
- Think in expressions: Use
pl.col()
to reference columns in operations - Embrace method chaining: String operations together instead of intermediate variables
- Try lazy execution: For complex operations, use
pl.scan_csv()
and lazy operations - Use with_columns(): Instead of direct assignment, use with_columns for adding/modifying columns
- Learn the expression functions: Many operations like string manipulation use different syntax
When to Keep Using Pandas
Despite Polarsβ advantages, pandas might still be preferred when:
- Working with existing codebases heavily dependent on pandas
- Using specialized libraries that only support pandas (some visualization tools)
- Dealing with very small datasets where performance isnβt critical
- Using pandas-specific features without polars equivalents
- Working with time series data that benefits from pandasβ specialized functionality
Conclusion
Polars offers significant performance improvements and a more consistent API compared to pandas, particularly for large datasets and complex operations. While the syntax differences require some adjustment, the benefits in speed and memory efficiency make it a compelling choice for modern data analysis workflows.
Both libraries have their place in the Python data ecosystem. Pandas remains the more mature option with broader ecosystem compatibility, while Polars represents the future of high-performance data processing. For new projects dealing with large datasets, Polars is increasingly becoming the recommended choice.