The article discusses the challenges of working with large datasets in Pandas and introduces Polars as an alternative with a syntax between Pandas and PySpark. It covers four key functions for data cleaning and analysis: filter, with_columns, group_by, and when. Polars offers a user-friendly API for handling large datasets, positioning it as a transition step from Pandas to PySpark.
“`html
4 Functions to Know If You Are Planning to Switch from Pandas to Polars
Data
First things first. We, of course, need data to learn how these functions work. I prepared sample data, which you can download in my datasets repository. The dataset we’ll use in this article is called “data_polars_practicing.csv”.
1. Filter
The first Polars function we’ll cover is filter. As its name suggests, it can be used for filtering DataFrame rows.
2. with_columns
The with_columns function creates a new column in Polars DataFrames. The new column can be derived from other columns such as extracting the year from a date value. We can do arithmetic operations including multiple columns, or simply create a column with a constant.
3. group_by
The group_by function groups the rows based on the distinct values in a given column or columns. Then, we can calculate several different aggregations on each group such as mean, max, min, sum, and so on.
4. when
We can use the when function along with the with_columns function for creating conditional columns.
Final words
I think of Polars library as an intermediate step between Pandas and Spark. It works quite well with datasets that Pandas struggle with. I haven’t tested Polars with much larger datasets (i.e. billions of rows) but I don’t think it can be a replacement for Spark. With that being said, the syntax of Polars is very intuitive. It’s similar to both Pandas and PySpark SQL syntax. I think this also indicates that Polars is kind of a transition step from Pandas to PySpark (my subjective opinion).
Thank you for reading. Please let me know if you have any feedback.
“`