Optimize Apache Spark SQL Queries using Predicate Pushdown

In this article, we will explore how leveraging Predicate Pushdown can enhance the performance of your Spark SQL queries, providing insights into a powerful optimization technique for efficient data processing. We’ll address the questions:

What is Predicate Pushdown ?
Why should you use it ?
And how to implement it ?

What is Predicate Pushdown?

Predicate Pushdown is an optimization strategy that pushes filtering predicates (conditions in WHERE clauses) as close to the data source as possible. By doing so, it minimizes the amount of data that needs to be loaded into memory and processed, resulting in improved performance and resource utilization.

Why use Predicate Pushdown?

Below are the three main benefits you get using the predicate pushdown.

Reduced Data Movement: By applying filtering conditions at the source, Predicate Pushdown minimizes the amount of data transferred across the network, reducing I/O overhead.
Enhanced CPU Efficiency: Filtering data closer to the source reduces the volume of data that needs to be processed in-memory, leading to more efficient CPU utilization.
Storage Optimization: Predicate Pushdown can reduce the storage requirements by fetching only the necessary data, optimizing both memory and disk space.

How Apply Predicate Pushdown?

To illustrate the predicate Pushdown, we’re going to use PySpark. If you choose to follow along and execute the code as you read, ensure you have a Python environment with the PySpark dependency installed. Alternatively, you can simply read the code, as it is straightforward to comprehend. You can find the complete source code for this article in my Github Repository.

The versions used in this article are:

Python: 3.11.6
PySpark: 3.4.1

Let’s consider a scenario where we have a Parquet file named persons_data.parquet with columns: id, name, age, job_title. Our goal is to filter the data to include only records where the age is greater than 25.

The following code creates a DataFrame and then writes it to a Parquet file that will serve for testing Predicate Pushdown.

from pyspark.sql import SparkSession

if __name__ == "__main__":

    spark = SparkSession.builder\
            .master("local[*]")\
            .appName("predicate-push-demo")\
            .getOrCreate()
    
    data = [(0, "person1", 22, "Doctor"),
            (1, "person2", 35, "Singer"),
            (3, "person3", 42, "Teacher")
    ]
    columns = ["id", "name", "age", "job_title"]
    df = spark.createDataFrame(data, columns)

    parquet_file_name = "persons_data.parquet"
    df.write.parquet(parquet_file_name, mode="overwrite")

In PySpark, predicate pushdown is generally enabled by default, particularly when working with file formats such as Parquet and ORC that support this optimization. However, you can explicitly confirm and configure it using the spark.conf settings. The specific parameter to set for Parquet files is spark.sql.parquet.filterPushdown.

Now, let’s test the Predicate Pushdown.

df_with_pd = spark.read \
           .parquet(parquet_file_name) \
           .filter("age > 25").select("name", "job_title")

df_with_pd.explain(mode="extended")

The code above pushes the filtering operation down to the data source level, reading only the data where age > 25 directly from the Parquet file. This approach significantly reduces the volume of data that PySpark needs to read, process, and transfer over the network, resulting in improved performance.

If you see the output in the Physical Plan indicating pushed filters, such as PushedFilters: [IsNotNull(age), GreaterThan(age,25)], it means that the filter was pushed down to the Parquet level.

Please note that not all data formats accept Predicate Pushdown. At the time of writing this article, some of the formats that do are:

PARQUET, ORC, JDBC (In versions before 3.0)
JSON, CSV, AVRO (From version 3.1.0)

For more information, refer to the Spark configurations documentation.