Using Future in Spark for Concurrency

Overview

In Spark, using Future is a way to achieve concurrency, allowing multiple tasks to be executed simultaneously instead of sequentially. This is especially helpful when performing multiple queries concurrently without blocking the main thread.

In this code, I was addressing the question: "How can I run SQL queries in parallel in Spark using Scala?" I utilized Future to enable asynchronous execution. With Future, I could initiate both SQL queries without waiting for one to finish before starting the other, making the process more efficient.

Approach and Code Explanation

Here’s the steps I took in the code:


            // Scala code example
            
            val spark = SparkSession.builder.appName("Future Example").getOrCreate()
            
            // DataFrames
            val df1 = spark.read.option("header", "true").csv("people.csv")
            val df2 = spark.read.option("header", "true").csv("subjects.csv")
            
            // Register Temp Views
            df1.createOrReplaceTempView("people")
            df2.createOrReplaceTempView("subjects")
            
            // Future for Query 1
            val query1Future = Future {
                spark.sql("SELECT name, age FROM people WHERE age > 30").show()
            }
            
            // Future for Query 2
            val query2Future = Future {
                spark.sql("SELECT p.name, s.subject FROM people p JOIN subjects s ON p.id = s.person_id").show()
            }
            
            // Wait for both queries to complete
            Await.result(query1Future, Duration.Inf)
            Await.result(query2Future, Duration.Inf)
            
            spark.stop()
            

Visuals

Below are images representing the codes and results during the execution of the Spark SQL queries:

Spark SQL Execution 1 Spark SQL Execution 2