In Spark, using Future is a way to achieve concurrency, allowing multiple tasks to be executed simultaneously instead of sequentially. This is especially helpful when performing multiple queries concurrently without blocking the main thread.
In this code, I was addressing the question: "How can I run SQL queries in parallel in Spark using Scala?" I utilized Future to enable asynchronous execution. With Future, I could initiate both SQL queries without waiting for one to finish before starting the other, making the process more efficient.
Here’s the steps I took in the code:
SparkSession to interact with Spark.df1 and df2, representing data about people and subjects.people and subjects, to be used in SQL queries.Query 1: Selects names and ages of people older than 30.Query 2: Joins the people and subjects tables and selects names and subjects.Future, allowing both queries to be executed concurrently.Await.result to block the main thread until both futures completed, ensuring that the program doesn’t exit prematurely.
// Scala code example
val spark = SparkSession.builder.appName("Future Example").getOrCreate()
// DataFrames
val df1 = spark.read.option("header", "true").csv("people.csv")
val df2 = spark.read.option("header", "true").csv("subjects.csv")
// Register Temp Views
df1.createOrReplaceTempView("people")
df2.createOrReplaceTempView("subjects")
// Future for Query 1
val query1Future = Future {
spark.sql("SELECT name, age FROM people WHERE age > 30").show()
}
// Future for Query 2
val query2Future = Future {
spark.sql("SELECT p.name, s.subject FROM people p JOIN subjects s ON p.id = s.person_id").show()
}
// Wait for both queries to complete
Await.result(query1Future, Duration.Inf)
Await.result(query2Future, Duration.Inf)
spark.stop()
Below are images representing the codes and results during the execution of the Spark SQL queries: