What Is Apache Spark ETL?

November 25, 2021 | Mark Hillary

In my last article here I talked about the ETL process. The Extract, Transform, and Load (ETL) process is fundamental for data engineers when moving data around between different data warehouses or data lakes. It’s essentially a process where the data is extracted, cleaned up, and then dropped into a new place for further analysis.

ETL isn’t perfect and it can be a very manual process. Often it requires a developer to code instructions for the transformation process. Multiple transformations on the same set of data can also create stale data so regular maintenance is required to prevent duplicates and other issues.

ETL has been around since the 1990s, supporting an entire ecosystem of Business Intelligence (BI) and analytic tools. Traditional ETL has been valuable – and a dramatic improvement on those fixed databases of the 90s – but there are more modern ways of moving data and transforming it.

BI is often focused now on Big Data, data warehouses, data lakes, and the applications are just small services accessing the data. Traditional ETL with manual processing is difficult to sustain in this highly unstructured environment.

Spark ETL creates a middleware framework for creating code that can perform ETL processes faster and more reliably than before. Traditional ETL struggles to handle unstructured or semi-structured data and with modern data use, this is becoming essential.

Often this unstructured data is being continuously created – imagine trying to create a sentiment analysis process that looks at all the real-time feedback coming from your customers across multiple social media platforms.

This stream of raw unstructured data needs to be transformed from data into insight in real-time. Spark allows the developer to build ETL pipelines that can continuously clean, process, and aggregate stream data before loading it to a data store.

This blog offers a good explanation of the ten key concepts that underpin how Spark transforms traditional ETL. Let’s explore a brief summary of all ten:

1. Architecture

Unlike many traditional ETL tools, Spark’s uses a master/worker architecture.

2. Data structures

RDD (Resilient Distributed Data) is the basic data structure in Spark. The name signifies that the data is recoverable from failures and is distributed across all the nodes.

3. Spark configurations

To be able to create an RDD/Dataframe and perform operations on it, every spark application must make a connection to the spark cluster by creating an object called SparkContext.

4. RDD transformations

Transformations are functions applied on a RDD to create one or more RDDs. The operations are applied in parallel on the RDD partitions.

5. Lazy evaluation

Typically, an ETL job comprises transformations that are applied to the input data before loading the final data into the target datastore. Let’s assume we have an ETL with one step to extract the data into spark environment, five steps of transformations, and one step to load the data into an external datastore. In Spark, the transformations are not executed/materialized until the action to load data into the external datastore is called.

6. Caching and broadcasting data

An RDD is a collection of partitions that are immutable subsets and are distributed across the nodes in the cluster. Spark distributes the tasks generated from the DAG graph to the workers to be applied on the partitions of data they hold.

7. SparkSQL

It is a programming module used to load data into Spark from variety of sources and provides a very easy interface to run SQL Queries on loaded data.

8. YARN

Each worker node in Hadoop cluster has compute resources like memory and CPU. YARN is the resource manager in Hadoop and is the ultimate authority in allocating/managing the resources to spark jobs.

9. Spark History Server

For every spark job submitted on the cluster, a web UI is launched to display useful information like list of jobs/stages/tasks, memory/disk utilization and other information of each executor assigned to the job.

10. Spark-submit

Spark submit is used to launch a spark application on cluster. A typical project may contain more than one job and an ETL job for semi-structured/unstructured data may need to import helper functions to perform operations on the data.

Click here for more information on data engineering expertise at IBA Group.

November 18, 2021

Working With IBA: No Client Had To Worry

Matthias Karius Continue Reading

November 29, 2021

The Digital CEO: A Dozen Years Of IT Insight

Mark Hillary Continue Reading

Blog