What Is Apache Spark ETL?
In my last article here I talked about the ETL process. The Extract, Transform, and Load (ETL) process is fundamental for data engineers when moving data around between different data warehouses or data lakes. It’s essentially a process where the data is extracted, cleaned up, and then dropped into a new place for further analysis.
ETL isn’t perfect and it can be a very manual process. Often it requires a developer to code instructions for the transformation process. Multiple transformations on the same set of data can also create stale data so regular maintenance is required to prevent duplicates and other issues.
ETL has been around since the 1990s, supporting an entire ecosystem of Business Intelligence (BI) and analytic tools. Traditional ETL has been valuable – and a dramatic improvement on those fixed databases of the 90s – but there are more modern ways of moving data and transforming it.
BI is often focused now on Big Data, data warehouses, data lakes, and the applications are just small services accessing the data. Traditional ETL with manual processing is difficult to sustain in this highly unstructured environment.
Spark ETL creates a middleware framework for creating code that can perform ETL processes faster and more reliably than before. Traditional ETL struggles to handle unstructured or semi-structured data and with modern data use, this is becoming essential.
Often this unstructured data is being continuously created – imagine trying to create a sentiment analysis process that looks at all the real-time feedback coming from your customers across multiple social media platforms.
This stream of raw unstructured data needs to be transformed from data into insight in real-time. Spark allows the developer to build ETL pipelines that can continuously clean, process, and aggregate stream data before loading it to a data store.
This blog offers a good explanation of the ten key concepts that underpin how Spark transforms traditional ETL. Let’s explore a brief summary of all ten:
1. Architecture
Unlike many traditional ETL tools, Spark’s uses a master/worker architecture.
2. Data structures
RDD (Resilient Distributed Data) is the basic data structure in Spark. The name signifies that the data is recoverable from failures and is distributed across all the nodes.
3. Spark configurations
To be able to create an RDD/Dataframe and perform operations on it, every spark application must make a connection to the spark cluster by creating an object called SparkContext.
4. RDD transformations
Transformations are functions applied on a RDD to create one or more RDDs. The operations are applied in parallel on the RDD partitions.
5. Lazy evaluation
Typically, an ETL job comprises transformations that are applied to the input data before loading the final data into the target datastore. Let’s assume we have an ETL with one step to extract the data into spark environment, five steps of transformations, and one step to load the data into an external datastore. In Spark, the transformations are not executed/materialized until the action to load data into the external datastore is called.
6. Caching and broadcasting data
An RDD is a collection of partitions that are immutable subsets and are distributed across the nodes in the cluster. Spark distributes the tasks generated from the DAG graph to the workers to be applied on the partitions of data they hold.
7. SparkSQL
It is a programming module used to load data into Spark from variety of sources and provides a very easy interface to run SQL Queries on loaded data.
8. YARN
Each worker node in Hadoop cluster has compute resources like memory and CPU. YARN is the resource manager in Hadoop and is the ultimate authority in allocating/managing the resources to spark jobs.
9. Spark History Server
For every spark job submitted on the cluster, a web UI is launched to display useful information like list of jobs/stages/tasks, memory/disk utilization and other information of each executor assigned to the job.
10. Spark-submit
Spark submit is used to launch a spark application on cluster. A typical project may contain more than one job and an ETL job for semi-structured/unstructured data may need to import helper functions to perform operations on the data.
Click here for more information on data engineering expertise at IBA Group.