Apache Spark

Apache Spark is an open source, general-purpose platform for cluster computing. Spark offers APIs in Java, Scala, Python, and R. Spark also provides higher-level tools to support SQL queries, streaming data, machine learning, and graph algorithms.

All code snippets are written in Python and use pyspark.

Clusters and Nodes

At its core, Spark is responsible for scheduling, distributing, and monitoring applications consisting of many computational tasks across many workers. Spark spreads data and computations over clusters with multiple nodes. Each node performs in-memory data processing. Spark enables parallel computation by giving each node a subset of the total data. Information in this section is taken from Spark's Cluster Node Overview.

A Spark application is coordinated by the Spark Context object of the main program (called the driver program). The Spark Context communicates with a cluster manager (or master). The cluster manager sends application code, input data, and tasks to a set of worker nodes (or slaves). Worker nodes start an executor that runs tasks and returns results.

The driver program has a web UI, typically on port 4040, that displays information about tasks, executors, and storage.

Spark Context

The Spark Context is the main entry point for Spark functionality. A Spark Context is essentially a client that tells Spark how to connect to the Spark cluster. The special "local" string tells Spark to run in local mode.

sc = SparkContext("local", "example")

A SparkContext object may also be instantiated with SparkConf.

conf = SparkConf().setMaster("local").setAppName("example")
sc = SparkContext(conf=conf)

Running Applications

Use bin/spark-submit to launch applications on a cluster. See Submitting Applications for more information.

# Run locally with 8 worker threads.
bin/spark-submit \
    --master local[8] \
    src/example.py \
    10 # program argument

# Run against a standalone master server.
bin/spark-submit \
    --master spark://207.184.161.138:7077 \
    --executor-memory 5G \
    src/example.py \
    10

Resilient Distributed Datasets (RDD)

Resilient Distributed Datasets (RDD) are the fundamental data structure of Spark. An RDD is a collection of elements partitioned across the nodes of the cluster. Note that Apache now recommends Datasets and DataFrames instead of the RDD interface. Information in this section is taken (often directly) from the RDD Programming Guide.

Create an RDD by parallelizing an existing collection or by referencing a dataset in an external storage system. All methods below take an optional second argument for controlling the number of file partitions.

# RDD from an existing collection.
data = [1, 2, 3, 4, 5]
rdd = sc.parallelize(data)

# RDD from a local file.
rdd = sc.textFile("data.txt")

# RDD from an S3 Bucket.
rdd = sc.textFile("s3a://bucket/file.txt")

RDDs support two types of operations: transformations (e.g. map) and actions (e.g. reduce). All transformations in Spark are lazy; transformations are only computed when an action requires a result.

# Count number of characters.
rdd.map(len).reduce(lambda a, b: a + b)

Spark SQL

Spark SQL is a Spark module for structured data processing. Spark SQL is a higher level interface than the RDD API. The SQL module provides a DataFrame abstraction built on top of RDDs. This abstraction facilitates SQL operations and provides additional optimization. DataFrames require a Spark Session.

spark = SparkSession.builder.getOrCreate()
flights = spark.sql("FROM flights SELECT * LIMIT 10")
flights.show()

The sql method takes a query string and returns a DataFrame with the results. List all available tables with spark.catalog.listTables().

A Spark DataFrame may created from an existing RDD, a file, or programmatically. The SparkSession has a .read attribute which has several methods for reading different data sources into Spark DataFrames.

airports = spark.read.csv("airports.csv")

DataFrames provide a domain-specific language for data manipulation. Methods include select, filter, join, groupBy, and others. If the data can fit in memory, a Spark DateFrame can be also be converted to a pandas DataFrame using the .toPandas() method. spark.createDataFrame(..) does the reverse.

Apache Spark

Clusters and Nodes

Spark Context

Running Applications

Resilient Distributed Datasets (RDD)

Spark SQL

TODO Spark Streaming

TODO Spark ML

TODO Spark Graph

TODO Shared Variables: Broadcast, Accumulator