Apache Spark Training


Contents of this module

  1. Spark Overview

  2. Use Cases

Big Data Platforms

  1. too much data for one machine

  2. processing speed

  3. scaling out vs. scaling up

Why Apache Spark?

  • next generation big data processing engine

  • large community

  • production ready since 05/14

  • used by

    • Amazon

    • SAP

    • IBM


Spark history

  • Apache Hadoop started in 2006

  • Spark incepted by Matei Zaharia at UC Berkeley’s AMPLab in 2009

    • In the same Year: Project Stratosphere started (later becoming Apache Flink)

  • 2010 open sourced under a BSD license

  • 2013 donated to the Apache Software Foundation

  • 2014 Databricks established

  • May 2014: Spark 1.0

  • 2016: Spark 2.0: Datasets

  • 2017: Spark 2.2: Structured Streaming

  • 2018: Spark 2.4

Apache Spark - Unified Framework

  • Scala, Java, Python, R

  • ease of use:

  • 80 high level operators

  • Batch Processing

  • Stream Analytics

  • Machine Learning

  • Interactive SQL


Benefits of a unified platform

  • Single consistent set of APIs

  • Shared Abstraction: RDDs or DataFrames, Dstreams

  • Pipelines: Combine different types of processing in the same application

  • Performance: Only one-time initialization, data sharing in memory



  • Spark does not have its own Storage

  • can use

    • HDFS (Hadoop)

    • Local file system

    • NoSQL (HBase, Cassandra)

    • Cloud (Amazon S3)

Spark Use Cases

Use caseDescriptionUsers

Data Integration and ETL

Cleansing and combining data from diverse sources

Palantir: Data analytics platform

Interactive analytics

Gain insight from massive data sets in ad hoc investigations or regularly planned dashboards

Goldman Sachs: Analytics platform
Huawei: Query platform in the telecom sector

High performance batch computation

Run complex algorithms against large scale data

Novartis: Genomic Research
MyFitnessPal: Process food data

Machine Learning

Predict outcomes to make decisions based on input data

Alibaba: Marketplace Analysis
Spotify: Music Recommendation

Real-time stream processing

Capturing and processing data continuously with low latency and high reliability

Netflix: Recommendation Engine
British Gas: Connected Homes

Short summary

  • Spark introduction, history and benefits

    • Next generation big data processing engine

    • Production-ready since 05/2014

    • Spark’s fast engine – 3 x faster with 10 x fewer recourses than Hadoop [Daytona-GreySort]

  • Overview of Spark components

    • Spark SQL, Spark Streaming, Spark ML, GraphX on top of Spark Core Engine

  • Several storage options i.e. HDFS, Cassandra

  • Suitable for different use cases