Basics
Spark Overview
Use Cases
too much data for one machine
processing speed
scaling out vs. scaling up
next generation big data processing engine
large community
production ready since 05/14
used by
Amazon
SAP
IBM
Apache Hadoop started in 2006
Spark incepted by Matei Zaharia at UC Berkeley’s AMPLab in 2009
In the same Year: Project Stratosphere started (later becoming Apache Flink)
2010 open sourced under a BSD license
2013 donated to the Apache Software Foundation
2014 Databricks established
May 2014: Spark 1.0
2016: Spark 2.0: Datasets
2017: Spark 2.2: Structured Streaming
2018: Spark 2.4
Scala, Java, Python, R
ease of use:
80 high level operators
Batch Processing
Stream Analytics
Machine Learning
Interactive SQL
DIAGRAM HERE
Single consistent set of APIs
Shared Abstraction: RDDs or DataFrames, Dstreams
Pipelines: Combine different types of processing in the same application
Performance: Only one-time initialization, data sharing in memory
DIAGRAM HERE
Spark does not have its own Storage
can use
HDFS (Hadoop)
Local file system
NoSQL (HBase, Cassandra)
Cloud (Amazon S3)
Use case | Description | Users |
---|---|---|
Data Integration and ETL | Cleansing and combining data from diverse sources | Palantir: Data analytics platform |
Interactive analytics | Gain insight from massive data sets in ad hoc investigations or regularly planned dashboards | Goldman Sachs: Analytics platform |
High performance batch computation | Run complex algorithms against large scale data | Novartis: Genomic Research |
Machine Learning | Predict outcomes to make decisions based on input data | Alibaba: Marketplace Analysis |
Real-time stream processing | Capturing and processing data continuously with low latency and high reliability | Netflix: Recommendation Engine |
Spark introduction, history and benefits
Next generation big data processing engine
Production-ready since 05/2014
Spark’s fast engine – 3 x faster with 10 x fewer recourses than Hadoop [Daytona-GreySort]
Overview of Spark components
Spark SQL, Spark Streaming, Spark ML, GraphX on top of Spark Core Engine
Several storage options i.e. HDFS, Cassandra
Suitable for different use cases