Apache Samza is a distributed stream processing framework. Samza allows to build stateful applications that process data in real-time from multiple sources including Apache Kafka. Samza is battle-tested at scale, it supports flexible deployment options to run on YARN or as a standalone library.
Apache Samza provides following features:
Pluggability at every level
Samza as an embedded library
Write once, Run anywhere
Samza as a managed service
Samza provides a simple API to describe the application-logic in a manner independent of the data-source.
The same API can process both batch and streaming data.
Samza can be used to process and transform data from any source.
Samza offers built-in integrations with Apache Kafka, AWS Kinesis, Azure EventHubs, ElasticSearch and Apache Hadoop.
It’s quite easy to integrate with various data sources.
Samza can integrate effortlessly with an existing application to eliminate the need to spin up and operate a separate cluster for stream processing.
Samza can be used as a light-weight client-library embedded in Java/Scala applications.
Samza supports flexible deployment options to run applications anywhere.
It can be deployed on public clouds as well as on containerized environments and bare-metal hardware.
Samza can be run for stream-processing as a managed service.
Samza integrates with popular cluster-managers including Apache YARN.
Samza transparently migrates tasks along with their associated state in the event of failures.
Samza supports host-affinity and incremental checkpointing to enable fast recovery from failures.
Samza is battle-tested on applications that use several terabytes of state and run on thousands of cores.
Samza powers multiple large companies including LinkedIn, Uber, TripAdvisor, Slack etc.
Samza processes the data in the form of streams.
A stream is a collection of immutable messages, usually of the same type or category.
Each message in a stream is modelled as a key-value pair.
A Samza stream is sharded into multiple partitions for scaling how its data is processed.
Each partition is an ordered, replayable sequence of records.
When a message is written to a stream, it ends up in one of its partitions.
Each message in a partition is uniquely identified by an offset.
Samza supports both stateless and stateful stream processing.
Stateless processing, as the name implies, does not retain any state associated with the current message after it has been processed.
Stateful processing records some state about a message even after processing it. Samza offers a fault-tolerant, scalable state-store for this purpose.
All built-in Samza operators use processing time.
In processing time, the timestamp of a message is determined by when it is processed by the system.
In event time, the timestamp of an event is determined by when it actually occurred at the source.
Samza provides event-time based processing by its integration with Apache BEAM.
Samza supports at-least once processing.
Each message in the input stream is processed by the system at-least once.
This guarantees no data-loss even when there are failures.
Samza scales an application by logically breaking it down into multiple tasks.
A task is the unit of parallelism for an application.
Each task consumes data from one partition of input streams.
The assignment of partitions to tasks never changes.
If a task is on a machine that fails, the task is restarted elsewhere, still consuming the same stream partitions.
Since there is no ordering of messages across partitions, it allows tasks to execute entirely independent of each other without sharing any state.
Just like a task is the logical unit of parallelism for an application, a container is the physical unit.
Each worker is a JVM process, which runs one or more tasks.
An application typically has multiple containers distributed across hosts.
Each Samza application has a coordinator which manages the assignment of tasks across the individual containers.
The coordinator monitors the liveness of individual containers and redistributes the tasks among the remaining ones during a failure.
The coordinator itself is pluggable, enabling Samza to support multiple deployment options.
Samza guarantees that messages won’t be lost, even if a job crashes, if a machine dies, if there is a network fault, or something else goes wrong. To achieve this property, each task periodically persists the last processed offsets for its input stream partitions.
If a task needs to be restarted on a different worker due to a failure, it resumes processing from its latest checkpoint.
Samza’s checkpointing mechanism ensures each task also stores the contents of its state-store consistently with its last processed offsets. Checkpoints are flushed incrementally ie., the state-store only flushes the delta since the previous checkpoint instead of flushing its entire state.
Samza offers scalable, high-performance storage to build stateful stream-processing applications. This is implemented by associating each Samza task with its own instance of a local database (aka. a state-store).
The state-store associated with a particular task only stores data corresponding to the partitions processed by that task.
Samza transparently migrates the tasks from one machine to another. By giving each task its own state, tasks can be relocated without affecting the overall application.
Samza provides multiple programming APIs to fit a use case:
Java APIs: Samza’s provides two Java programming APIs that are ideal for building advanced Stream Processing applications.
Samza SQL: Samza SQL provides a declarative query language for describing the stream processing logic. It lets a user manipulate streams using SQL predicates and UDFs instead of working with the physical implementation details.
Apache Beam API: Samza also provides a Apache Beam runner to run applications written using the Apache Beam API. This is considered as an extension to the operators supported by the High Level Streams API in Samza.
Samza provides two Java programming APIs that are ideal for building advanced Stream Processing applications.
High Level Streams API: Samza’s flexible High Level Streams API can describe a complex stream processing pipeline in the form of a Directional Acyclic Graph (DAG) of operations on message streams. It provides a rich set of built-in operators that simplify common stream processing operations such as filtering, projection, repartitioning, joins, and windows.
Low Level Task API: Samza’s powerful Low Level Task API can be used to write an application in terms of processing logic for each incoming message.