Apache Flink supports both unbounded and bounded streams:
Unbounded streams have a start but no defined end. They do not terminate and provide data as it is generated.
Bounded streams have a defined start and end. Bounded streams can be processed by ingesting all data before performing any computations. Processing of bounded streams is also known as batch processing.
Apache Flink is a distributed system and requires compute resources in order to execute applications.
Flink integrates with all common cluster resource managers such as Hadoop YARN, Apache Mesos, and Kubernetes.
It can also be setup to run as a stand-alone cluster.
Flink is designed to run stateful streaming applications at any scale.
Applications are parallelized into possibly thousands of tasks that are distributed and concurrently executed in a cluster.
A Flink application can leverage virtually unlimited amounts of CPUs, main memory, disk and network IO.
Stateful Flink applications are optimized for local state access.
Task state is always maintained in memory.
If the state size exceeds the available memory, it is maintained in access-efficient on-disk data structures.
Pattern detection is a very common use case for event stream processing.
Flink’s CEP library provides an API to specify patterns of events (think of regular expressions or state machines).
The CEP library is integrated with Flink’s DataStream API, such that patterns are evaluated on DataStreams.
Applications for the CEP library include network intrusion detection, business process monitoring, and fraud detection.
The DataSet API is Flink’s core API for batch processing applications.
The primitives of the DataSet API include map, reduce, (outer) join, co-group, and iterate.
All operations are backed by algorithms and data structures that operate on serialized data in memory.
These operations spill to disk if the data size exceed the memory budget.
Algorithms of Flink’s DataSet API are based on database operators, like hybrid hash-join or external merge-sort.
Gelly is a library for scalable graph processing and analysis.
Gelly is implemented on top of and integrated with the DataSet API.
Gelly benefits from its scalable and robust operators.
Gelly features built-in algorithms, such as label propagation, triangle enumeration, and page ran.
Gelly provides a Graph API that eases the implementation of custom graph algorithms.
Flink features two relational APIs, the Table API and SQL.
Both APIs are unified APIs for batch and stream processing.
Queries are executed with the same semantics on unbounded, real-time streams or bounded, recorded streams and produce the same results.
The Table API and SQL leverage Apache Calcite for parsing, validation, and query optimization.
Flink’s recovery mechanism is based on consistent checkpoints of an application’s state.
In case of a failure, the application is restarted and its state is loaded from the latest checkpoint.
In combination with resettable stream sources, this feature can guarantee exactly-once state consistency.
Checkpointing the state of an application can be quite expensive if the application maintains terabytes of state.
Flink can perform asynchronous and incremental checkpoints.
It can keep the impact of checkpoints on the application’s latency SLAs very small.
Flink features transactional sinks for specific storage systems that guarantee that data is only written out exactly once, even in case of failures.
Flink is tightly integrated with cluster managers, such as Hadoop YARN, Mesos, or Kubernetes.
When a process fails, a new process is automatically started to take over its work.
Flink features a high-availability mode that eliminates all single-points-of-failure.
The HA-mode is based on Apache ZooKeeper, a battle-proven service for reliable distributed coordination.
Savepoints enable following features:
Flink Version Updates
A/B Tests and What-If Scenarios
Pause and Resume