Question
In a (bigger) Flink job / Ververica Platform deployment, what happens in case a task manager is lost due to some failures? Is it restarted on a different node? What happens to its state?
Answer
Apache Flink
Note: This section applies to Flink 1.5 or later.
In case you lose one or more task managers on which your job was running, the following actions will be taken, depending on the configured restart strategy:
- Flink stops the job on all other task managers running it.
- Flink will try to acquire the number of missing slots for the desired parallelism from the remaining task managers in the cluster, if available.
- If there are not enough task slots available, Flink will ask the cluster manager (YARN, Mesos, Native Kubernetes) to start new task manager(s) - not available for standalone clusters.
- Flink restarts the whole job based on its latest checkpoint or savepoint (whichever is newer).
Please note that Flink's fault-tolerance model is based on checkpoints: you configure how often checkpoints should be started, and each operator then writes its Flink-managed state into a checkpoint file onto some distributed storage - basically any distributed file system of your choice. Please refer to Flink's checkpointing documentation for configuration details.
Tip: Local recovery may considerably speed up recovery times on the remaining task managers and should be stable starting with Flink 1.6.3.
Apache Flink with Ververica Platform
Note: This section applies to Ververica Platform 1.0 or later.
Ververica Platform uses Kubernetes for its deployments and deploys TaskManager pods with a replica set based on the deployment configuration. It is the responsibility of Kubernetes to manage these pods, including any restarts necessary to sustain the number of configured replicas.
Tip: For faster recovery, spare TaskManager pods may be configured by setting the number of task managers manually in the deployment specification. By default, the number of task managers is equal to the parallelism, and there is one slot per task manager.
While pods may be restarting, Flink will try to restart the job as described above. This may fail, for example, because of Flink's restart strategy, in which case the Flink job enters a terminal state such as FAILED or CANCELLED. Then, Ververica Platform steps in and tries to deploy the job based on its restore strategy using a completely new cluster - the old one is torn down.
Tip: Ververica Platform's restore strategy may be used as a safety net for Flink's restart strategy: If user code (or Flink) is leaking memory in the task or job manager instances for every restart, it will eventually fail to restart. Configuring a restart strategy that would fail the job will then cause Ververica Platform to start over with fresh instances completely.
Important: If Ververica Platform restarts a deployment, it cannot make use of Flink's local recovery! Consider configuring Flink's restart strategy to try first to make use of local recovery. See Flink's restart strategy documentation for various tuning options.