I want to configure Kubernetes High-Availability (HA) for my Flink cluster. Could you tell me how Ververica Platform's Kubernetes HA and Apache Flink's Kubernetes HA work? What are the differences between these two solutions?
Note: This article applies to Flink 1.8-1.14 running in Ververica Platform 2.0-2.6.
tldr; See Summary
What is High Availability in Flink in General
For a Flink cluster without HA, if the JM crashes, no new jobs can be submitted and all running jobs will fail. After HA is enabled, new jobs can be submitted immediately when a new JM is ready. Running jobs will continue to make progress before one more step - restarting from the previous checkpoint. Yes, you can consider it as the job's downtime. In Flink, you must always expect restarts and tune your application such that your SLAs still hold with a sporadic restart.
Apache Flink's Kubernetes HA
Apache Flink's Kubernetes HA can be activated by following Flink configuration:
high-availability: org.apache.flink.kubernetes.highavailability.KubernetesHaServicesFactory high-availability.storageDir: /path/to/ha/store kubernetes.cluster-id: cluster-id
Typically, in this approach, there is a single leading JobManager at any time and multiple standby JobManagers to take over leadership in case the leader fails.
Ververica Platform's Kubernetes HA
Ververica Platform's Kubernetes HA can be activated by
kind: Deployment # this is Ververica Platform deployment, not the Kubernetes one spec: template: spec: flinkConfiguration: high-availability: vvp-kubernetes high-availability.storageDir: /path/to/ha/store # optional, see more below high-availability.vvp-kubernetes.job-graph-store.enabled: true # optional,see more below
Unlike Apache Flink's Kubernetes HA, this approach starts a single JobManager (JM) and manages it with a Kubernetes Job resource with the following specification:
kind: Job spec: completions: 1 parallelism: 1
Whenever that JM crashes, i.e., not exited with 0, the Kubernetes will launch another JM as a replacement. With the help of high availability storage, the new JM knows where to start. The high availability storage is configured automatically if Ververica Platform's Universal Blob Storage is configured.
In addition, with the configuration `high-availability.vvp-kubernetes.job-graph-store.enabled: true`, you can also store the generated job graph in the high availability storage. Upon JM failover, instead of re-executing your job's `main()` to generate the job graph, the stored job graph will be used.
In both HA approaches, when the active JM (the leading JM in the case of Apache Flink's Kubernetes HA, the single JM in the case of Ververica Platform's Kubernetes HA) is lost, the TaskManagers (TMs) who are talking to the failed JM will fail the tasks running on them. This leads to a job restart from the previous checkpoint. Flink's restart strategy controls how restarts are handled.
In terms of failover and recovery time, Apache Flink's Kubernetes HA can switch over to one of the standby JM quickly as it is started already. With Ververica Platform's Kubernetes HA, a new JM pod needs to be launched and started. But the execution of the job's `main()` can be skipped if you enable the job graph store. The actual recovery time depends on your job and your cluster environment.
Another difference is that Ververica Platform's Kubernetes HA runs a single leader election process for the entire JM process while Apache Flink's Kubernetes HA runs multiple of them, one for Dispatcher, one for ResourceManager, one for JobManager, etc. But Flink will follow the same route as Ververica Platform did to implement a multiple component leader election service in version 1.15. See FLINK-24038 for more details.
The biggest advantage of Ververica Platform's Kubernetes HA is that, if you set `RestoreStrategy=LATEST_STATE` in your deployment, even if the job fails completely (e.g., exhausted the configured restart attempts) after the underlying issue is fixed, the job can still restart from the latest state (i.e. the previously completed checkpoint or savepoint).
This table summarizes the comparison between both Kubernetes HA approaches.
|Apache Flink||Ververica Platform|
|The number of JMs||
one active, multiple standbys
|Is the job restarted upon failover?||yes|
|Is the job graph regenerated upon failover?||yes, `main()` is re-executed to generate the job graph||
Can (optionally) be stored in the HA store for re-use
|The number of leader election processes in JM||
multiple (Flink <1.15)
single (Flink 1.15 or later)
|Can recover from `LATEST_STATE`||
FLINK-24038: Flink moves to single leader election in Flink 1.15