I want to configure Kubernetes High-Availability (HA) for my Flink cluster. Could you tell me how Ververica Platform's Kubernetes HA and Apache Flink's Kubernetes HA work? What are the differences between these two solutions?
Note: This article applies to Flink 1.8 and later, running in Ververica Platform 2.0+.
Since Ververica Platform 2.8, we have integrated Apache Flink Kubernetes HA and adapted it to support the LATEST_STATE restore strategy in Ververica Platform. All users of Ververica Platform 2.8 or later are recommended to use Flink Kubernetes HA for their deployments.
tldr; See Summary
What is High Availability in Flink in General
For a Flink cluster without HA, if the JM crashes, no new jobs can be submitted and all running jobs will fail. After HA is enabled, new jobs can be submitted immediately when a new JM is ready. Running jobs will continue to make progress after one more step - restarting from the previous checkpoint. Yes, you can consider it as the job's downtime. In Flink, you must always expect restarts and tune your application such that your SLAs still hold with a sporadic restart.
Apache Flink's Kubernetes HA
Apache Flink's Kubernetes HA can be activated by the following Flink configuration:
high-availability: org.apache.flink.kubernetes.highavailability.KubernetesHaServicesFactory high-availability.storageDir: /path/to/ha/store kubernetes.cluster-id: cluster-id
Typically, in this approach, there is a single leading JobManager at any time and multiple standby JobManagers to take over leadership in case the leader fails.
Ververica Platform's Kubernetes HA
Ververica Platform's Kubernetes HA can be activated by
##Ververica Platform 2.8 or later
kind: Deployment # this is Ververica Platform deployment, not the Kubernetes one spec: template: spec: flinkConfiguration: high-availability: kubernetes high-availability.storageDir: /path/to/ha/store # optional, see more below high-availability.vvp-kubernetes.job-graph-store.enabled: true # optional,see more below
##Ververica Platform 2.7 or earlier
kind: Deployment # this is Ververica Platform deployment, not the Kubernetes one spec: template: spec: flinkConfiguration: high-availability: vvp-kubernetes high-availability.storageDir: /path/to/ha/store # optional, see more below high-availability.vvp-kubernetes.job-graph-store.enabled: true # optional,see more below
Unlike Apache Flink's Kubernetes HA, this approach starts a single JobManager (JM) and manages it with a Kubernetes Job resource with the following specification:
kind: Job spec: completions: 1 parallelism: 1
Whenever that JM crashes, i.e., not exited with 0, Kubernetes will launch another JM as a replacement. With the help of high availability storage, the new JM knows where to start. The high availability storage is configured automatically if Ververica Platform's Universal Blob Storage is configured.
In addition, with the configuration `high-availability.kubernetes.job-graph-store.enabled: true` or `high-availability.vvp-kubernetes.job-graph-store.enabled: true`, you can also store the generated job graph in the high availability storage. Upon JM failover, instead of re-executing your job's `main()` to generate the job graph, the stored job graph will be used.
In all Kubernetes HA approaches described above, when the active JM (the leading JM in the case of Apache Flink's Kubernetes HA, the single JM in the case of Ververica Platform's original and new Kubernetes HA) is lost, the TaskManagers (TMs) who are talking to the failed JM will fail the tasks running on them. This leads to a job restart from the previous checkpoint. Flink's restart strategy controls how restarts are handled.
In terms of failover and recovery time, Apache Flink's Kubernetes HA can switch over to one of the stand-by JM quickly as it is already started. With Ververica Platform's Kubernetes HA, a new JM pod needs to be launched and started. But the execution of the job's `main()` can be skipped if you enable the job graph store. The actual recovery time depends on your job and your cluster environment.
Another difference is that Ververica Platform's Kubernetes HA runs a single leader election process for the entire JM process while Apache Flink's Kubernetes HA runs multiple of them, one for Dispatcher, one for ResourceManager, one for JobManager, etc. Flink 1.15 followed the same route as Ververica Platform did to implement a multiple-component leader election service. See FLINK-24038 for more details.
The biggest advantage of Ververica Platform's Kubernetes HA is that, if you set `RestoreStrategy=LATEST_STATE` in your deployment, even if the job fails completely (e.g., exhausted the configured restart attempts) after the underlying issue is fixed, the job can still restart from the latest state (i.e., the previously completed checkpoint or savepoint).
This table summarizes the comparison between both Kubernetes HA approaches.
|Apache Flink Kubernetes HA
|Ververica Platform Kubernetes HA
|The number of JMs
one active, multiple standbys
|Is the job restarted upon failover?
|Is the job graph regenerated upon failover?
|yes, `main()` is re-executed to generate the job graph
Can (optionally) be stored in the HA store for re-use
|The number of leader election processes in JM
multiple (Flink <1.15)
single (Flink 1.15 or later)
|Can recover from `LATEST_STATE`
FLINK-24038: Flink moves to single leader election in Flink 1.15