Question
What is the process for scaling a running Flink job in or out, and how do the different upgrade and restore strategies of Ververica Platform play together with this?
Answer
Plain Apache Flink
Note: This section applies to Flink 1.5 or later.
In order to re-scale any Flink job:
- take a savepoint,
- stop the job,
- restart from the previously taken savepoint using any
parallelism <= maxParallelism
.
Since Flink 1.5, flink modify --parallelism <newParallelism>
may be used to change the parallelism in one command. It will try to perform these actions in one go. If taking a savepoint fails, the whole operation will fail.
If you are using Flink 1.13 or later, the MVP (“minimum viable product”) feature of reactive mode can also be used to scale jobs up and down automatically by allocating more or fewer TM managers.
Apache Flink with Ververica Platform
Note: This section applies to Ververica Platform 1.3 or later.
Ververica Platform exposes rescaling operations with a few more options and works slightly differently: Simply edit an existing deployment, change the parallelism, and apply changes. Ververica Platform will perform the required operations in the background.
Two properties of the deployment influence the scaling behavior: spec.upgradeStrategy
and spec.restoreStrategy
.
- The restore strategy defines the state to start with when transitioning the job into Ververica Platform's
RUNNING
state. UseLATEST_STATE
, for example, to start from the latest successful checkpoint or savepoint known to Ververica Platform. - The upgrade strategy defines what happens if the deployment specification is updated, for example, by changing the parallelism.
STATELESS
will terminate the currently running job immediately and start a new job based on the restore strategy above.STATEFUL
will first take a savepoint and then continue the same way. If taking the savepoint fails (after retries), the upgrade process stops and the job continues running.
Ververica Platform will retry the operation in case of failures.
Tip: Ververica Platform 2.2 (and later) Enterprise Edition adds a new Autopilot feature which comes with autoscaling support for your Flink 1.11 or newer deployments.