What is the process for scaling a running Flink job in or out and how do the different upgrade and restore strategies of Ververica Platform play together with this?
Plain Apache Flink
Note: This section applies to Flink 1.5 - 1.10.
In order to re-scale any Flink job:
- take a savepoint,
- stop the job,
- restart from the previously taken savepoint using any
parallelism <= maxParallelism.
Since Flink 1.5,
flink modify --parallelism <newParallelism> may be used to change the parallelism in one command. It will try to perform these actions in one go. If taking a savepoint fails, the whole operation will fail.
There is some talk in the Flink community to improve / support dynamic scaling of Flink jobs but this is not available yet.
Apache Flink with Ververica Platform
Note: This section applies to Ververica Platform 1.3 - 2.1.
Ververica Platform exposes rescaling operations with a few more options and works slightly differently: Simply edit an existing deployment, change the parallelism, and apply changes. Ververica Platform will perform the required operations in the background.
Two properties of the deployment influence the scaling behaviour:
- The restore strategy defines the state to start with when transitioning the job into Ververica Platform's
LATEST_STATE, for example, to start from the latest successful checkpoint or savepoint known to Ververica Platform.
- The upgrade strategy defines what happens if the deployment specification is updated, for example by changing the parallelism.
STATELESSwill terminate the currently running job immediately and start a new job based on the restore strategy above.
STATEFULwill first take a savepoint and then continue the same way. If taking the savepoint fails (after retries), the upgrade process stops there and the job continues running.
Ververica Platform will retry the operation in case of failures.