Issue
When a deployment starts in Ververica Platform, it may happen that multiple Flink jobs are started. The following TimeoutException
may also appear in Ververica Platform appmanager
logs:
2020-04-14 15:15:23.873 DEBUG 1 --- [eduler-worker-1] c.d.a.c.c.drivers.ScheduledController : Exception while invoking the controller DeploymentControllerLogic{deploymentId=12d6b0e7-b9e8-4736-9ff4-d63688cc6882}. Will retry after backing off for 3210 milliseconds
java.lang.RuntimeException: java.util.concurrent.TimeoutException
at com.google.common.base.Throwables.propagate(Throwables.java:241) ~[guava-28.0-jre.jar!/:na]
at com.dataartisans.appmanager.controller.api.MoreFutures.deref(MoreFutures.java:41) ~[appmanager-controller-api-2.1.0.jar!/:na]
at com.dataartisans.appmanager.controller.core.JobRepositoryFacade.deref(JobRepositoryFacade.java:147) ~[appmanager-controller-2.1.0.jar!/:na]
at com.dataartisans.appmanager.controller.core.JobRepositoryFacade.create(JobRepositoryFacade.java:58) ~[appmanager-controller-2.1.0.jar!/:na]
at com.dataartisans.appmanager.controller.domain.deployment.ToRunning.createJob(ToRunning.java:248) ~[appmanager-controller-2.1.0.jar!/:na]
at com.dataartisans.appmanager.controller.domain.deployment.ToRunning.tryTransition(ToRunning.java:145) ~[appmanager-controller-2.1.0.jar!/:na]
at com.dataartisans.appmanager.controller.domain.deployment.DeploymentControllerLogic.dispatchOnSpecState(DeploymentControllerLogic.java:84) ~[appmanager-controller-2.1.0.jar!/:na]
at com.dataartisans.appmanager.controller.domain.deployment.DeploymentControllerLogic.invoke(DeploymentControllerLogic.java:72) ~[appmanager-controller-2.1.0.jar!/:na]
at com.dataartisans.appmanager.controller.core.drivers.ScheduledController.invoke(ScheduledController.java:56) ~[appmanager-controller-2.1.0.jar!/:na]
at com.dataartisans.appmanager.controller.core.drivers.SchedulerWorker.invokeSilently(SchedulerWorker.java:50) [appmanager-controller-2.1.0.jar!/:na]
at com.dataartisans.appmanager.controller.core.drivers.SchedulerWorker.run(SchedulerWorker.java:27) [appmanager-controller-2.1.0.jar!/:na]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [na:1.8.0_242]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [na:1.8.0_242]
at java.lang.Thread.run(Thread.java:748) [na:1.8.0_242]
Caused by: java.util.concurrent.TimeoutException: null
at java.util.concurrent.CompletableFuture.timedGet(CompletableFuture.java:1784) ~[na:1.8.0_242]
at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1928) ~[na:1.8.0_242]
at com.dataartisans.appmanager.controller.api.MoreFutures.deref(MoreFutures.java:39) ~[appmanager-controller-api-2.1.0.jar!/:na]
... 12 common frames omitted
Environment
This happens in Ververica Platform 2.1.0 with persistent volumes backed up by Amazon Elastic File System (EFS).
This may happen in Ververica Platform 2.1.1 or later with EFS if a longer timeout is needed (see how to increase the timeout value as described below).
Resolution
Upgrade to Ververica Platform 2.1.1 or later and, if necessary, set a longer timeout in your values.yaml
, e.g.,:
vvp:
appmanager:
persistence.repository.timeout.ms: 15000
The time unit here is millisecond. The default value is 10000 (10 seconds). Increase the timeout if necessary.
Cause
Accessing persistent volumes backed by Amazon EFS takes a long time.