Data Flow now supports Pools

  • Services: Data Flow
  • Release Date: June 21, 2023

A Data Flow Pool is a group of pre-allocated Compute resources that can be used to run Data Flow based Spark workloads with faster startup time.

Use-cases: • Time sensitive large production workloads with many executors, which need faster startup time in seconds. • Critical production workloads aren’t affected by dynamic development workloads because their resources can be allocated from different Pools. • Cost and usage separation between development and production workloads with IAM policies that let you submit specific Data Flow Runs to specific Pools. • Execute a large number of Data Flow Runs back-to-back with less startup time. • Queueing Data Flow Runs in a pool for efficient use of resources and cost control. • Automatic start of a Pool based on a schedule; automatic termination based on idle time.

For more information, see the Data Flow Service Limits documentation.