Migrate Data Flow to Spark 3.2.1

Follow these steps to migrate Data Flow to using Spark 3.2.1.

To use Data Flow with Delta Lakes 1.2.1 and to integrate with Conda Pack, you must use at least version 3.2.1. of Spark with Data Flow.

Follow the instructions in the Spark 3.2.1 Migration Guide to upgrade to Spark 3.2.1.

Further to the supported versions information in Before you Begin with Data Flow, the following library versions include minimum and only supported versions by Data Flow with Spark 3.2.1 and with Spark 3.0.2.

Note

Build applications using the versions listed for Spark 3.0.2 before migrating to Spark 3.2.1.

Supported versions for Spark 3.2.1 and Spark 3.0.2.
Library	Spark 3.2.1	Spark 3.0.2
Python	3.8.13	3.6.8
Java	11	1.8.0_321
Hadoop (minimum version)	3.3.1	3.2.0
Scala	2.12.15	2.12.10
oci-hdfs (minimum version)	3.3.1.0.3.2	3.2.1.3
oci-java-sdk (minimum version)	2.45.0	1.25.2

Note

To maximize performance with Spark 3.2.1, see Performance Settings for Spark 3.2.1.

Performance Settings for Spark 3.2.1

If using Spark 3.2.1, set two parameters to maximize performance.

By default, the Oracle Cloud Infrastructure Java SDK uses the ApacheConnector. This can cause buffering of requests in memory, so, instead, use the Jersey HttpurlConnector by setting the following parameters:

spark.executorEnv.OCI_JAVASDK_JERSEY_CLIENT_DEFAULT_CONNECTOR_ENABLED=true
spark.driverEnv.OCI_JAVASDK_JERSEY_CLIENT_DEFAULT_CONNECTOR_ENABLED=true

Oracle Cloud Infrastructure Documentation

Migrate Data Flow to Spark 3.2.1

Performance Settings for Spark 3.2.1