Spark-Submit Functionality in Data Flow
Find out how to use Data Flow with Spark-submit.
Spark-Submit Compatibility
You can use spark-submit compatible options to run your applications using Data Flow.
Spark-submit is an industry standard command for running applications on Spark clusters. The following spark-submit compatible options are supported by Data Flow:
-
--conf -
--files -
--py-files -
--jars -
--class -
--driver-java-options -
--packages -
main-application.jarormain-application.py - arguments to
main-application. Arguments passed to the main method of your main class (if any).
The --files option flattens your file hierarchy, so all files are placed
at the same level in the current working directory. To keep the file hierarchy, use
either archive.zip, or --py-files with a JAR, ZIP or
EGG dependecy module.
--packages option is used to include any other dependencies by
supplying a comma-delimited list of Maven coordinates. For example,
--packages org.apache.spark:spark-sql-kafka-0-10_2.12:3.0.2All
transitive dependencies are handled when using this command. With the
--packages option, each Run's driver pod needs to download
dependencies dynamically, which relies on network stability and access to Maven
central or other remote repositories. Use the Data Flow Dependency Packager to generate a
dependency archive for production.For all spark-submit options on Data Flow, the URI
must begin oci://.... URIs starting with
local://... or hdfs://... aren't supported.
Use the fully qualified domain names (FQDN) in the URI. Load all files, including
main-application, to Oracle Cloud Infrastructure Object Storage.
Creating a Spark-Submit Data Flow Application explains how to create an application in the
Console using spark-submit. You can also use
spark-submit with a Java SDK or from the CLI. If you're using CLI, you don't have to
create a Data Flow Application to run your Spark
application with spark-submit compatible options on Data Flow. This is useful if you already have a working
spark-submit command in a different environment. When you follow the syntax of the
run submit command, an Application is created, if one doesn't
already exist in the main-application URI.
Installing Public CLI with the run submit Command
These steps are needed to install a public CLI with the run submit
command for use with Data Flow:
Using Spark-submit in Data Flow
You can take your spark-submit CLI command and convert it into a compatible CLI command in Data Flow.
run submit command. If you already have a working Spark application
in any cluster, you're familiar with the spark-submit syntax. For example:
spark-submit --master spark://<IP-address>:port \
--deploy-mode cluster \
--conf spark.sql.crossJoin.enabled=true \
--files oci://file1.json \
--class org.apache.spark.examples.SparkPi \
--jars oci://file2.jar <path_to>/main_application_with-dependencies.jar 1000oci data-flow run submit \
--compartment-id <compartment-id> \
--execute "--conf spark.sql.crossJoin.enabled=true
--files oci://<bucket-name>@<namespace>/path/to/file1.json
--jars oci://<bucket-name>@<namespace>/path/to/file2.jar
oci://<bucket-name>@<namespace>/path_to_main_application_with-dependencies.jar 1000"- Upload all the files, including the main application, in the Object Storage.
- Replace the existing URIs with the corresponding
oci://...URI. -
Remove any unsupported or reserved spark-submit parameters. For example,
--masterand--deploy-modeare reserved for Data Flow and a user doesn't need to populate them. -
Add
--executeparameter and pass in a spark-submit compatible command string. To build the--executestring, keep the supported spark-submit parameters, and main-application and its arguments, in sequence. Put them inside a quoted string (single-quote or double-quotes). - Replace
spark submitwith the Oracle Cloud Infrastructure standard command prefix,oci data-flow run submit. - Add the Oracle Cloud Infrastructure mandatory argument
and parameter pairs for
--profile,--auth security_token, and--compartment-id.
Run Submit Examples
Some examples of run submit in Data Flow.
Oci-cli Examples
Examples of run submit using oci-cli in Data Flow.
oci --profile oci-cli --auth security_token data-flow run submit \
--compartment-id <compartment-id> \
--execute "--conf spark.sql.crossJoin.enabled=true
--class org.apache.spark.examples.SparkPi
oci://<bucket-name>@<tenancy-name>/spark-examples_2.11-2.3.1-SNAPSHOT-jar-with-dependencies.jar 10"--jars,
--files, and
pyfiles:oci --profile oci-cli --auth security_token data-flow run submit \
--compartment-id <compartment-id> \
--execute "--jars oci://<bucket-name>@<tenancy-name>/a.jar
--files \"oci://<bucket-name>@<tenancy-name>/b.json\"
--py-files oci://<bucket-name>@<tenancy-name>/c.py
--conf spark.sql.crossJoin.enabled=true
--class org.apache.spark.examples.SparkPi
oci://<bucket-name>@<tenancy-name>/spark-examples_2.11-2.3.1-SNAPSHOT-jar-with-dependencies.jar 10"archiveUri, --jars,
--files, and
pyfiles:oci --profile oci-cli --auth security_token data-flow run submit \
--compartment-id <compartment-id> \
--archive-uri "oci://<bucket-name>@<tenancy-name>/mmlspark_original.zip" \
--execute "--jars local:///opt/dataflow/java/mmlspark_2.11-0.18.1.jar
--files \"local:///opt/dataflow/java/mmlspark_2.11-0.18.1.jar\"
--py-files local:///opt/dataflow/java/mmlspark_2.11-0.18.1.jar
--conf spark.sql.crossJoin.enabled=true
--class org.apache.spark.examples.SparkPi
oci://<bucket-name>@<tenancy-name>/spark-examples_2.11-2.3.1-SNAPSHOT-jar-with-dependencies.jar 10"jars, files, and
pyfiles:oci --profile oci-cli --auth security_token data-flow run submit \
--compartment-id <compartment-id> \
--archive-uri "oci://<bucket-name>@<tenancy-name>/mmlspark_original.zip" \
--execute "--jars oci://<bucket-name>@<tenancy-name>/fake.jar
--conf spark.sql.crossJoin.enabled=true
--class org.apache.spark.examples.SparkPi
oci://<bucket-name>@<tenancy-name>/spark-examples_2.11-2.3.1-SNAPSHOT-jar-with-dependencies.jar 10"
#result
{'opc-request-id': '<opc-request-id>', 'code': 'InvalidParameter',
'message': 'Invalid OCI Object Storage uri. The object was not found or you are not authorized to access it.
{ResultCode: OBJECTSTORAGE_URI_INVALID,
Parameters: [oci://<bucket-name>@<tenancy-name>/fake.jar]}', 'status': 400}To enable Resource Principal Auth, add the Spark property in the conf file using Spark submit, and add the following configuration in the execute method:
--execute "--conf dataflow.auth=resource_principal --conf other-spark-property=other-value"
Oci-curl Example
An example of run submit using oci-curl in Data Flow.
oci-curl <IP-Address>:443 POST /Users/<user-name>/workspace/sss/dependency_test/spark-submit-test.json
/latest/runs --insecure --noproxy <IP-Address>
{
"execute": "--jars local:///opt/dataflow/java/mmlspark_2.11-0.18.1.jar
--files \"local:///opt/spark/examples/jars/spark-examples_2.11-2.4.4.jar\"
--py-files local:///opt/spark/conf/spark.properties
--conf spark.sql.crossJoin.enabled=true
--class org.apache.spark.examples.SparkPi
oci://<bucket-name>@<tenancy-name>/spark-examples_2.11-2.3.1-SNAPSHOT-jar-with-dependencies.jar 10",
"displayName": "spark-submit-test",
"sparkVersion": "2.4",
"driverShape": "VM.Standard2.1",
"executorShape": "VM.Standard2.1",
"numExecutors": 1,
"logsBucketUri": "",
"freeformTags": {},
"definedTags": {}
}