Getting Started with Oracle Cloud Infrastructure Data Flow

This tutorial introduces you to Oracle Cloud Infrastructure Data Flow, a service that lets you run any Apache Spark Application at any scale with no infrastructure to deploy or manage.

If you've used Spark before, you'll get more out of this tutorial, but no prior Spark knowledge is required. All Spark applications and data have been provided for you. This tutorial shows how Data Flow makes running Spark applications easy, repeatable, secure, and simple to share across the enterprise.

In this tutorial you learn:

How to use Java to perform ETL in a Data Flow Application .
How to use SparkSQL in a SQL Application.
How to create and run a Python Application to perform a simple machine learning task.

You can also perform this tutorial using spark-submit from CLI or using spark-submit and Java SDK.

Data Flow Advantages

Here's why Data Flow is better than running your own Spark clusters, or other Spark Services out there.

It's serverless, which means you don't need experts to provision, patch, upgrade or maintain Spark clusters. That means you focus on your Spark code and nothing else.
It has simple operations and tuning. Access to the Spark UI is a click away and is governed by IAM authorization policies. If a user complains that a job is running too slow, then anyone with access to the Run can open the Spark UI and get to the root cause. Accessing the Spark History Server is as simple for jobs that are already done.
It is great for batch processing. Application output is automatically captured and made available by REST APIs. Do you need to run a four-hour Spark SQL job and load the results in your pipeline management system? In Data Flow, it's just two REST API calls away.
It has consolidated control. Data Flow gives you a consolidated view of all Spark applications, who is running them and how much they consume. Do you want to know which applications are writing the most data and who is running them? Simply sort by the Data Written column. Is a job running for too long? Anyone with the correct IAM permissions can see the job and stop it.

Example of a list of runs. There is a table of nine columns and three rows. The columns are Name, Language, Status, Owner, Created, Duration, Total oCPU, Data Read, Data Written. The cells in all three rows are all populated. The names are Tutorial Example 1, Tutorial Example 2, and Tutorial Example 3. The languages for each are Java, Python, and SQL respectively. All three have a Status of Succeeded.

There is a table of nine columns and three rows. The columns are Name, Language, Status, Owner, Created, Duration, Total oCPU, Data Read, Data Written. The cells in all three rows are all populated. The names are Tutorial Example 1, Tutorial Example 2, and Tutorial Example 3. The languages for each are Java, Python, and SQL respectively. All three have a Status of Succeeded.

Before You Begin

To successfully perform this tutorial, you must have Set Up Your Tenancy and can Access Data Flow.

Set Up Your Tenancy

Before Data Flow can run, you must grant permissions that allow effective log capture and run management. See the Set Up Administration section of Data Flow Service Guide, and follow the instructions given there.

Accessing Data Flow

From the Console, click the navigation menu to display the list of available services.
Select Analytics & AI.
From under Big Data, select Data Flow.
Select Applications.

1. ETL with Java

An exercise to learn how to create a Java application in Data Flow

The steps here are for using the Console UI. You can complete this exercise using spark-submit from CLI or spark-submit with Java SDK.

Overview

The most common first step in data processing applications, is to take data from some source and get it into a format that's suitable for reporting and other forms of analytics. In a database, you would load a flat file into the database and create indexes. In Spark, your first step is to clean and convert data from a text format into Parquet format. Parquet is an optimized binary format supporting efficient reads, making it ideal for reporting and analytics. In this exercise, you take source data, convert it into Parquet, and then do a few interesting things with it. The dataset is the Berlin Airbnb Data dataset, downloaded from the Kaggle website under the terms of the Creative Commons CC0 1.0 Universal (CC0 1.0) "Public Domain Dedication" license.

There is a box labelled CSV Data, Easy to Read, Slow. An arrow flows to a box on the right labelled Parquet, Harder to Read, Fast. From there are two arrows, one to a box labelled SQL Queries and the other to a box labelled Machine Learning.

The data is provided in CSV format and the first step is to convert this data to Parquet and store it in object store for downstream processing. A Spark application, called oow-lab-2019-java-etl-1.0-SNAPSHOT.jar, is provided to make this conversion. The objective is to create a Data Flow Application which runs this Spark app, and run it with the correct parameters. Because you're starting out, this exercise guides you step by step, and provides the parameters you need. Later you need to provide the parameters yourself, so you must understand what you're entering and why.

Create the Java Application

Create a Data Flow Java Application from the Console, or with Spark-submit from the command line or using SDK.

Create the Java Application in the Console.

Create a Java application in Data Flow from the Console.

Create a Data Flow Application.

Navigate to the Data Flow service in the Console by expanding the hamburger menu on the top left and scrolling to the bottom.
Highlight Data Flow, then select Applications. Select a compartment where you want the Data Flow applications to be created. Finally, click Create Application.
Select Java Application and enter a name for the Application, for example, Tutorial Example 1.
Scroll down to Resource Configuration. Leave all these values as their defaults.
Scroll down to Application Configuration. Configure the application as follows:
1. File URL: is the location of the JAR file in object storage. The location for this application is:
```
oci://oow_2019_dataflow_lab@idehhejtnbtc/oow_2019_dataflow_lab/usercontent/oow-lab-2019-java-etl-1.0-SNAPSHOT.jar
```
2. Main Class Name: Java applications need a Main Class Name which depends on the application. For this exercise, enter
```
convert.Convert
```
3. Arguments: The Spark application expects two command line parameters, one for the input and one for the output. In the Arguments field, enter
```
${input} ${output}
```
  You're prompted for default values, and it's a good idea to enter them now.
The input and output arguments are:
1. Input:
```
oci://oow_2019_dataflow_lab@idehhejtnbtc/oow_2019_dataflow_lab/usercontent/kaggle_berlin_airbnb_listings_summary.csv
```
2. Output:
```
oci://<yourbucket>@<namespace>/optimized_listings
```
Double-check the Application configuration, to confirm it looks similar to the following:
Note

You must customize the output path to point to a bucket in the tenant.
When done, select Create. When the Application is created, you see it in the Application list.

Congratulations! You've created your first Data Flow Application. Now you can run it.

Create the Java Application Using Spark-Submit and CLI

Use spark-submit and CLI to create a Java Application.

Set up your tenancy.
If you don't have a bucket in Object Storage where you can save the input and results, you must create a bucket with a suitable folder structure. In this example, the folder structure is /output/tutorial1.

Run this code:

oci --profile <profile-name> --auth security_token data-flow run submit \
--compartment-id <compartment-id> \
--display-name Tutorial_1_ETL_Java \
--execute '
    --class convert.Convert 
    --files oci://oow_2019_dataflow_lab@idehhejtnbtc/oow_2019_dataflow_lab/usercontent/kaggle_berlin_airbnb_listings_summary.csv 
    oci://oow_2019_dataflow_lab@idehhejtnbtc/oow_2019_dataflow_lab/usercontent/oow-lab-2019-java-etl-1.0-SNAPSHOT.jar \
    kaggle_berlin_airbnb_listings_summary.csv oci://<bucket-name>@<namespace-name>/output/tutorial1'

If you have run this tutorial before, delete the contents of the output directory, oci://<bucket-name>@<namespace-name>/output/tutorial1, to prevent the tutorial failing.

Note

To find the compartment-id, from the navigation menu, click Identity and click Compartments. The compartments available to you're listed, including the OCID of each.

Create the Java Application Using Spark-Submit and SDK

Complete the exercise to create a Java application in Data Flow using spark-submit and Java SDK.

These are the files to run this exercise, and they're available on the following public Object Storage URIs:

Input files in CSV format:

oci://oow_2019_dataflow_lab@idehhejtnbtc/oow_2019_dataflow_lab/usercontent/kaggle_berlin_airbnb_listings_summary.csv

JAR file:

oci://oow_2019_dataflow_lab@idehhejtnbtc/oow_2019_dataflow_lab/usercontent/oow-lab-2019-java-etl-1.0-SNAPSHOT.jar

Create a bucket in Object Storage where you can save the input and results.

Run this code:

public class Tutorial1 {
 
  String compartmentId = "<your-compartment_id>"; // Might need to change comapartment id
 
  public static void main(String[] ars) throws IOException {
    System.out.println("ETL with JAVA Tutorial");
    new Tutorial1().crateRun();
  }
 
  public void crateRun() throws IOException {
 
    // Authentication Using BOAT config from ~/.oci/config file
    final ConfigFileReader.ConfigFile configFile = ConfigFileReader.parseDefault();
 
    final AuthenticationDetailsProvider provider =
        new ConfigFileAuthenticationDetailsProvider(configFile);
 
    // Creating a Data Flow Client
    DataFlowClient client = new DataFlowClient(provider);
    client.setRegion(Region.US_PHOENIX_1);
    client.setEndpoint("http://<IP_address>:443");   // Might need to change endpoint
 
    // creation of execute String
    String executeString = "--class convert.Convert "
        + "--files oci://oow_2019_dataflow_lab@idehhejtnbtc/oow_2019_dataflow_lab/usercontent/kaggle_berlin_airbnb_listings_summary.csv "
        + "oci://oow_2019_dataflow_lab@idehhejtnbtc/oow_2019_dataflow_lab/usercontent/oow-lab-2019-java-etl-1.0-SNAPSHOT.jar "
        + "kaggle_berlin_airbnb_listings_summary.csv oci://<bucket_name>@<tenancy_name>/output/tutorial1";
 
    // Create Run details and create run.
    CreateRunDetails runDetails = CreateRunDetails.builder()
        .compartmentId(compartmentId).displayName("Tutorial_1_ETL_with_JAVA").execute(executeString)
        .build();
 
    CreateRunRequest runRequest = CreateRunRequest.builder().createRunDetails(runDetails).build();
    CreateRunResponse response = client.createRun(runRequest);
    GetRunRequest grq = GetRunRequest.builder().opcRequestId(response.getOpcRequestId()).runId(response.getRun().getId()).build();
    GetRunResponse gr = client.getRun(grq);
 
    System.out.println("Run Created!");
 
  }
}

Run the Data Flow Java Application

Having created a Java application you can run it.

If you followed the steps precisely, all you need to do is highlight your Application in the list, select the Actions menu, and click Run.
You're presented with the ability to customize parameters before running the Application. In your case, you entered the precise values ahead-of-time, and you can start running by clicking Run.
While the Application is running, you can optionally load the Spark UI to monitor progress. From the Actions menu for the run in question, select Spark UI.
You're automatically redirected to the Apache Spark UI, which is useful for debugging and performance tuning.
After a minute or so your Run shows successful completion with a State of Succeeded:
Drill into the Run to see more details, and scroll to the bottom to see a listing of logs.
When you click the spark_application_stdout.log.gz file, you see the log output, Conversion was successful:
You can also navigate to your output object storage bucket to confirm that new files have been created.
These new files are used by later applications. Ensure you can see them in your bucket before moving onto the next exercises.

2. SparkSQL Made Simple

In this exercise, you run a SQL script to perform basic profiling of a dataset.

This exercise uses the output you generated in 1. ETL with Java. You must have completed it successfully before you can try this one.

The steps here are for using the Console UI. You can complete this exercise using spark-submit from CLI or spark-submit with Java SDK.

Overview

As with other Data Flow Applications, SQL files are stored in object storage and might be shared among many SQL users. To help this, Data Flow lets you parameterize SQL scripts and customize them at runtime. As with other applications you can supply default values for parameters which often serve as valuable clues to people running these scripts.

The SQL script is available for use directly in the Data Flow Application, you don't need to create a copy of it. The script is reproduced here to illustrate a few points.

Reference text of the SparkSQL Script: Some sample SparkSQL code.

Important highlights:

The script begins by creating the SQL tables we need. Currently, Data Flow doesn't have a persistent SQL catalog so all scripts must begin by defining the tables they require.
The table's location is set as ${location} This is a parameter which the user needs to supply at runtime. This gives Data Flow the flexibility to use one script to process many different locations and to share code among different users. For this lab, we must customize ${location} to point to the output location we used in Exercise 1
As we will see, the SQL script's output is captured and made available to us under the Run.

Create a SQL Application

In Data Flow, create a SQL Application, select SQL as type, and accept default resources.
Under Application Configuration, configure the SQL Application as follows:
1. File URL: is the location of the SQL file in object storage. The location for this application is:
```
oci://oow_2019_dataflow_lab@idehhejtnbtc/oow_2019_dataflow_lab/usercontent/oow_lab_2019_sparksql_report.sql
```
2. Arguments: The SQL script expects one parameter, the location of output from the prior step. Click Add Parameter and enter a parameter named location with the value you used as the output path in step a, based on the template
```
oci://[bucket]@[namespace]/optimized_listings
```
When you're done, confirm that the Application configuration looks similar to the following:
Customize the location value to a valid path in your tenancy.

Run a SQL Application

Save the Application and run it from the Applications list.
After the Run is complete, open the Run:
Navigate to the Run logs:
Open spark_application_stdout.log.gz and confirm that the output agrees with the following output.
Note

Your rows might be in a different order from the picture but values should agree.
Based on your SQL profiling, you can conclude that, in this dataset, Neukolln has the lowest average listing price at $46.57, while Charlottenburg-Wilmersdorf has the highest average at $114.27 (Note: the source dataset has prices in USD rather than EUR.)

This exercise has shown some key aspects of Data Flow. When a SQL application is in place anyone can easily run it without worrying about cluster capacity, data access and retention, credential management, or other security considerations. For example, a business analyst can easily use Spark-based reporting with Data Flow.

3. Machine Learning with PySpark

Use PySpark to perform a simple machine learning task over input data.

This exercise uses the output from 1. ETL with Java as its input data. You must have successfully completed the first exercise before you can try this one. This time, your objective is to identify the best bargains among the various Airbnb listings using Spark machine learning algorithms.

The steps here are for using the Console UI. You can complete this exercise using spark-submit from CLI or spark-submit with Java SDK.

Overview

A PySpark application is available for you to use directly in your Data Flow Applications. You don't need to create a copy.

Reference text of the PySpark script is provided here to illustrate a few points: Sample PySpark code.

A few observations from this code:

The Python script expects a command line argument (highlighted in red). When you create the Data Flow Application, you need to create a parameter with which the user sets to the input path.
The script uses linear regression to predict a price per listing, and finds the best bargains by subtracting the list price from the prediction. The most negative value indicates the best value, per the model.
The model in this script is simplified, and only considers square footage. In a real setting you would use more variables, such as the neighborhood and other important predictor variables.

Create a PySpark Application

Create a PySpark Application from the Console, or with Spark-submit from the command line or using SDK.

Machine Learning with PySpark Using the Console

Create a PySpark application in Data Flow using the Console.

Create an Application, and select the Python type.
In Application Configuration, configure the Application as follows:
1. File URL: is the location of the Python file in object storage. The location for this application is:
```
oci://oow_2019_dataflow_lab@idehhejtnbtc/oow_2019_dataflow_lab/usercontent/oow_lab_2019_pyspark_ml.py
```
2. Arguments: The Spark app expects two command line parameters, one for the input and one for the output. In the Arguments field, enter
```
${location}
```
  . You're prompted for a default value. Enter the value used as the output path in step a on the template:
```
oci://<bucket>@<namespace>/optimized_listings
```
Double-check the Application configuration, and confirm it's similar to the following:
Customize the location value to a valid path in the tenancy.

Machine Learning with PySpark Using Spark-Submit and CLI

Create a PySpark application in Data Flow using Spark-submit and CLI.

Complete exercise Create the Java Application Using Spark-Submit and CLI, before trying this exercise. The results are used in this exercise.

Run the following code:

oci --profile <profile-name> --auth security_token data-flow run submit \
--compartment-id <compartment-id> \
--display-name Tutorial_3_PySpark_ML \
--execute '
    oci://oow_2019_dataflow_lab@idehhejtnbtc/oow_2019_dataflow_lab/usercontent/oow_lab_2019_pyspark_ml.py 
    oci://<your_bucket>@<namespace-name>/output/tutorial1'

Machine Learning with PySpark Using Spark-Submit and SDK

Create a PySpark application in Data Flow using Spark-submit and SDK.

Complete exercise Create the Java Application Using Spark-Submit and SDK, before trying this exercise. The results are used in this exercise.

Run the following code:

public class PySParkMLExample {
 
  private static Logger logger = LoggerFactory.getLogger(PySParkMLExample.class);
  String compartmentId = "<compartment-id>"; // need to change comapartment id
 
  public static void main(String[] ars){
    System.out.println("ML_PySpark Tutorial");
    new PySParkMLExample().createRun();
  }
 
  public void createRun(){
 
    ConfigFileReader.ConfigFile configFile = null;
    // Authentication Using config from ~/.oci/config file
    try {
      configFile = ConfigFileReader.parseDefault();
    }catch (IOException ie){
      logger.error("Need to fix the config for Authentication ", ie);
      return;
    }
 
    try {
    AuthenticationDetailsProvider provider =
        new ConfigFileAuthenticationDetailsProvider(configFile);
 
    DataFlowClient client = new DataFlowClient(provider);
    client.setRegion(Region.US_PHOENIX_1);
 
    String executeString = "oci://oow_2019_dataflow_lab@idehhejtnbtc/oow_2019_dataflow_lab/usercontent/oow_lab_2019_pyspark_ml.py oci://<bucket-name>@<namespace-name>/output/optimized_listings";
 
    CreateRunResponse response;
 
    CreateRunDetails runDetails = CreateRunDetails.builder()
        .compartmentId(compartmentId).displayName("Tutorial_3_ML_PySpark").execute(executeString)
        .build();
 
    CreateRunRequest runRequest = CreateRunRequest.builder().createRunDetails(runDetails).build();
    CreateRunResponse response = client.createRun(runRequest);
 
    logger.info("Successful run creation for ML_PySpark with OpcRequestID: "+response.getOpcRequestId()
        +" and Run ID: "+response.getRun().getId());
 
    }catch (Exception e){
      logger.error("Exception creating run for ML_PySpark ", e);
    }
 
 
  }
}

Run a PySpark Application

Run the Application from the Application list.
When the Run completes, open it and navigate to the logs.
Open the spark_application_stdout.log.gz file. Your output should be identical to the following:
From this output, you see that listing ID 690578 is the best bargain with a predicted price of $313.70, compared to the list price of $35.00 with listed square footage of 4639 square feet. If it sounds a little too good to be true, the unique ID means you can drill into the data, to better understand if it really is the steal of the century. Again, a business analyst could easily consume the output of this machine learning algorithm to further their analysis.

What's Next

Now you can create and run Java, Python, or SQL applications with Data Flow, and explore the results.

Data Flow handles all details of deployment, tear down, log management, security, and UI access. With Data Flow, you focus on developing Spark applications without worrying about the infrastructure.

Oracle Cloud Infrastructure Documentation

Getting Started with Oracle Cloud Infrastructure Data Flow

Before You Begin

1. ETL with Java

2. SparkSQL Made Simple

3. Machine Learning with PySpark

What's Next