Spark-SubmitおよびSDKの開始

Java SDKコードを使用し、spark-submitをexecute文字列とともに使用してデータ・フローでSparkアプリケーションの実行を開始する際に役立つチュートリアル。

SDKを使用してデータ・フローでspark-submitを開始します:Oracle Cloud Infrastructureデータ・フローの開始の既存のチュートリアルに従いますが、spark-submitコマンドの実行にはJava SDKを使用します。

開始する前に

Java SDKを使用してデータ・フローでspark-submitコマンドを使用する前に、前提条件を満たします。

テナンシを設定します。
キーを設定して構成します。

mavenプロジェクトを作成し、Oracle Cloud InfrastructureのJava SDK依存関係を追加します:

<dependency>
  <groupId>com.oracle.oci.sdk</groupId>
  <artifactId>oci-java-sdk-dataflow</artifactId>
  <version>${oci-java-sdk-version}</version>
</dependency>

1. Javaを使用したETL

Spark-submitおよびJava SDKを使用して、JavaでETLを実行します。

Spark-submitおよびJava SDKを使用して、Oracle Cloud Infrastructure Data Flowスタート・ガイド・チュートリアルの演習Javaを使用したETLを完了します。

テナンシを設定します。
入力および結果を保存できるバケットがオブジェクト・ストレージにない場合は、適切なフォルダ構造を持つバケットを作成する必要があります。この例では、フォルダ構造は/output/です。

次のコードを実行します:

public class ETLWithJavaExample {
 
  private static Logger logger = LoggerFactory.getLogger(ETLWithJavaExample.class);
  String compartmentId = "<compartment-id>"; // need to change comapartment id
 
  public static void main(String[] ars){
    System.out.println("ETL with JAVA Tutorial");
    new ETLWithJavaExample().createRun();
  }
 
  public void createRun(){
 
    ConfigFileReader.ConfigFile configFile = null;
    // Authentication Using config from ~/.oci/config file
    try {
      configFile = ConfigFileReader.parseDefault();
    }catch (IOException ie){
      logger.error("Need to fix the config for Authentication ", ie);
      return;
    }
 
    try {
    AuthenticationDetailsProvider provider =
        new ConfigFileAuthenticationDetailsProvider(configFile);
 
    // Creating a Data Flow Client
    DataFlowClient client = new DataFlowClient(provider);
    client.setRegion(Region.US_PHOENIX_1);
 
    // creation of execute String
    String executeString = "--class convert.Convert "
        + "--files oci://oow_2019_dataflow_lab@idehhejtnbtc/oow_2019_dataflow_lab/usercontent/kaggle_berlin_airbnb_listings_summary.csv "
        + "oci://oow_2019_dataflow_lab@idehhejtnbtc/oow_2019_dataflow_lab/usercontent/oow-lab-2019-java-etl-1.0-SNAPSHOT.jar "
        + "kaggle_berlin_airbnb_listings_summary.csv oci://<bucket-name>@<namespace-name>/output/optimized_listings";
 
    // Create Run details and create run.
    CreateRunResponse response;
     
    CreateRunDetails runDetails = CreateRunDetails.builder()
        .compartmentId(compartmentId).displayName("Tutorial_1_ETL_with_JAVA").execute(executeString)
        .build();
 
    CreateRunRequest runRequest = CreateRunRequest.builder().createRunDetails(runDetails).build();
    CreateRunResponse response = client.createRun(runRequest);
 
    logger.info("Successful run creation for ETL_with_JAVA with OpcRequestID: "+response.getOpcRequestId()
        +" and Run ID: "+response.getRun().getId());
 
    }catch (Exception e){
      logger.error("Exception creating run for ETL_with_JAVA ", e);
    }
 
  }
}

このチュートリアルを以前に実行したことがある場合は、出力ディレクトリoci://<bucket-name>@<namespace-name>/output/optimized_listingsの内容を削除して、チュートリアルが失敗しないようにします。

ノート

compartment-idを確認するには、ナビゲーション・メニューで「アイデンティティ」をクリックし、「コンパートメント」をクリックします。使用可能なコンパートメントがそれぞれのOCIDも含めてリストされます。

2: PySparkを使用した機械学習

Spark-submitおよびJava SDKを使用して、PySparkで機械学習を実行します。

Spark-submitおよびJava SDKを使用して、3を完了します。Oracle Cloud Infrastructure Data Flowスタート・ガイド・チュートリアルのPySparkによる機械学習。

演習1を完了します。この演習を試行する前に、Javaを使用したETLを実行します。結果がこの演習で使用されます。

次のコードを実行します:

public class PySParkMLExample {
 
  private static Logger logger = LoggerFactory.getLogger(PySParkMLExample.class);
  String compartmentId = "<compartment-id>"; // need to change comapartment id
 
  public static void main(String[] ars){
    System.out.println("ML_PySpark Tutorial");
    new PySParkMLExample().createRun();
  }
 
  public void createRun(){
 
    ConfigFileReader.ConfigFile configFile = null;
    // Authentication Using config from ~/.oci/config file
    try {
      configFile = ConfigFileReader.parseDefault();
    }catch (IOException ie){
      logger.error("Need to fix the config for Authentication ", ie);
      return;
    }
 
    try {
    AuthenticationDetailsProvider provider =
        new ConfigFileAuthenticationDetailsProvider(configFile);
 
    DataFlowClient client = new DataFlowClient(provider);
    client.setRegion(Region.US_PHOENIX_1);
 
    String executeString = "oci://oow_2019_dataflow_lab@idehhejtnbtc/oow_2019_dataflow_lab/usercontent/oow_lab_2019_pyspark_ml.py oci://<bucket-name>@<namespace-name>/output/optimized_listings";
 
    CreateRunResponse response;
 
    CreateRunDetails runDetails = CreateRunDetails.builder()
        .compartmentId(compartmentId).displayName("Tutorial_3_ML_PySpark").execute(executeString)
        .build();
 
    CreateRunRequest runRequest = CreateRunRequest.builder().createRunDetails(runDetails).build();
    CreateRunResponse response = client.createRun(runRequest);
 
    logger.info("Successful run creation for ML_PySpark with OpcRequestID: "+response.getOpcRequestId()
        +" and Run ID: "+response.getRun().getId());
 
    }catch (Exception e){
      logger.error("Exception creating run for ML_PySpark ", e);
    }
 
 
  }
}

次の手順

Spark-submitおよびその他の状況ではCLIを使用します。

spark-submitおよびJava SDKを使用して、データ・フローでJava、PythonまたはSQLアプリケーションを作成および実行し、結果を調べることができます。データ・フローは、デプロイメント、分解、ログ管理、セキュリティおよびUIアクセスのすべての詳細を処理します。データ・フローでは、インフラストラクチャを気にすることなく、Sparkアプリケーションの開発に集中します。

Oracle Cloud Infrastructureドキュメント

Spark-SubmitおよびSDKの開始

開始する前に

1. Javaを使用したETL

2: PySparkを使用した機械学習

次の手順