Spark Oracle Datasource
Spark Oracle Datasource is extension of the JDBC datasource provided by Spark.
Spark Oracle Datasource is an extension of the Spark JDBC datasource. It simplifies the
connection to Oracle databases from Spark. In addition to all the options provided by Spark's JDBC datasource, Spark Oracle Datasource simplifies
connecting Oracle databases from Spark by providing:
- An auto download wallet from Autonomous Database Serverless, which means there's no need to download the wallet and keep it in Object Storage or Vault.
- It automatically distributes the wallet bundle from Object Storage to the driver and executor without any customized code fom users.
- It includes JDBC driver JAR files, and so eliminates the need to download them and
include them in your
archive.zip
file. The JDBC driver is version 21.3.0.0.
Use a Spark Oracle Datasource
There are two ways to use this data source in Data Flow.
- In the Advanced Options section when creating, editing, or running an application, include the key:
with the value:
spark.oracle.datasource.enabled
true
. For more information, see the Create Applications section. - Use the Oracle Spark datasource format. For example in Scala:More examples in other languages are available in the Spark Oracle Datasource Examples section.
val df = spark.read .format("oracle") .option("adbId","autonomous_database_ocid") .option("dbtable", "schema.tablename") .option("user", "username") .option("password", "password") .load()
The following three properties are available with Oracle datasource in addition to the
properties provided by Spark's JDBC datasource:
Property Name | Default Setting | Description | Scope |
---|---|---|---|
walletUri |
An Object Storage or HDFS-compatible URL. It contains the ZIP file of the Oracle Wallet needed for mTLS connections to an Oracle database. For more information on using the Oracle Wallet, see View TNS Names and Connection Strings for an Autonomous Database Serverless | Read/write | |
connectionId |
|
The connection identifier alias from tnsnames.ora file, as part of the Oracle wallet. For more information, see the Overview of Local Naming Parameters and the Glossary in the Oracle Database Net Services Reference. | Read/write |
adbId |
The Oracle Autonomous database OCID. For more information, see the Overview of Autonomous Database Serverless. | Read/write |
Note
The following limitations apply to the options:
You can use Spark Oracle Datasource in Data Flow
with Spark 3.0.2 and later versions.The following limitations apply to the options:
adbId
andwalletUri
can't be used together.connectionId
must be provided withwalletUri
, but is optional withadbId
.adbId
isn't supported for databases with scan.adbId
isn't supported for Autonomous Database Serverless.
To use Spark Oracle Datasource with Spark Submit, set
the following
option:
--conf spark.oracle.datasource.enable=true
The following databases, only, are supported with adbId:
- Autonomous Database Serverless
Note
If you have this database in a VCN private subnet, use a Private Network to allowlist the FQDN of the autonomous database's private endpoint.
The following databases can be used with the
walletUri
option:- Autonomous Database Serverless
- Autonomous Dedicated Infrastructure Database, including Exadata infrastructure.
- Autonomous Transaction Processing Dedicated Infrastructure
- On premises Oracle database, which can be accessed from Data Flow's network, either through fastconnect or site-to-site VPN.