Using Cloud SQL with Big Data

You can use Oracle Cloud SQL with Oracle Big Data.

Note

Cloud SQL is not included with Big Data. You must pay an extra fee for use of Cloud SQL. See Adding Cloud SQL.

Cloud SQL Overview

Oracle Cloud SQL supports queries against non-relational data stored in multiple big data sources, including Apache Hive, HDFS, Oracle NoSQL Database, Apache Kafka, Apache HBase, and other object stores (Oracle Object Store and S3). It enables unified query for distributed data and therefore the ability to view and analyze data from disparate data stores seamlessly, as if it were all stored in an Oracle database.

Cloud SQL provides a Query Server that enables you to execute complex Oracle SQL statements against data in the Hadoop ecosystem, either manually or through your applications.

Cloud SQL provides enhancements to Oracle external tables that are specially designed for scale-out processing. An external table is an Oracle Database object that identifies and describes the location of data outside of a database. You can query an external table using the same SQL SELECT syntax that you use for any other database tables.

External tables use access drivers to parse the data outside the database. Each type of external data requires a unique access driver. Cloud SQL includes three access drivers for big data. There are drivers for:

Data that has metadata defined in Apache Hive
Accessing data stored in HDFS, with metadata specified only by an Oracle data administrator
Accessing data stored in the object stores, again with metadata specified only by an Oracle data administrator

Cloud SQL automatically defines external tables for metadata defined in the Hive metastore. Oracle Database schemas are created for Hive databases, and external tables are created for Hive tables.

If your Big Data cluster uses Cloudera Distribution including Hadoop, then Cloudera Sentry authorization policies are used to control access to data.

Components of a Cloud SQL Deployment

The Cloud SQL architecture consists of a Cloud SQL Query Server — an Oracle Database 19c-compatible query engine that works in conjunction with Cloud SQL processes (known as Cloud SQL cells) that run on the worker nodes of your Big Data Service cluster. The Query Server is installed on its own node of Big Data Service.

Since data in HDFS is stored in an undetermined format, SQL queries require some constructs to parse and interpret data for it to be processed in rows and columns. Cloud SQL leverages available Hadoop constructs to accomplish this, notably the InputFormat and SerDe Java classes, optionally through Hive metadata definitions. The Cloud SQL processing cells on the DataNodes are a layer on top of this generic Hadoop infrastructure. Three key features provided by the cells are Smart Scan, Storage Indexes, and Aggregation Offload. See Cloud SQL Query Processing for information about those.

About Cloud SQL Query Server

Oracle Cloud SQL Query Server is an Oracle Database instance provided when you install Cloud SQL on your Oracle Big Data Service.

You use Cloud SQL Query Server to query data stored in the cluster (in HDFS and Hive formats) using Oracle external tables. This enables you to take advantage of the full SQL capabilities provided by Oracle Database.

Query Server automatically defines external tables based on the metadata in the Hive metastore. You can also define additional external tables using the ORACLE_HDFS, ORACLE_HIVE, or ORACLE_BIGDATA access drivers. In the former case, Hive databases map to Oracle Database schemas, and the corresponding Hive tables are defined as Oracle external tables in those schemas. All data authorization is based on authorization rules in Hadoop such as Apache Sentry for clusters using Cloudera Distribution of Hadoop (CDH), or HDFS Access Control Lists (ACLs).

Cloud SQL Query Server supports the definition of external tables and views; it does not support storing data. Restarting the Query Server restores the database to a clean state. It preserves external tables (ORACLE_HIVE, ORACLE_HDFS, and ORACLE_BIGDATA types), associated statistics, user defined views, and credentials. A restart deletes regular tables containing user data.

To install Query Server, you must specify Cloud SQL as an option in Big Data Service through the service console. Query Server automatically creates Oracle external tables corresponding to the tables in the Hive metastore database(s) so that they are ready for querying. The set of external tables in the Query Server can be automatically kept up to date with the corresponding Hive metastore tables by running either the Restart this Cloud SQL Query Server or the Synchronize Hive Databases commands in Apache Ambari or Cloudera Manager management software. You can also use the dbms_bdsqs.sync_hive_databases PL/SQL API package procedure.

Important Terms and Concepts

It's helpful to know a little about edge nodes, cell nodes, and Hadoop cluster integration. These terms are key to understanding Oracle Cloud SQL Query Server.

Edge Nodes: An edge node in a Hadoop cluster is the interface between the Hadoop cluster and the outside network. Edge nodes are usually used to run client applications. Edge nodes can act as a data gateway by providing HDFS access through NFS or HttpFS, or by running REST servers. The Query Server node can be considered an edge node for Oracle Big Data Service. Hadoop data is not stored and processing does not take place on this node.
Cell Nodes: Cloud SQL cells run on the DataNodes, and allow for parts of query processing to be pushed down to the Hadoop cluster DataNodes where the data resides. This ensures both load distribution and reduction in the volume of data that needs to be sent to the database for processing. This can result in significant performance improvements on big data workloads.

Hadoop Cluster Integration

Cloud SQL includes the following three service roles that you can manage in Apache Ambari or Cloudera Manager:

Cloud SQL Query Server: Enables you to run SQL queries against the Hadoop cluster. Applications connect to this server using JDBC or SQL*Net.
Cloud SQL Agent: Manages the Cloud SQL installation.
Cloud SQL Server: Also known as Cloud SQL cells, allows for parts of query processing to get pushed down to the Hadoop cluster DataNodes where the data resides.

Specifying Hive Databases to Synchronize with Query Server

You can control the Hive databases that will synchronize with Oracle Cloud SQL Query Server. This can be especially important if you have a large number of Hive databases and tables but only require SQL access to some of them. Fewer databases and tables results in faster Hive metadata sync with Query Server.

Note

Cloud SQL Query Server is not intended to store internal data in Oracle tables. Whenever the Query Server is restarted, it's reset to its initial and clean state. This eliminates typical database maintenance such as storage management, database configuration, and so on. The goal of Query Server is to provide a SQL front end for data in Hadoop, object stores, Kafka, and NoSQL databases. Query Server is not a general purpose RDBMS.

Using Apache Ambari

In Apache Ambari, update Synchronized Hive Databases configuration parameter with a comma-separated list of Hive databases. This is recommended if you don't plan to access all of the Hive databases from Query Server. To synchonize all Hive databases in the metastore with Query Server, use the * wildcard character.

You can update the list of Hive databases to synchronize with Query Server by using the Synchronized Hive Databases configuration parameter in Apache Ambari as follows.

Log in to Apache Ambari using your login credentials.
In Apache Ambari, click Cloud SQL under Services from the side toolbar.
Click the Configs tab.
Expand the Advanced metadata-sync tab.
In the Synchronized Hive Databases text box, enter the names of the Hive databases separated by commas. For example: htdb0,htdb1
Click Save.

Using Cloudera Manager

In Cloudera Manager, update the sync_hive_db_list configuration parameter with a comma-separated list of Hive databases. This is recommended if you don't plan to access all of the Hive databases from Query Server. To synchronize all Hive databases in the metastore with Query Server, use the * wildcard character.

You can update the list of Hive databases to synchronize with Query Server by using the sync_hive_db_list configuration parameter in Cloudera Manager as follows.

Log in to Cloudera Manager using your login credentials.
In Cloudera Manager, use the Search field to search for the Synchronized Hive Databases configuration parameter. Enter /Synchronized Hive Databases (or enter part of the name until it's displayed in the list) in the Search field, and then press Enter.
Click Cloud SQL: Synchronized Hive Databases.
In the Synchronized Hive Databases text box, enter the names of the Hive databases separated by commas, such as htdb0,htdb1, and then click Save Changes. Only these two Hive databases will be synchronized with Query Server.