Using Cloud SQL with Big Data
You can use Oracle Cloud SQL with Oracle Big Data.
Cloud SQL is not included with Big Data. You must pay an extra fee for use of Cloud SQL. See Adding Cloud SQL.
Cloud SQL Overview
Oracle Cloud SQL supports queries against non-relational data stored in multiple big data sources, including Apache Hive, HDFS, Oracle NoSQL Database, Apache Kafka, Apache HBase, and other object stores (Oracle Object Store and S3). It enables unified query for distributed data and therefore the ability to view and analyze data from disparate data stores seamlessly, as if it were all stored in an Oracle database.
Cloud SQL provides a Query Server that enables you to execute complex Oracle SQL statements against data in the Hadoop ecosystem, either manually or through your applications.
Cloud SQL provides enhancements to Oracle external tables that are specially
designed for scale-out processing. An external table is an Oracle Database object
that identifies and describes the location of data outside of a database. You can
query an external table using the same SQL SELECT
syntax that you
use for any other database tables.
External tables use access drivers to parse the data outside the database. Each type of external data requires a unique access driver. Cloud SQL includes three access drivers for big data. There are drivers for:
-
Data that has metadata defined in Apache Hive
-
Accessing data stored in HDFS, with metadata specified only by an Oracle data administrator
-
Accessing data stored in the object stores, again with metadata specified only by an Oracle data administrator
Cloud SQL automatically defines external tables for metadata defined in the Hive metastore. Oracle Database schemas are created for Hive databases, and external tables are created for Hive tables.
If your Big Data cluster uses Cloudera Distribution including Hadoop, then Cloudera Sentry authorization policies are used to control access to data.
Components of a Cloud SQL Deployment
The Cloud SQL architecture consists of a Cloud SQL Query Server — an Oracle Database 19c-compatible query engine that works in conjunction with Cloud SQL processes (known as Cloud SQL cells) that run on the worker nodes of your Big Data Service cluster. The Query Server is installed on its own node of Big Data Service.
Since data in HDFS is stored in an undetermined format, SQL queries require some
constructs to parse and interpret data for it to be processed in rows and columns.
Cloud SQL leverages available Hadoop constructs to accomplish this,
notably the InputFormat
and SerDe
Java classes,
optionally through Hive metadata definitions. The Cloud SQL processing
cells on the DataNodes are a layer on top of this generic Hadoop infrastructure.
Three key features provided by the cells are Smart Scan, Storage Indexes, and
Aggregation Offload. See Cloud SQL Query Processing for information
about those.
About Cloud SQL Query Server
Oracle Cloud SQL Query Server is an Oracle Database instance provided when you install Cloud SQL on your Oracle Big Data Service.
You use Cloud SQL Query Server to query data stored in the cluster (in HDFS and Hive formats) using Oracle external tables. This enables you to take advantage of the full SQL capabilities provided by Oracle Database.
Query Server automatically defines external tables based on the metadata in the Hive
metastore. You can also define additional external tables using the
ORACLE_HDFS
, ORACLE_HIVE
, or
ORACLE_BIGDATA
access drivers. In the former case, Hive databases
map to Oracle Database schemas, and the corresponding Hive tables are defined as Oracle
external tables in those schemas. All data authorization is based on authorization rules
in Hadoop such as Apache Sentry for clusters using Cloudera Distribution of Hadoop
(CDH), or HDFS Access Control Lists (ACLs).
Cloud SQL Query Server supports the definition of external tables and views; it
does not support storing data. Restarting the Query Server restores the database to a
clean state. It preserves external tables (ORACLE_HIVE
,
ORACLE_HDFS
, and ORACLE_BIGDATA
types), associated
statistics, user defined views, and credentials. A restart deletes regular tables
containing user data.
To install Query Server, you must specify Cloud SQL as an option in Big Data
Service through the service console. Query Server automatically creates Oracle external
tables corresponding to the tables in the Hive metastore database(s) so that they are
ready for querying. The set of external tables in the Query Server can be automatically
kept up to date with the corresponding Hive metastore tables by running either the
Restart this Cloud SQL Query Server or the
Synchronize Hive Databases commands in Apache Ambari or
Cloudera Manager management software. You can also use the
dbms_bdsqs.sync_hive_databases
PL/SQL API package procedure.
Important Terms and Concepts
It's helpful to know a little about edge nodes, cell nodes, and Hadoop cluster integration. These terms are key to understanding Oracle Cloud SQL Query Server.
- Edge Nodes
- An edge node in a Hadoop cluster is the interface between the Hadoop cluster and the outside network. Edge nodes are usually used to run client applications. Edge nodes can act as a data gateway by providing HDFS access through NFS or HttpFS, or by running REST servers. The Query Server node can be considered an edge node for Oracle Big Data Service. Hadoop data is not stored and processing does not take place on this node.
- Cell Nodes
-
Cloud SQL cells run on the DataNodes, and allow for parts of query processing to be pushed down to the Hadoop cluster DataNodes where the data resides. This ensures both load distribution and reduction in the volume of data that needs to be sent to the database for processing. This can result in significant performance improvements on big data workloads.
Hadoop Cluster Integration
Cloud SQL includes the following three service roles that you can manage in Apache Ambari or Cloudera Manager:
- Cloud SQL Query Server: Enables you to run SQL queries against the Hadoop cluster. Applications connect to this server using JDBC or SQL*Net.
- Cloud SQL Agent: Manages the Cloud SQL installation.
- Cloud SQL Server: Also known as Cloud SQL cells, allows for parts of query processing to get pushed down to the Hadoop cluster DataNodes where the data resides.
Specifying Hive Databases to Synchronize with Query Server
You can control the Hive databases that will synchronize with Oracle Cloud SQL Query Server. This can be especially important if you have a large number of Hive databases and tables but only require SQL access to some of them. Fewer databases and tables results in faster Hive metadata sync with Query Server.
Cloud SQL Query Server is not intended to store internal data in Oracle tables. Whenever the Query Server is restarted, it's reset to its initial and clean state. This eliminates typical database maintenance such as storage management, database configuration, and so on. The goal of Query Server is to provide a SQL front end for data in Hadoop, object stores, Kafka, and NoSQL databases. Query Server is not a general purpose RDBMS.
In Apache Ambari, update Synchronized Hive Databases
configuration
parameter with a comma-separated list of Hive databases. This is recommended if you
don't plan to access all of the Hive databases from Query Server. To synchonize all
Hive databases in the metastore with Query Server, use the *
wildcard character.
You can update the list of Hive databases to synchronize with Query Server by using
the Synchronized Hive Databases
configuration parameter in Apache
Ambari as follows.
-
Log in to Apache Ambari using your login credentials.
-
In Apache Ambari, click Cloud SQL under Services from the side toolbar.
-
Click the Configs tab.
-
Expand the Advanced metadata-sync tab.
-
In the Synchronized Hive Databases text box, enter the names of the Hive databases separated by commas. For example: htdb0,htdb1
-
Click Save.
In Cloudera Manager, update the sync_hive_db_list
configuration
parameter with a comma-separated list of Hive databases. This is recommended if you
don't plan to access all of the Hive databases from Query Server. To synchronize all
Hive databases in the metastore with Query Server, use the *
wildcard character.
You can update the list of Hive databases to synchronize with Query Server by using
the sync_hive_db_list
configuration parameter in Cloudera Manager
as follows.
-
Log in to Cloudera Manager using your login credentials.
-
In Cloudera Manager, use the Search field to search for the Synchronized Hive Databases configuration parameter. Enter /Synchronized Hive Databases (or enter part of the name until it's displayed in the list) in the Search field, and then press Enter.
-
Click Cloud SQL: Synchronized Hive Databases.
-
In the Synchronized Hive Databases text box, enter the names of the Hive databases separated by commas, such as
htdb0,htdb1
, and then click Save Changes. Only these two Hive databases will be synchronized with Query Server.