Query Apache Iceberg Tables
Autonomous Database supports querying Apache Iceberg tables.
- About Querying Apache Iceberg Tables
Autonomous Database supports querying of Apache Iceberg tables stored in Amazon Web Services (AWS) or in Oracle Cloud Infrastructure (OCI) Object Storage. - Concepts Related to Querying Apache Iceberg Tables
An understanding of the following concepts is helpful for querying Apache Iceberg tables. - Examples: Querying Apache Iceberg Tables
These examples show how to query Apache Iceberg tables on Amazon Web Services (AWS) and Oracle Cloud Infrastructure (OCI), using a data catalog and using direct URLs for the root manifest file.
Parent topic: Query External Data with Autonomous Database
About Querying Apache Iceberg Tables
Autonomous Database supports querying of Apache Iceberg tables stored in Amazon Web Services (AWS) or in Oracle Cloud Infrastructure (OCI) Object Storage.
Supported Configurations
- Iceberg tables on AWS:
- Iceberg tables registered with AWS Glue Data Catalog,
created with Spark or Athena.
For more information see Use the AWS Glue connector to read and write Apache Iceberg tables with ACID transactions and perform time travel and Using Iceberg tables.
- Iceberg tables stored on AWS S3 by directly providing the URL for the root metadata file.
- Iceberg tables registered with AWS Glue Data Catalog,
created with Spark or Athena.
- Iceberg tables on OCI:
- Iceberg tables generated with OCI Data Flow using a Hadoop
Catalog.
For more information, see Oracle Data Flow Examples and Using a Hadoop Catalog.
- Iceberg tables stored on OCI Object Storage by directly providing the URL for the root metadata file.
- Iceberg tables generated with OCI Data Flow using a Hadoop
Catalog.
Restrictions
- Partitioned Iceberg tables
Oracle does not support Iceberg partitioned tables.
- Row-level updates of Iceberg tables
Oracle does not support merge-on-read for Iceberg table updates. Queries encountering deleted files in the Iceberg metadata will fail. For more information about merge-on-read, see Enum RowLevelOperationMode.
- Schema evolution
The schema for Oracle external tables is fixed and reflects the Iceberg schema version at external table creation time. Queries fail if the Iceberg metadata points to a different schema version compared to the one used to create the Iceberg table. If the Iceberg schema changes after the external table has been created, it is recommended that the external table be recreated.
Parent topic: Query Apache Iceberg Tables
Concepts Related to Querying Apache Iceberg Tables
An understanding of the following concepts is helpful for querying Apache Iceberg tables.
Iceberg Catalog
The Iceberg catalog is a service that manages table metadata, such table snapshots, the table schema and partitioning information. In order to query the latest snapshot for an Iceberg table, query engines must first access the catalog and obtain the location of the most recent metadata file. There are already a number of available catalog implementations, including AWS Glue, Hive, Nessie, and Hadoop. Autonomous Database supports the AWS Glue catalog and the HadoopCatalog implementation used by Spark.
For more information, see Iceberg Catalog, Optimistic Concurrency, Catalog Implementations, and Class HadoopCatalog.
Metadata Files
The metadata file is a JSON document that keeps track of the table snapshots, partitioning scheme and schema information. The metadata file is the entry point to a hierarchy of manifest lists and manifest files. The manifests track the table's data files along with information including partitioning and column statistics. See the Iceberg Table Specification, for more information.
Transactions
Iceberg supports row-level updates to tables using either copy-on-write or merge-on-read. Copy-on-write generates new data files that reflect the updated rows, while merge-on-read generates new "delete files" that must be merged with the data files during reading. Oracle supports copy-on-write. Queries on iceberg tables fail if they encounter a delete file. For more information, see RowLevelOperationMode.
Schema Evolution
Iceberg supports schema evolution. Schema changes are reflected in the Iceberg metadata using a schema ID. Note that Oracle external tables have a fixed schema, determined by the most current schema version at table creation time. Iceberg queries fail when the queried metadata points to a different schema version compared to the one used at table creation time. For more information, see Schema Evolution.
Partitioning
Iceberg supports advanced partitioning options such as hidden partitioning and partition evolution that rely on processing/altering the table's metadata without costly data layout changes.
Parent topic: Query Apache Iceberg Tables
Examples: Querying Apache Iceberg Tables
These examples show how to query Apache Iceberg tables on Amazon Web Services (AWS) and Oracle Cloud Infrastructure (OCI), using a data catalog and using direct URLs for the root manifest file.
For detailed information on creating external tables for Apache Iceberg, see CREATE_EXTERNAL_TABLE Procedure for Apache Iceberg.
Query an Iceberg table on AWS using a Glue Data Catalog
In this example, we query the Iceberg table
iceberg_parquet_time_dim
.
The table belongs to Glue database
my-iceberg-db
and is stored in folder
s3://my-iceberg-bucket/iceberg-loc
.
The table details for
iceberg_parquet_time_dim
are shown
here:
We can create an external table for
iceberg_parquet_time_dim
as follows:
BEGIN
DBMS_CLOUD.CREATE_EXTERNAL_TABLE (
table_name => 'iceberg_parquet_time_dim',
credential_name => 'AWS_CRED',
file_uri_list => '',
format =>
'{"access_protocol":
{
"protocol_type": "iceberg",
"protocol_config":
{
"iceberg_catalog_type": "aws_glue",
"iceberg_glue_region": "us-west-1",
"iceberg_table_path": "my-iceberg-db.iceberg_parquet_time_dim"
}
}
}'
);
END;
/
In the protocol_config section we specify that the
table uses AWS Glue as the catalog type, and set the catalog's
region to us-west-1
.
The
credential AWS_CRED
is created using
dbms_cloud.create_credential
with an AWS
API key. The Glue Data Catalog instance is determined by the account
ID associated with AWS_CRED
for region
us-west-1
, as there is a single Glue Data
Catalog region for each account. The
iceberg_table_path
element in the
protocol_config
section uses a
$database_name.$table_name
path to specify
the Glue table name and database name. Finally, the
column_list
and
field_list
parameters are left null because
the table's schema is automatically derived from the Iceberg
metadata.
For further information on creating the credential, see CREATE_CREDENTIAL Procedure. For information on AWS Glue resources, see Specifying AWS Glue resource ARNs.
Query an Iceberg table on AWS using the location of the root metadata file
If we know the location of the metadata file for an Iceberg table, we can create an external table without specifying a catalog, as follows:
BEGIN
DBMS_CLOUD.CREATE_EXTERNAL_TABLE (
table_name => 'iceberg_parquet_time_dim',
credential_name => 'AWS_CRED',
file_uri_list => 'https://my-iceberg-bucket.s3.us-west-1.amazonaws.com/iceberg-loc/metadata/00004-1758ee2d-a204-4fd9-8d52-d17e5371a5ce.metadata.json',
format =>'{"access_protocol":{"protocol_type":"iceberg"}}');
END;
/
We use the file_uri_list
parameter to specify the
location of the metadata file in AWS S3 virtual-hosted-style URL
format. For information on this format, see Methods for accessing an AWS
S3 bucket.
In this example, the database accesses the metadata file directly, so
there is no need to provide a protocol_config
section in the format
parameter. When using the
metadata file location to create an external table, the database
queries the most recent snapshot referenced by the metadata file.
Subsequent updates to the Iceberg table, that create new snapshots
and new metadata files, will not be visible to the database.
Query an Iceberg table that uses Hadoop Catalog on OCI
In this example, we query the Iceberg
table icebergTablePy
, created using OCI Data Flow,
where Spark uses the HadoopCatalog implementation for the Iceberg
catalog. HadoopCatalog uses a warehouse
directory
and puts the Iceberg metadata in a
$database_name/$table_name
subfolder under
this directory. It also uses a version-hint.text
file that contains the version number for the most recent metadata
file version. See Iceberg Support
on OCI Data Flow for the example on Github.
The sample table db.icebergTablePy
was
created using a warehouse
folder, named
iceberg
, in OCI bucket
my-iceberg-bucket
. The storage layout on
OCI for table icebergTablePy
is shown below:
Create an external table for table
db.icebergTablePy
as follows:
BEGIN
DBMS_CLOUD.CREATE_EXTERNAL_TABLE (
table_name => 'iceberg_parquet_time_dim3',
credential_name => 'OCI_CRED',
file_uri_list => '',
format =>'{"access_protocol":{"protocol_type":"iceberg",
"protocol_config":{"iceberg_catalog_type": "hadoop",
"iceberg_warehouse":"https://objectstorage.uk-cardiff-1.oraclecloud.com/n/my-tenancy/b/my-iceberg-bucket/o/iceberg",
"iceberg_table_path": "db.icebergTablePy"}}}');
END;
/
Query an Iceberg table on OCI using the location of the root metadata file
We can query the Iceberg table described in the previous section by directly using the URL for the metadata file, as follows:
BEGIN
DBMS_CLOUD.CREATE_EXTERNAL_TABLE (
table_name => 'iceberg_parquet_time_dim4',
credential_name => 'OCI_CRED',
file_uri_list => 'https://objectstorage.uk-cardiff-1.oraclecloud.com/n/my-tenancy/b/my-iceberg-bucket/o/iceberg/db/icebergTablePy/metadata/v2.metadata.json',
format =>'{"access_protocol":{"protocol_type":"iceberg"}}'
);
END;
/
In this example, we use the file_uri_list
parameter to
specify the URI for the metadata file using the native OCI URI
format. When using the metadata file URI, the external table always
queries the latest snapshot stored in the specific file. Subsequent
updates that generate new snapshots and new metadata files are not
accessible to the query.
For more information on the native OCI URI format, see Cloud Object Storage URI Formats.
Parent topic: Query Apache Iceberg Tables