Harvesting Object Storage Files as Logical Data Entities
Your data lake typically has many files that represent a single data set. The files naming conversions indicate that multiple files are part of a single logical data entity.
You can group multiple Object Storage files into logical data entities in data catalog using filename patterns . A logical data entity is like any other data entity and can be used for search and discovery. Using logical data entities, you can organize your data lake content meaningfully and prevent the explosion of data entities and attributes in your data catalog.
Typical tasks you perform while harvesting Object Storage files as logical data entities:
Understanding Logical Data Entities
Consider the following set of files:
myserv/20191205_yny_myIOTSensor.json
myserv/20191105_yny_myIOTSensor.json
myserv/20191005_yny_myIOTSensor.json
myserv/20190905_yny_myIOTSensor.json
myserv/20191005_hyd_my2ndIOTSensor.json
myserv/20190905_hyd_my2ndIOTSensor.json
myserv/20191005_bom_my3rdIOTSensor.json
myserv/20190905_bom_my3rdIOTSensor.json
myserv/somerandomfile_2019AUG05.json
If you harvest these files in your Oracle Object Storage data source without creating filename patterns , Data Catalog creates nine individual data entities in your data catalog. Imagine this situation with hundreds of files in your data source resulting in hundreds of data entities in your data catalog.
myserv/20191205_yny_myIOTSensor.json
myserv/20191105_yny_myIOTSensor.json
myserv/20191005_yny_myIOTSensor.json
myserv/20190905_yny_myIOTSensor.json
myserv/20191005_hyd_my2ndIOTSensor.json
myserv/20190905_hyd_my2ndIOTSensor.json
myserv/20191005_bom_my3rdIOTSensor.json
myserv/20190905_bom_my3rdIOTSensor.json
myserv/somerandomfile_2019AUG05.json
Understanding Expressions
In Data Catalog, a filename pattern is defined using expressions.
An expression can have one or more components that you separate using a delimiter. Each component specifies a matching rule for the pattern. Filename patterns are created using Java regular expressions. You specify the regular expression that should be used to group your files into required logical data entities.
You can specify qualifiers that are used when parsing the expression. You can use the following qualifiers:
bucketName
: Use this qualifier to specify that the bucket name should be derived from the path that matches the given expression. ThebucketName
qualifier is used only once in the expression and always as the first component of the expression. ThebucketName
qualifier value can be a static text or an expression.logicalEntity
: Use this qualifier to specify that the logical data entity name should be derived from the path that matches the given expression. You can uselogicalEntity
multiple times in an expression. ThelogicalEntity
qualifier values can consist of static text or expressions.
Consider the following filenames:
bling_metering/1970120520_yny_hourly_region_res_delayed.json
bling_metering/1973110523_yny_hourly_region_res_delayed.json
bling_metering/1988101605_hyd_daily_region_res_delayed.json
bling_metering/1991042302_yny_hourly_region_res_delayed.json
bling_metering/2019073019_zrh_daily_region_res_delayed.json
bling_metering/2019073020_zrh_monthly_region_res_delayed.json
bling_metering/some_random_file_123.json
To derive logical data entities based on frequency (hourly, daily, monthly) mentioned in the filename, you can use the following pattern expression:
{bucketName:bling_metering}/[0-9]*_[a-z]*_{logicalEntity:[a-z]*}_.*.json
This expression uses the bucketName
and logicalEntity
qualifiers. In this example, [0-9]*
matches any number;
[a-z]*
matches any lowercase alphabet; and .*
matches any character. The expression results in the following logical data
entities:
- bling_metering_monthly
bling_metering/2019073020_zrh_monthly_region_res_delayed.json
- bling_metering_hourly
bling_metering/1970120520_yny_hourly_region_res_delayed.json bling_metering/1973110523_yny_hourly_region_res_delayed.json bling_metering/1991042302_yny_hourly_region_res_delayed.json
- bling_metering_daily
bling_metering/1988101605_hyd_daily_region_res_delayed.json bling_metering/2019073019_zrh_daily_region_res_delayed.json
Unmatched
bling_metering/some_random_file_123.json
To derive logical data entities based on regions (yny, hyd, zrh) mentioned in the filename, you can use either of the following pattern expression:
{bucketName:bling_metering}/[0-9]*_{logicalEntity:yny|hyd|zrh}_[a-z]*_region_res_delayed.json
{bucketName:bling_metering}/[0-9]*_{logicalEntity:[a-z]*}_[a-z]*_.*.json
This expression results in the following logical data entities:
- bling_metering_zrh
bling_metering/2019073020_zrh_monthly_region_res_delayed.json bling_metering/2019073019_zrh_daily_region_res_delayed.json
bling_metering_yny
bling_metering/1970120520_yny_hourly_region_res_delayed.json bling_metering/1973110523_yny_hourly_region_res_delayed.json bling_metering/1991042302_yny_hourly_region_res_delayed.json
- bling_metering_hyd
bling_metering/1988101605_hyd_daily_region_res_delayed.json
Unmatched
bling_metering/some_random_file_123.json
To derive logical data entities based on regions and frequency (hourly, daily, monthly) mentioned in the filename, you can use the following pattern expression:
{bucketName:bling_metering}/[0-9]*_{logicalEntity:[a-z]*}_{logicalEntity:[a-z]*}_region_res_delayed.json
The above expression uses the bucketName
and two
logicalEntity
qualifiers. The expression results in the following
logical data entities:
- bling_metering_zrh_monthly
bling_metering/2019073020_zrh_monthly_region_res_delayed.json
- bling_metering_hyd_daily
bling_metering/1988101605_hyd_daily_region_res_delayed.json
- bling_metering_zrh_daily
bling_metering/2019073019_zrh_daily_region_res_delayed.json
- bling_metering_yny_hourly
bling_metering/1970120520_yny_hourly_region_res_delayed.json bling_metering/1973110523_yny_hourly_region_res_delayed.json bling_metering/1991042302_yny_hourly_region_res_delayed.json
Unmatched
bling_metering/some_random_file_123.json
If no logicalEntity
qualifier is specified, the filename pattern name is
used as the logical data entity name. For example, consider the following expression for
the filename pattern bling pattern
:
{bucketName:bling_metering}/[0-9]*_[a-z]*_[a-z]*_.*.json
The above expression uses the bucketName
qualifier, but no
logicalEntity
qualifier. The expression results in the following
logical data entities:
- bling
pattern
bling_metering/2019073020_zrh_monthly_region_res_delayed.json bling_metering/1970120520_yny_hourly_region_res_delayed.json bling_metering/1973110523_yny_hourly_region_res_delayed.json bling_metering/1991042302_yny_hourly_region_res_delayed.json bling_metering/1988101605_hyd_daily_region_res_delayed.json bling_metering/2019073019_zrh_daily_region_res_delayed.json
Unmatched
bling_metering/some_random_file_123.json
When you test this expression with no
logicalEntity
qualifier, in
resulting logical data entity the expression is shown as the logical entity name. But on
harvesting, the name of the filename pattern is used as the logical data entity
name.