Table Function Operators Reference

The table function operator provides a list of built-in table functions that you can use in a data flow to return data in the form of tables.

See Table Function Operator.

Here's a list of the built-in table functions that are supported by the table function operator.

Cache

Persists and caches a dataset according to the given memory storage level.

Returns a new dataset in the cached resilient distributed dataset (RDD).


Parameter	Description
`level`	The memory storage to use: `MEMORY_ONLY`: Store as deserialized Java objects in the JVM. If memory is insufficient, some RDD partitions are not cached, but are recomputed as and when needed. `MEMORY_AND_DISK`: Store as deserialized Java objects in the JVM. If memory is insufficient, some RDD partitions are stored on disk, and read from disk as and when needed. `MEMORY_ONLY_SER`: Store as deserialized Java objects. `MEMORY_AND_DISK_SER`: Store as deserialized Java objects. Partitions that don't fit are stored on disk. `DISK_ONLY`: Store the RDD partitions only on disk. `MEMORY_ONLY_2, MEMORY_AND_DISK_2, etc`: Similar to other levels but each partition is replicated on two cluster nodes.

Cross tabulation analysis

Computes a pair-wise frequency table or contingency table from distinct values of the given two columns.

The number of distinct values for each column must be less than 1e4.

Returns a dataframe that contains the contingency table. In the contingency table:

The name of the first column is col1_col2, where col1 is the first column and col2 is the second column before transformation. The first column of each row are the distinct values of col1.
The other column names are the distinct values of col2.
The counts are returned as the long type.
Pairs without occurrences are given zero as the count.
The maximum number of non-zero pairs is 1e6.
Null elements are replaced by null.


Parameter	Description
`col1`	Name of the first column. See the note after this table for limitations. Distinct items from `col1` make the first item of each row in the returned dataframe.
`col2`	Name of the second column. See the note after this table for limitations. Distinct items from `co2` make the column names in the returned dataframe.

Note

Only the following characters are allowed in a column name:

lower and upper case letters a-z A-Z
numbers 0-9
underscore _

Example

Suppose col1 is age and col2 is salary.

Before transformation:

+---+------+
|age|salary|
+---+------+
|1  |100   |
|1  |200   |
|2  |100   |
|2  |100   |
|2  |300   |
|3  |200   |
|3  |300   |
+---+------+

After transformation:

+----------+---+---+---+
|age_salary|100|200|300|
+----------+---+---+---+
|2         |2  |0  |1  |
|1         |1  |1  |0  |
|3         |0  |1  |1  |
+----------+---+---+---+

Cube

Generates a multi-dimensional cube of all possible combinations using the given columns list and the calculations in the given aggregate function expressions.

The following aggregate functions are supported in the expressions:

AVG
COUNT
MEAN
MAX
MIN
SUM

Returns a dataframe after the cube transformation.


Parameter	Description
`cubeColumns`	The list of columns, separated by commas, by which to generate the multi-dimensional cube.
`aggExpressions`	The aggregate function expressions to perform on the columns. For example: `salary -> avg, age -> max`

Example

Before transformation:

+----------+-----+-------+------+---+
|department|group|region |salary|age|
+----------+-----+-------+------+---+
|Eng       |A    |local  |1000  |10 |
|Art       |B    |foreign|2000  |11 |
|Sport     |C    |local  |1500  |13 |
|Eng       |A    |foreign|800   |19 |
|Art       |A    |foreign|3000  |30 |
|Eng       |E    |foreign|1000  |21 |
|Art       |D    |local  |1000  |32 |
|Sport     |C    |local  |1000  |28 |
+----------+-----+-------+------+---+

If cubeColumns is department, region and aggExpressions issalary -> avg, age -> max (compute the average salary and maximum age), the after transformation cube is:

+----------+-------+-----------------+--------+
|department|region |avg(salary)      |max(age)|
+----------+-------+-----------------+--------+
|Eng       |foreign|900.0            |21      |
|Sport     |null   |1250.0           |28      |
|Art       |null   |2000.0           |32      |
|Eng       |null   |933.3333333333334|21      |
|null      |local  |1125.0           |32      |
|Art       |foreign|2500.0           |30      |
|null      |foreign|1700.0           |30      |
|Eng       |local  |1000.0           |10      |
|null      |null   |1412.5           |32      |
|Sport     |local  |1250.0           |28      |
|Art       |local  |1000.0           |32      |
+----------+-------+-----------------+--------+

Data generation

Generates synthetic data using the given file format and content. The supported file formats are:

CSV
JSON

Returns a data entity. The synthetic data source can be used in pipeline unit-tests where real data is not used or moved.


Parameter	Description
`format`	The type of file. Supported values are: `CSV` `JSON`
`content`	The file content for the given format.

Example

A CSV content example:

co1,co2,co3
1,B,3
A,2,C

Data generation result:

+---+---+---+
|co1|co2|co3|
+---+---+---+
|1  |B  |3  |
+---+---+---+
|A  |2  |C  |
+---+---+---+

Deduplication

Finds duplicates in the given column or columns list and returns a new dataset with the duplicate rows removed.


Parameter	Description
`columns`	A column name or a comma-separated list of column names.

Example

Before transformation:

+---+---+
|a  |b  |
+---+---+
|1  |2  |
|3  |4  |
|1  |3  |
|2  |4  |
+---+---+

If columns is a, the dataset after transformation is:

+---+---+
|a  |b  |
+---+---+
|1  |2  |
|3  |4  |
|2  |4  |
+---+---+

Frequent items

Finds frequent items in the given column or columns list using the given minimum frequency.

The frequent element count algorithm proposed by Karl et al. is used to find frequent items. False positives are possible.

Returns a dataframe with an array of frequent items for each column.


Parameter	Description
`freqItems`	A column name or a comma-separated list of column names for which you want to find frequent items.
`support`	Minimum frequency for an item to be considered frequent. The value must be greater than 1e-4 (decimal value of `0.0001`). For example, `0.6` represents 60%. That is, an item that appears in at least 60% of the rows is a frequent item.

Example

Before transformation:

+-------+--------------+
|user_id|favorite_fruit|
+-------+--------------+
|      1|         apple|
|      2|        banana|
|      3|         apple|
|      4|        orange|
|      5|        banana|
|      6|        banana|
|      7|         apple|
+-------+--------------+

If freqItems is favorite_fruit and support is 0.6, the dataframe returned after transformation is:

+------------------------+
|favorite_fruit_freqItems|
+------------------------+
|                [banana]|
+------------------------+

Na Drop

Removes rows that contain null or NaN values in the given column or columns list.

Returns a dataframe that excludes the rows.


Parameter	Description
`how`	How to determine if a row is to be dropped. Supported values are: `any`: Drop the row that contains any null or NaN values in the specified `cols`. `all`: Drop the row only if every specified `cols` is null or NaN for that row. The `how` parameter is ignored if the `minNonNulls` parameter is specified.
`cols`	A column name or a comma-separated list of column names.
`minNonNulls`	The minimum number of non-null and non-NaN values that a row can contain. Drops the rows that contain less than the specified minimum. The `minNonNulls` parameter overrides the `how` parameter.

Example

Before transformation:

+----+-------+----+
|id  |name   |age |
+----+-------+----+
|1   |Alice  |25  |
|2   |null   |28  |
|null|Bob    |null|
|4   |Charlie|30  |
+----+-------+----+

If how is any and cols is name, the dataframe returned after transformation is:

+----+-------+----+
|id  |name   |age |
+----+-------+----+
|1   |Alice  |25  |
|null|Bob    |null|
|4   |Charlie|30  |
+----+-------+----+

When you use the how parameter with the value all, a row is dropped only if all its values are null. For example, before transformation:

+----+--------+----+
|  id|    name| age|
+----+--------+----+
|   1|John Doe|  30|
|   2|    null|null|
|   3|Jane Doe|  25|
|   4|Jake Doe|null|
|null|    null|null|
+----+--------+----+

After transformation:

+---+--------+----+
| id|    name| age|
+---+--------+----+
|  1|John Doe|  30|
|  2|    null|null|
|  3|Jane Doe|  25|
|  4|Jake Doe|null|
+---+--------+----+

Na Fill

Replaces null values and returns a dataframe using the replaced values.


Parameter	Description
`replacement`	The key-value map to use for replacing null values. The key is a column name. The value is a replacement value. The key points to the replacement value. For example: `name -> Tom` means replace null values in the `name` column with `Tom`. A replacement value is cast to the column's data type. A replacement value must be of the following type: Int Long Float Double String Boolean

Example

Before transformation:

+----+-------+----+
|id  |name   |age |
+----+-------+----+
|1   |Alice  |25  |
|2   |null   |28  |
|null|Bob    |null|
|4   |Charlie|30  |
+----+-------+----+

If the key-value map is as follows:

id -> 3

name -> unknown

age -> 10

After transformation:

+---+-------+---+
|id |name   |age|
+---+-------+---+
|1  |Alice  |25 |
|2  |unknown|28 |
|3  |Bob    |10 |
|4  |Charlie|30 |
+---+-------+---+

Na Replace

Replaces a value with another value in the given column or columns by matching keys in the given key and replacement value map.

Returns a new dataframe that contains the replaced transformation.


Parameter	Description
`cols`	A column name or a comma-separated list of column names on which to apply replacement values. If `cols` is `*`, then replacement is applied on all String, Numeric, or Boolean columns.
`replacement`	The replacement key-value map to use. The key is a value to replace. The value is the replacement value. The map value can have nulls. The key points to the replacement value. For example: `Alice -> Tom` means replace `Alice` with `Tom`. The key and replacement value pair must have the same type. Only the following types are supported: Boolean Double String

Example

Before transformation:

+----+-------+----+
|id  |name   |age |
+----+-------+----+
|1   |Alice  |25  |
|2   |null   |28  |
|null|Bob    |null|
|4   |Charlie|30  |
+----+-------+----+

If cols is name, and replacement is Alice -> Tom, the dataframe after transformation is:

+----+-------+----+
|id  |name   |age |
+----+-------+----+
|1   |Tom    |25  |
|2   |null   |28  |
|null|Bob    |null|
|4   |Charlie|30  |
+----+-------+----+

Repartition

Divides a dataset by the given columns into the specified number of partitions.

Returns a new dataset that's hash partitioned. The exact number of partition as specified is returned.


Parameter	Description
`partitionColumns`	A column name or comma-separated list of column names by which the dataset is partitioned.
`numberOfPartitions`	The number of partitions to create.

Rollup

Generates a multi-dimensional rollup of possible combinations using the given columns list and the calculations in the given aggregate function expressions.

The following aggregate functions are supported in the expressions:

AVG
COUNT
MEAN
MAX
MIN
SUM

Returns a dataframe after the rollup transformation.


Parameter	Description
`rollupColumns`	The list of columns, separated by commas, by which to generate the multi-dimensional rollup.
`aggExpressions`	The aggregate function expressions to perform on the columns. For example: `salary -> avg, age -> max`

Example

Before transformation:

+----------+-----+-------+------+---+
|department|group|region |salary|age|
+----------+-----+-------+------+---+
|Eng       |A    |local  |1000  |10 |
|Art       |B    |foreign|2000  |11 |
|Sport     |C    |local  |1500  |13 |
|Eng       |A    |foreign|800   |19 |
|Art       |A    |foreign|3000  |30 |
|Eng       |E    |foreign|1000  |21 |
|Art       |D    |local  |1000  |32 |
|Sport     |C    |local  |1000  |28 |
+----------+-----+-------+------+---+

If rollupColumns is department, region and aggExpressions is salary -> avg, age -> max (compute the average salary and maximum age), the after transformation rollup is:

+----------+-------+-----------------+--------+
|department|region |avg(salary)      |max(age)|
+----------+-------+-----------------+--------+
|Eng       |foreign|900.0            |21      |
|Sport     |null   |1250.0           |28      |
|Art       |null   |2000.0           |32      |
|Eng       |null   |933.3333333333334|21      |
|Art       |foreign|2500.0           |30      |
|Eng       |local  |1000.0           |10      |
|null      |null   |1412.5           |32      |
|Sport     |local  |1250.0           |28      |
|Art       |local  |1000.0           |32      |
+----------+-------+-----------------+--------+

Spark SQL

Runs Spark SQL queries on incoming data, by first creating temporary views using the given table names.


Parameter	Description
`SQL`	The SQL statement or script to run. Example: `select * from table1`
`tableName`	A table name or comma-separated list of table names by which Spark creates temporary tables. Example: `table1, table2`

Stratified sampling

Generates a stratified sample without replacement based on the given sampling fraction for each stratum.

Returns a new dataframe that represents the stratified sample.


Parameter	Description
`strata_col`	The column that defines the strata.
`fractions`	The sampling fraction for each stratum, from `0.0` to `1.0`. For example, `0.1` returns 10% of the rows, and `1.0` returns 100% of the rows. If a fraction is not specified, zero is assumed.
`sample_size`	If `fractions` is not specified, specify a portion of the dataset to be sampled, from `0.0` to `1.0`.
`seed`	Use any random number in `seed` to return the same sample every time, keeping the result consistent.

Example

Before transformation:

+---+-----+
|key|value|
+---+-----+
|  1|    1|
|  1|    2|
|  2|    1|
|  2|    1|
|  2|    3|
|  3|    2|
|  3|    3|
+---+-----+

If strata_col is key, and fractions is the following:

1 -> 1.0, 3 -> 0.5

After transformation:

+---+-----+
|key|value|
+---+-----+
|  1|    1|
|  1|    2|
|  3|    2|
+---+-----+

Summary

Computes the given statistics for numeric and string columns.

If no statistics are provided, all the following are computed:

COUNT
MEAN
STDDEV
MIN
approximate quartiles (percentiles at 25%, 50%, and 75%)
MAX


Parameter	Description
`statistics`	A comma-separated list of statistics. Supported values are: `COUNT` `MEAN` `STDDEV` `MIN` one or more arbitrary approximate percentiles specified as a percentage (for example, `75%`) `MAX` Example: `count, mean, stddev, min, 27%, 41%, 95%, max`

Example

Before transformation:

+----------+------+-----------------+--------+
|department|gender|avg(salary)      |max(age)|
+----------+------+-----------------+--------+
|Eng       |female|900.0            |21      |
|Sport     |null  |1250.0           |28      |
|Art       |null  |2000.0           |32      |
|Eng       |null  |933.3333333333334|21      |
|Art       |female|2500.0           |30      |
|Eng       |male  |1000.0           |10      |
|null      |null  |1412.5           |32      |
|Sport     |male  |1250.0           |28      |
|Art       |male  |1000.0           |32      |
+----------+------+-----------------+--------+

Suppose statistics is count, mean, stddev, min, 27%, 41%, 95%, max.

After transformation:

+-------+----------+-----+------+----------------+----------------+
|summary|department|group|gender|salary          |age             |
+-------+----------+-----+------+----------------+----------------+
|count  |8         |8    |8     |8               |8               |
|mean   |null      |null |null  |1412.5          |20.5            |
|stddev |null      |null |null  |749.166203188585|8.76682056718072|
|min    |Art       |A    |female|800             |10              |
|27%    |null      |null |null  |1000            |13              |
|41%    |null      |null |null  |1000            |19              |
|95%    |null      |null |null  |3000            |32              |
|max    |Sport     |E    |male  |3000            |32              |
+-------+----------+-----+------+----------------+----------------+

Oracle Cloud Infrastructure Documentation

Table Function Operators Reference

Example

Example

Example

Example

Example

Example

Example

Example

Example

Example

Example