Table Function Operators Reference
The table function operator provides a list of built-in table functions that you can use in a data flow to return data in the form of tables.
Here's a list of the built-in table functions that are supported by the table function operator.
Persists and caches a dataset according to the given memory storage level
.
Returns a new dataset in the cached resilient distributed dataset (RDD).
Parameter | Description |
---|---|
level |
The memory storage to use:
|
Computes a pair-wise frequency table or contingency table from distinct values of the given two columns.
The number of distinct values for each column must be less than 1e4.
Returns a dataframe that contains the contingency table. In the contingency table:
- The name of the first column is
col1_col2
, wherecol1
is the first column andcol2
is the second column before transformation. The first column of each row are the distinct values ofcol1
. - The other column names are the distinct values of
col2
. - The counts are returned as the
long
type. - Pairs without occurrences are given zero as the count.
- The maximum number of non-zero pairs is 1e6.
- Null elements are replaced by
null
.
Parameter | Description |
---|---|
col1 |
Name of the first column. See the note after this table for limitations. Distinct items from |
col2 |
Name of the second column. See the note after this table for limitations. Distinct items from |
Only the following characters are allowed in a column name:
- lower and upper case letters
a-z
A-Z
- numbers
0-9
- underscore
_
Example
Suppose col1
is age
and col2
is salary
.
Before transformation:
+---+------+
|age|salary|
+---+------+
|1 |100 |
|1 |200 |
|2 |100 |
|2 |100 |
|2 |300 |
|3 |200 |
|3 |300 |
+---+------+
After transformation:
+----------+---+---+---+
|age_salary|100|200|300|
+----------+---+---+---+
|2 |2 |0 |1 |
|1 |1 |1 |0 |
|3 |0 |1 |1 |
+----------+---+---+---+
Generates a multi-dimensional cube of all possible combinations using the given columns list and the calculations in the given aggregate function expressions.
The following aggregate functions are supported in the expressions:
AVG
COUNT
MEAN
MAX
MIN
SUM
Returns a dataframe after the cube transformation.
Parameter | Description |
---|---|
cubeColumns |
The list of columns, separated by commas, by which to generate the multi-dimensional cube. |
aggExpressions |
The aggregate function expressions to perform on the columns. For example: |
Example
Before transformation:
+----------+-----+-------+------+---+
|department|group|region |salary|age|
+----------+-----+-------+------+---+
|Eng |A |local |1000 |10 |
|Art |B |foreign|2000 |11 |
|Sport |C |local |1500 |13 |
|Eng |A |foreign|800 |19 |
|Art |A |foreign|3000 |30 |
|Eng |E |foreign|1000 |21 |
|Art |D |local |1000 |32 |
|Sport |C |local |1000 |28 |
+----------+-----+-------+------+---+
If cubeColumns
is department, region
and aggExpressions
issalary -> avg, age -> max
(compute the average salary and maximum age), the after transformation cube is:
+----------+-------+-----------------+--------+
|department|region |avg(salary) |max(age)|
+----------+-------+-----------------+--------+
|Eng |foreign|900.0 |21 |
|Sport |null |1250.0 |28 |
|Art |null |2000.0 |32 |
|Eng |null |933.3333333333334|21 |
|null |local |1125.0 |32 |
|Art |foreign|2500.0 |30 |
|null |foreign|1700.0 |30 |
|Eng |local |1000.0 |10 |
|null |null |1412.5 |32 |
|Sport |local |1250.0 |28 |
|Art |local |1000.0 |32 |
+----------+-------+-----------------+--------+
Generates synthetic data using the given file format and content. The supported file formats are:
- CSV
- JSON
Returns a data entity. The synthetic data source can be used in pipeline unit-tests where real data is not used or moved.
Parameter | Description |
---|---|
format |
The type of file. Supported values are:
|
content |
The file content for the given format. |
Example
A CSV content example:
co1,co2,co3
1,B,3
A,2,C
Data generation result:
+---+---+---+
|co1|co2|co3|
+---+---+---+
|1 |B |3 |
+---+---+---+
|A |2 |C |
+---+---+---+
Finds duplicates in the given column or columns list and returns a new dataset with the duplicate rows removed.
Parameter | Description |
---|---|
columns |
A column name or a comma-separated list of column names. |
Example
Before transformation:
+---+---+
|a |b |
+---+---+
|1 |2 |
|3 |4 |
|1 |3 |
|2 |4 |
+---+---+
If columns
is a
, the dataset after transformation is:
+---+---+
|a |b |
+---+---+
|1 |2 |
|3 |4 |
|2 |4 |
+---+---+
Finds frequent items in the given column or columns list using the given minimum frequency.
The frequent element count algorithm proposed by Karl et al. is used to find frequent items. False positives are possible.
Returns a dataframe with an array of frequent items for each column.
Parameter | Description |
---|---|
freqItems |
A column name or a comma-separated list of column names for which you want to find frequent items. |
support |
Minimum frequency for an item to be considered frequent. The value must be greater than 1e-4 (decimal value of For example, |
Example
Before transformation:
+-------+--------------+
|user_id|favorite_fruit|
+-------+--------------+
| 1| apple|
| 2| banana|
| 3| apple|
| 4| orange|
| 5| banana|
| 6| banana|
| 7| apple|
+-------+--------------+
If freqItems
is favorite_fruit
and support
is 0.6
, the dataframe returned after transformation is:
+------------------------+
|favorite_fruit_freqItems|
+------------------------+
| [banana]|
+------------------------+
Removes rows that contain null or NaN values in the given column or columns list.
Returns a dataframe that excludes the rows.
Parameter | Description |
---|---|
how |
How to determine if a row is to be dropped. Supported values are:
The |
cols |
A column name or a comma-separated list of column names. |
minNonNulls |
The minimum number of non-null and non-NaN values that a row can contain. Drops the rows that contain less than the specified minimum. The |
Example
Before transformation:
+----+-------+----+
|id |name |age |
+----+-------+----+
|1 |Alice |25 |
|2 |null |28 |
|null|Bob |null|
|4 |Charlie|30 |
+----+-------+----+
If how
is any
and cols
is name
, the dataframe returned after transformation is:
+----+-------+----+
|id |name |age |
+----+-------+----+
|1 |Alice |25 |
|null|Bob |null|
|4 |Charlie|30 |
+----+-------+----+
When you use the how
parameter with the value all
, a row is dropped only if all its values are null. For example, before transformation:
+----+--------+----+
| id| name| age|
+----+--------+----+
| 1|John Doe| 30|
| 2| null|null|
| 3|Jane Doe| 25|
| 4|Jake Doe|null|
|null| null|null|
+----+--------+----+
After transformation:
+---+--------+----+
| id| name| age|
+---+--------+----+
| 1|John Doe| 30|
| 2| null|null|
| 3|Jane Doe| 25|
| 4|Jake Doe|null|
+---+--------+----+
Replaces null values and returns a dataframe using the replaced values.
Parameter | Description |
---|---|
replacement |
The key-value map to use for replacing null values. The key is a column name. The value is a replacement value. The key points to the replacement value. For example: A replacement value is cast to the column's data type. A replacement value must be of the following type:
|
Example
Before transformation:
+----+-------+----+
|id |name |age |
+----+-------+----+
|1 |Alice |25 |
|2 |null |28 |
|null|Bob |null|
|4 |Charlie|30 |
+----+-------+----+
If the key-value map is as follows:
id -> 3
name -> unknown
age -> 10
After transformation:
+---+-------+---+
|id |name |age|
+---+-------+---+
|1 |Alice |25 |
|2 |unknown|28 |
|3 |Bob |10 |
|4 |Charlie|30 |
+---+-------+---+
Replaces a value with another value in the given column or columns by matching keys in the given key and replacement value map.
Returns a new dataframe that contains the replaced transformation.
Parameter | Description |
---|---|
cols |
A column name or a comma-separated list of column names on which to apply replacement values. If |
replacement |
The replacement key-value map to use. The key is a value to replace. The value is the replacement value. The map value can have nulls. The key points to the replacement value. For example: The key and replacement value pair must have the same type. Only the following types are supported:
|
Example
Before transformation:
+----+-------+----+
|id |name |age |
+----+-------+----+
|1 |Alice |25 |
|2 |null |28 |
|null|Bob |null|
|4 |Charlie|30 |
+----+-------+----+
If cols
is name
, and replacement
is Alice -> Tom
, the dataframe after transformation is:
+----+-------+----+
|id |name |age |
+----+-------+----+
|1 |Tom |25 |
|2 |null |28 |
|null|Bob |null|
|4 |Charlie|30 |
+----+-------+----+
Divides a dataset by the given columns into the specified number of partitions.
Returns a new dataset that's hash partitioned. The exact number of partition as specified is returned.
Parameter | Description |
---|---|
partitionColumns |
A column name or comma-separated list of column names by which the dataset is partitioned. |
numberOfPartitions |
The number of partitions to create. |
Generates a multi-dimensional rollup of possible combinations using the given columns list and the calculations in the given aggregate function expressions.
The following aggregate functions are supported in the expressions:
AVG
COUNT
MEAN
MAX
MIN
SUM
Returns a dataframe after the rollup transformation.
Parameter | Description |
---|---|
rollupColumns |
The list of columns, separated by commas, by which to generate the multi-dimensional rollup. |
aggExpressions |
The aggregate function expressions to perform on the columns. For example: |
Example
Before transformation:
+----------+-----+-------+------+---+
|department|group|region |salary|age|
+----------+-----+-------+------+---+
|Eng |A |local |1000 |10 |
|Art |B |foreign|2000 |11 |
|Sport |C |local |1500 |13 |
|Eng |A |foreign|800 |19 |
|Art |A |foreign|3000 |30 |
|Eng |E |foreign|1000 |21 |
|Art |D |local |1000 |32 |
|Sport |C |local |1000 |28 |
+----------+-----+-------+------+---+
If rollupColumns
is department, region
and aggExpressions
is salary -> avg, age -> max
(compute the average salary and maximum age), the after transformation rollup is:
+----------+-------+-----------------+--------+
|department|region |avg(salary) |max(age)|
+----------+-------+-----------------+--------+
|Eng |foreign|900.0 |21 |
|Sport |null |1250.0 |28 |
|Art |null |2000.0 |32 |
|Eng |null |933.3333333333334|21 |
|Art |foreign|2500.0 |30 |
|Eng |local |1000.0 |10 |
|null |null |1412.5 |32 |
|Sport |local |1250.0 |28 |
|Art |local |1000.0 |32 |
+----------+-------+-----------------+--------+
Runs Spark SQL queries on incoming data, by first creating temporary views using the given table names.
Parameter | Description |
---|---|
SQL |
The SQL statement or script to run. Example: |
tableName |
A table name or comma-separated list of table names by which Spark creates temporary tables. Example: |
Generates a stratified sample without replacement based on the given sampling fraction for each stratum.
Returns a new dataframe that represents the stratified sample.
Parameter | Description |
---|---|
strata_col |
The column that defines the strata. |
fractions |
The sampling fraction for each stratum, from For example, If a fraction is not specified, zero is assumed. |
sample_size |
If |
seed |
Use any random number in |
Example
Before transformation:
+---+-----+
|key|value|
+---+-----+
| 1| 1|
| 1| 2|
| 2| 1|
| 2| 1|
| 2| 3|
| 3| 2|
| 3| 3|
+---+-----+
If strata_col
is key
, and fractions
is the following:
1 -> 1.0, 3 -> 0.5
After transformation:
+---+-----+
|key|value|
+---+-----+
| 1| 1|
| 1| 2|
| 3| 2|
+---+-----+
Computes the given statistics for numeric and string columns.
If no statistics are provided, all the following are computed:
COUNT
MEAN
STDDEV
MIN
- approximate quartiles (percentiles at 25%, 50%, and 75%)
MAX
Parameter | Description |
---|---|
statistics |
A comma-separated list of statistics. Supported values are:
Example: |
Example
Before transformation:
+----------+------+-----------------+--------+
|department|gender|avg(salary) |max(age)|
+----------+------+-----------------+--------+
|Eng |female|900.0 |21 |
|Sport |null |1250.0 |28 |
|Art |null |2000.0 |32 |
|Eng |null |933.3333333333334|21 |
|Art |female|2500.0 |30 |
|Eng |male |1000.0 |10 |
|null |null |1412.5 |32 |
|Sport |male |1250.0 |28 |
|Art |male |1000.0 |32 |
+----------+------+-----------------+--------+
Suppose statistics
is count, mean, stddev, min, 27%, 41%, 95%, max
.
After transformation:
+-------+----------+-----+------+----------------+----------------+
|summary|department|group|gender|salary |age |
+-------+----------+-----+------+----------------+----------------+
|count |8 |8 |8 |8 |8 |
|mean |null |null |null |1412.5 |20.5 |
|stddev |null |null |null |749.166203188585|8.76682056718072|
|min |Art |A |female|800 |10 |
|27% |null |null |null |1000 |13 |
|41% |null |null |null |1000 |19 |
|95% |null |null |null |3000 |32 |
|max |Sport |E |male |3000 |32 |
+-------+----------+-----+------+----------------+----------------+