Data Cleaning
Operators in the Data Cleaning category
Home > Data Cleaning
Subcategories
Operators
| Operator | Description |
|---|
| Distinct | Remove duplicate tuples |
| Filter | Performs a filter operation using OR between multiple predicates |
| Limit | Limit the number of output rows |
| Projection | Keeps or drops the column |
| Type Casting | Cast between types |
Total: 5 operators
1 - Join
Operators in the Join category
Home > Data Cleaning > Join
Operators
| Operator | Description |
|---|
| Cartesian Product | Append fields together to get the cartesian product of two inputs |
| Hash Join | Join two inputs |
| Interval Join | Join two inputs with left table join key in the range of [right table join key, right table join key + constant value] |
Total: 3 operators
1.1 - Cartesian Product
Append fields together to get the cartesian product of two inputs
Home > Data Cleaning > Join
Output Ports
1.2 - Hash Join
Join two inputs
Home > Data Cleaning > Join
| Property | Requirement | Type | Default | Description |
|---|
| Left Input Attribute | ✓ | String | - | Attribute to be joined on the Left Input |
| Right Input Attribute | ✓ | String | - | Attribute to be joined on the Right Input |
| Join Type | ✓ | inner, left outer, right outer, full outer | inner | Select the join type to execute |
Output Ports
1.3 - Interval Join
Join two inputs with left table join key in the range of [right table join key, right table join key + constant value]
Home > Data Cleaning > Join
| Property | Requirement | Type | Default | Description |
|---|
| Interval Constant | ✓ | Long | 10 | Left attri in (right, right + constant) |
| Include Left Bound | ✓ | Boolean | true | Include condition left attri = right attri |
| Include Right Bound | ✓ | Boolean | true | Include condition left attri = right attri |
| Time interval type | | TimeIntervalType | day | Year, Month, Day, Hour, Minute or Second |
| Left Input attr | ✓ | String (integer, long, double, timestamp) | - | Choose one attribute in the left table |
| Right Input attr | ✓ | String | - | Choose one attribute in the right table |
Output Ports
2 - Set
Operators in the Set category
Home > Data Cleaning > Set
Operators
| Operator | Description |
|---|
| Difference | Find the set difference of two inputs |
| Intersect | Take the intersect of two inputs |
| SymmetricDifference | Find the symmetric difference (the set of elements which are in either of the sets, but not in their intersection) of two inputs |
| Union | Unions the output rows from multiple input operators |
Total: 4 operators
2.1 - Difference
Find the set difference of two inputs
Home > Data Cleaning > Set
Output Ports
2.2 - Intersect
Take the intersect of two inputs
Home > Data Cleaning > Set
Output Ports
2.3 - SymmetricDifference
Find the symmetric difference (the set of elements which are in either of the sets, but not in their intersection) of two inputs
Home > Data Cleaning > Set
Output Ports
2.4 - Union
Unions the output rows from multiple input operators
Home > Data Cleaning > Set
Output Ports
3 - Aggregate
Operators in the Aggregate category
Home > Data Cleaning > Aggregate
Operators
| Operator | Description |
|---|
| Aggregate | Calculate different types of aggregation values |
Total: 1 operator
3.1 - Aggregate
Calculate different types of aggregation values
Home > Data Cleaning > Aggregate
| Property | Requirement | Type | Default | Description |
|---|
| Aggregations | ✓ | List | - | Multiple aggregation functions (min: 1, aggregations cannot be empty) |
| ↳ Aggregate Func | ✓ | sum, count, average, min, max, concat | - | Sum, count, average, min, max, or concat |
| ↳ Attribute | ✓ | String | - | Column to calculate average value |
| ↳ Result Attribute | ✓ | String | - | Column name of average result |
| Group By Keys | | List | - | Group by columns |
Output Ports
4 - Sort
Operators in the Sort category
Home > Data Cleaning > Sort
Operators
| Operator | Description |
|---|
| Sort | Sort based on the columns and sorting methods |
| Sort Partitions | Sort Partitions |
| Stable Merge Sort | Stable per-partition sort with multi-key ordering (incremental stack of sorted buckets) |
Total: 3 operators
4.1 - Sort
Sort based on the columns and sorting methods
Home > Data Cleaning > Sort
| Property | Requirement | Type | Default | Description |
|---|
| Attributes | ✓ | List | - | Column to perform sorting on |
| ↳ Attribute | ✓ | String | - | Attribute name to sort by |
| ↳ Sort Preference | ✓ | ASC, DESC | - | Sort preference (ASC or DESC) |
Output Ports
4.2 - Sort Partitions
Sort Partitions
Home > Data Cleaning > Sort
| Property | Requirement | Type | Default | Description |
|---|
| Attribute | ✓ | String (integer, long, double) | - | Attribute to sort (must be numerical) |
| Attribute Domain Min | ✓ | Long | 0 | Minimum value of the domain of the attribute |
| Attribute Domain Max | ✓ | Long | 0 | Maximum value of the domain of the attribute |
Output Ports
4.3 - Stable Merge Sort
Stable per-partition sort with multi-key ordering (incremental stack of sorted buckets)
Home > Data Cleaning > Sort
| Property | Requirement | Type | Default | Description |
|---|
| Sort Keys | ✓ | List | - | List of attributes to sort by with ordering preferences |
| ↳ Attribute | ✓ | String | - | Attribute name to sort by |
| ↳ Sort Preference | ✓ | ASC, DESC | - | Sort preference (ASC or DESC) |
Output Ports
6 - Filter
Performs a filter operation using OR between multiple predicates
Home > Data Cleaning
| Property | Requirement | Type | Default | Description |
|---|
| Predicates | ✓ | List | - | Multiple predicates in OR |
| ↳ Attribute | ✓ | String | - | |
| ↳ Condition | ✓ | =, >, >=, <, <=, !=, is null, is not null | - | |
| ↳ Value | | String | - | |
Output Ports
7 - Limit
Limit the number of output rows
Home > Data Cleaning
| Property | Requirement | Type | Default | Description |
|---|
| Limit | ✓ | Integer | 0 | The max number of output rows |
Output Ports
8 - Projection
Keeps or drops the column
Home > Data Cleaning
| Property | Requirement | Type | Default | Description |
|---|
| Drop Option | ✓ | Boolean | false | Check to drop the selected attributes |
| Attributes | ✓ | List | - | |
| ↳ Attribute | ✓ | String | - | Attribute name in the schema |
| ↳ Alias | | String | - | Renamed attribute name |
Output Ports