This is the multi-page printable view of this section. Click here to print.

Return to the regular view of this page.

Data Cleaning

Operators in the Data Cleaning category

Home > Data Cleaning

Subcategories

Operators

OperatorDescription
DistinctRemove duplicate tuples
FilterPerforms a filter operation using OR between multiple predicates
LimitLimit the number of output rows
ProjectionKeeps or drops the column
Type CastingCast between types

Total: 5 operators

1 - Join

Operators in the Join category

Home > Data Cleaning > Join

Operators

OperatorDescription
Cartesian ProductAppend fields together to get the cartesian product of two inputs
Hash JoinJoin two inputs
Interval JoinJoin two inputs with left table join key in the range of [right table join key, right table join key + constant value]

Total: 3 operators

1.1 - Cartesian Product

Append fields together to get the cartesian product of two inputs

Home > Data Cleaning > Join

Output Ports

PortMode
0Set Snapshot

1.2 - Hash Join

Join two inputs

Home > Data Cleaning > Join

Input Properties

PropertyRequirementTypeDefaultDescription
Left Input AttributeString-Attribute to be joined on the Left Input
Right Input AttributeString-Attribute to be joined on the Right Input
Join Typeinner, left outer, right outer,
full outer
innerSelect the join type to execute

Output Ports

PortMode
0Set Snapshot

1.3 - Interval Join

Join two inputs with left table join key in the range of [right table join key, right table join key + constant value]

Home > Data Cleaning > Join

Input Properties

PropertyRequirementTypeDefaultDescription
Interval ConstantLong10Left attri in (right, right + constant)
Include Left BoundBooleantrueInclude condition left attri = right attri
Include Right BoundBooleantrueInclude condition left attri = right attri
Time interval typeTimeIntervalTypedayYear, Month, Day, Hour, Minute or Second
Left Input attrString (integer, long, double, timestamp)-Choose one attribute in the left table
Right Input attrString-Choose one attribute in the right table

Output Ports

PortMode
0Set Snapshot

2 - Set

Operators in the Set category

Home > Data Cleaning > Set

Operators

OperatorDescription
DifferenceFind the set difference of two inputs
IntersectTake the intersect of two inputs
SymmetricDifferenceFind the symmetric difference (the set of elements which are in either of the sets, but not in their intersection) of two inputs
UnionUnions the output rows from multiple input operators

Total: 4 operators

2.1 - Difference

Find the set difference of two inputs

Home > Data Cleaning > Set

Output Ports

PortMode
0Set Snapshot

2.2 - Intersect

Take the intersect of two inputs

Home > Data Cleaning > Set

Output Ports

PortMode
0Set Snapshot

2.3 - SymmetricDifference

Find the symmetric difference (the set of elements which are in either of the sets, but not in their intersection) of two inputs

Home > Data Cleaning > Set

Output Ports

PortMode
0Set Snapshot

2.4 - Union

Unions the output rows from multiple input operators

Home > Data Cleaning > Set

Output Ports

PortMode
0Set Snapshot

3 - Aggregate

Operators in the Aggregate category

Home > Data Cleaning > Aggregate

Operators

OperatorDescription
AggregateCalculate different types of aggregation values

Total: 1 operator

3.1 - Aggregate

Calculate different types of aggregation values

Home > Data Cleaning > Aggregate

Input Properties

PropertyRequirementTypeDefaultDescription
AggregationsList-Multiple aggregation functions (min: 1,
aggregations cannot be empty)
↳ Aggregate Funcsum, count, average, min, max, concat-Sum, count, average, min, max, or concat
↳ AttributeString-Column to calculate average value
↳ Result AttributeString-Column name of average result
Group By KeysList-Group by columns

Output Ports

PortMode
0Set Snapshot

4 - Sort

Operators in the Sort category

Home > Data Cleaning > Sort

Operators

OperatorDescription
SortSort based on the columns and sorting methods
Sort PartitionsSort Partitions
Stable Merge SortStable per-partition sort with multi-key ordering (incremental stack of sorted buckets)

Total: 3 operators

4.1 - Sort

Sort based on the columns and sorting methods

Home > Data Cleaning > Sort

Input Properties

PropertyRequirementTypeDefaultDescription
AttributesList-Column to perform sorting on
↳ AttributeString-Attribute name to sort by
↳ Sort PreferenceASC, DESC-Sort preference (ASC or DESC)

Output Ports

PortMode
0Set Snapshot

4.2 - Sort Partitions

Sort Partitions

Home > Data Cleaning > Sort

Input Properties

PropertyRequirementTypeDefaultDescription
AttributeString (integer, long, double)-Attribute to sort (must be numerical)
Attribute Domain MinLong0Minimum value of the domain of the attribute
Attribute Domain MaxLong0Maximum value of the domain of the attribute

Output Ports

PortMode
0Set Snapshot

4.3 - Stable Merge Sort

Stable per-partition sort with multi-key ordering (incremental stack of sorted buckets)

Home > Data Cleaning > Sort

Input Properties

PropertyRequirementTypeDefaultDescription
Sort KeysList-List of attributes to sort by with ordering
preferences
↳ AttributeString-Attribute name to sort by
↳ Sort PreferenceASC, DESC-Sort preference (ASC or DESC)

Output Ports

PortMode
0Set Snapshot

5 - Distinct

Remove duplicate tuples

Home > Data Cleaning

Output Ports

PortMode
0Set Snapshot

6 - Filter

Performs a filter operation using OR between multiple predicates

Home > Data Cleaning

Input Properties

PropertyRequirementTypeDefaultDescription
PredicatesList-Multiple predicates in OR
↳ AttributeString-
↳ Condition=, >, >=, <, <=, !=, is null,
is not null
-
↳ ValueString-

Output Ports

PortMode
0Set Snapshot

7 - Limit

Limit the number of output rows

Home > Data Cleaning

Input Properties

PropertyRequirementTypeDefaultDescription
LimitInteger0The max number of output rows

Output Ports

PortMode
0Set Snapshot

8 - Projection

Keeps or drops the column

Home > Data Cleaning

Input Properties

PropertyRequirementTypeDefaultDescription
Drop OptionBooleanfalseCheck to drop the selected attributes
AttributesList-
↳ AttributeString-Attribute name in the schema
↳ AliasString-Renamed attribute name

Output Ports

PortMode
0Set Snapshot

9 - Type Casting

Cast between types

Home > Data Cleaning

Input Properties

PropertyRequirementTypeDefaultDescription
TypeCasting UnitsList-Multiple type castings
↳ AttributeString-Attribute for type casting
↳ Cast typestring, integer, long, double, boolean,
timestamp, binary, large_binary
-Result type after type casting

Output Ports

PortMode
0Set Snapshot