Pipeline Algebra

This section describes the Pipeline interface of vectorized data transformations. We will use the following definitions:

using DataKnots:
    @VectorTree,
    Pipeline,
    Runtime,
    adapt_missing,
    adapt_tuple,
    adapt_vector,
    block_any,
    block_cardinality,
    block_filler,
    block_length,
    block_lift,
    block_not_empty,
    chain_of,
    column,
    distribute,
    distribute_all,
    filler,
    flatten,
    get_by,
    group_by,
    lift,
    null_filler,
    pass,
    sieve_by,
    slice_by,
    sort_by,
    tuple_lift,
    tuple_of,
    unique_by,
    with_column,
    with_elements,
    wrap,
    x0toN,
    x1to1,
    x1toN

Lifting and Fillers

DataKnots stores structured data in a column-oriented format, serialized using specialized composite vector types. Consequently, operations on data must also be adapted to the column-oriented format.

In DataKnots, operations on column-oriented data are called pipelines. A pipeline is a vectorized transformation: it takes a vector of input values and produces a vector of the same size containing output values.

Any unary scalar function could be vectorized, which gives us a simple method for creating new pipelines. Consider, for example, function titlecase(), which transforms the input string by capitalizing the first letter of each word and converting every other character to lowercase.

titlecase("JEFFERY A")      #-> "Jeffery A"

This function can be converted to a pipeline or lifted, using the lift pipeline constructor.

p = lift(titlecase)
p(["JEFFERY A", "JAMES A", "TERRY A"])
#-> ["Jeffery A", "James A", "Terry A"]

A scalar function with N arguments could be lifted by tuple_lift to make a pipeline that transforms a TupleVector with N columns. For example, a binary predicate > gives rise to a pipeline tuple_lift(>) that transforms a TupleVector with two columns into a Boolean vector.

p = tuple_lift(>)
p(@VectorTree (Int, Int) [260004 200000; 185364 200000; 170112 200000])
#-> Bool[1, 0, 0]

In a similar manner, a function with a vector argument can be lifted by block_lift to make a pipeline that expects a BlockVector input. For example, function length(), which returns the length of a vector, could be converted to a pipeline block_lift(length) that transforms a block vector to an integer vector containing block lengths.

p = block_lift(length)
p(@VectorTree [String] [["JEFFERY A", "NANCY A"], ["JAMES A"]])
#-> [2, 1]

Not just functions, but also regular values could give rise to pipelines. The filler constructor makes a pipeline from any scalar value. This pipeline maps any input vector to a vector filled with the given scalar.

p = filler(200000)
p(["JEFFERY A", "JAMES A", "TERRY A"])
#-> [200000, 200000, 200000]

Similarly, block_filler makes a pipeline from any vector value. This pipeline produces a BlockVector filled with the given vector.

p = block_filler(["POLICE", "FIRE"])
p(["GARRY M", "ANTHONY R", "DANA A"])
#-> @VectorTree (0:N) × String [["POLICE", "FIRE"], ["POLICE", "FIRE"], ["POLICE", "FIRE"]]

A variant of block_filler called null_filler makes a pipeline that produces a BlockVector filled with empty blocks.

p = null_filler()
p(["GARRY M", "ANTHONY R", "DANA A"])
#-> @VectorTree (0:1) × Bottom [missing, missing, missing]

Chaining Pipelines

Given a series of pipelines, the chain_of constructor creates their composition pipeline, which transforms the input vector by sequentially applying the given pipelines.

p = chain_of(lift(split), lift(first), lift(titlecase))
p(["JEFFERY A", "JAMES A", "TERRY A"])
#-> ["Jeffery", "James", "Terry"]

The degenerate composition of an empty sequence of pipelines has its own name, pass(). It passes its input to the output unchanged.

chain_of()
#-> pass()

p = pass()
p(["JEFFERY A", "JAMES A", "TERRY A"])
#-> ["JEFFERY A", "JAMES A", "TERRY A"]

In general, pipeline constructors that take one or more pipelines as arguments are called pipeline combinators. Combinators are used to assemble elementary pipelines into complex pipeline expressions.

Composite Vectors

In DataKnots, composite data is represented as a tree of vectors with regular Vector objects at the leaves and composite vectors, such as TupleVector and BlockVector, at the intermediate nodes. Pipelines that operate and rearrange this tree are described here.

The tuple_of pipeline combinator permits us to construct a TupleVector. TupleVector is a vector of tuples composed of a sequence of column vectors. Any collection of vectors could be used as columns as long as they all have the same length. One way to obtain N columns for a TupleVector is to apply N pipelines to the same input vector.

p = tuple_of(:first => chain_of(lift(split), lift(first), lift(titlecase)),
             :last => lift(last))
p(["JEFFERY A", "JAMES A", "TERRY A"])
#-> @VectorTree (first = String, last = Char) [(first = "Jeffery", last = 'A') … ]

In the opposite direction, the column constructor makes a pipeline that extracts the specified column from the input TupleVector.

p = column(:salary)
p(@VectorTree (name=String, salary=Int) [("JEFFERY A", 101442), ("JAMES A", 103350), ("TERRY A", 93354)])
#-> [101442, 103350, 93354]

The wrap() pipeline primitive is used to create a BlockVector. BlockVector is a vector of vectors serialized as a partitioned vector of elements. Any input vector could be transformed to a BlockVector by partitioning its elements into one-element blocks.

p = wrap()
p(["GARRY M", "ANTHONY R", "DANA A"])
#-> @VectorTree (1:1) × String ["GARRY M", "ANTHONY R", "DANA A"]

Dual to wrap() is the pipeline flatten(), which transforms a nested BlockVector by flattening its nested blocks.

p = flatten()
p(@VectorTree [[String]] [[["GARRY M"], ["ANTHONY R", "DANA A"]], [[], ["JOSE S"], ["CHARLES S"]]])
#-> @VectorTree (0:N) × String [["GARRY M", "ANTHONY R", "DANA A"], ["JOSE S", "CHARLES S"]]

The distribute constructor makes a pipeline that rearranges a TupleVector with a BlockVector column. This operation exchanges their positions, pushing tuples down and pulling blocks up. Specifically, it takes each tuple, where a specific field must contain a block value, and transforms it to a block of tuples by distributing the block value over the tuple.

p = distribute(:employee)
p(@VectorTree (department = String, employee = [String]) [
    "POLICE"    ["GARRY M", "ANTHONY R", "DANA A"]
    "FIRE"      ["JOSE S", "CHARLES S"]]) |> display
#=>
@VectorTree of 2 × (0:N) × (department = String, employee = String):
 [(department = "POLICE", employee = "GARRY M"), (department = "POLICE", employee = "ANTHONY R"), (department = "POLICE", employee = "DANA A")]
 [(department = "FIRE", employee = "JOSE S"), (department = "FIRE", employee = "CHARLES S")]
=#

Often we need to transform only a part of a composite vector, leaving the rest of the structure intact. This can be achieved using with_column and with_elements combinators. Specifically, with_column transforms a specific column of a TupleVector while with_elements transforms the vector of elements of a BlockVector.

p = with_column(:employee, with_elements(lift(titlecase)))
p(@VectorTree (department = String, employee = [String]) [
    "POLICE"    ["GARRY M", "ANTHONY R", "DANA A"]
    "FIRE"      ["JOSE S", "CHARLES S"]]) |> display
#=>
@VectorTree of 2 × (department = String, employee = (0:N) × String):
 (department = "POLICE", employee = ["Garry M", "Anthony R", "Dana A"])
 (department = "FIRE", employee = ["Jose S", "Charles S"])
=#

Specialized Pipelines

Not every data transformation can be implemented with lifting. DataKnots provide pipeline constructors for some common transformation tasks.

For example, data filtering is implemented with the pipeline sieve_by(). As input, it expects a TupleVector of pairs containing a value and a Bool flag. sieve_by() transforms the input to a BlockVector containing 0- and 1-element blocks. When the flag is false, it is mapped to an empty block, otherwise, it is mapped to a one-element block containing the data value.

p = sieve_by()
p(@VectorTree (String, Bool) [("JEFFERY A", true), ("JAMES A", true), ("TERRY A", false)])
#-> @VectorTree (0:1) × String ["JEFFERY A", "JAMES A", missing]

If DataKnots does not provide a specific transformation, it is easy to create a new one. For example, let us create a pipeline constructor double which makes a pipeline that doubles the elements of the input vector.

We need to provide two definitions: to create a Pipeline object and to perform the tranformation on the given input vector.

double() = Pipeline(double)
double(::Runtime, input::AbstractVector{<:Number}) = input .* 2

p = double()
p([260004, 185364, 170112])
#-> [520008, 370728, 340224]

It is also easy to create new pipeline combinators. Let us create a combinator twice, which applies the given pipeline to the input two times.

twice(p) = Pipeline(twice, p)
twice(rt::Runtime, input, p) = p(rt, p(rt, input))

p = twice(double())
p([260004, 185364, 170112])
#-> [1040016, 741456, 680448]

API Reference

DataKnots.Pipeline — Type

Pipeline(op, args...)

A pipeline object represents a vectorized data transformation.

Parameter op is a function that performs the transformation; args are extra arguments to be passed to the function.

The pipeline transforms any input vector by invoking op with the following arguments:

op(rt::Runtime, input::AbstractVector, args...)

The result of op must be the output vector, which should be of the same length as the input vector.