Shapes and Signatures
To describe data shapes and pipeline signatures, we need the following definitions.
using DataKnots:
@VectorTree,
AnyShape,
BlockOf,
BlockVector,
IsFlow,
IsLabeled,
IsScope,
NoShape,
Signature,
TupleOf,
TupleVector,
ValueOf,
cardinality,
chain_of,
column,
columns,
compose,
context,
designate,
domain,
elements,
fits,
label,
labels,
print_graph,
replace_column,
replace_elements,
shapeof,
signature,
source,
subject,
target,
tuple_lift,
tuple_of,
wrap,
x0to1,
x0toN,
x1to1,
x1toNData Shapes
In DataKnots, the structure of composite data is represented using shape objects.
For example, consider a collection of departments with associated employees.
depts =
@VectorTree (name = (1:1)String,
employee = (1:N)(name = (1:1)String,
position = (1:1)String,
salary = (0:1)Int64,
rate = (0:1)Float64)) [
(name = "POLICE",
employee = [(name = "JEFFERY A", position = "SERGEANT", salary = 101442, rate = missing),
(name = "NANCY A", position = "POLICE OFFICER", salary = 80016, rate = missing)]),
(name = "FIRE",
employee = [(name = "JAMES A", position = "FIRE ENGINEER-EMT", salary = 103350, rate = missing),
(name = "DANIEL A", position = "FIRE FIGHTER-EMT", salary = 95484, rate = missing)]),
(name = "OEMC",
employee = [(name = "LAKENYA A", position = "CROSSING GUARD", salary = missing, rate = 17.68),
(name = "DORIS A", position = "CROSSING GUARD", salary = missing, rate = 19.38)])
]In this collection, each department record has two fields: name and employee. Each employee record has four fields: name, position, salary, and rate. The employee field is plural; salary and rate are optional.
Physically, this collection is stored as a tree of interleaving TupleVector and BlockVector objects with regular Vector objects as the leaves. Its shape is described by a congruent tree composed of TupleOf, BlockOf and ValueOf objects.
ValueOf corresponds to regular Julia Vector objects and specifies the type of the vector elements.
ValueOf(String)
#-> ValueOf(String)BlockOf specifies the shape of the elements and the cardinality of a BlockVector. As a shorthand, a regular Julia type is accepted in place of a ValueOf shape, and the cardinality x0toN is assumed by default.
BlockOf(ValueOf(String), x1to1)
#-> BlockOf(String, x1to1)TupleOf describes a TupleVector object with the given labels and the shapes of the columns.
emp_shp = TupleOf(:name => BlockOf(String, x1to1),
:position => BlockOf(String, x1to1),
:salary => BlockOf(Int, x0to1),
:rate => BlockOf(Float64, x0to1))Using nested shape objects, we can accurately specify the structure of a nested collection.
dept_shp = TupleOf(:name => BlockOf(String, x1to1),
:employee => BlockOf(emp_shp, x1toN))
#=>
TupleOf(:name => BlockOf(String, x1to1),
:employee => BlockOf(TupleOf(
:name => BlockOf(String, x1to1),
:position => BlockOf(String, x1to1),
:salary => BlockOf(Int64, x0to1),
:rate => BlockOf(Float64, x0to1)),
x1toN))
=#Traversing Nested Data
A field gives rise to a pipeline that maps the records to the field values. For example, the field employee corresponds to a pipeline which maps a collection of departments to associated employees.
dept_employee = column(:employee)
dept_employee(depts) |> display
#=>
@VectorTree of 3 × (1:N) × (name = (1:1) × String,
position = (1:1) × String,
salary = (0:1) × Int64,
rate = (0:1) × Float64):
[(name = "JEFFERY A", position = "SERGEANT", salary = 101442, rate = missing), (name = "NANCY A", position = "POLICE OFFICER", salary = 80016, rate = missing)]
[(name = "JAMES A", position = "FIRE ENGINEER-EMT", salary = 103350, rate = missing), (name = "DANIEL A", position = "FIRE FIGHTER-EMT", salary = 95484, rate = missing)]
[(name = "LAKENYA A", position = "CROSSING GUARD", salary = missing, rate = 17.68), (name = "DORIS A", position = "CROSSING GUARD", salary = missing, rate = 19.38)]
=#The expected input and output of a pipeline can be specified by its signature.
dept_employee =
dept_employee |> designate(dept_shp, BlockOf(emp_shp, x1toN) |> IsFlow)Here, we also annotate the output shape with IsFlow to indicate its special role in pipeline composition.
Two adjacent field pipelines may form a path. For example, consider the rate pipeline.
emp_rate =
column(:rate) |> designate(emp_shp, BlockOf(Float64, x0to1) |> IsFlow)
signature(emp_rate)
#=>
Signature(TupleOf(:name => BlockOf(String, x1to1),
:position => BlockOf(String, x1to1),
:salary => BlockOf(Int64, x0to1),
:rate => BlockOf(Float64, x0to1)),
BlockOf(Float64, x0to1) |> IsFlow)
=#We wish to form a path through the fields employee and rate. However, the pipelines dept_employee and emp_rate cannot be chained into chain_of(dept_employee, emp_rate) because their intermediate shapes do not match.
fits(target(dept_employee), source(emp_rate)) #-> falseOn the other hand, these pipelines could be composed using the elementwise composition combinator.
dept_employee_rate = compose(dept_employee, emp_rate)
#=>
chain_of(column(:employee),
chain_of(with_elements(column(:rate)), flatten()))
=#
dept_employee_rate(depts)
#-> @VectorTree (0:N) × Float64 [[], [], [17.68, 19.38]]
signature(dept_employee_rate)
#=>
Signature(TupleOf(:name => BlockOf(String, x1to1),
:employee =>
BlockOf(TupleOf(
:name => BlockOf(String, x1to1),
:position => BlockOf(String, x1to1),
:salary => BlockOf(Int64, x0to1),
:rate => BlockOf(Float64, x0to1)),
x1toN)),
BlockOf(Float64) |> IsFlow)
=#Elementwise composition connects the pipelines by fusing their output flows. The least upper bound of the flow cardinalities is the cardinality of the fused flow.
dept_employee_card = cardinality(target(dept_employee))
#-> x1toN
emp_rate_card = cardinality(target(emp_rate))
#-> x0to1
dept_employee_rate_card = cardinality(target(dept_employee_rate))
#-> x0toN
dept_employee_card|emp_rate_card == dept_employee_rate_card
#-> trueFlow and Scope
Elementwise composition is a sequential composition with special handling of two types of containers: flow and scope.
The flow is a BlockVector that wraps the output of the pipeline. When two pipelines are composed, their output flows are fused together.
The scope is a TupleVector that augments the input data with extra context parameters. When pipelines are composed, the context is passed along the composition.
For example, consider a pipeline that wraps the function round and expects the precision to be passed as a context parameter :P.
round_digits(x, d) = round(x, digits=d)
round_it =
chain_of(
tuple_of(column(1),
chain_of(column(2), column(:P))),
tuple_lift(round_digits),
wrap())
round_it(@VectorTree (Float64, (P = (1:1)Int,)) [(17.68, (P = 1,)), (19.38, (P = 1,))])
#-> @VectorTree (1:1) × Float64 [17.7, 19.4]To be able to use this pipeline in composition, we assign it its signature.
round_it =
round_it |> designate(TupleOf(Float64, TupleOf(:P => Float64)) |> IsScope,
BlockOf(Float64, x1to1) |> IsFlow)When two pipelines have compatible intermediate domains, they could be composed.
domain(target(dept_employee_rate))
#-> ValueOf(Float64)
domain(source(round_it))
#-> ValueOf(Float64)
dept_employee_round_rate = compose(dept_employee_rate, round_it)The composition also has a signature assigned to it. The input of the composition should contain the department data together with a parameter P.
signature(dept_employee_round_rate)
#=>
Signature(TupleOf(TupleOf(
:name => BlockOf(String, x1to1),
:employee =>
BlockOf(
TupleOf(
:name => BlockOf(String, x1to1),
:position => BlockOf(String, x1to1),
:salary => BlockOf(Int64, x0to1),
:rate => BlockOf(Float64, x0to1)),
x1toN)),
TupleOf(:P => Float64)) |>
IsScope,
BlockOf(Float64) |> IsFlow)
=#To run this pipeline, we pack the input data together with parameters.
slots = @VectorTree (P = Int,) [(P = 1,), (P = 1,), (P = 1,)]
input = TupleVector(:depts => depts, :slots => slots)
dept_employee_round_rate(input)
#-> @VectorTree (0:N) × Float64 [[], [], [17.7, 19.4]]API Reference
DataKnots.AbstractShape — TypeAbstractShapeDescribes the structure of column-oriented data.
DataKnots.AnyShape — TypeAnyShape()Nothing is known about the data.
DataKnots.BlockOf — TypeBlockOf(elts::AbstractShape, card::Cardinality=x0toN)Shape of a BlockVector.
DataKnots.IsLabeled — Typesub |> IsLabeled(::Symbol)The shape has an attached label.
DataKnots.NoShape — TypeNoShape()Inconsistent constraints on the data.
DataKnots.Signature — TypeSignature(::AbstractShape, ::AbstractShape)Shapes of a pipeline source and tagret.
DataKnots.fits — Functionfits(x::T, y::T) :: BoolChecks if constraint x implies constraint y.
Test Suite
Cardinality
Cardinality constraints are partially ordered. For two Cardinality constraints, we can determine whether one is more strict than the other.
fits(x0to1, x1toN) #-> false
fits(x1to1, x0toN) #-> trueData shapes
The structure of composite data is specified with shape objects.
A regular vector containing values of a specific type is indicated by the ValueOf shape.
str_shp = ValueOf(String)
#-> ValueOf(String)
eltype(str_shp)
#-> StringThe structure of a BlockVector object is described using BlockOf shape.
rate_shp = BlockOf(Float64, x0to1)
#-> BlockOf(Float64, x0to1)
cardinality(rate_shp)
#-> x0to1
elements(rate_shp)
#-> ValueOf(Float64)
eltype(rate_shp)
#-> Union{Missing, Float64}For a TupleVector, the column shapes and their labels are described with TupleOf.
emp_shp = TupleOf(:name => BlockOf(String, x1to1),
:position => BlockOf(String, x1to1),
:salary => BlockOf(Int, x0to1),
:rate => BlockOf(Float64, x0to1))
#=>
TupleOf(:name => BlockOf(String, x1to1),
:position => BlockOf(String, x1to1),
:salary => BlockOf(Int64, x0to1),
:rate => BlockOf(Float64, x0to1))
=#
labels(emp_shp)
#-> [:name, :position, :salary, :rate]
label(emp_shp, 4)
#-> :rate
columns(emp_shp)
#-> DataKnots.AbstractShape[BlockOf(String, x1to1), BlockOf(String, x1to1), BlockOf(Int64, x0to1), BlockOf(Float64, x0to1)]
column(emp_shp, :rate)
#-> BlockOf(Float64, x0to1)
column(emp_shp, 4)
#-> BlockOf(Float64, x0to1)It is possible to specify the shape of a TupleVector without labels.
cmp_shp = TupleOf(BlockOf(Int, x0to1), BlockOf(Int, x1to1))
#-> TupleOf(BlockOf(Int64, x0to1), BlockOf(Int64, x1to1))In this case, the columns will be assigned ordinal labels.
label(cmp_shp, 1) #-> Symbol("#A")
label(cmp_shp, 2) #-> Symbol("#B")Annotations
Any shape can be assigned a label using IsLabeled annotation.
lbl_shp = BlockOf(String, x1to1) |> IsLabeled(:name)
subject(lbl_shp)
#-> BlockOf(String, x1to1)
label(lbl_shp)
#-> :nameA BlockOf shape is annotated with IsFlow to indicate that the container holds the output flow of a pipeline.
flw_shp = BlockOf(String, x1to1) |> IsFlow
subject(flw_shp)
#-> BlockOf(String, x1to1)The shape of the flow elements could be easily accessed or replaced.
elements(flw_shp)
#-> ValueOf(String)
replace_elements(flw_shp, ValueOf(Int))
#-> BlockOf(Int64, x1to1) |> IsFlowA TupleOf shape is annotated with IsScope to indicate that the container holds the scoping context of a pipeline.
scp_shp = TupleOf(Float64, TupleOf(:P => Int)) |> IsScope
subject(scp_shp)
#-> TupleOf(Float64, TupleOf(:P => Int64))We can get the shapes of the input data and the context parameters.
context(scp_shp)
#-> TupleOf(:P => Int64)
column(scp_shp)
#-> ValueOf(Float64)
replace_column(scp_shp, ValueOf(Int))
#-> TupleOf(Int64, TupleOf(:P => Int64)) |> IsScopeShape ordering
A single vector instance may satisfy many different shape constraints.
bv = BlockVector(:, ["Chicago"])
fits(bv, BlockOf(String, x1to1)) #-> true
fits(bv, BlockOf(AbstractString)) #-> true
fits(bv, AnyShape()) #-> trueWe can tell, for any two shape constraints, if one of them is more specific than the other.
fits(ValueOf(Int), ValueOf(Number)) #-> true
fits(ValueOf(Int), ValueOf(String)) #-> false
fits(BlockOf(Int, x1to1),
BlockOf(Number, x0to1)) #-> true
fits(BlockOf(Int, x1toN),
BlockOf(Number, x0to1)) #-> false
fits(BlockOf(Int, x1to1),
BlockOf(String, x0to1)) #-> false
fits(TupleOf(BlockOf(Int, x1to1),
BlockOf(String, x0to1)),
TupleOf(BlockOf(Number, x1to1),
BlockOf(String, x0toN))) #-> true
fits(TupleOf(BlockOf(Int, x0to1),
BlockOf(String, x1to1)),
TupleOf(BlockOf(Number, x1to1),
BlockOf(String, x0toN))) #-> false
fits(TupleOf(BlockOf(Int, x1to1)),
TupleOf(BlockOf(Number, x1to1),
BlockOf(String, x0toN))) #-> falseShapes of different kinds are typically not compatible with each other. The exceptions are AnyShape() and NoShape().
fits(ValueOf(Int), BlockOf(Int)) #-> false
fits(ValueOf(Int), AnyShape()) #-> true
fits(NoShape(), ValueOf(Int)) #-> trueColumn labels are treated as additional shape constraints.
fits(TupleOf(:name => String),
TupleOf(:name => String)) #-> true
fits(TupleOf(String),
TupleOf(:position => String)) #-> false
fits(TupleOf(:name => String),
TupleOf(String)) #-> true
fits(TupleOf(:name => String),
TupleOf(:position => String)) #-> falseSimilarly, annotations are treated as shape constraints.
fits(String |> IsLabeled(:name),
String |> IsLabeled(:name)) #-> true
fits(ValueOf(String),
String |> IsLabeled(:position)) #-> false
fits(String |> IsLabeled(:name),
ValueOf(String)) #-> true
fits(String |> IsLabeled(:name),
String |> IsLabeled(:position)) #-> false
fits(BlockOf(String, x1to1) |> IsFlow,
BlockOf(String, x0toN) |> IsFlow) #-> true
fits(BlockOf(String, x1to1),
BlockOf(String, x0toN) |> IsFlow) #-> false
fits(BlockOf(String, x1to1) |> IsFlow,
BlockOf(String, x0toN)) #-> true
fits(TupleOf(Int, TupleOf(:X => Int))
|> IsScope,
TupleOf(Int, TupleOf(:X => Int))
|> IsScope) #-> true
fits(TupleOf(Int, TupleOf(:X => Int)),
TupleOf(Int, TupleOf(:X => Int))
|> IsScope) #-> false
fits(TupleOf(Int, TupleOf(:X => Int))
|> IsScope,
TupleOf(Int, TupleOf(:X => Int))) #-> trueShape of a vector
Function shapeof() determines the shape of a given vector.
shapeof(["GARRY M", "ANTHONY R", "DANA A"])
#-> ValueOf(String)
shapeof(@VectorTree ((1:1)String, (0:1)Int) [])
#-> TupleOf(BlockOf(String, x1to1), BlockOf(Int64, x0to1))
shapeof(@VectorTree (name = String, employee = [String]) [])
#-> TupleOf(:name => String, :employee => BlockOf(String))Pipeline signature
A Signature object describes the shapes of a pipeline's input and output.
sig = Signature(ValueOf(UInt),
BlockOf(TupleOf(:name => BlockOf(String, x1to1),
:employee => BlockOf(UInt, x0toN))) |> IsFlow)
#=>
Signature(ValueOf(UInt64),
BlockOf(TupleOf(:name => BlockOf(String, x1to1),
:employee => BlockOf(UInt64))) |>
IsFlow)
=#Components of the signature can be easily extracted.
target(sig)
#=>
BlockOf(TupleOf(:name => BlockOf(String, x1to1),
:employee => BlockOf(UInt64))) |>
IsFlow
=#
source(sig)
#-> ValueOf(UInt64)Rendering as a graph
Function print_graph() visualizes a shape constraint as a tree.
print_graph(ValueOf(String))
#-> # String
print_graph(BlockOf(String, x1to1))
#-> # 1:1 × String
print_graph(BlockOf(String, x1to1) |> IsLabeled(:name))
#-> name 1:1 × String
print_graph(
TupleOf(
:name => String,
:position => String,
:salary => Int) |> IsLabeled(:employee))
#=>
employee
├╴name String
├╴position String
└╴salary Int64
=#
print_graph(
BlockOf(
TupleOf(
TupleOf(
:name => BlockOf(String, x1to1),
:position => BlockOf(String, x1to1),
:salary => BlockOf(Int, x0to1),
:rate => BlockOf(Float64, x0to1)) |> IsLabeled(:employee),
TupleOf(:mean_salary => BlockOf(Float64, x0to1))) |> IsScope,
x0toN) |> IsFlow)
#=>
# 0:N
├╴employee
│ ├╴name 1:1 × String
│ ├╴position 1:1 × String
│ ├╴salary 0:1 × Int64
│ └╴rate 0:1 × Float64
└╴#B
└╴mean_salary 0:1 × Float64
=#