Who is Highly Paid?

This tutorial is intended for new users of DataKnots. It does not give an exhaustive account of all the features, just the ones you are likely to use right away.

Problem Statement

Let's discuss a particular inquiry: Which City of Chicago employees have salary higher than the average for their department?

We use a tiny selection of public data from the City of Chicago. This dataset includes employees and their annual salary.

using CSV

employee_csv = """
    name,department,position,salary
    "ANTHONY A","POLICE","POLICE OFFICER",72510
    "DANIEL A","FIRE","FIRE FIGHTER-EMT",95484
    "JAMES A","FIRE","FIRE ENGINEER-EMT",103350
    "JEFFERY A","POLICE","SERGEANT",101442
    "NANCY A","POLICE","POLICE OFFICER",80016
    "ROBERT K","FIRE","FIRE FIGHTER-EMT",103272
    """ |> IOBuffer |> CSV.File

To query this employee data, we convert it to a DataKnot, or just knot. In the DataKnot constructor, employee_csv is provided a label of employee.

using DataKnots

chicago = DataKnot(:employee => employee_csv)

Then, to answer this inquiry, we query chicago as follows.

using Statistics: mean

@query chicago begin
    employee
    group(department)
    keep(mean_salary => mean(employee.salary))
    employee
    filter(salary > mean_salary)
end
#=>
  │ employee                                         │
  │ name       department  position           salary │
──┼──────────────────────────────────────────────────┼
1 │ JAMES A    FIRE        FIRE ENGINEER-EMT  103350 │
2 │ ROBERT K   FIRE        FIRE FIGHTER-EMT   103272 │
3 │ JEFFERY A  POLICE      SERGEANT           101442 │
=#

This overview will proceed by incrementally reconstructing this query, showing how an analyst could explore data and independently arrive at the answer above.

Basic Queries

DataKnots implements an algebra of queries. This algebra's elements, or queries, represent relationships among class entities and datatypes. Nouns, such as employee, department, and salary, are query primitives.

Let's query the chicago knot to list employee records.

@query chicago employee
#=>
  │ employee                                         │
  │ name       department  position           salary │
──┼──────────────────────────────────────────────────┼
1 │ ANTHONY A  POLICE      POLICE OFFICER      72510 │
2 │ DANIEL A   FIRE        FIRE FIGHTER-EMT    95484 │
3 │ JAMES A    FIRE        FIRE ENGINEER-EMT  103350 │
4 │ JEFFERY A  POLICE      SERGEANT           101442 │
5 │ NANCY A    POLICE      POLICE OFFICER      80016 │
6 │ ROBERT K   FIRE        FIRE FIGHTER-EMT   103272 │
=#

Verbs, such as group, keep, mean, and filter are query combinators. Combinators build new queries from existing ones. For example, count is a combinator.

@query chicago count(employee)
#=>
┼───┼
│ 6 │
=#

Query composition (.) is also a combinator, it builds a query that applies the output of one query as the input to another. The query employee.name lists all employee names.

@query chicago employee.name
#=>
  │ name      │
──┼───────────┼
1 │ ANTHONY A │
2 │ DANIEL A  │
3 │ JAMES A   │
4 │ JEFFERY A │
5 │ NANCY A   │
6 │ ROBERT K  │
=#

Within a multi-line macro block, each individual statement is composed with its predecessor. Hence, we could write the query above without the period delimiter.

@query chicago begin
    employee
    name
end

Often it's helpful to see the combined output from correlated queries. The record combinator, which is delimited with a pair of curly braces {}, is used to build queries that produce parallel results.

@query chicago employee{name, salary}
#=>
  │ employee          │
  │ name       salary │
──┼───────────────────┼
1 │ ANTHONY A   72510 │
2 │ DANIEL A    95484 │
3 │ JAMES A    103350 │
4 │ JEFFERY A  101442 │
5 │ NANCY A     80016 │
6 │ ROBERT K   103272 │
=#

Within a @query macro, constants, such as 100_000 are treated as query primitives. These constant queries produce the same output regardless of the input they receive.

@query chicago employee{name, salary, threshold => 100_000}
#=>
  │ employee                     │
  │ name       salary  threshold │
──┼──────────────────────────────┼
1 │ ANTHONY A   72510     100000 │
2 │ DANIEL A    95484     100000 │
3 │ JAMES A    103350     100000 │
4 │ JEFFERY A  101442     100000 │
5 │ NANCY A     80016     100000 │
6 │ ROBERT K   103272     100000 │
=#

Functions, such as titlecase, and operators, such as greater-than (>), are treated as query combinators; that is, they are used to build queries from component queries.

@query chicago begin
    employee
    {name => titlecase(name), highly_paid => salary > 100_000}
end
#=>
  │ employee               │
  │ name       highly_paid │
──┼────────────────────────┼
1 │ Anthony A        false │
2 │ Daniel A         false │
3 │ James A           true │
4 │ Jeffery A         true │
5 │ Nancy A          false │
6 │ Robert K          true │
=#

Since salary > 100_000 is a predicate query, we can use the filter combinator to build the query filter(salary > 100_000).

@query chicago employee.filter(salary > 100_000)
#=>
  │ employee                                         │
  │ name       department  position           salary │
──┼──────────────────────────────────────────────────┼
1 │ JAMES A    FIRE        FIRE ENGINEER-EMT  103350 │
2 │ JEFFERY A  POLICE      SERGEANT           101442 │
3 │ ROBERT K   FIRE        FIRE FIGHTER-EMT   103272 │
=#

In this section, we have built a query that produces highly-compensated employees. More broadly, we've demonstrated how an algebra of queries permits us to combine previously proven queries in an intuitive way.

Before moving on to the original inquiry, we need to discuss how queries see their input.

What is a DataKnot?

Input and output of queries are serialized as DataKnot objects. A DataKnot is a container that stores a hierarchy of labeled elements, where each element is either a scalar value, such as an integer or a string, or a collection of nested elements.

Recall the knot produced by the query employee{name, salary}.

@query chicago employee{name, salary}
#=>
  │ employee          │
  │ name       salary │
──┼───────────────────┼
1 │ ANTHONY A   72510 │
2 │ DANIEL A    95484 │
3 │ JAMES A    103350 │
4 │ JEFFERY A  101442 │
5 │ NANCY A     80016 │
6 │ ROBERT K   103272 │
=#

This knot contains 6 composite elements labeled employee, each employee having elements labeled name and salary. This element hierarchy can be visualized using show(::DataKnot, as=:flow)

    show(as=:flow, @query chicago employee{name, salary})
    #=>
    6-element DataKnot:
      employee:
        name: "ANTHONY A"
        salary: 72510
      employee:
        name: "DANIEL A"
        salary: 95484
      ⋮
    =#

The structure of a DataKnot is called its shape. In this case, the shape describes a hierarchy with 2 levels: zero or more elements with label employee, each employee element containing exactly one string element labeled name and exactly one integer element labeled salary.

show(as=:shape, @query chicago employee{name, salary})
#=>
6-element DataKnot:
  employee  0:N
  ├╴name    1:1 × String
  └╴salary  1:1 × Int64
=#

Not all knots have exactly two levels. Recall the query `count(employee) which produces a single, unlabeled numeric value.

@query chicago count(employee)
#=>
┼───┼
│ 6 │
=#

The shape of the knot reflects this structure. Here # is used in place of the missing label.

show(as=:shape, @query chicago count(employee))
#=>
1-element DataKnot:
  #  1:1 × Int64
=#

Our chicago knot has a hierarchy of three levels: a single unlabeled root element, branch level of employee elements, and a leaf level with name, department, position, and salary.

show(as=:shape, chicago)
#=>
1-element DataKnot:
  #               1:1
  └╴employee      0:N
    ├╴name        String
    ├╴department  String
    ├╴position    String
    └╴salary      Int64
=#

This knot could be shown as a hierarchy of elements. Observe that chicago has a single, unlabeled root element.

    show(as=:flow, chicago)
    #=>
    1-element DataKnot:
      #:
        employee:
          name: "ANTHONY A"
          department: "POLICE"
          position: "POLICE OFFICER"
          salary: 72510
        employee:
          name: "DANIEL A"
          department: "FIRE"
          position: "FIRE FIGHTER-EMT"
          salary: 95484
        ⋮
    =#

When a knot is displayed, its hierarchy is projected to a tabular form. For chicago, the root element gets its own row with employee elements packed into a single cell: employees are delimited by a semi-colon; and nested attributes are separated by a comma. For packed cells, such as employee, the header shows the subordinate labels within a pair of curly braces.

chicago
#=>
│ employee{name,department,position,salary}                           │
┼─────────────────────────────────────────────────────────────────────┼
│ ANTHONY A, POLICE, POLICE OFFICER, 72510; DANIEL A, FIRE, FIRE FIGH…│
=#

In this section, we have seen how DataKnots sees data as a hierarchical flow of labeled elements. In the next section, we show how hierarchies can be collapsed and created.

Hierarchical Transformations

DataKnots' combinators implement hierarchical transformations. Summary combinators, such as count, build queries that collapse a subtree into a single value. For example, we can compute average salary across employees with mean(employee.salary).

using Statistics: mean

@query chicago begin
    mean_salary => mean(employee.salary)
end
#=>
│ mean_salary │
┼─────────────┼
│     92679.0 │
=#

The group combinator builds queries that introduce a new level in the hierarchy by constructing grouping records for each unique element produced by its argument. For example, we could group employees by department.

@query chicago employee.group(department)
#=>
  │ department  employee{name,department,position,salary}             │
──┼───────────────────────────────────────────────────────────────────┼
1 │ FIRE        DANIEL A, FIRE, FIRE FIGHTER-EMT, 95484; JAMES A, FIR…│
2 │ POLICE      ANTHONY A, POLICE, POLICE OFFICER, 72510; JEFFERY A, …│
=#

In this tabular layout, grouping records are represented as table rows, and the corresponding employees are packed into a single cell. To see the knot in an unpacked form, we can display it as an element hierarchy.

    show(as=:shape, @query chicago employee.group(department))
    #=>
    2-element DataKnot:
      #:
        department: "POLICE"
        employee:
          name: "ANTHONY A"
          department: "POLICE"
          position: "POLICE OFFICER"
          salary: 72510
        employee:
          name: "JEFFERY A"
          department: "POLICE"
          position: "SERGEANT"
          salary: 201442
        ⋮
      #:
        department: "FIRE"
        employee:
          name: "DANIEL A"
          department: "FIRE"
          position: "FIRE FIGHTER-EMT"
          salary: 95484
        employee:
          name: "JAMES A"
          department: "FIRE"
          position: "FIRE ENGINEER-EMT"
          salary: 103350
        ⋮
    =#

Once constructed, grouping records can be used as any other input. In this next query, we show salaries of employees by department. Since each department in our reduced dataset happens to have 3 employees, our output has 3 salary entries.

@query chicago begin
    employee
    group(department)
    {department, employee.salary}
end
#=>
  │ department  salary                │
──┼───────────────────────────────────┼
1 │ FIRE        95484; 103350; 103272 │
2 │ POLICE      72510; 101442; 80016  │
=#

We can use summary operations relative to grouping records. In this next example, mean(employee.salary) is computed for each department.

@query chicago begin
    employee
    group(department)
    {department, mean_salary => mean(employee.salary)}
end
#=>
  │ department  mean_salary │
──┼─────────────────────────┼
1 │ FIRE           100702.0 │
2 │ POLICE          84656.0 │
=#

In this section, we have built a query that computes the average employee compensation for each department. Further, we've shown how group is used to transform hierarchies. Finally, we've demonstrated that grouping and summary combinators are independent, yet work fluidly together.

We're close to answering our original inquiry. We've built a query that filters employees. We've built a query that produces average salary by department. We need only connect these queries.

Query Context

DataKnots' queries are interpreted contextually, relative to the input that they receive. We've seen this earlier: depending where it is placed, mean(employee.salary) can produce either the average salary over the entire dataset, or averages within each department.

For our inquiry, we need to compare each employee's salary with the average salary. However, we cannot evaluate both salary and mean_salary in the same context.

@query chicago begin
    employee
    {name, salary, mean_salary => mean(employee.salary)}
end
#-> ERROR: cannot find "employee" ⋮

To evaluate an expression in one context and then make the value available in subsequent contexts, we could use the keep combinator. The next query computes mean_salary with respect to the entire dataset, and then uses this value in the context of each employee.

@query chicago begin
    keep(mean_salary => mean(employee.salary))
    employee
    {name, salary, mean_salary}
end
#=>
  │ employee                       │
  │ name       salary  mean_salary │
──┼────────────────────────────────┼
1 │ ANTHONY A   72510      92679.0 │
2 │ DANIEL A    95484      92679.0 │
3 │ JAMES A    103350      92679.0 │
4 │ JEFFERY A  101442      92679.0 │
5 │ NANCY A     80016      92679.0 │
6 │ ROBERT K   103272      92679.0 │
=#

However, the inquiry asks us to use average salary by department not across all employees. To adapt the previous query, we need to place our keep in the context of employee.group(department).

@query chicago begin
    employee
    group(department)
    keep(mean_salary => mean(employee.salary))
    employee
    {name, salary, mean_salary}
end
#=>
  │ employee                       │
  │ name       salary  mean_salary │
──┼────────────────────────────────┼
1 │ DANIEL A    95484     100702.0 │
2 │ JAMES A    103350     100702.0 │
3 │ ROBERT K   103272     100702.0 │
4 │ ANTHONY A   72510      84656.0 │
5 │ JEFFERY A  101442      84656.0 │
6 │ NANCY A     80016      84656.0 │
=#

We just need to add a filter to answer our initial inquiry: which employees have a salary that is higher than the average for their department?

@query chicago begin
    employee
    group(department)
    keep(mean_salary => mean(employee.salary))
    employee
    filter(salary > mean_salary)
end
#=>
  │ employee                                         │
  │ name       department  position           salary │
──┼──────────────────────────────────────────────────┼
1 │ JAMES A    FIRE        FIRE ENGINEER-EMT  103350 │
2 │ ROBERT K   FIRE        FIRE FIGHTER-EMT   103272 │
3 │ JEFFERY A  POLICE      SERGEANT           101442 │
=#

In this section, we've completed our query. Let's talk more about mechanics, how these queries operate.