Queries for Data Analysts

DataKnots is a toolkit for representing and querying complex structured data. It is designed for data analysts and domain experts (e.g. accountants, engineers, researchers) who need to build and share domain-specific queries.

This overview shows how typical query operations can be performed upon a simplified in-memory dataset using DataKnot's macro syntax.

Data Navigation

Consider a tiny cross-section of public data from Chicago, represented as nested Vector and NamedTuple objects.

department_data = [
  (name = "POLICE",
   employee = [
    (name = "ANTHONY A", position = "POLICE OFFICER", salary = 72510),
    (name = "JEFFERY A", position = "SERGEANT", salary = 101442),
    (name = "NANCY A", position = "POLICE OFFICER", salary = 80016)]),
  (name = "FIRE",
   employee = [
    (name = "DANIEL A", position = "FIREFIGHTER-EMT", salary = 95484),
    (name = "ROBERT K", position = "FIREFIGHTER-EMT", salary = 103272)])]

This hierarchical dataset contains a list of departments, with each department containing associated employees.

To query this dataset, we convert it into a DataKnot, or knot.

using DataKnots

chicago = DataKnot(:department => department_data)

Let's say we want to return the list of department names. We query the chicago knot with department.name.

@query chicago department.name
#=>
  │ name   │
──┼────────┼
1 │ POLICE │
2 │ FIRE   │
=#

The dotted notation lets one navigate a hierarchical dataset. Let's continue our dataset exploration by listing employee names.

@query chicago department.employee.name
#=>
  │ name      │
──┼───────────┼
1 │ ANTHONY A │
2 │ JEFFERY A │
3 │ NANCY A   │
4 │ DANIEL A  │
5 │ ROBERT K  │
=#

We could write the query above, without the period delimiter, as a multi-line macro block.

@query chicago begin
    department
    employee
    name
end

Navigation context matters. For example, employee tuples are not directly accessible from the root of the knot. When a field label, such as employee, can't be found, an appropriate error message is displayed.

@query chicago employee
#-> ERROR: cannot find "employee" ⋮

Instead, employee tuples can be queried by navigating through department tuples. When tuples are returned, they are displayed as a table.

@query chicago department.employee
#=>
  │ employee                           │
  │ name       position         salary │
──┼────────────────────────────────────┼
1 │ ANTHONY A  POLICE OFFICER    72510 │
2 │ JEFFERY A  SERGEANT         101442 │
3 │ NANCY A    POLICE OFFICER    80016 │
4 │ DANIEL A   FIREFIGHTER-EMT   95484 │
5 │ ROBERT K   FIREFIGHTER-EMT  103272 │
=#

Counting & Context

To count the number of departments in this chicago dataset we write the query count(department). Observe that the argument provided to count(), department, is itself a query.

@query chicago count(department)
#=>
┼───┼
│ 2 │
=#

We could also count the total number of employees across all departments.

@query chicago count(department.employee)
#=>
┼───┼
│ 5 │
=#

What if we wanted to count employees by department? Using query composition (.), we can perform count in a nested context.

@query chicago department.count(employee)
#=>
──┼───┼
1 │ 3 │
2 │ 2 │
=#

In this output, we see that one department has 3 employees, while the other has 2.

Records & Filters

Let's improve the previous query by including each department's name alongside employee counts. This can be done by constructing a record using paired curly brackets {}.

@query chicago department{name, count(employee)}
#=>
  │ department │
  │ name    #B │
──┼────────────┼
1 │ POLICE   3 │
2 │ FIRE     2 │
=#

To label a record field we use the Pair syntax, (=>).

@query chicago department{name, size => count(employee)}
#=>
  │ department   │
  │ name    size │
──┼──────────────┼
1 │ POLICE     3 │
2 │ FIRE       2 │
=#

Additionally, we could list the employee names associated with each of these departments.

@query chicago begin
   department
   { name,
     size => count(employee),
     employee_names => employee.name }
end
#=>
  │ department                                  │
  │ name    size  employee_names                │
──┼─────────────────────────────────────────────┼
1 │ POLICE     3  ANTHONY A; JEFFERY A; NANCY A │
2 │ FIRE       2  DANIEL A; ROBERT K            │
=#

In this display employee_names is a plural value. Hence the output cell for each department is delimited by a semi-colon.

We can extend the previous query to show only departments with more than 2 employees.

@query chicago begin
   department
   { name,
     size => count(employee),
     employee_names => employee.name }
   filter(size > 2)
end
#=>
  │ department                                  │
  │ name    size  employee_names                │
──┼─────────────────────────────────────────────┼
1 │ POLICE     3  ANTHONY A; JEFFERY A; NANCY A │
=#

The argument to filter can be any query expression that is valid for the current context.

@query chicago begin
    department
    filter(count(employee) < 3)
    { name,
      employee_names => employee.name }
end
#=>
  │ department               │
  │ name  employee_names     │
──┼──────────────────────────┼
1 │ FIRE  DANIEL A; ROBERT K │
=#

With these queries we've seen how to navigate, count, record and filter. These operations form the base of our query language.

Aggregate Queries

So far we've only seen elementwise queries which emit an output for each of its input elements. Informally, we can see this with the query department.count(employee).

@query chicago department.count(employee)
#=>
──┼───┼
1 │ 3 │
2 │ 2 │
=#

In this case, the query count(employee) input has two elements, one for each department. It it emits output elements for each, representing the number of employees for the given department.

Without arguments, count() counts the number of input elements it receives. These aggregate queries produce an output relative to their input as a whole.

@query chicago department.employee.count()
#=>
┼───┼
│ 5 │
=#

We may wish to count employees by department. Contrary to expectation, adding parentheses will not change the result.

@query chicago department.(employee.count())
#=>
┼───┼
│ 5 │
=#

To count employees in each department, we use each(), which evaluates its argument elementwise.

@query chicago department.each(employee.count())
#=>
──┼───┼
1 │ 3 │
2 │ 2 │
=#

Equivalently, we could use count(employee).

@query chicago department.count(employee)
#=>
──┼───┼
1 │ 3 │
2 │ 2 │
=#

Which variant of count to use depends upon what is notationally convenient: is the count of the input elements requested? or is a count of something relative to each input needed?

Paging Data

Sometimes query results can be quite large. In this case it's helpful to take or drop items from the input. Let's return only the first two of the employees.

@query chicago department.employee.take(2)
#=>
  │ employee                          │
  │ name       position        salary │
──┼───────────────────────────────────┼
1 │ ANTHONY A  POLICE OFFICER   72510 │
2 │ JEFFERY A  SERGEANT        101442 │
=#

A negative index counts records from the end of the input. So, to return all the records but the last two, we write:

@query chicago department.employee.take(-2)
#=>
  │ employee                          │
  │ name       position        salary │
──┼───────────────────────────────────┼
1 │ ANTHONY A  POLICE OFFICER   72510 │
2 │ JEFFERY A  SERGEANT        101442 │
3 │ NANCY A    POLICE OFFICER   80016 │
=#

To skip the first two records, returning the rest, we use drop.

@query chicago department.employee.drop(2)
#=>
  │ employee                          │
  │ name      position         salary │
──┼───────────────────────────────────┼
1 │ NANCY A   POLICE OFFICER    80016 │
2 │ DANIEL A  FIREFIGHTER-EMT   95484 │
3 │ ROBERT K  FIREFIGHTER-EMT  103272 │
=#

To return the 1st half of the employees in the database, we could use take with an argument that computes how many to take.

@query chicago begin
    department.employee
    take(count(department.employee) ÷ 2)
end
#=>
  │ employee                          │
  │ name       position        salary │
──┼───────────────────────────────────┼
1 │ ANTHONY A  POLICE OFFICER   72510 │
2 │ JEFFERY A  SERGEANT        101442 │
=#

Unlike filter, take and drop are aggregates because the output they generate depend not just upon each input element, but also upon the position of that element with respect to the entire input collection.

Grouping Data

So far, we've navigated and counted data by exploiting its hierarchical organization. But what if we want a query that isn't supported by the existing hierarchy? For example, how could we calculate the number of employees for each position?

A list of distinct positions could be obtained using unique.

@query chicago department.employee.position.unique()
#=>
  │ position        │
──┼─────────────────┼
1 │ FIREFIGHTER-EMT │
2 │ POLICE OFFICER  │
3 │ SERGEANT        │
=#

However, unique is not sufficient because positions are not associated to the respective employees. To associate employee records to their positions, we use group.

@query chicago begin
     department
     employee
     group(position)
     { position, employee.name }
end
#=>
  │ position         name               │
──┼─────────────────────────────────────┼
1 │ FIREFIGHTER-EMT  DANIEL A; ROBERT K │
2 │ POLICE OFFICER   ANTHONY A; NANCY A │
3 │ SERGEANT         JEFFERY A          │
=#

The complement of each group is often plural, and in this case, elements of from the complement are separated by the semi-colon. We could see this by counting employees in each position.

@query chicago begin
     department
     employee
     group(position)
     { position, count => count(employee) }
end
#=>
  │ position         count │
──┼────────────────────────┼
1 │ FIREFIGHTER-EMT      2 │
2 │ POLICE OFFICER       2 │
3 │ SERGEANT             1 │
=#

Here, group and unique are also aggregate. In particular, the output they produce is quite distinct from their input. Generally, it's the flexibility of aggregate queries like group(position) that provide the operational power of this query language.