Usage Guide

This guide will teach you how to assemble SQL queries using FunSQL.

Test Database

To demonstrate database queries, we need a test database. The database we use in this guide is a tiny 10 person sample of simulated patient data extracted from a much larger CMS DE-SynPuf dataset. For a database engine, we picked SQLite. Using SQLite in a guide is convenient because it does not require a database server to run and allows us to distribute the whole database as a single file. FunSQL supports SQLite and many other database engines. The techniques discussed here are not specific to SQLite and once you learn them, you will be able to apply them to any SQL database.

The data in the test database is stored in the format of the OMOP Common Data Model, an open source database schema for observational healthcare data. In this guide, we will only use a small fragment of the Common Data Model.

Fragment of the OMOP Common Data Model

The patient data, including basic demographic information, is stored in the table person. Patient addresses are stored in a separate table location, linked to person by the key column location_id.

The bulk of patient data consists of clinical events: visits to healthcare providers, recorded observations, diagnosed conditions, prescribed medications, etc. In this guide we only use two types of events, visits and conditions.

The specific type of the event (e.g., Inpatient visit or Essential hypertension condition) is indicated using a concept id column, which refers to the concept table. Different concepts may be related to each other. For instance, Essential hypertension is a Hypertensive disorder, which itself is a Disorder of cardiovascular system. Concept relationships are recorded in the corresponding table.

If you wish to follow along with the guide and run the examples, download the database file:

DATABASE = download("https://github.com/MechanicalRabbit/ohdsi-synpuf-demo/releases/download/20210412/synpuf-10p.sqlite")

All examples in this guide are tested on each update using the NarrativeTest package. To avoid downloading the database file all the time, we registered the download URL as an artifact and use Pkg.Artifacts API to fetch it:

using Pkg.Artifacts, LazyArtifacts

DATABASE = joinpath(artifact"synpuf-10p", "synpuf-10p.sqlite")
#-> ⋮

Using FunSQL

To interact with an SQLite database from Julia code, we need to install the SQLite package:

using Pkg

Pkg.add("SQLite")

With the package installed, we can open a database connection:

using FunSQL
using SQLite

conn = DBInterface.connect(FunSQL.DB{SQLite.DB}, DATABASE)

This call to DBInterface.connect creates a connection to the SQLite database, retrieves the catalog of available database tables, and returns a FunSQL connection object.

Some applications open many connections to the same database. For instance, a web application may open a new database connection on every incoming HTTP request. In this case, it may be worth to have all these connections to share the same database catalog. The application can start with loading the catalog using using reflect:

using FunSQL: reflect

catalog = reflect(DBInterface.connect(SQLite.DB, DATABASE))

Then whenever a new connection is created, this catalog object could be reused:

conn = FunSQL.DB(DBInterface.connect(SQLite.DB, DATABASE),
                 catalog = catalog)

Warning

Some database drivers, including the PostgreSQL client library LibPQ.jl, do not support DBInterface. For instructions on how to enable DBInterface for LibPQ, see this example.

Using the connection object, we can execute FunSQL queries. For example, the following query outputs the content of the table person:

using FunSQL: From

q = From(:person)

This query could be executed with DBInterface.execute:

res = DBInterface.execute(conn, q)

To display the result of a query, it is convenient to convert it to a DataFrame object:

using DataFrames

DataFrame(res)
#=>
10×18 DataFrame
 Row │ person_id  gender_concept_id  year_of_birth  month_of_birth  day_of_bir ⋯
     │ Int64      Int64              Int64          Int64           Int64      ⋯
─────┼──────────────────────────────────────────────────────────────────────────
   1 │      1780               8532           1940               2             ⋯
   2 │     30091               8532           1932               8
   3 │     37455               8532           1913               7
   4 │     42383               8507           1922               2
   5 │     69985               8532           1956               7             ⋯
   6 │     72120               8507           1937              10
   7 │     82328               8532           1957               9
   8 │     95538               8507           1923              11
   9 │    107680               8532           1963              12             ⋯
  10 │    110862               8507           1911               4
                                                              14 columns omitted
=#

Instead of executing the query directly, we can render it to generate the corresponding SQL statement:

using FunSQL: render

sql = render(conn, q)

print(sql)
#=>
SELECT
  "person_1"."person_id",
  "person_1"."gender_concept_id",
  "person_1"."year_of_birth",
  "person_1"."month_of_birth",
  "person_1"."day_of_birth",
  "person_1"."time_of_birth",
  "person_1"."race_concept_id",
  "person_1"."ethnicity_concept_id",
  "person_1"."location_id",
  "person_1"."provider_id",
  "person_1"."care_site_id",
  "person_1"."person_source_value",
  "person_1"."gender_source_value",
  "person_1"."gender_source_concept_id",
  "person_1"."race_source_value",
  "person_1"."race_source_concept_id",
  "person_1"."ethnicity_source_value",
  "person_1"."ethnicity_source_concept_id"
FROM "person" AS "person_1"
=#

In fact, we do not need a database connection if all we want is to generate a SQL query. For this purpose, we only need a SQLCatalog object that describes the structure of the database tables and the target SQL dialect:

using FunSQL: SQLCatalog, SQLTable

catalog = SQLCatalog(SQLTable(:person, columns = [:person_id, :year_of_birth]),
                     dialect = :sqlite)

sql = render(catalog, q)

print(sql)
#=>
SELECT
  "person_1"."person_id",
  "person_1"."year_of_birth"
FROM "person" AS "person_1"
=#

Why FunSQL?

Let us clarify the purpose of FunSQL. Consider a problem:

Find all patients born between 1930 and 1940 and living in Illinois, and for each patient show their current age (by the end of 2020).

The answer can be obtained with the following SQL query:

SELECT p.person_id, 2020 - p.year_of_birth AS age
FROM person p
JOIN location l ON (p.location_id = l.location_id)
WHERE (p.year_of_birth BETWEEN 1930 AND 1940) AND (l.state = 'IL')

The simplest way to incorporate this query into Julia code is to embed it as a string literal:

sql = """
SELECT p.person_id, 2020 - p.year_of_birth AS age
FROM person p
JOIN location l ON (p.location_id = l.location_id)
WHERE (p.year_of_birth BETWEEN 1930 AND 1940) AND (l.state = 'IL')
"""

DBInterface.execute(conn, sql) |> DataFrame
#=>
1×2 DataFrame
 Row │ person_id  age
     │ Int64      Int64
─────┼──────────────────
   1 │     72120     83
=#

With FunSQL, instead of embedding the SQL query directly into Julia code, we construct a query object:

using FunSQL: As, From, Fun, Get, Join, Select, Where

q = From(:person) |>
    Where(Fun.between(Get.year_of_birth, 1930, 1940)) |>
    Join(From(:location) |> Where(Get.state .== "IL") |> As(:location),
         on = Get.location_id .== Get.location.location_id) |>
    Select(Get.person_id, :age => 2020 .- Get.year_of_birth)

The value of q is a composite object of type SQLNode. "Composite" means that q is assembled from components (also of type SQLNode), which themselves are either atomic or assembled from smaller components. Different kinds of components are created by SQLNode constructors such as From, Where, Fun, Get, etc.

We use the same DBInterface.execute method to serialize the query object as a SQL statement and immediately execute it:

DBInterface.execute(conn, q) |> DataFrame
#=>
1×2 DataFrame
 Row │ person_id  age
     │ Int64      Int64
─────┼──────────────────
   1 │     72120     83
=#

Why, instead of embedding a complete SQL query, we prefer to generate it through a query object? To justify this extra step, consider that in a real Julia program, any query is likely going to be parameterized:

Find all patients born between $start_year and $end_year and living in $states, and for each patient show the $output_columns.

If this is the case, the SQL query cannot be prepared in advance and must be assembled on the fly. While it is possible to assemble a SQL query from string fragments, it is tedious, error-prone and definitely not fun. FunSQL provides a more robust and effective approach: build the query as a composite data structure.

Here is how this parameterized query may be constructed with FunSQL:

function FindPatients(; start_year = nothing,
                        end_year = nothing,
                        states = String[])
    q = From(:person) |>
        Where(BirthRange(start_year, end_year))
    if !isempty(states)
        q = q |>
            Join(:location => From(:location) |>
                              Where(Fun.in(Get.state, states...)),
                 on = Get.location_id .== Get.location.location_id)
    end
    q
end

function BirthRange(start_year, end_year)
    p = true
    if start_year !== nothing
        p = Fun.and(p, Get.year_of_birth .>= start_year)
    end
    if end_year !== nothing
        p = Fun.and(p, Get.year_of_birth .<= end_year)
    end
    p
end

The function FindPatients effectively becomes a new SQLNode constructor, which can be used directly or as a component of a larger query.

Show all patient data.

q = FindPatients()

DBInterface.execute(conn, q) |> DataFrame
#=>
10×18 DataFrame
 Row │ person_id  gender_concept_id  year_of_birth  month_of_birth  day_of_bir ⋯
     │ Int64      Int64              Int64          Int64           Int64      ⋯
─────┼──────────────────────────────────────────────────────────────────────────
   1 │      1780               8532           1940               2             ⋯
   2 │     30091               8532           1932               8
   3 │     37455               8532           1913               7
   4 │     42383               8507           1922               2
   5 │     69985               8532           1956               7             ⋯
   6 │     72120               8507           1937              10
   7 │     82328               8532           1957               9
   8 │     95538               8507           1923              11
   9 │    107680               8532           1963              12             ⋯
  10 │    110862               8507           1911               4
                                                              14 columns omitted
=#

Show all patients born in or after 1930.

q = FindPatients(start_year = 1930) |>
    Select(Get.person_id)

DBInterface.execute(conn, q) |> DataFrame
#=>
6×1 DataFrame
 Row │ person_id
     │ Int64
─────┼───────────
   1 │      1780
   2 │     30091
   3 │     69985
   4 │     72120
   5 │     82328
   6 │    107680
=#

Find all patients born between 1930 and 1940 and living in Illinois, and for each patient show their current age.

q = FindPatients(start_year = 1930, end_year = 1940, states = ["IL"]) |>
    Select(Get.person_id, :age => 2020 .- Get.year_of_birth)

DBInterface.execute(conn, q) |> DataFrame
#=>
1×2 DataFrame
 Row │ person_id  age
     │ Int64      Int64
─────┼──────────────────
   1 │     72120     83
=#

Tabular Operations

Recall the query from the previous section:

Find all patients born between 1930 and 1940 and living in Illinois, and for each patient show their current age.

From(:person) |>
Where(Fun.between(Get.year_of_birth, 1930, 1940)) |>
Join(From(:location) |> Where(Get.state .== "IL") |> As(:location),
     on = Get.location_id .== Get.location.location_id) |>
Select(Get.person_id, :age => 2020 .- Get.year_of_birth)

At the outer level, this query is constructed from tabular operations From, Where, Join, and Select arranged in a pipeline by the pipe (|>) operator. In SQL, a tabular operation takes a certain number of input datasets and produces an output dataset. It is helpful to visualize a tabular operation as a node with a certain number of input arrows and one output arrow.

From, Where, Select, and Join nodes

Then the whole query can be visualized as a pipeline diagram. Each arrow in this diagram represents a dataset, and each node represents an elementary data processing operation.

Query pipeline

The following tabular operations are available in FunSQL.

Constructor	Function
`Append`	concatenate datasets
`As`	wrap all columns in a nested record
`Define`	add an output column
`From`	produce the content of a database table
`Group`	partition the dataset into disjoint groups
`Iterate`	iterate a query
`Join`	correlate two datasets
`Limit`	truncate the dataset
`Order`	sort the dataset
`Partition`	relate dataset rows to each other
`Select`	specify output columns
`Where`	filter the dataset by the given condition
`With`	assign a name to a temporary dataset

`From`, `Select`, and `Define`

The From node outputs the content of a database table. The constructor takes one argument, the name of the table.

As opposed to SQL, FunSQL does not demand that all queries have an explicit Select. The following query will produce all columns of the table:

Show all patients.

using FunSQL: From

q = From(:person)

render(conn, q) |> print
#=>
SELECT
  "person_1"."person_id",
  ⋮
  "person_1"."ethnicity_source_concept_id"
FROM "person" AS "person_1"
=#

The From node also accepts a DataFrame or any argument supporting the Tables.jl interface, which is very convenient when you need to correlate database content with external data. Keep in mind that From serializes a DataFrame argument as a part of the query, so for a large DataFrame it is better to load it into the database and query it as a regular table.

df = DataFrame(person_id = ["SQL", "Julia", "FunSQL"],
               year_of_birth = [1974, 2012, 2021])

q = From(df)

render(conn, q) |> print
#=>
SELECT
  "values_1"."column1" AS "person_id",
  "values_1"."column2" AS "year_of_birth"
FROM (
  VALUES
    ('SQL', 1974),
    ('Julia', 2012),
    ('FunSQL', 2021)
) AS "values_1"
=#

It is possible for a query not to have a From node:

Show the current date and time.

using FunSQL: Select

q = Select(Fun.current_timestamp())

sql = render(q)

print(sql)
#-> SELECT CURRENT_TIMESTAMP AS "current_timestamp"

In this query, the Select node is not connected to any source of data. In such a case, it is supplied with a unit dataset containing one row and no columns. Hence this query will generate one row of output.

The same effect could be achieved with From(nothing):

q = From(nothing) |>
    Select(Fun.current_timestamp())

sql = render(q)

print(sql)
#-> SELECT CURRENT_TIMESTAMP AS "current_timestamp"

The Select node is used to specify the output columns. The name of the column is either derived from the expression or set explicitly with As (or the shorthand =>).

For each patient, show their ID and the current age.

q = From(:person) |>
    Select(Get.person_id,
           :age => 2020 .- Get.year_of_birth)

render(conn, q) |> print
#=>
SELECT
  "person_1"."person_id",
  (2020 - "person_1"."year_of_birth") AS "age"
FROM "person" AS "person_1"
=#

To add a new column while preserving existing output columns, we use the Define node.

Show the patient data together with their current age.

using FunSQL: Define

q = From(:person) |>
    Define(:age => 2020 .- Get.year_of_birth)

render(conn, q) |> print
#=>
SELECT
  "person_1"."person_id",
  ⋮
  "person_1"."ethnicity_source_concept_id",
  (2020 - "person_1"."year_of_birth") AS "age"
FROM "person" AS "person_1"
=#

Define could also be used to replace an existing column.

Hide the day of birth of patients born before 1930.

q = From(:person) |>
    Define(:day_of_birth => Fun.case(Get.year_of_birth .>= 1930,
                                     Get.day_of_birth,
                                     missing))

render(conn, q) |> print
#=>
SELECT
  "person_1"."person_id",
  "person_1"."gender_concept_id",
  "person_1"."year_of_birth",
  "person_1"."month_of_birth",
  (CASE WHEN ("person_1"."year_of_birth" >= 1930) THEN "person_1"."day_of_birth" ELSE NULL END) AS "day_of_birth",
  ⋮
  "person_1"."ethnicity_source_concept_id"
FROM "person" AS "person_1"
=#

`Join`

The Join node correlates the rows of two input datasets. Predominantly, Join is used for looking up table records by key. In the following example, Join associates each person record with their location using the key column location_id that uniquely identifies a location record.

Show all patients together with their state of residence.

From(:person) |>
Join(:location => From(:location),
     Get.location_id .== Get.location.location_id,
     left = true) |>
Select(Get.person_id, Get.location.state)

The modifier left = true tells Join that it must output all person records including those without the corresponding location. Since this is a very common requirement, FunSQL provides an alias:

using FunSQL: LeftJoin

From(:person) |>
LeftJoin(:location => From(:location),
         Get.location_id .== Get.location.location_id) |>
Select(Get.person_id, Get.location.state)

Since Join needs two input datasets, it must be attached to two input pipelines. The first pipeline is attached using the |> operator and the second one is provided as an argument to the Join constructor. Alternatively, both input pipelines can be specified as keyword arguments:

Join(over = From(:person),
     joinee = :location => From(:location),
     on = Get.location_id .== Get.location.location_id,
     left = true) |>
Select(Get.person_id, Get.location.state)

The output of Join combines columns of both input datasets, which will cause ambiguity if both datasets have a column with the same name. Such is the case in the previous example since both tables, person and location, have a column called location_id. To disambiguate them, we can place all columns of one of the datasets into a nested record. This is the action of the arrow (=>) operator or its full form, the As node:

using FunSQL: As

From(:person) |>
LeftJoin(From(:location) |> As(:location),
         on = Get.location_id .== Get.location.location_id) |>
Select(Get.person_id, Get.location.state)

Alternatively, we could use bound column references, which are described in a later section.

Scalar Operations

Many tabular operations including Join, Select and Where are parameterized with scalar operations. A scalar operation acts on an individual row of a dataset and produces a scalar value. Scalar operations are assembled from literal values, column references, and applications of SQL functions and operators. Below is a list of scalar operations available in FunSQL.

Constructor	Function
`Agg`	apply an aggregate function
`As`	assign a column alias
`Bind`	create a correlated subquery
`Fun`	apply a scalar function or a scalar operator
`Get`	produce the value of a column
`Lit`	produce a constant value
`Sort`	indicate the sort order
`Var`	produce the value of a query parameter

`Lit`: SQL Literals

The Lit node creates a literal value, although we could usually omit the constructor:

using FunSQL: Lit

Select(Lit(42))
Select(42)

The SQL value NULL is represented by the Julia constant missing:

q = Select(missing)

render(conn, q) |> print
#-> SELECT NULL AS "_"

`Get`: Column References

The Get node creates a column reference. The Get constructor admits several equivalent forms:

Get.year_of_birth
Get(:year_of_birth)
Get."year_of_birth"
Get("year_of_birth")

Such column references are resolved at the place of use against the input dataset. As we mentioned earlier, sometimes column references cannot be resolved unambiguously. To alleviate this problem, we can bind the column reference to the node that produces it:

Show all patients with their state of residence.

qₚ = From(:person)
qₗ = From(:location)
q = qₚ |>
    LeftJoin(qₗ, on = qₚ.location_id .== qₗ.location_id) |>
    Select(qₚ.person_id, qₗ.state)

The notation qₚ.location_id and qₗ.location_id is a syntax sugar for

Get(:location_id, over = qₚ)
Get(:location_id, over = qₗ)

`Fun`: SQL Functions and Operators

SQL functions and operators are represented using the Fun node, which also has several equivalent forms:

Fun.between(Get.year_of_birth, 1930, 1940)
Fun(:between, Get.year_of_birth, 1930, 1940)
Fun."between"(Get.year_of_birth, 1930, 1940)
Fun("between", Get.year_of_birth, 1930, 1940)

Certain SQL operators, notably logical and comparison operators, can be represented using Julia broadcasting notation:

Fun.">="(Get.year_of_birth, 1930)
Get.year_of_birth .>= 1930

Fun.and(Fun."="(Get.city, "CHICAGO"), Fun."="(Get.state, "IL"))
#? VERSION >= v"1.7"
Get.city .== "CHICAGO" .&& Get.state .== "IL"

We should note that FunSQL does not verify if a SQL function or an operator is used correctly or even whether it exists or not. In such a case, FunSQL will generate a SQL query that fails to execute:

q = From(:person) |>
    Select(Fun.frobnicate(Get.year_of_birth))

render(conn, q) |> print
#=>
SELECT frobnicate("person_1"."year_of_birth") AS "frobnicate"
FROM "person" AS "person_1"
=#

DBInterface.execute(conn, q)
#-> ERROR: SQLite.SQLiteException("no such function: frobnicate")

On the other hand, FunSQL will correctly serialize many SQL functions and operators that have irregular syntax including AND, OR, NOT, IN, EXISTS, CASE, and others.

Show the demographic cohort of each patient.

q = From(:person) |>
    Select(Fun.case(Get.year_of_birth .<= 1960, "boomer", "millenial"))

render(conn, q) |> print
#=>
SELECT (CASE WHEN ("person_1"."year_of_birth" <= 1960) THEN 'boomer' ELSE 'millenial' END) AS "case"
FROM "person" AS "person_1"
=#

Since FunSQL recognizes || as a logical or operator, it conflicts with those SQL dialects that use || for string concatenation. Other SQL dialects use function concat for this purpose. In FunSQL, always use Fun.concat, which will pick correct serialization depending on the target SQL dialect.

Show city, state.

q = From(:location) |>
    Select(Fun.concat(Get.city, ", ", Get.state))

render(conn, q) |> print
#=>
SELECT ("location_1"."city" || ', ' || "location_1"."state") AS "concat"
FROM "location" AS "location_1"
=#

When the name of the Fun node contains one or more ? symbols, this name serves as a template of a SQL expression. When the node is rendered, the ? symbols are substituted with the node arguments.

q = From(:person) |>
    Select(Fun."CAST(? AS TEXT)"(Get.year_of_birth))

render(conn, q) |> print
#=>
SELECT CAST("person_1"."year_of_birth" AS TEXT) AS "_"
FROM "person" AS "person_1"
=#

To decide how to render a Fun node, FunSQL checks the node name:

If the name has a specialized implementation of FunSQL.serialize!(), this implementation is used for rendering the node.
If the name contains one or more placeholders (?), the node is rendered as a template.
If the name contains only symbol characters, or if the name starts or ends with a space, the node is rendered as an operator.
Otherwise, the node is rendered as a function.

`Group` and Aggregate Functions

Group and aggregate functions are used for summarizing data to report totals, averages and so on. We start by applying the Group node to partition the input rows into disjoint groups. Then, for each group, we can calculate summary values using aggregate functions. In FunSQL, aggregate functions are created using the Agg node. In the following example, we use the aggregate function Agg.count, which simply counts the number of rows in each group.

Show the number of patients by the year of birth.

using FunSQL: Agg, Group

q = From(:person) |>
    Group(Get.year_of_birth) |>
    Select(Get.year_of_birth, Agg.count())

render(conn, q) |> print
#=>
SELECT
  "person_1"."year_of_birth",
  count(*) AS "count"
FROM "person" AS "person_1"
GROUP BY "person_1"."year_of_birth"
=#

DBInterface.execute(conn, q) |> DataFrame
#=>
10×2 DataFrame
 Row │ year_of_birth  count
     │ Int64          Int64
─────┼──────────────────────
   1 │          1911      1
   2 │          1913      1
   3 │          1922      1
⋮
=#

To indicate that aggregate functions must be applied to the dataset as a whole, we create a Group node without arguments. This is the case where FunSQL notation deviates from SQL, where we would omit the GROUP BY clause to achieve the same effect.

Show the average year of birth.

q = From(:person) |>
    Group() |>
    Select(Agg.avg(Get.year_of_birth))

render(conn, q) |> print
#=>
SELECT avg("person_1"."year_of_birth") AS "avg"
FROM "person" AS "person_1"
=#

DBInterface.execute(conn, q) |> DataFrame
#=>
1×1 DataFrame
 Row │ avg
     │ Float64
─────┼─────────
   1 │  1935.4
=#

In general, the arguments of the Group node form the grouping key so that two rows of the input dataset belongs to the same group when they have the same value of the grouping key. The output of Group contains all distinct values of the grouping key.

Show the US states that are present in the location records.

q = From(:location) |>
    Group(Get.state)

render(conn, q) |> print
#=>
SELECT DISTINCT "location_1"."state"
FROM "location" AS "location_1"
=#

DBInterface.execute(conn, q) |> DataFrame
#=>
10×1 DataFrame
 Row │ state
     │ String
─────┼────────
   1 │ MI
   2 │ WA
   3 │ FL
⋮
=#

FunSQL has no lexical limitations on the use of aggregate functions. While in SQL aggregate functions can only be used in the SELECT or HAVING clauses, there is no such restriction in FunSQL: they could be used in any context where an ordinary expression is permitted. The only requirement is that for each aggregate function, FunSQL can determine the corresponding Group node. It is convenient to imagine that the output of Group contains the grouped rows, which cannot be observed directly, but whose presence in the output allows us to apply aggregate functions.

In particular, we use a regular Where node where SQL would require a HAVING clause.

Show patients who saw a doctor within the last year.

q = From(:visit_occurrence) |>
    Group(Get.person_id) |>
    Where(Agg.max(Get.visit_end_date) .>= Fun.date("now", "-1 year"))

render(conn, q) |> print
#=>
SELECT "visit_occurrence_1"."person_id"
FROM "visit_occurrence" AS "visit_occurrence_1"
GROUP BY "visit_occurrence_1"."person_id"
HAVING (max("visit_occurrence_1"."visit_end_date") >= date('now', '-1 year'))
=#

When the output of Group is blocked by an As node, we need to traverse it with Get in order to use an aggregate function.

For each patient, show the date of their latest visit to a doctor.

q = From(:person) |>
    LeftJoin(:visit_group => From(:visit_occurrence) |> Group(Get.person_id),
             on = Get.person_id .== Get.visit_group.person_id) |>
    Select(Get.person_id,
           Get.visit_group |> Agg.max(Get.visit_start_date))

render(conn, q) |> print
#=>
SELECT
  "person_1"."person_id",
  "visit_group_1"."max"
FROM "person" AS "person_1"
LEFT JOIN (
  SELECT
    max("visit_occurrence_1"."visit_start_date") AS "max",
    "visit_occurrence_1"."person_id"
  FROM "visit_occurrence" AS "visit_occurrence_1"
  GROUP BY "visit_occurrence_1"."person_id"
) AS "visit_group_1" ON ("person_1"."person_id" = "visit_group_1"."person_id")
=#

`Partition` and Window Functions

We can relate each row to other rows in the same dataset using the Partition node and window functions. We start by applying the Partition node to partition the input rows into disjoint groups. The rows in each group are reordered according to the given sort order. Unlike Group, which collapses each row group into a single row, the Partition node preserves the original rows, but allows us to relate each row to adjacent rows in the same partition. In particular, we can apply regular aggregate functions, which calculate the summary value of a subset of rows related to the current row.

In the following example, the rows visit_occurrence are partitioned per patient and ordered by the starting date of the visit. The frame clause specifies the subset of rows relative to the current row (the window frame) to be used by aggregate functions. In this example, the frame contains all rows prior to the current row.

For each visit, show the time passed since the previous visit.

using FunSQL: Partition

q = From(:visit_occurrence) |>
    Partition(Get.person_id,
              order_by = [Get.visit_start_date],
              frame = (mode = :rows, start = -Inf, finish = -1)) |>
    Select(Get.person_id,
           Get.visit_start_date,
           Get.visit_end_date,
           :gap => Fun.julianday(Get.visit_start_date) .- Fun.julianday(Agg.max(Get.visit_end_date)))

render(conn, q) |> print
#=>
SELECT
  "visit_occurrence_1"."person_id",
  "visit_occurrence_1"."visit_start_date",
  "visit_occurrence_1"."visit_end_date",
  (julianday("visit_occurrence_1"."visit_start_date") - julianday((max("visit_occurrence_1"."visit_end_date") OVER (PARTITION BY "visit_occurrence_1"."person_id" ORDER BY "visit_occurrence_1"."visit_start_date" ROWS BETWEEN UNBOUNDED PRECEDING AND 1 PRECEDING)))) AS "gap"
FROM "visit_occurrence" AS "visit_occurrence_1"
=#

DBInterface.execute(conn, q) |> DataFrame
#=>
27×4 DataFrame
 Row │ person_id  visit_start_date  visit_end_date  gap
     │ Int64      String            String          Float64?
─────┼────────────────────────────────────────────────────────
   1 │      1780  2008-04-09        2008-04-13      missing
   2 │      1780  2008-04-10        2008-04-10           -3.0
   3 │      1780  2008-11-22        2008-11-22          223.0
   4 │      1780  2009-05-22        2009-05-22          181.0
⋮
=#

Query Parameters

A SQL query may include a reference to a query parameter. When we execute such a query, we must supply the actual values for all parameters used in the query. This is a restricted form of dynamic query construction directly supported by SQL syntax.

Show all patients born between $start_year and $end_year.

sql = """
SELECT p.person_id
FROM person p
WHERE p.year_of_birth BETWEEN ? AND ?
"""

DBInterface.execute(conn, sql, (1930, 1940)) |> DataFrame
#=>
3×1 DataFrame
 Row │ person_id
     │ Int64
─────┼───────────
   1 │      1780
   2 │     30091
   3 │     72120
=#

FunSQL can be used to construct a query with parameters. Similar to Get, parameter references are created using the Var node:

using FunSQL: Var

q = From(:person) |>
    Where(Fun.between(Get.year_of_birth, Var.START_YEAR, Var.END_YEAR)) |>
    Select(Get.person_id)

render(conn, q) |> print
#=>
SELECT "person_1"."person_id"
FROM "person" AS "person_1"
WHERE ("person_1"."year_of_birth" BETWEEN ?1 AND ?2)
=#

While we specified parameters by name, in the generated SQL query the same parameters are numbered. FunSQL will automatically pack named parameters in the order in which they appear in the SQL query.

DBInterface.execute(conn, q, START_YEAR = 1930, END_YEAR = 1940) |> DataFrame
#=>
3×1 DataFrame
 Row │ person_id
     │ Int64
─────┼───────────
   1 │      1780
   2 │     30091
   3 │     72120
=#

Correlated Queries

An inner query is a SQL query that is included into the outer query as a part of a scalar expression. An inner query must either produce a single value or be used as an argument of a query operator, such as IN or EXISTS, which transforms the query output to a scalar value.

It is easy to assemble an inner query with FunSQL:

Find the oldest patients.

qᵢ = From(:person) |>
     Group() |>
     Select(Agg.min(Get.year_of_birth))

qₒ = From(:person) |>
     Where(Get.year_of_birth .== qᵢ) |>
     Select(Get.person_id, Get.year_of_birth)

render(conn, qₒ) |> print
#=>
SELECT
  "person_1"."person_id",
  "person_1"."year_of_birth"
FROM "person" AS "person_1"
WHERE ("person_1"."year_of_birth" = (
  SELECT min("person_2"."year_of_birth") AS "min"
  FROM "person" AS "person_2"
))
=#

DBInterface.execute(conn, qₒ) |> DataFrame
#=>
1×2 DataFrame
 Row │ person_id  year_of_birth
     │ Int64      Int64
─────┼──────────────────────────
   1 │    110862           1911
=#

Find patients with no visits to a healthcare provider.

qᵢ = From(:visit_occurrence) |>
     Select(Get.person_id)

qₒ = From(:person) |>
     Where(Fun.not_in(Get.person_id, qᵢ))

render(conn, qₒ) |> print
#=>
SELECT
  "person_1"."person_id",
  ⋮
  "person_1"."ethnicity_source_concept_id"
FROM "person" AS "person_1"
WHERE ("person_1"."person_id" NOT IN (
  SELECT "visit_occurrence_1"."person_id"
  FROM "visit_occurrence" AS "visit_occurrence_1"
))
=#

DBInterface.execute(conn, qₒ) |> DataFrame
#=>
0×18 DataFrame
 Row │ person_id  gender_concept_id  year_of_birth  month_of_birth  day_of_bir ⋯
     │ Int64?     Int64?             Int64?         Int64?          Int64?     ⋯
─────┴──────────────────────────────────────────────────────────────────────────
                                                              14 columns omitted
=#

The inner query may depend on the data from the outer query. Such inner queries are called correlated. In FunSQL, correlated queries are created using the Bind node. Specifically, in the body of a correlated query we use query parameters to refer to the external data. The Bind node, which wrap the correlated query, binds each parameter to an expression evaluated in the context of the outer query.

Find all visits where at least one condition was diagnosed.

using FunSQL: Bind

CorrelatedCondition(person_id, start_date, end_date) =
    From(:condition_occurrence) |>
    Where(Fun.and(Get.person_id .== Var.PERSON_ID,
                  Fun.between(Get.condition_start_date, Var.START_DATE, Var.END_DATE))) |>
    Bind(:PERSON_ID => person_id,
         :START_DATE => start_date,
         :END_DATE => end_date)

q = From(:visit_occurrence) |>
    Where(Fun.exists(CorrelatedCondition(Get.person_id, Get.visit_start_date, Get.visit_end_date)))

render(conn, q) |> print
#=>
SELECT
  "visit_occurrence_1"."visit_occurrence_id",
  ⋮
  "visit_occurrence_1"."visit_source_concept_id"
FROM "visit_occurrence" AS "visit_occurrence_1"
WHERE (EXISTS (
  SELECT "condition_occurrence_1"."condition_occurrence_id"
  FROM "condition_occurrence" AS "condition_occurrence_1"
  WHERE
    ("condition_occurrence_1"."person_id" = "visit_occurrence_1"."person_id") AND
    ("condition_occurrence_1"."condition_start_date" BETWEEN "visit_occurrence_1"."visit_start_date" AND "visit_occurrence_1"."visit_end_date")
))
=#

`Order` and `Limit`

The database server emits the output rows in an arbitrary order. In fact, different runs of the same query may produce rows in a different order. To specify a particular order of output rows, we use the Order node.

Show patients ordered by the year of birth.

using FunSQL: Order

q = From(:person) |>
    Order(Get.year_of_birth)

render(conn, q) |> print
#=>
SELECT
  "person_1"."person_id",
  ⋮
  "person_1"."ethnicity_source_concept_id"
FROM "person" AS "person_1"
ORDER BY "person_1"."year_of_birth"
=#

The Asc and the Desc modifiers specify whether to sort the rows in an ascending or in a descending order.

Show patients ordered by the year of birth in the descending order.

using FunSQL: Desc

q = From(:person) |>
    Order(Get.year_of_birth |> Desc())

render(conn, q) |> print
#=>
SELECT
  "person_1"."person_id",
  ⋮
  "person_1"."ethnicity_source_concept_id"
FROM "person" AS "person_1"
ORDER BY "person_1"."year_of_birth" DESC
=#

The Limit node lets us take a slice of the input dataset. To make the output deterministic, Limit must be applied right after Order.

Show the top three oldest patients.

using FunSQL: Limit

q = From(:person) |>
    Order(Get.year_of_birth) |>
    Limit(1:3)

render(conn, q) |> print
#=>
SELECT
  "person_1"."person_id",
  ⋮
  "person_1"."ethnicity_source_concept_id"
FROM "person" AS "person_1"
ORDER BY "person_1"."year_of_birth"
LIMIT 3
OFFSET 0
=#

`Append` and `Iterate`

The Append node concatenates two or more input datasets. Only the columns that are present in every input dataset will be included to the output.

Show all clinical events (visits and conditions) associated with each patient.

using FunSQL: Append

q = From(:visit_occurrence) |>
    Define(:type => "visit", :date => Get.visit_start_date) |>
    Append(From(:condition_occurrence) |>
           Define(:type => "condition", :date => Get.condition_start_date)) |>
    Order(Get.person_id, Get.date)

render(conn, q) |> print
#=>
SELECT
  "union_1"."visit_occurrence_id",
  "union_1"."person_id",
  "union_1"."provider_id",
  "union_1"."type",
  "union_1"."date"
FROM (
  SELECT
    "visit_occurrence_1"."visit_occurrence_id",
    "visit_occurrence_1"."person_id",
    "visit_occurrence_1"."provider_id",
    'visit' AS "type",
    "visit_occurrence_1"."visit_start_date" AS "date"
  FROM "visit_occurrence" AS "visit_occurrence_1"
  UNION ALL
  SELECT
    "condition_occurrence_1"."visit_occurrence_id",
    "condition_occurrence_1"."person_id",
    "condition_occurrence_1"."provider_id",
    'condition' AS "type",
    "condition_occurrence_1"."condition_start_date" AS "date"
  FROM "condition_occurrence" AS "condition_occurrence_1"
) AS "union_1"
ORDER BY
  "union_1"."person_id",
  "union_1"."date"
=#

DBInterface.execute(conn, q) |> DataFrame
#=>
53×5 DataFrame
 Row │ visit_occurrence_id  person_id  provider_id  type       date
     │ Int64                Int64      Int64        String     String
─────┼────────────────────────────────────────────────────────────────────
   1 │               88179       1780         5247  visit      2008-04-09
   2 │               88246       1780        61112  visit      2008-04-10
   3 │               88246       1780        61112  condition  2008-04-10
   4 │               88214       1780        12674  visit      2008-11-22
   5 │               88214       1780        12674  condition  2008-11-22
   6 │               88263       1780        61118  visit      2009-05-22
   7 │               88263       1780        61118  condition  2009-05-22
   8 │             1454922      30091        36303  visit      2008-11-12
  ⋮  │          ⋮               ⋮           ⋮           ⋮          ⋮
  47 │             5314671     110862         5159  condition  2008-09-07
  48 │             5314690     110862        31906  visit      2009-06-30
  49 │             5314690     110862        31906  condition  2009-06-30
  50 │             5314664     110862        31857  visit      2009-09-30
  51 │             5314664     110862        31857  condition  2009-09-30
  52 │             5314696     110862       192777  visit      2010-06-07
  53 │             5314696     110862       192777  condition  2010-06-07
                                                           38 rows omitted
=#

For a second example, consider the table concept, which contains the vocabulary of medical concepts (such as Myocardial Infarction). These concepts may be related to each other (Myocardial Infarction has a subtype Acute Myocardial Infarction), and their relationships are stored in the table concept_relationship. We can encapsulate construction of a query that finds immediate subtypes as the function:

SubtypesOf(base) =
    From(:concept) |>
    Join(From(:concept_relationship) |>
         Where(Get.relationship_id .== "Is a"),
         on = Get.concept_id .== Get.concept_id_1) |>
    Join(:base => base,
         on = Get.concept_id_2 .== Get.base.concept_id)

Show the concept "Myocardial Infarction" and its immediate subtypes.

base = From(:concept) |>
       Where(Get.concept_name .== "Myocardial infarction")

q = base |> Append(SubtypesOf(base))

DBInterface.execute(conn, q) |> DataFrame
#=>
2×10 DataFrame
 Row │ concept_id  concept_name                 domain_id  vocabulary_id  conc ⋯
     │ Int64       String                       String     String         Stri ⋯
─────┼──────────────────────────────────────────────────────────────────────────
   1 │    4329847  Myocardial infarction        Condition  SNOMED         Clin ⋯
   2 │     312327  Acute myocardial infarction  Condition  SNOMED         Clin
                                                               6 columns omitted
=#

But how can we fetch not just immediate, but all of the subtypes of a concept?

Show the concept "Myocardial Infarction" and all of its subtypes.

A good start is to repeatedly apply SubtypesOf and concatenate all the outputs:

base |>
Append(SubtypesOf(base),
       SubtypesOf(SubtypesOf(base)),
       SubtypesOf(SubtypesOf(SubtypesOf(base))),
       SubtypesOf(SubtypesOf(SubtypesOf(SubtypesOf(base)))))

However we do not know if 4 iterations of SubtypesOf is enough to fully traverse the concept hierarchy. Ideally, we should continue applying SubtypesOf until the last iteration produces an empty output. This is exactly the action of the Iterate node.

using FunSQL: Iterate

q = base |>
    Iterate(SubtypesOf(From(^)))

render(conn, q) |> print
#=>
WITH RECURSIVE "__1" ("concept_id", "concept_name", "domain_id", "vocabulary_id", "concept_class_id", "standard_concept", "concept_code", "valid_start_date", "valid_end_date", "invalid_reason") AS (
  SELECT
    "concept_1"."concept_id",
    "concept_1"."concept_name",
    "concept_1"."domain_id",
    "concept_1"."vocabulary_id",
    "concept_1"."concept_class_id",
    "concept_1"."standard_concept",
    "concept_1"."concept_code",
    "concept_1"."valid_start_date",
    "concept_1"."valid_end_date",
    "concept_1"."invalid_reason"
  FROM "concept" AS "concept_1"
  WHERE ("concept_1"."concept_name" = 'Myocardial infarction')
  UNION ALL
  SELECT
    "concept_2"."concept_id",
    "concept_2"."concept_name",
    "concept_2"."domain_id",
    "concept_2"."vocabulary_id",
    "concept_2"."concept_class_id",
    "concept_2"."standard_concept",
    "concept_2"."concept_code",
    "concept_relationship_2"."valid_start_date",
    "concept_relationship_2"."valid_end_date",
    "concept_relationship_2"."invalid_reason"
  FROM "concept" AS "concept_2"
  JOIN (
    SELECT
      "concept_relationship_1"."valid_start_date",
      "concept_relationship_1"."valid_end_date",
      "concept_relationship_1"."invalid_reason",
      "concept_relationship_1"."concept_id_2",
      "concept_relationship_1"."concept_id_1"
    FROM "concept_relationship" AS "concept_relationship_1"
    WHERE ("concept_relationship_1"."relationship_id" = 'Is a')
  ) AS "concept_relationship_2" ON ("concept_2"."concept_id" = "concept_relationship_2"."concept_id_1")
  JOIN "__1" AS "__2" ON ("concept_relationship_2"."concept_id_2" = "__2"."concept_id")
)
SELECT
  "concept_3"."concept_id",
  "concept_3"."concept_name",
  "concept_3"."domain_id",
  "concept_3"."vocabulary_id",
  "concept_3"."concept_class_id",
  "concept_3"."standard_concept",
  "concept_3"."concept_code",
  "concept_3"."valid_start_date",
  "concept_3"."valid_end_date",
  "concept_3"."invalid_reason"
FROM "__1" AS "concept_3"
=#

DBInterface.execute(conn, q) |> DataFrame
#=>
6×10 DataFrame
 Row │ concept_id  concept_name                       domain_id  vocabulary_id ⋯
     │ Int64       String                             String     String        ⋯
─────┼──────────────────────────────────────────────────────────────────────────
   1 │    4329847  Myocardial infarction              Condition  SNOMED        ⋯
   2 │     312327  Acute myocardial infarction        Condition  SNOMED
   3 │     434376  Acute myocardial infarction of a…  Condition  SNOMED
   4 │     438170  Acute myocardial infarction of i…  Condition  SNOMED
   5 │     438438  Acute myocardial infarction of a…  Condition  SNOMED        ⋯
   6 │     444406  Acute subendocardial infarction    Condition  SNOMED
                                                               6 columns omitted
=#