Three ways to store hierarchical data in SQL

16 Jul 2024

At work, we recently needed to store hierarchical data both MySQL and SQLite, a situation that's worth a short post.

Case study

Let's imagine that we content to tag, and our table of tags looks like this:

id name
1 environment
2 biodiversity
3 deforestation
4 nuclear energy

Now we realise we want some tags to be subtags, so that deforestation and biodiversity are subtags of environment.

To do this, we want to turn the flat tags list into a hierarchy. Let's look at three ways of doing this.

Adjacency list model

In the adjacencly list model, we give every row a reference to its parent row, a parent_id column. The mental model is a linked list.

id name parent_id
1 environment NULL
2 biodiversity 1
3 deforestation 1
4 nuclear energy NULL

The rows without a parent_id are at the top of the hierarchy.

This makes inserting rows cheap but querying the whole hierarchy difficult, as you don’t know the number of joins needed. However, the situation improved as of MySQL 8.0 thanks to recursive common table expressions:1

WITH RECURSIVE all_children AS (
    -- base case, the parent
    SELECT *
    FROM tags
    WHERE name = 'environment'

    UNION ALL

    -- recursive case
    SELECT t.*
    FROM tags t
    INNER JOIN all_children c ON (t.parent_id = c.id)
)

SELECT * FROM all_children

Recursion always needs a base case and a recursive case. Here, the first part of the query selecting the parent is the base case, and the second recursive part selects all children of existing rows.

Nested set model

In the nested set model, every row is a set containing everything between two numbers, left and right. The mental model is a set of cups of different sizes so that some fit into others.

id name left right
1 environment 1 6
2 biodiversity 2 3
3 deforestation 4 5
4 nuclear energy 7 8

environment has the range 1 to 6, a span which includes the left and right values biodiversity and deforestation.

The numbering is what you get if you load the whole hierarchy into memory and walk the tree depth-first, numbering the left node on entry and right node on exit.

To get the children of environment, you query like this:

WITH parent AS (
    SELECT left, right
    FROM tags
    WHERE
        name = 'environment'
)

SELECT *
FROM tags
WHERE
    tags.left BETWEEN parent.left AND parent.right

In this model, you get nice reads, but the numbering system is fragile and needs to be recomputed for many or all rows when adding a new one.2 This makes writes slow and cumbersome, which is still fine if the table is usually read and only occasionally written to.

Fixed-depth hierarchy

Another model I've seen is to have a fixed number of levels in the hierarchy, and have a column for each level. The mental model for this is breadcrumbs.

For example, a table could look like this:

id level_1 level_2 level_3
1 environment NULL NULL
2 environment biodiversity NULL
3 environment deforestation NULL
4 nuclear energy NULL NULL

This is pretty easy to query:

SELECT * FROM tags
WHERE level_1 = 'environment'

Some downsides to this model. If you don’t know the tag level, querying becomes more complex and adds to your application logic. You have to decide decide the number of levels in advance, something you might not know. Updating a row means updating all of its children too. Adding an extra level requires a schema migration to add a new column for it. The same tag gets stored multiple times, which is a slight waste, but not a big deal if your table is small.

Conclusion

Overall, the adjacency list model with a parent_id is likely the best default. It’s compact, easy to understand, query, and insert into. Recursive queries in MySQL 8.0+ and SQLite make it possible to query the whole hierarchy in one go and are quite performant. You are not forced to decide the number of levels in advance, which is helpful.

Footnotes

  1. See: MySQL - Common table expressions

  2. It seems obvious that a floating point scheme could avoid this, but I've never seen one in the wild.