Understanding recursive queries in PostgreSQL

Perhaps 3-4 years ago in an interview, I was asked a query to print a product hierarchy or something in the context of an e-commerce site database. I couldn’t, of course, as I barely had a handle on basic aggregations at that time.

To be fair, I still don’t know whether there’s a non-brain-twisting query that’d print the entire hierarchy of a tree-like system. However, I do know now have to do a subset of it – i.e., show all the results within a parent category, with several nested subcategories.

A typical example comes from employee data, where employees are also managers. We can represent such a situation in the table by adding another manager_id column:

CREATE TABLE employees (
    employee_id serial PRIMARY KEY,
    full_name VARCHAR NOT NULL,
    manager_id INT

So, yes, the top-level employees will have a manager_id of 0 or NULL or something, while all others will have a valid integer. Now, suppose we have to find all the people that are directly or indirectly managed by employee id 2. How do we do it in a single query? a JOIN will not help us, simply because we don’t know how many times the query should run. We only know that it should start with id 2 and keep looking deeper until it reaches a dead-end in every search. Sounds a lot like recursion!

The nice thing here is that we don’t have to write a recursive query the way we write a function call (we also cannot, even if we try). SQL (and thus, PostgreSQL) has a special construct for such queries – the RECURSIVE keyword and a well-defined structure.

It’s hard to explain without an example, so let’s see the query right away:

WITH RECURSIVE subordinates_of_2 AS (
    FROM employees WHERE employee_id = 2
    FROM employees e JOIN subordinates_of_2 s2 ON e.manager_id = s2.employee_id
SELECT * FROM subordinates_of_2 WHERE employee_id <> 2;

The WITH ... part is called a Common Table Expression (CTE), and can be thought of a name assigned to the parenthesized query that we reference in the final expression (subordinates_of_2). Next comes the RECURSIVE keyword, which tells the PostgreSQL engine that the expression coming next needs to be executed recursively. Recursively how?

To tell Postgres to execute the query recursively, we have to tell it three things:

Head hurts? Mine does too! As a programmer, it’s kinda hard to see how recursion gets executed in this query, but you’ll have to trust the engine on this one. The query starts by generating a list of employees with id 2, and then keeps JOINing things until it doesn’t find any more employees who are managers. Since the final output includes the top-level manager as well, we’ve filtered it out using WHERE employee_id <> 2.

In terms of syntax, that’s all there is to a recursive query. And yes, I admit it’s a bit of a brain-bender. But it solves a use case that’s pretty much impossible to achieve in a single query otherwise. So, let’s rejoice! 🥳