Hierarchical Data in MySQL

For the purposes of this post, we’ll use a fake website and use the hierarchy in the image below.

Managing a site structure, or any other structure, in a database often gets overlooked. It’s pretty simple and a common standard to use what’s called the Adjacency List Model without sitting down and thinking about other possible solutions. The Adjacency List Model in a database table consists of an ID and a Parent ID. The Parent ID usually refers to the ID of another record in the same table. Here is an example of what one might look like:

------------------------------------------------
|  ID  |      page_name          |  parent_id  |
------------------------------------------------
|  1   |       Home              |  NULL       |
|  2   |       Books             |  1          |
|  3   |       Best Sellers      |  2          |
|  4   |       Cloths            |  1          |
|  5   |       Toys              |  1          |
|  6   |       Electronics       |  1          |
|  7   |       iPods             |  6          |
|  8   |       Computers         |  6          |
|  9   |       Desktops          |  8          |
|  10  |       Laptops           |  8          |
|  11  |       Netbooks          |  8          |
------------------------------------------------

Notice the Home page has an ID of 1 and no parent_id. Best Sellers has the parent_id of 2 which corresponds to the ID of 2 which belongs to Books. Desktops, Laptops and Netbooks all have a parent_id of 8 which corresponds to the ID of Computers, and Computers has 6 as the parent_id which corresponds to Electronics, and finally Electronics has 1 as its parent_id which corresponds to Home. This is the most common form of hierarchical data storage in a database. First thing you’ll notice is that it’s pretty simple. If I’m on a page and I need to find it’s parent, the SQL is easy.

Assume $page_id is the variable that corresponds to ID in the database. Lets retrieve its parent.

SELECT * FROM pages
WHERE page_id = (
	SELECT parent_id FROM pages WHERE page_id = $page_id
)

Similarly, we could get all the child pages

SELECT * FROM pages WHERE parent_id = $page_id

Now you might be thinking there’s nothing wrong with this. After all, it’s very simple and easy to understand, but what if you want to find the entire hierarchy? Suppose you’re on “Laptops” and you want to get the crumbtrail all the way up to the Home page. (For simplicity and clarity, I’ll use the page “name” rather than the ID’s in my examples).

SELECT pageLevel1.page_name AS lev1, pageLevel2.page_name as lev2,
	pageLevel3.page_name as lev3, pageLevel4.page_name as lev4
FROM pages AS pageLevel1
LEFT JOIN pages AS pageLevel2 ON pageLevel2.parent = pageLevel1.page_id
LEFT JOIN pages AS pageLevel3 ON pageLevel3.parent = pageLevel2.page_id
LEFT JOIN pages AS pageLevel4 ON pageLevel4.parent = pageLevel3.page_id
WHERE pageLevel1.page_name = 'Home';

The result is:

---------------------------------------------------------------------
|   lev1         |   lev2         |   lev3         |   lev4         |
---------------------------------------------------------------------
|   Home         |   Electronics  |   Computers    |   Laptops      |
---------------------------------------------------------------------

There’s a problem with this. Besides the large amount of joins, this assumes we know the depths of the hierarchy. In a website application we generate the crumbtrail, navigations, and hierarchy trees dynamically. This Adjacency List Model is too flat and has no easy way of getting the depth of a node.

The Adjacency List Model is great for finding parents of nodes, but not so great at finding grand parent nodes, great grand parent nodes, even second-cousin-once-removed nodes, or the uncle-no-one-talks-about nodes.

Luckily, there is another solution to our database hierarchical structures. It’s called the Nested Set Model. This model uses two fields to determine the spot in the hierarchy as oppose to a single parent_id: a left and a right. Our example above would look like this in a database using the Nested Set Model:

----------------------------------------------------
|  ID  |      page_name          |  lft  |	rgt	|
----------------------------------------------------
|  1   |       Home              |  1    |	22	|
|  2   |       Books             |  2    |	5	|
|  3   |       Best Sellers      |  3    |	4	|
|  4   |       Cloths            |  6    |	7	|
|  5   |       Toys              |  8    |	9	|
|  6   |       Electronics       |  10   |	21	|
|  7   |       iPods             |  11   |	12	|
|  7   |       Computers         |  13   |	20	|
|  9   |       Desktops          |  14   |	15	|
|  10  |       Laptops           |  16   |	17	|
|  11  |       Netbooks          |  18   |	19	|
----------------------------------------------------

If you’re unfamiliar with the Nested Set Model, this might be confusing. All you have to remember is that if an item’s “lft” and “rgt” fall between another rows lft and rgt, it is a sub-node of that row. (Notice this is not “left” and “right” as they are reserved words in SQL, so I used lft and rgt).

Here is a diagram of the the same site with the numbers added to give you an idea of the structure.

Since Home is the top of the hierarchy, it starts it’s lft with the lowest number, and rgt with the highest number; therefore any page that has an lft that’s GREATER than 1 and an rgt that’s LESS than 22 is a sub node of Home. Likewise, Books has 2 and 5 with Best Sellers having 3 and 4 making it a sub node of Books. We can then know that Best Sellers is also a sub node of Home because 3 and 4 fall between 1 and 22. Hopefully this is making some sense. If you need help following the flow of the above chart, follow the blue line of this chart. Notice the blue lines go in numeric order.

This might seem quite a bit tougher than the Adjacency List Model, and you’d be right. The implementation is tougher, but once you start using it in your application you won’t be sorry. Let’s try to get a crumbtrail from “Laptops” again.

SELECT parent.page_name
FROM pages AS node,
pages AS parent
WHERE node.lft BETWEEN parent.lft AND parent.rgt
AND node.page_name = 'Laptops'
ORDER BY parent.lft;

This creates the same results:

------------------
|   name         |
------------------
|   Home         |
|   Electronics  |
|   Computers    |
|   Laptops      |
------------------

This was much simpler. I just found “Laptops” lft and rgt and I’m basically saying “find me all the pages where the left is less than Laptop’s left and the right is greater than Laptop’s right”. This just works its way right up the hierarchy.

What about finding the entire depth of a tree with this model?

SELECT lft,rgt,page_id,page_name,
    (
		SELECT COUNT(page_id) FROM pages
		WHERE lft < node.lft 
		AND rgt > node.rgt
		AND node.lft <> 0
	)
	AS depth
FROM `pages` AS node
WHERE lft <> 0 ORDER BY lft

The subquery (line 2 above) is an easy way to find how deep inside the hierarchy we are. We can use this in our code to create a full tree:


$last_depth	= -1;
$output		= '';
$i		= 0;
$pageList = mysql_query("SELECT lft,rgt,page_id,page_name,
	(
		SELECT COUNT(page_id) FROM pages 
		WHERE lft < node.lft 
		AND rgt > node.rgt 
		AND node.lft <> 0
	) AS depth 
	FROM `pages` AS node 
	WHERE lft <> 0 ORDER BY lft");

while($row = mysql_fetch_array($pageList)) {
	if ($i==0){
		$liOffset = $row['depth'];// first (and therefore highest) in the hierarchy
	}
	$current_depth = $row['depth'];

	if ($last_depth > $current_depth) {
		$output .= str_repeat('</ul></li>',($last_depth-$current_depth));
	}

	$output .= "<li id='page_".$row['page_id']."'>".$row['page_name'];

	if ($row['rgt'] - 1 != $row['lft']) {
		$output .= '<ul>';
	} else {
		$output .= '</li>';
	}
	$last_depth = $current_depth;
	$i++;
}
while($last_depth >= ($liOffset+1)) {
	$output .= '</ul></li>';
	$last_depth--;
}
echo $output.'</ul>';

This will give you a perfect <ul> tree with just one, simple SQL statement. Hopefully, you’re starting to see the power of the Nested Set Model compared to the Adjacency List Model. The SQL can get more complicated, but it’s much easier on the MySQL engine. Here is an example of getting just the immediate sub nodes, for example, Books, Cloths, Toys, and Electronics from “Home” but not Best Sellers, iPods, Computers etc.

SELECT node.page_name, (COUNT(parent.page_name) - (sub_tree.depth + 1)) AS depth
FROM pages AS node,
	pages AS parent,
	pages AS sub_parent,
	(
		SELECT node.page_name, (COUNT(parent.page_name) - 1) AS depth
		FROM pages AS node,
		pages AS parent
		WHERE node.lft BETWEEN parent.lft AND parent.rgt
		AND node.page_name = 'Home'
		GROUP BY node.page_name
		ORDER BY node.lft
	)AS sub_tree
WHERE node.lft BETWEEN parent.lft AND parent.rgt
	AND node.lft BETWEEN sub_parent.lft AND sub_parent.rgt
	AND sub_parent.page_name = sub_tree.page_name
GROUP BY node.page_name
HAVING depth <= 1
ORDER BY node.lft;

The results are:

----------------------------
|   name         |  Depth  |
----------------------------
|   Home         |  0      |
|   Books        |  1      |
|   Cloths       |  1      |
|   Toys         |  1      |
|   Electronics  |  1      |
----------------------------

If you familiarize yourself with this model, many things will become second nature to you. You may ask yourself, “how do I add a new node in the Nested Set Model?”. I will leave you not with the SQL, but with the theory. If you wanted to add a node “Servers” under “Computers” to the right of “Netbooks” you would just make the new “lft” be 20 and the “rgt” be 21, and every other node that has an lft or rgt greater than or equal to 20 should have 2 more added to it.

The downside of the Nested Set Model

Although I feel that in many cases the Nested Set Model is superior to the Adjacency List Model, there are some negatives.

1. It is not widely used, therefore support when you run into issues is a limited. You may find some help online, but the vast majority of users and existing applications are not using this model.

2. Data corruption is possible. If you make a mistake in your code when you add a new node, move a node, delete a node etc. you may cause data corruption. For example, what happens if you try to move a Node to be a child of one of it’s on children? If you create your script objects right and always use them after extensively testing to make sure they always work, you should be fine. However if someone goes straight into the database and changes some random pages “lft” value to zero, the repercussions could involve restoring your database to a previous backup or meticulously going through the hierarchy in the back end and manually updating the nodes yourself. This is not a fun task!

3. It can be hard to learn. It’s a steep learning curve. I had to read the same article probably ten times before I fully grasped the SQL queries, but eventually it all clicked, and now that I use it in my applications I feel confident in many of the custom requests I receive.

5 thoughts on “Hierarchical Data in MySQL”

  1. Recently I was study the nested model, I agree with you when what happen when I want to move or delete some node, maybe I’m going to have a Data Corruption, but, what is the best Hierarchical Data Base to use if you know something? please, let me know before I continue with the nested model.
    Regards.

  2. Hello, Frank.

    The model provided in these examples are for a MySQL database. The nested set model can be implemented on virtually any relational database, but with modified SQL statements. If you want to use the nested set model, my recommendation is MySQL.

  3. Interesting, but what to do if I need to add a new node under “Electronics” (lets say “TVs”)?
    Do I have to modify “Electronics” and “Home”?
    Perhaps I didn’t get the point, yet.

  4. To become a node under “Electronics” lets assume you want it to be the first item, and is therefore pushed iPods to second place. You would:

    First find the “lft” value of Electronics (in the above example, it would be 10).
    Second, you want to create the empty ‘slot’ for TV’s by having a SQL statement that pushes every “lft” and “rgt” up by two where the lft or rgt is greater than the Electronic’s lft value (because your new Node of TV’s will have an lft and an rgt, they are updated by two).
    Third, you want to insert your TV node into the spot new spot (by the update in the last point, there is now no node with an lft or rgt of 11 or 12). Now insert TV with 11 and 12 as the lft & rgt respectively.

    The SQL could be combined, but I’ll split it into two.

    Create the slot based on the lft position of Electronics:


    UPDATE pages SET lft = lft+2 WHERE lft > (SELECT lft FROM pages WHERE page_name = 'Electronics');
    UPDATE pages SET rgt = rgt+2 WHERE rgt > (SELECT lft FROM pages WHERE page_name = 'Electronics');

    Insert TV node:


    INSERT INTO pages (page_name,lft,rgt)
    VALUES
    (
        'TV\'s',
        (
            SELECT lft FROM pages WHERE page_name = 'Electronics'
        )+1,
        (
            SELECT lft FROM pages WHERE page_name = 'Electronics'
        )+2
    );

    In a real-world scenario, you’d likely use the ID column rather than the page_name, but for sake of understanding it helps the illustration to use page_name.

Leave a Reply

Your email address will not be published. Required fields are marked *