So, I first wrote about this topic on the wp-hackers list back in January 2009, explaining some of the scaling issues involved with having ambiguous rewrite rules and loads of static Pages in WordPress. A year later the same topic came up again in the WPTavern Forums, and later I wrote a blog post about the issue in more detail. That post generated lots of questions and responses.
In August 2011, thanks to highly valuable input from Andy Skelton which gave me a critical insight needed to make it work, and with Jon Cave and Mark Jaquith doing testing (read: breaking my patches over and over again), I was able to create a patch which fixed the problem (note: my final patch was slightly over-complicated, Jon Cave later patched it again to simplify some of the handling). This patch is now in WordPress 3.3.
So I figured I’d write up a quick post explaining the patch, how it works, and the subsequent consequences of it.
Quick Summary of the Problem
The original underlying problem is that WordPress relies on a set of rules to determine what type of page you’re looking for, and it uses only the URL itself to do this. Basically, given /some/url/like/this, WordPress has to figure out a) what you’re trying to see and b) how to query for that data from the database. The only information it has to help it do this is called the “rewrite rules”, which is basically a big list of regular expressions that turn the “pretty” URL into variables used for the main WP_Query system.
The user of the WordPress system has direct access to exactly one of these rewrite rules, which is the “Custom Structure” on the Settings->Permalink page. This custom string can be used to change what the “single post” URLs look like.
The problem is that certain custom structures will interfere with existing structures. If you make a custom structure that doesn’t start with something easily identifiable, like a number, then the default rewrite rules wouldn’t be able to cope with it.
To work around this problem, WordPress detected it and uses a flag called “verbose_rewrite_rules”, which triggers everything into changing the list of rules into more verbose ones, making the ambiguous rules into unambiguous ones. It did this by the simple method of making all Pages into static rules.
This works fine, but it doesn’t scale to large numbers of Pages. Once you have about 50-100 static Pages or so, and you’re using an ambiguous custom structure, then the system tends to fall apart. Most of the time, the ruleset grows too large to fit into a single mySQL query, meaning that the rules can no longer be properly saved in the database and must be rebuilt each time. The most obvious effect when this happens is that the number of queries on every page load rises from the below 50 range to 2000+ queries, and the site slows down to snail speed.
The “Fix”
The solution to this problem is deeper than simple optimizations. Remember that I said “WordPress relies on a set of rules to determine what type of page you’re looking for, and it uses only the URL itself to do this”. Well, to fix the problem, we have to give WordPress more input than just the URL. Specifically, we make it able to find out what Pages exist in the database.
When you use an ambiguous custom structure, WordPress 3.3 still detects that, and it still sets the verbose_page_rules flag. However, the flag now doesn’t cause the Pages to be made unambiguous in the rules. Instead, it changes the way the rules work. Specifically, it causes the following to happen:
- The Page rules now get put in front of the Post rules, and
- The actual matching process can do database queries to determine if the Page exists.
So now what happens is that the Page matching rules are run first, and for an ambiguous case, they’ll indeed match the Page rule. However, for all Page matches, a call to the get_page_by_path function is made, to see if that Page actually exists. If the Page doesn’t exist in the database, then the rule gets skipped even though it matched, and then the Post’s custom structure rules take over and will match the URL.
The Insight
The first patch I made while at WordCamp Montreal used this same approach of calling get_page_by_path, but the problem with it was that get_page_by_path was a rather expensive function to call at the time, especially for long page URLs. It was still better than what existed already, so I submitted the patch anyway, but it was less than ideal.
When I was at WordCamp San Francisco in August, hanging around all these awesome core developers, Andy Skelton commented on it and suggested a different kind of query. His suggestion didn’t actually work out directly, but it did give me the final idea which I implemented in get_page_by_path. Basically, Andy suggested splitting the URL path up into components and then querying for each one. I realized that you could split the path up by components, query for all of them at once, and then do a loop through the URL components in reverse order to determine if the URL referred to a Page that existed in the database or not.
So basically, given a URL like /aaa/bbb/ccc/ddd, get_page_by_path now does this:
SELECT ID, post_name, post_parent FROM $wpdb->posts WHERE post_name IN ('aaa','bbb','ccc','ddd') AND (post_type = 'page' OR post_type = 'attachment')
The results of this are stored in an array of objects using the ID as the array keys (a clever trick Andrew Nacin pointed out to me at the time).
By then looping through that array only once with a foreach, and comparing to the reversed form of the URL (ddd, ccc, bbb, aaa) you can make an algorithm that basically works like this:
foreach(results as res) {
if (res->post_name = 'ddd') {
get the parent of res from the results array
(if it's not in the array, then it can't be the parent of ddd, which is ccc and should be in the array)
check to make sure parent is 'ccc',
loop back to get the parent of ccc and repeat the process until you run out of parents
}
}
This works because all the Pages in our /aaa/bbb/ccc/ddd hierarchy must be in the resulting array from that one query, if /aaa/bbb/ccc/ddd is a valid page. So you can quickly check, using that indexed ID key, to see if they are all there by working backwards. If they are all there, then you’ll eventually get to parent = zero (which is the root) and the post_name = ‘aaa’. If they’re not there, then the loop exits and you didn’t find the Page because it doesn’t actually exist.
So using this one query, you can check for the existence of a Page any number of levels deep fairly quickly and without lots of expensive database operations.
Consequences
There are still some drawbacks though.
In theory, you could break this by making lots and lots of Pages, if you also made their hierarchy go hundreds of levels deep and thus make the loop operation take a long time. This seems unlikely to me, or at least way more unlikely than somebody making a mere couple hundreds of Pages. Also, WordPress won’t let you use the same Page name twice on the same level, so you’d really have to try for it to make this take too long.
If you try to make a URL longer than around 900K or so, the query will break. Pretty sure it’d break before that though, and anyway most people can’t remember URLs with the contents of a whole book in them. 😉
This also adds one SQL operation to every single Post page lookup. However, this is still better than having it break and try to run a few thousand queries every time in order to build rewrite rules which it can’t ultimately save. And the SQL being used is relatively fast, since post_name and post_type are both indexed fields.
Basically, for the very few and specific cases that had the problem, the speedup is dramatic and immediate. For the cases that use unambiguous rules, nothing has changed at all.
There’s still some bits that need to be fixed. Some of the code is duplicated in a couple of places and that needs to be merged. The pagename rewrite rule is a bit of a hack to avoid clashing, but it works everywhere even if it does make the regexp purist groan with dismay (for critics of this, please know that I did indeed try to do this using a regexp comment to make the difference instead of the strange and silly expression, but it doesn’t work because the regexp needs to be in a PHP array key).
Anyway, there you have it. I wrote the patch, but at least 5 other core developers contributed ideas and put in the grunt work in testing the result. A lot of brain power from these guys went into what is such a small little thing, really. A bit obscure, but I figured some people might like to read about it. 🙂