Seems that some people don’t know much about kses. It’s really not all that complicated, but there doesn’t seem to be a lot of documentation around for it, so what the hell.
The kses package is short for “kses strips evil scripts”, and it is basically an HTML filtering mechanism. It can read HTML code, no matter how malformed it is, and filter out undesirable bits. The idea is to allow some safe subset of HTML through, so as to prevent various forms of attacks.
However, by necessity, it also includes a passable HTML parser, albeit not a complete one. Bits of it can be used for your own plugins and to make things a bit easier all around.
Note that the kses included in WordPress is a modified version of the original kses. I’ll only be discussing it here, not the original package.
Filtering
The basic use of kses is as a filter. It will eliminate any HTML that is not allowed. Here’s how it works:
$filtered = wp_kses($unfiltered, $allowed_html, $allowed_protocols);
Simple, no?
The allowed html parameter is an array of HTML that you want to allow through. The array can look sorta like this:
$allowed_html = array(
'a' => array(
'href' => array (),
'title' => array ()),
'abbr' => array(
'title' => array ()),
'acronym' => array(
'title' => array ()),
'b' => array(),
'blockquote' => array(
'cite' => array ()),
'cite' => array (),
'code' => array(),
'del' => array(
'datetime' => array ()),
'em' => array (), 'i' => array (),
'q' => array(
'cite' => array ()),
'strike' => array(),
'strong' => array(),
);
As you can see, it’s rather simple. The main array is a list of HTML tags. Each of those points to an array of allowable attributes for those tags. Each of those points to an empty array, because kses is somewhat recursive in this manner.
Any HTML that is not in that list will get stripped out of the string.
The allowed protocols is basically a list of protocols for links that it will allow through. The default is this:
array ('http', 'https', 'ftp', 'ftps', 'mailto', 'news', 'irc', 'gopher', 'nntp', 'feed', 'telnet')
Anything else goes away.
That $allowed_html I gave before may look familiar. It’s the default set of allowed HTML in comments on WordPress. This is stored in a WordPress global called $allowedtags. So you can use this easily like so:
global $allowedtags;
$filtered = wp_kses($unfiltered, $allowedtags);
This is so useful that WordPress 2.9 makes it even easier:
$filtered = wp_kses_data($unfiltered);
$filtered = wp_filter_kses($unfiltered); // does the same, but will also slash escape the data
That uses the default set of allowed tags automatically. There’s another set of defaults, the allowed post tags. This is the set that is allowed to be put into Posts by non-admin users (admins have the “unfiltered_html” capability, and can put anything they like in). There’s easy ways to use that too:
$filtered = wp_kses_post($unfiltered);
$filtered = wp_filter_post_kses($unfiltered); // does the same, but will also slash escape the data
Note that because of the way they are written, they make perfect WordPress filters as well.
add_filter('the_title','wp_kses_data');
This is exactly how WordPress uses them for several internal safety checks.
Now, this is all very handy, but what if I’m not filtering? What if I’m trying to get some useful information out of HTML? Well, kses can help you there too.
Parsing
As part of the filtering mechanism, kses includes a lot of functions to parse the data and to try to find HTML in there, no matter how mangled up and weird looking it might be.
One of these functions is wp_kses_split. It’s not something that is useful directly, but it is useful to understand how kses works. The wp_kses_split function basically finds anything that looks like an HTML tag, then passes it off to wp_kses_split2.
The wp_kses_split2 function takes that tag, cleans it up a bit, and perhaps even recursively calls kses on it again, just in case. But eventually, it calls wp_kses_attr. The wp_kses_attr is what parses the attributes of any HTML tag into chunks and then removes them according to your set of allowed rules. But here’s where we finally find something useful: wp_kses_hair.
The wp_kses_hair function can parse attributes of tags into PHP lists. Here’s how you can use it.
Let’s say we’ve got a post with a bunch of images in it. We’d like to find the source (src) of all those images. This code will do it:
global $post;
if ( preg_match_all('/<img (.+?)>/', $post->post_content, $matches) ) {
foreach ($matches[1] as $match) {
foreach ( wp_kses_hair($match, array('http')) as $attr)
$img[$attr['name']] = $attr['value'];
echo $img['src'];
}
}
What happened there? Well, quite a bit, actually.
First we used preg_match_all to find all the img tags in a post. The regular expression in preg_match_all gave us all the attributes in the img tags, in the form of a string (that is what the “(.+?)” was for). Next, we loop through our matches, and pass each one through wp_kses_hair. It returns an array of name and value pairs. A quick loop through that to set up a more useful $img array, and voila, all we have to do is to reference $img[‘src’] to get the content of the src attribute. Equally accessible is every other attribute, such as $img[‘class’] or $img[‘id’].
Here’s an example piece of code, showing how kses rejects nonsense:
$content = 'This is a test. <img src="test.jpg" class="testclass another" id="testid" fake fake... / > More';
if ( preg_match_all('/<img (.+?)>/', $content, $matches) ) {
foreach ($matches[1] as $match) {
foreach ( wp_kses_hair($match, array('http')) as $attr)
$img[$attr['name']] = $attr['value'];
print_r($img); // show what we got
}
}
The resulting output from the above:
Array
(
[src] => test.jpg
[class] => testclass another
[id] => testid
[fake] =>
)
Very nice and easy way to parse selected pieces of HTML, don’t you think?
Overriding kses
Want to apply some kind of filter of your own to things? WordPress kindly adds a filter hook to all wp_kses calls: pre_kses.
function my_filter($string) {
// do stuff to string
return $string;
}
add_filter('pre_kses', 'my_filter');
Or maybe you want to add your own tags to the allowed list? Like, what if you wanted comments to be able to have images in them, but (sorta) safely?
global $allowedtags;
$allowedtags['img'] = array( 'src' => array () );
What if you want total control? Well, there’s a CUSTOM_TAGS define. If you set that to true, then the $allowedposttags, $allowedtags, and $allowedentitynames variables won’t get set at all. Feel free to define your own globals. I recommend copying them out of kses.php and then editing them if you want to do this.
And of course, if you only want to do a small bit of quick filtering, this sort of thing is always a valid option as well:
// only allow a hrefs through
$filtered = wp_kses($unfiltered, array( 'a' => array( 'href' => array() ) ));
Hopefully that answers some kses questions. It’s not a complete HTML parser by any means, but for quick and simple tasks, it can come in very handy.
Note: kses is NOT 100% safe. It’s very good, but it’s not a full-fledged HTML parser. It’s just safer than not using it. There’s always the possibility that somebody can figure out a way to sneak bad code through. It’s just a lot harder for them to do it.
WordPress 3.0 Theme Tip: The Comment Form
WordPress 3.0 has something very handy that I want theme authors to start implementing as soon as possible.
To show exactly why it’s so useful, I modified my own theme to start using it.
Demonstration
So, here’s a big hunk of code I pulled out of my current theme’s comments.php. This hunk of code has only one purpose: To display the form area where people can leave a comment:
Nasty, eh? It’s a mess of if/else statements. It handles cases where the user is logged in or not, where the comments are open or closed, whether registration is required, etc. It’s confusing, difficult to modify, poor for CSS referencing…
Here’s what I replaced all that code with:
Now then, that’s much better, isn’t it?
The comment_form function is new to 3.0. Basically, it standardizes the comments form. It makes it wonderful for us plugin authors, since now we can easily modify the comments form with various hooks and things. I’ve already modified Simple Facebook Connect and Simple Twitter Connect to support this new approach; if you’re using a theme with this, then the user won’t have to modify it to have their buttons appear on the comments form.
Customizing
Since theme authors love to customize things, the comments form is also extremely customizable. Doing it, however, can be slightly confusing.
Inside the comments_form function, we find some useful hooks to let us change things around.
The first hook is comment_form_default_fields. This lets us modify the three main fields: author, email, and website. It’s a filter, so we can change things as they pass through it. The fields are stored in an array which contains the html that is output. So it looks sorta like this:
I truncated it for simplicity. But what this means is that code like this can modify the fields:
That sort of thing lets us add a new input field, or modify the existing ones, etc…
But fields aren’t the only thing we can change. There’s a comment_form_defaults filter too. It gets a lot of the surrounding text of the comments form. The defaults look sorta like this:
All the various pieces of html that are displayed as part of the comment form section are defined here. So those can be modified as you see fit. However, unlike the fields, adding new bits here won’t help us at all. The fields get looped through for displaying them, these are just settings that get used at various times.
But filters are not the only way to modify these. The comment_form function actually can take an array of arguments as the first parameter, and those arguments will modify the form. So if we wanted a simple change, like to change the wording of “Leave a Reply”, then we could do this:
This gives us a simple and easy way to make changes without all the trouble of filters. Nevertheless, those filters can be very useful for more complex operations.
But wait, there’s more!
As the comments form is being created, there’s a ton of action hooks being called, at every stage. So if you want to insert something into the form itself, there’s easy ways to do it.
A quick list of the action hooks. Most of them are self-explanatory.
CSS and other extras
Let’s not forget styling. All parts of the comments form have nice classes and id’s and such. Take a look at the resulting HTML source and you’ll find all the styling capabilities you like. Also, everything is properly semantic, using label tags and aria-required and so forth. All the text is run through the translation system for core translations.
So theme authors should start modifying their themes to use this instead of the existing big-ugly-comment-form code. Your users will thank you for it. Plugin authors will thank you for it. And really, it’s about time we made WordPress themes more about design and less about the nuts and bolts of the programming, no?
Category: Code, WordPress | 242 Comments