Seems that some people don’t know much about kses. It’s really not all that complicated, but there doesn’t seem to be a lot of documentation around for it, so what the hell.

The kses package is short for “kses strips evil scripts”, and it is basically an HTML filtering mechanism. It can read HTML code, no matter how malformed it is, and filter out undesirable bits. The idea is to allow some safe subset of HTML through, so as to prevent various forms of attacks.

However, by necessity, it also includes a passable HTML parser, albeit not a complete one. Bits of it can be used for your own plugins and to make things a bit easier all around.

Note that the kses included in WordPress is a modified version of the original kses. I’ll only be discussing it here, not the original package.

Filtering

The basic use of kses is as a filter. It will eliminate any HTML that is not allowed. Here’s how it works:

$filtered = wp_kses($unfiltered, $allowed_html, $allowed_protocols);

Simple, no?

The allowed html parameter is an array of HTML that you want to allow through. The array can look sorta like this:

$allowed_html = array(
	'a' => array(
		'href' => array (),
		'title' => array ()),
	'abbr' => array(
		'title' => array ()),
	'acronym' => array(
		'title' => array ()),
	'b' => array(),
	'blockquote' => array(
		'cite' => array ()),
	'cite' => array (),
	'code' => array(),
	'del' => array(
		'datetime' => array ()),
	'em' => array (), 'i' => array (),
	'q' => array(
		'cite' => array ()),
	'strike' => array(),
	'strong' => array(),
);

As you can see, it’s rather simple. The main array is a list of HTML tags. Each of those points to an array of allowable attributes for those tags. Each of those points to an empty array, because kses is somewhat recursive in this manner.

Any HTML that is not in that list will get stripped out of the string.

The allowed protocols is basically a list of protocols for links that it will allow through. The default is this:

array ('http', 'https', 'ftp', 'ftps', 'mailto', 'news', 'irc', 'gopher', 'nntp', 'feed', 'telnet')

Anything else goes away.

That $allowed_html I gave before may look familiar. It’s the default set of allowed HTML in comments on WordPress. This is stored in a WordPress global called $allowedtags. So you can use this easily like so:

global $allowedtags;
$filtered = wp_kses($unfiltered, $allowedtags);

This is so useful that WordPress 2.9 makes it even easier:

$filtered = wp_kses_data($unfiltered);
$filtered = wp_filter_kses($unfiltered); // does the same, but will also slash escape the data

That uses the default set of allowed tags automatically. There’s another set of defaults, the allowed post tags. This is the set that is allowed to be put into Posts by non-admin users (admins have the “unfiltered_html” capability, and can put anything they like in). There’s easy ways to use that too:

$filtered = wp_kses_post($unfiltered);
$filtered = wp_filter_post_kses($unfiltered); // does the same, but will also slash escape the data

Note that because of the way they are written, they make perfect WordPress filters as well.

add_filter('the_title','wp_kses_data');

This is exactly how WordPress uses them for several internal safety checks.

Now, this is all very handy, but what if I’m not filtering? What if I’m trying to get some useful information out of HTML? Well, kses can help you there too.

Parsing

As part of the filtering mechanism, kses includes a lot of functions to parse the data and to try to find HTML in there, no matter how mangled up and weird looking it might be.

One of these functions is wp_kses_split. It’s not something that is useful directly, but it is useful to understand how kses works. The wp_kses_split function basically finds anything that looks like an HTML tag, then passes it off to wp_kses_split2.

The wp_kses_split2 function takes that tag, cleans it up a bit, and perhaps even recursively calls kses on it again, just in case. But eventually, it calls wp_kses_attr. The wp_kses_attr is what parses the attributes of any HTML tag into chunks and then removes them according to your set of allowed rules. But here’s where we finally find something useful: wp_kses_hair.

The wp_kses_hair function can parse attributes of tags into PHP lists. Here’s how you can use it.

Let’s say we’ve got a post with a bunch of images in it. We’d like to find the source (src) of all those images. This code will do it:

global $post;
if ( preg_match_all('/<img (.+?)>/', $post->post_content, $matches) ) {
        foreach ($matches[1] as $match) {
                foreach ( wp_kses_hair($match, array('http')) as $attr)
                	$img[$attr['name']] = $attr['value'];
                echo $img['src'];
        }
}

What happened there? Well, quite a bit, actually.

First we used preg_match_all to find all the img tags in a post. The regular expression in preg_match_all gave us all the attributes in the img tags, in the form of a string (that is what the “(.+?)” was for). Next, we loop through our matches, and pass each one through wp_kses_hair. It returns an array of name and value pairs. A quick loop through that to set up a more useful $img array, and voila, all we have to do is to reference $img['src'] to get the content of the src attribute. Equally accessible is every other attribute, such as $img['class'] or $img['id'].

Here’s an example piece of code, showing how kses rejects nonsense:

$content = 'This is a test. <img src="test.jpg" class="testclass another" id="testid" fake fake... / > More';
if ( preg_match_all('/<img (.+?)>/', $content, $matches) ) {
        foreach ($matches[1] as $match) {
                foreach ( wp_kses_hair($match, array('http')) as $attr)
                	$img[$attr['name']] = $attr['value'];
                print_r($img); // show what we got
        }
}

The resulting output from the above:

Array
(
    [src] => test.jpg
    [class] => testclass another
    [id] => testid
    [fake] =>
)

Very nice and easy way to parse selected pieces of HTML, don’t you think?

Overriding kses

Want to apply some kind of filter of your own to things? WordPress kindly adds a filter hook to all wp_kses calls: pre_kses.

function my_filter($string) {
	// do stuff to string
	return $string;
}
add_filter('pre_kses', 'my_filter');

Or maybe you want to add your own tags to the allowed list? Like, what if you wanted comments to be able to have images in them, but (sorta) safely?

global $allowedtags;
$allowedtags['img'] = array( 'src' => array () );

What if you want total control? Well, there’s a CUSTOM_TAGS define. If you set that to true, then the $allowedposttags, $allowedtags, and $allowedentitynames variables won’t get set at all. Feel free to define your own globals. I recommend copying them out of kses.php and then editing them if you want to do this.

And of course, if you only want to do a small bit of quick filtering, this sort of thing is always a valid option as well:

// only allow a hrefs through
$filtered = wp_kses($unfiltered, array( 'a' => array( 'href' => array() ) ));

Hopefully that answers some kses questions. It’s not a complete HTML parser by any means, but for quick and simple tasks, it can come in very handy.

Note: kses is NOT 100% safe. It’s very good, but it’s not a full-fledged HTML parser. It’s just safer than not using it. There’s always the possibility that somebody can figure out a way to sneak bad code through. It’s just a lot harder for them to do it.

Shortlink:

29 Comments

  1. Otto, you have done an awesome job in explaining this complicated subject. Very interesting on how KSES can be used to parse attributes out of an HTML tag!

    Thank you! :-)

  2. Not sure if it was my request on trac that triggered you to write this or not but whatever the case, thank you. It was exactly the primer than I needed. Kudos.

  3. Are you aware of any ready-made solution to sort HTML tag attributes in WordPress? It looks to me that kses could be used for the parsing, and after parsing the attributes could be injected back in whatever order, e.g. alphabetical.

    I’m guessing there’s also some hook/filter to do it for the image or link injecter, which is one part, but I’m also looking for solution to do that for existing posts and content as well…

  4. Nice write-up!

    // Ulf (kses guy)

  5. Damm this is handy Otto!

    The other documentation on this is pretty horrid, so your post is a life saver when I’m trying to remember the correct syntax for this.

  6. So I have a question for you. I have been trying to embed Tweets in my .org blog in comments utilizing code from Twitter’s “Blackbird Pie” and the html will not show. Does that have anything to do with WordPress’ default kses settings? If so, what code would I implement and what file, and where in the file would this be applied?

  7. [...] for blog post entries. Anything other than what’s defined here will be torn asunder. (See Otto’s post for a more detailed description.) Many queries to the all powerful Google resulted in this post [...]

  8. Awesome function wonder why there is so less documentation about it. Filtering is really easy with this no need invent complex logic for filtering.

    Thanks a lot

  9. [...] KSES staat voor “kses strips evil scripts” en is een filter voor HTML. WordPress gebruikt het standaard om comments te filteren. Otto legt het verder heel goed uit in zijn post http://ottopress.com/2010/wp-quickie-kses/ [...]

  10. Awesome post. Thanks again Otto, you always save the day.

  11. Hi!
    Would something like this be valid ?

    wp_kses($unfiltered, ”) if i don’t want any html tags allowed?

  12. Hi Otto,

    I’m trying to make a plugin for kses – basically I want to allow images, but I want to replace

    <img src="x" /> with <a href="x"><img src="x" /></a></code>

    So here’s where I am right now:

    global $allowedtags;
    $allowedtags['img'] = array( 'src' => array () );
    
    //take images, and enclose them in a link
    function zimgf($ztring){
    $result = preg_replace('/<img src="(.+)"(.+)\/>/Ui', '<a href="$1"><img src="$1"$2/></a>', $ztring);
    return $result;
    }
    
    add_filter('pre-kses', 'zimgf');
    

    Problem is, while this does allow the images through, it doesn’t do the replacing part (the regex works, tried it separately).

  13. Great stuff here! Kind of late, but thanks anyway…

    Now, I heard from Mark Jaquith you are not suppose to use kses for ouptputing stuff (performance wise). So if I need to output some user input html, how do I do that? I mean, how do I make sure I’m escaping right and letting that html pass through?

    • For performance, it’s best to use kses on the input before saving it to the database. WordPress does this via pre_* filters, such as pre_comment_content and similar. Thus, the “safe” content is what gets saved. Then it can just output it directly.

      • Otto! Thanks for replying to my previous comment..! Somehow I missed that answer before.
        So you say we should output directly from the DB without worrying for what we are outputting?

        • I wouldn’t exactly say that as a general case. But I also wouldn’t want to process every comment through kses every time they’re displayed. Lot of unnecessary processing there. If you are using kses to filter, and you filter before saving the data to the database, then you know it’s safe for displaying already.

          WordPress filters comments through kses on the pre_ filter, so it’s processed once. When it’s pulled for display, it’s not filtered through kses again at that time. But if you’re filtering something other than comments, you may need a different filtering strategy. You can’t generalize that sort of thing, filtering is all about context.

          • Yeap, I know. I was talking about sanitizing user’s html input, kind of the input that may go into a content editor. But I think I get the point. Sanitize whenever I can, trust WordPress output functions when I can use them and deal with the exceptions, right?

  14. Are there any security implications if I use wp_filter_nohtml_kses on something such as pre_comment_content like this: add_filter( 'pre_comment_content', 'wp_filter_nohtml_kses' );?

    • Not that I know of. Doing that would just strip all HTML from the submitted comments.

      • Thanks Otto. My aim is to strip HTML tags before they are saved to the database. I wanted to be sure that running wp_filter_nohtml_kses on pre_comment_content didn’t get in the way of any processing WordPress might do on the submitted comments.

        Thanks for the insight into the kses functions, been V helpful!

  15. Hi again Otto

    I’m running wp_filter_nohtml_kses on pre_post_content.

    Is there a tag I could use for pages only? HTML is stripped from both posts and pages when I use pre_post_content

  16. […] you’re interested in kses, check out this. kses is what parses the HTML from the comments to make sure nothing crazy is put on the […]

Leave a Reply

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>

Need to post PHP code? Wrap it in [php] and [/php] tags.