Posts tagged ‘parsing’

Seems that some people don’t know much about kses. It’s really not all that complicated, but there doesn’t seem to be a lot of documentation around for it, so what the hell.

The kses package is short for “kses strips evil scripts”, and it is basically an HTML filtering mechanism. It can read HTML code, no matter how malformed it is, and filter out undesirable bits. The idea is to allow some safe subset of HTML through, so as to prevent various forms of attacks.

However, by necessity, it also includes a passable HTML parser, albeit not a complete one. Bits of it can be used for your own plugins and to make things a bit easier all around.

Note that the kses included in WordPress is a modified version of the original kses. I’ll only be discussing it here, not the original package.

Filtering

The basic use of kses is as a filter. It will eliminate any HTML that is not allowed. Here’s how it works:

$filtered = wp_kses($unfiltered, $allowed_html, $allowed_protocols);

Simple, no?

The allowed html parameter is an array of HTML that you want to allow through. The array can look sorta like this:

$allowed_html = array(
	'a' => array(
		'href' => array (),
		'title' => array ()),
	'abbr' => array(
		'title' => array ()),
	'acronym' => array(
		'title' => array ()),
	'b' => array(),
	'blockquote' => array(
		'cite' => array ()),
	'cite' => array (),
	'code' => array(),
	'del' => array(
		'datetime' => array ()),
	'em' => array (), 'i' => array (),
	'q' => array(
		'cite' => array ()),
	'strike' => array(),
	'strong' => array(),
);

As you can see, it’s rather simple. The main array is a list of HTML tags. Each of those points to an array of allowable attributes for those tags. Each of those points to an empty array, because kses is somewhat recursive in this manner.

Any HTML that is not in that list will get stripped out of the string.

The allowed protocols is basically a list of protocols for links that it will allow through. The default is this:

array ('http', 'https', 'ftp', 'ftps', 'mailto', 'news', 'irc', 'gopher', 'nntp', 'feed', 'telnet')

Anything else goes away.

That $allowed_html I gave before may look familiar. It’s the default set of allowed HTML in comments on WordPress. This is stored in a WordPress global called $allowedtags. So you can use this easily like so:

global $allowedtags;
$filtered = wp_kses($unfiltered, $allowedtags);

This is so useful that WordPress 2.9 makes it even easier:

$filtered = wp_kses_data($unfiltered);
$filtered = wp_filter_kses($unfiltered); // does the same, but will also slash escape the data

That uses the default set of allowed tags automatically. There’s another set of defaults, the allowed post tags. This is the set that is allowed to be put into Posts by non-admin users (admins have the “unfiltered_html” capability, and can put anything they like in). There’s easy ways to use that too:

$filtered = wp_kses_post($unfiltered);
$filtered = wp_filter_post_kses($unfiltered); // does the same, but will also slash escape the data

Note that because of the way they are written, they make perfect WordPress filters as well.

add_filter('the_title','wp_kses_data');

This is exactly how WordPress uses them for several internal safety checks.

Now, this is all very handy, but what if I’m not filtering? What if I’m trying to get some useful information out of HTML? Well, kses can help you there too.

Parsing

As part of the filtering mechanism, kses includes a lot of functions to parse the data and to try to find HTML in there, no matter how mangled up and weird looking it might be.

One of these functions is wp_kses_split. It’s not something that is useful directly, but it is useful to understand how kses works. The wp_kses_split function basically finds anything that looks like an HTML tag, then passes it off to wp_kses_split2.

The wp_kses_split2 function takes that tag, cleans it up a bit, and perhaps even recursively calls kses on it again, just in case. But eventually, it calls wp_kses_attr. The wp_kses_attr is what parses the attributes of any HTML tag into chunks and then removes them according to your set of allowed rules. But here’s where we finally find something useful: wp_kses_hair.

The wp_kses_hair function can parse attributes of tags into PHP lists. Here’s how you can use it.

Let’s say we’ve got a post with a bunch of images in it. We’d like to find the source (src) of all those images. This code will do it:

global $post;
if ( preg_match_all('/<img (.+?)>/', $post->post_content, $matches) ) {
        foreach ($matches[1] as $match) {
                foreach ( wp_kses_hair($match, array('http')) as $attr)
                	$img[$attr['name']] = $attr['value'];
                echo $img['src'];
        }
}

What happened there? Well, quite a bit, actually.

First we used preg_match_all to find all the img tags in a post. The regular expression in preg_match_all gave us all the attributes in the img tags, in the form of a string (that is what the “(.+?)” was for). Next, we loop through our matches, and pass each one through wp_kses_hair. It returns an array of name and value pairs. A quick loop through that to set up a more useful $img array, and voila, all we have to do is to reference $img['src'] to get the content of the src attribute. Equally accessible is every other attribute, such as $img['class'] or $img['id'].

Here’s an example piece of code, showing how kses rejects nonsense:

$content = 'This is a test. <img src="test.jpg" class="testclass another" id="testid" fake fake... / > More';
if ( preg_match_all('/<img (.+?)>/', $content, $matches) ) {
        foreach ($matches[1] as $match) {
                foreach ( wp_kses_hair($match, array('http')) as $attr)
                	$img[$attr['name']] = $attr['value'];
                print_r($img); // show what we got
        }
}

The resulting output from the above:

Array
(
    [src] => test.jpg
    [class] => testclass another
    [id] => testid
    [fake] =>
)

Very nice and easy way to parse selected pieces of HTML, don’t you think?

Overriding kses

Want to apply some kind of filter of your own to things? WordPress kindly adds a filter hook to all wp_kses calls: pre_kses.

function my_filter($string) {
	// do stuff to string
	return $string;
}
add_filter('pre_kses', 'my_filter');

Or maybe you want to add your own tags to the allowed list? Like, what if you wanted comments to be able to have images in them, but (sorta) safely?

global $allowedtags;
$allowedtags['img'] = array( 'src' => array () );

What if you want total control? Well, there’s a CUSTOM_TAGS define. If you set that to true, then the $allowedposttags, $allowedtags, and $allowedentitynames variables won’t get set at all. Feel free to define your own globals. I recommend copying them out of kses.php and then editing them if you want to do this.

And of course, if you only want to do a small bit of quick filtering, this sort of thing is always a valid option as well:

// only allow a hrefs through
$filtered = wp_kses($unfiltered, array( 'a' => array( 'href' => array() ) ));

Hopefully that answers some kses questions. It’s not a complete HTML parser by any means, but for quick and simple tasks, it can come in very handy.

Note: kses is NOT 100% safe. It’s very good, but it’s not a full-fledged HTML parser. It’s just safer than not using it. There’s always the possibility that somebody can figure out a way to sneak bad code through. It’s just a lot harder for them to do it.

Shortlink: