200911.04

How to convert HTML named entities to numbered entities in PHP

I recently (read: today) had an obnoxious problem: I'm writing some code for creating an ATOM feed, and kept getting errors about entity-escaped values. Namely, things like ’, •, etc. Even written as entities, Opera and IE7 did not recognize them. I read somewhere that it was necessary to convert the named entities to numbered entities. Great.

Well, PHP doesn't have a native function for this. Why, I do not know...there seems to be functions for many other things, and adding an argument to htmlentities that returns numbered entities would seem easy enough. Either way, I wrote a quick function that takes the htmlentities translation table, adds any missing values that are not in the translation table, and runs the conversion to numbered entities. Check it:

function htmlentities_numbered($string)
{
	$table	=	get_html_translation_table(HTML_ENTITIES);
	$trans	=	array();
	foreach($table as $char => $ent)
	{
		$trans[$ent]	=	'&#'. ord($char) .';';
	}
	$trans['€']	=	'€';
	$trans['‚']	=	'‚';
	$trans['ƒ']	=	'ƒ';
	$trans['„']	=	'„';
	$trans['…']	=	'…';
	$trans['†']	=	'†';
	$trans['‡']	=	'‡';
	$trans['ˆ']	=	'ˆ';
	$trans['‰']	=	'‰';
	$trans['Š']	=	'Š';
	$trans['‹']	=	'‹';
	$trans['Œ']	=	'Œ';
	$trans['‘']	=	'‘';
	$trans['’']	=	'’';
	$trans['“']	=	'“';
	$trans['”']	=	'”';
	$trans['•']	=	'•';
	$trans['–']	=	'–';
	$trans['—']	=	'—';
	$trans['˜']	=	'˜';
	$trans['™']	=	'™';
	$trans['š']	=	'š';
	$trans['›']	=	'›';
	$trans['œ']	=	'œ';
	$trans['Ÿ']	=	'Ÿ';
	$string	=	strtr($string, $trans);
	return $string;
}

Hope it's helpful.

UPDATE - apparently, even the numbered entities are not valid XML. Fair enough, I've converted them all to unicode (0x80 - 0x9F). All my ATOM feeds validate now (through feedvalidator.org).