• 200911.03

    How to convert HTML named entities to numbered entities in PHP

    I recently (read: today) had an obnoxious problem: I'm writing some code for creating an ATOM feed, and kept getting errors about entity-escaped values. Namely, things like ’, •, etc. Even written as entities, Opera and IE7 did not recognize them. I read somewhere that it was necessary to convert the named entities to numbered entities. Great.

    Well, PHP doesn't have a native function for this. Why, I do not know...there seems to be functions for many other things, and adding an argument to htmlentities that returns numbered entities would seem easy enough. Either way, I wrote a quick function that takes the htmlentities translation table, adds any missing values that are not in the translation table, and runs the conversion to numbered entities. Check it:

    function htmlentities_numbered($string)
    {
    	$table	=	get_html_translation_table(HTML_ENTITIES);
    	$trans	=	array();
    	foreach($table as $char => $ent)
    	{
    		$trans[$ent]	=	'&#'. ord($char) .';';
    	}
    	$trans['€']	=	'€';
    	$trans['‚']	=	'‚';
    	$trans['ƒ']	=	'ƒ';
    	$trans['„']	=	'„';
    	$trans['…']	=	'…';
    	$trans['†']	=	'†';
    	$trans['‡']	=	'‡';
    	$trans['ˆ']	=	'ˆ';
    	$trans['‰']	=	'‰';
    	$trans['Š']	=	'Š';
    	$trans['‹']	=	'‹';
    	$trans['Œ']	=	'Œ';
    	$trans['‘']	=	'‘';
    	$trans['’']	=	'’';
    	$trans['“']	=	'“';
    	$trans['”']	=	'”';
    	$trans['•']	=	'•';
    	$trans['–']	=	'–';
    	$trans['—']	=	'—';
    	$trans['˜']	=	'˜';
    	$trans['™']	=	'™';
    	$trans['š']	=	'š';
    	$trans['›']	=	'›';
    	$trans['œ']	=	'œ';
    	$trans['Ÿ']	=	'Ÿ';
    	$string	=	strtr($string, $trans);
    	return $string;
    }
    

    Hope it's helpful.

    UPDATE - apparently, even the numbered entities are not valid XML. Fair enough, I've converted them all to unicode (0x80 - 0x9F). All my ATOM feeds validate now (through feedvalidator.org).

    Comments