Re: html_entity_decode + regex = profit?



ReGenesis0 wrote:
*bangs head against wall*

I'm stuck on a caret (^), and it's driving me nuts.

I've got an incoming string with a caret that's been escaped to
ˆ. The caret isn't a legal character where it's going (a text
string that will be used to generate a graphic with a font) so I have
to get rid of it.

From where are you getting your data? The data should not be escaped until the point at which output occurs. However, you still validate the data coming in (i.e.: prepared statements, mysql_real_escape_string(), intval(), etc.).

This way, you don't have to deal with doing unnecessary data alterations.

My default solution was to check each character in the string against
a regex to see if it's one of the "allowed" characters. (The font
being used to generate the image only has ~60 chars, it's a display
font used for titles.) This regex, naturally, does NOT include a
caret as a legal character.

You can escape regex metacharacters with backslashes. Check the PHP manual's section on PCRE. You can also use preg_quote() to escape metacharacters in a string.

$regex = '/\\^/'; // literal caret

...and so, naturally, I get the number "710" showing up in the
resulting graphic.

html_entity_decode does nothing unless I set it to a restrictive
character set like UTF-8, which then eliminates some of the legal
characters i want to keep.

I can obviously just replace 'ˆ' in the original string... but
the /point/ is... there are probably OTHER problem characters slipping
through my net. I want a universal solution.

I feel like this shouldn't be this complicated. I have a friggin LIST
of allowable characters-- but even if I TEST AGAINST THAT LIST, one by
one, garbage from these encoded characters slips through.

If you have a known list, you can probably avoid the cost of using regex:

<?php
$trans = array(
'&#710;' => '^',
// more, after the same manner ...
);
$decoded = strtr($data, $trans);
?>

I GATHER that this is a legacy of PHP's grotty character encoding. I
understand that. Is there ANYTHING I can do to convert an incoming
string to that each character == 1 character? Because (and if you'll
pardon my metaphorical black rage) THAT'S WHAT html_entity_decode is
SUPPOSED to do.
All I want to do is drop the whole damn thing into Unicode 16 or
something so that NO MATTER WHAT character I'm dealing with, be it a
circumflex, a greater-than, a euro, a bullet, a ~n, a Cyrillic
backwards R, or a Japanese 'ko,' ...is logically regarded BY PHP as a
SINGLE CHARACTER that can be subjected to logical scrutiny wihout
TEARING MY HAIR OUT.
RICH TEXT BITCHES. RICH TEXT!

*runs out of steam, panting*
*takes a breath, pushes back hair*

Sooo... am I missing something? Is this actually /possible/ and it's
just not working?
(And god DAMNIT, I can understand why PHP doesn't want to change the
existing behavior of html_entity_decode for all those legacy coders...
but why does there not seem to even be an OPTIONAL PARAMETER to force
it to convert ALL HTML entities instead of it's baffling behavior of
just converting the 'most common'? That's insane.)

If you are in a position to help it, make sure the data isn't stored in an escaped form, as suggested above.

-Derik

--
Curtis
$eMail = str_replace('sig.invalid', 'gmail.com', $from);
.



Relevant Pages

  • RfD: Escaped Strings version 4
    ... the S" string can only contain printable characters, ... the S" string cannot contain the '"' character, ... as an escape character for the entry of characters that cannot be ... \b BS (backspace, ASCII 8) ...
    (comp.lang.forth)
  • RfD: Escaped Strings version 4
    ... the S" string can only contain printable characters, ... the S" string cannot contain the '"' character, ... as an escape character for the entry of characters that cannot be ... \b BS (backspace, ASCII 8) ...
    (comp.lang.forth)
  • Re: problems with opening files due to files path
    ... GUI or it is a console app. ... of what an escape character and escape sequence is. ... character) inside a string specially, it makes the character after the ...
    (comp.lang.python)
  • Re: about escape sequence in RC file
    ... \r can be used as escape sequence in string table of RC ... string resources, embedded quotes don't need to be escaped (the escapes are ... prevent the user from changing the typeface to one that does not. ... place of the escaped character. ...
    (microsoft.public.vc.mfc.docview)
  • html_entity_decode + regex = profit?
    ... I've got an incoming string with a caret that's been escaped to ... My default solution was to check each character in the string against ...
    (comp.lang.php)