Re: filter out "strange" text in perl ? íµ▓½τ┤░Φâ₧




anno4000@xxxxxxxxxxxxxxxxxxxxxx wrote:
Jack <jack_posemsky@xxxxxxxxx> wrote in comp.lang.perl.misc:
Hi

I am parsing a text file and see what looks like in the datafiel a NULL
(nothing) in between my delimeter but Perl is recognizing a value when
I print to the screen as: íµ▓½τ┤░Φâ₧

Posting "strange characters" to Usenet is useless. Every news reader
will show something else. In fact, my reader shows your string
differently in the subject and the body of the message.

To communicate the data unambiguously you could print the numeric
value of each character:

printf "%d ", ord $_ for split //, $string;
print "\n";

Be sure to post not only the output but also the proglet that
generated it.

This is screwing up my program and I want to get rid of it ! Does
anyone know how to auto match / detect this so I can remove it / deal
with it ?!!

What exactly do you mean by "this"? Simply deleting the exact sequence
of bytes wherever it appears would be a horrible solution, and probably
not a solution at all if similar but not identical strings appear
elsewhere.

You should find out why the disruptive strings are there in the first
place. Then there may be a realistic chance to get rid of them.

Anno

Ok then - does anyone know what the syntax is to detect:
1- ASCII
2- double byte characters
3- UTF-8

Thank you,

Jack

.



Relevant Pages

  • Re: =?utf-8?B?ZmlsdGVyIG91dCAic3RyYW5nZSIgdGV4dCBpbiBwZXJsID8gICAgw63CteKWk8K9z4TilKTilpHOpsOi4o
    ... Posting "strange characters" to Usenet is useless. ... Every news reader ... of bytes wherever it appears would be a horrible solution, ... You should find out why the disruptive strings are there in the first ...
    (comp.lang.perl.misc)
  • Re: Why R6RS is controversial
    ... the semantics of the language, ... behavior of grapheme-cluster characters under most linguistic ... as the strings grow longer. ... Normalization is hideously complicated, and may require many ...
    (comp.lang.scheme)
  • Re: Unicode LISP??
    ... I'm not experienced with Common Lisp library, ... terms of strings rather than characters. ... have their representation upgraded if they are updated in place. ...
    (comp.lang.lisp)
  • Re: not quite 1252
    ... The kill_gremlins function is intended to fix Unicode strings that have been obtained by decoding 8-bit strings using 'latin1' instead of 'cp1252'. ... In fact it wasn't, it was UTF-8 like Sergei wrote, but it was easy to convert it to cp1252, no problem. ... characters to documents marked up as ISO 8859-1 or other encodings. ...
    (comp.lang.python)
  • Re: How to check variables for uniqueness ?
    ... FI in English typography), so the correct uppercase version of those ... characters is the sequence SS. ... So you at least agree with me that it should be consistent with toUpperCase -- all strings should have a single canonical toUpperCase, a single canonical toLowerCase, both should define equivalence classes on the mixed-case input strings, these should be the SAME equivalence class, and equalsIgnoreCase should implement and embody the corresponding equivalence relation. ... The version that doesn't shouldn't surprise English speakers; the version that does shouldn't surprise anyone familiar with its locale-specific behavior for the locale actually used. ...
    (comp.lang.java.programmer)