Re: comparing binary strings



On Mon, 10 Dec 2007 14:27:12 +0000, Ben Morrow wrote:

Quoth Yakov <iler.ml@xxxxxxxxx>:
, and I need to ensure that comparison (lt, gt) of those strings not
depend on locale and not be utf-8 interpreted.

Don't use locales then. For the purposes of lt and gt and eq it's
irrelevant wether the strings are utf-8 encoded or not.

If you want to be able to output them as bytes, just don't insert any
chars > 255.

Don't attempt to mix POSIX locales and Perl's UTF8 support. The two
don't play well together at all yet.

Yup.

How can I "label" thise string "raw binary" in CreateEightByteString
so that any subsequent comparison be as raw binary 8 bytes, independent
on program's locale ?

You can't. You need to perform *the comparisons* under 'use bytes'.

No, you don't need to. The only time the encoding of the strings is
important is when you're passing them to external code as a C-style char*
pointer. Or at least it should be.

In general, I'd say: don't use bytes. It breaks stuff - for instance, it
makes it impossible to compare an utf-8 encoded string with a non-utf8
encoded one.

If you aren't using locales, and you call CreateEightByteString under
'use bytes', you will get byte-strings back. If you only mix these with
other byte-strings, they will stay that way.

As long as you're reading and writing to filehandles that have the
correct encoding layer the internal encoding of the strings doesn't
matter to perl code.

Joost.


.



Relevant Pages

  • Re: LANG, locale, unicode, setup.py and Debian packaging
    ... encoding, and compute that encoding with locale.getpreferredencoding. ... the locale returns something like "ANSI" and I ... If I access the filename it throws a unicodeDecodeError. ... can't know if I am testing real-world strings or crazy Tolkein strings. ...
    (comp.lang.python)
  • Re: UTF8: cgi ist staerker als ich
    ... use encoding "utf8" ... use locale ist sogar äusserst gefährlich und unberechenbar. ... dass Latin-1 weder hebräische noch kyrillische ... hab' ich schon festgestellt - wenn ich die cgi header auf utf-8 ...
    (de.comp.lang.perl.cgi)
  • utf-8 support in libc?
    ... Reading thru one of the postgres mailing lists regarding which character encoding to use for a database, someone chimed in and claimed this: ... there is no collation support for UTF-8 on those platforms. ... locale you initdb with is a UTF-8 locale. ... I need to have a UTF-8 encoded database in an upcoming project, and performance is always a concern. ...
    (freebsd-stable)
  • Re: LANG, locale, unicode, setup.py and Debian packaging
    ... NTFS and VFAT represent file names as Unicode ... strings always, independent of locale. ... Then, if the locale's encoding cannot decode the file names, you have ...
    (comp.lang.python)
  • Re: character encoding
    ... On new Etch installs, UTF-8 is the default. ... on you locale (I'm not sure if a system upgraded to Etch would be UTF-8 ... that application will try to read it as a certain encoding -- how is ... specific format (binary executables are in ELF format on ...
    (Debian-User)