Re: Try this
- From: "mensanator@xxxxxxx" <mensanator@xxxxxxx>
- Date: Sun, 16 Sep 2007 22:55:02 -0700
On Sep 16, 9:27?pm, "Gabriel Genellina" <gagsl-...@xxxxxxxxxxxx>
wrote:
En Sun, 16 Sep 2007 21:58:09 -0300, mensana...@xxxxxxx
<mensana...@xxxxxxx> escribi :
I'm eagerly awaiting publication of your professional specification
for correctly detecting the encoding of an arbitrary stream of
bytes
The very presence of an algorithm to detect encoding is a bug.
Files with they .txt extension should always be treated as ANSI
even if they contain binary data.
Why ANSI?
Because that's the absence of encoding?
Because it's convenient to *you*?
No, it's ANSI unless told otherwise.
What about the rest of the world that don't speak
English or even worse, don't use the Latin alpabet?
When the rest of the world creates the next
generation of computers, THEY can chosse the
defaults.
What do you mean by "binary data"?
8-bit, ASCII is only 7-bit.
Notepad is not interpreting the file as
"binary", it's text,
And will treat non-ASCII data as if it were ASCII.
but interpreted using the wrong encoding.
So that's not a serious bug? To decide that a file
is Unicode despite the absence of the appropriate
markers?
If you want to understand what happens here: The Unicode block for 'CJK
Unified Han' goes from U+4E00 to U+9FFF and is the largest block in the
basic plane, with more than 20000 code points. The previous block contains
the famous 64 hexagrams, and the previous block is 'CJK Unified Han
Extension A' ranging from U+3400 to U+4DBF.
Note that ASCII letters go from 0x41-0x5A and 0x61-7A, and the range
0x4100-0x7AFF is totally contained inside the above Unicode blocks.
Reading a small phrase containing only ASCII letters as it were in UTF16
would collapse each two letters into a single character, each character
being part of 'CJK Unified Han'. (Space and punctuation are allowed in odd
positions only, else the character would not belong to the Han blocks).
As every character goes into the same code block the heuristics concludes
that the text is some Estern language encoded in UTF16.
But...but...Notepad doesn't have a UTF16 option.
This is the "Well you are speed" phrase interpreted as UTF16:
u'\u6557\u6c6c\u7920\u756f\u6120\u6572\u7320\u6570\u6465'
How can you tell from that that it's UTF16? If there's
something stored in addition to those 18 bytes, you're
being misleading.
Notepad should never be
allowed to try to decide what the encoding is if the the open
dialog has the encoding set to ANSI.
I'm using notepad.exe version 5.1.2600.2180 (XP SP2 fully updated) and
that's exactly what happens. I have to explicitely select Unicode in order
to see those Han characters.
So which is worse, you having to tell it that it's
Unicode or Notepad deciding on its own that a file
is Unicode when it isn't.
--
Gabriel Genellina
.
- Follow-Ups:
- Re: Try this
- From: Steve Holden
- Re: Try this
- From: Gabriel Genellina
- Re: Try this
- References:
- Try this
- From: GeorgeRXZ
- Re: Try this
- From: mensanator@xxxxxxx
- Re: Try this
- From: Steve Holden
- Re: Try this
- From: mensanator@xxxxxxx
- Re: Try this
- From: John Machin
- Re: Try this
- From: mensanator@xxxxxxx
- Re: Try this
- From: John Machin
- Re: Try this
- From: mensanator@xxxxxxx
- Re: Try this
- From: Gabriel Genellina
- Try this
- Prev by Date: Re: Python statements not forcing whitespace is messy?
- Next by Date: Re: Coming from Perl
- Previous by thread: Re: Try this
- Next by thread: Re: Try this
- Index(es):
Relevant Pages
|
Loading