Re: Try this



On Sep 16, 9:27?pm, "Gabriel Genellina" <gagsl-...@xxxxxxxxxxxx>
wrote:
En Sun, 16 Sep 2007 21:58:09 -0300, mensana...@xxxxxxx
<mensana...@xxxxxxx> escribi :

I'm eagerly awaiting publication of your professional specification
for correctly detecting the encoding of an arbitrary stream of
bytes

The very presence of an algorithm to detect encoding is a bug.
Files with they .txt extension should always be treated as ANSI
even if they contain binary data.

Why ANSI?

Because that's the absence of encoding?

Because it's convenient to *you*?

No, it's ANSI unless told otherwise.

What about the rest of the world that don't speak
English or even worse, don't use the Latin alpabet?

When the rest of the world creates the next
generation of computers, THEY can chosse the
defaults.

What do you mean by "binary data"?

8-bit, ASCII is only 7-bit.

Notepad is not interpreting the file as
"binary", it's text,

And will treat non-ASCII data as if it were ASCII.

but interpreted using the wrong encoding.

So that's not a serious bug? To decide that a file
is Unicode despite the absence of the appropriate
markers?


If you want to understand what happens here: The Unicode block for 'CJK
Unified Han' goes from U+4E00 to U+9FFF and is the largest block in the
basic plane, with more than 20000 code points. The previous block contains
the famous 64 hexagrams, and the previous block is 'CJK Unified Han
Extension A' ranging from U+3400 to U+4DBF.
Note that ASCII letters go from 0x41-0x5A and 0x61-7A, and the range
0x4100-0x7AFF is totally contained inside the above Unicode blocks.
Reading a small phrase containing only ASCII letters as it were in UTF16
would collapse each two letters into a single character, each character
being part of 'CJK Unified Han'. (Space and punctuation are allowed in odd
positions only, else the character would not belong to the Han blocks).
As every character goes into the same code block the heuristics concludes
that the text is some Estern language encoded in UTF16.

But...but...Notepad doesn't have a UTF16 option.

This is the "Well you are speed" phrase interpreted as UTF16:
u'\u6557\u6c6c\u7920\u756f\u6120\u6572\u7320\u6570\u6465'

How can you tell from that that it's UTF16? If there's
something stored in addition to those 18 bytes, you're
being misleading.


Notepad should never be
allowed to try to decide what the encoding is if the the open
dialog has the encoding set to ANSI.

I'm using notepad.exe version 5.1.2600.2180 (XP SP2 fully updated) and
that's exactly what happens. I have to explicitely select Unicode in order
to see those Han characters.

So which is worse, you having to tell it that it's
Unicode or Notepad deciding on its own that a file
is Unicode when it isn't.


--
Gabriel Genellina


.



Relevant Pages

  • Re: C# and encodings
    ... different encoding than Unicode does (Unicode set uses three ... Any character encoding that is not Unicode by definition uses a different encoding than Unicode does. ... The point is that the Unicode "character" 0xfeff is not representable in any ANSI code page, and is treated specially by stripping it from input rather than replacing it with the "default character". ...
    (microsoft.public.dotnet.languages.csharp)
  • Re: C# and encodings
    ... But if windows has numerous code pages, ... encoding, and thus have only 255 code points matched to characters? ... Unicode can't be represented in only 8-bits, ... But Notepad supports Unicode and yet it only recognizes 255 character, ...
    (microsoft.public.dotnet.languages.csharp)
  • C# and encodings
    ... Can code page support Unicode coded character set, ... Are there also 8-bit code pages which use Unicode character ... encoding, and thus have only 255 code points matched to characters? ... mark written in UTF-8. ...
    (microsoft.public.dotnet.languages.csharp)
  • Re: C# and encodings
    ... different encoding than Unicode does ... encoded into a binary stream using an encoding that either supports the ... So if code page supports only a subset of Unicode character set… ... characters as those in Unicode coded character set, ...
    (microsoft.public.dotnet.languages.csharp)
  • =?ISO-8859-1?Q?Re=3A_How_to_upload_a_=A3?=
    ... A reference to a character that will display as this glyph ... Correctly encoding some bytes so as to be recognised as this ... ASCII-like encodings are old and only cope with a character set of up ... straight for UTF-8. ...
    (alt.html)

Loading