Re: how to remove c++ comments from a cpp file?



"Peter Otten" <__peter__@xxxxxx> escribió en el mensaje news:epd751$518$03$1@xxxxxxxxxxxxxxxxxxxx
Laurent Rahuel wrote:

And using the codecs module

Why would you de/encode at all?

I'd say the otherwise: why not? This is the recommended practice: decode inputs as soon as possible, work on Unicode, encode only when you write the output.
In this particular case, it's not necesary and you get the same results, only because these two conditions are met:

- the encoding used is utf-8
- we're looking for '//', and no unicode character contains '/' in its representation using that encoding apart from '/' itself

Looking for the byte sequence '//' into data encoded with a different encoding (like utf-16 or ucs-2) could give false positives. And looking for other things (like '¡¡') on utf-8 could give false positives too.
The same applies if one wants to skip string literals looking for '"' and '\\"'.
Anyway for a toy script like this, perhaps it does not make any sense at all - but one should be aware of the potential problems.

--
Gabriel Genellina


.



Relevant Pages

  • Re: Code review: UTF-8
    ... > I'm currently working on stuff involving Unicode encodings. ... However, the unique shortest encoding ... and itself has no meaning as a character in ... could legally encode values up to 0x7FFFFFFF, ...
    (comp.programming)
  • Re: ascii codec cant encode character uxf3
    ... There seems to be a general misunderstanding about what unicode, an encoding ... So it can encode the whole unicode set, ... Now whenever you want to use a string that is encoded in a special encoding, ...
    (comp.lang.python)
  • Re: Unicode/UTF-8 decoding
    ... It might be possible to recreate the data if you know what encodings was used to encode and decode it. ... Then you might be able to encode it back to it's prevois state and use the proper encoding to decode it. ... If you want to store unicode strings in the MySQL database, it has to be set up to use unicode as character set. ...
    (microsoft.public.dotnet.languages.vb)
  • Re: problem with cjkcodecs on Mandrake linux +++
    ... > Remember the basic rule of Unicode? ... > know the encoding, you ... > print statement has to encode s somehow (Unicode ... > encoded using the 'gbk' charset, ...
    (comp.lang.python)
  • Re: C# and encodings
    ... different encoding than Unicode does (Unicode set uses three ... Any character encoding that is not Unicode by definition uses a different encoding than Unicode does. ... The point is that the Unicode "character" 0xfeff is not representable in any ANSI code page, and is treated specially by stripping it from input rather than replacing it with the "default character". ...
    (microsoft.public.dotnet.languages.csharp)

Loading