questions about StreamTokenizer

From: Christian Bongiorno (firstname_at_lastname.org)
Date: 05/09/04

  • Next message: Dave Monroe: "Re: Java developed on WinXP, now crashing on Mac??"
    Date: Sun, 09 May 2004 02:48:13 GMT
    
    

    I am trying to use StreamTokenizer to parse email (a spam corpus) and I
    am running into some problems.

    First, string tokenizer evaluates every character between 00->ff. Ascii
    and extended ASCII.

    However, I am noticing that on occasion, when it tells me it has a
    TT_WORD value, that the sval contains characters with a value > ff.

    I assumed the tokenizer would just treat these are whitespace since I
    set .isWhiteSpace(128,ffff).

    So, I guess what I need to know, is there something I am missing, or is
    there a better class that can actually deal with unicode characters?

    Christian


  • Next message: Dave Monroe: "Re: Java developed on WinXP, now crashing on Mac??"

    Relevant Pages

    • questions about StreamTokenizer
      ... I am trying to use StreamTokenizer to parse email (a spam corpus) and I ... string tokenizer evaluates every character between 00->ff. ... and extended ASCII. ... that the sval contains characters with a value> ff. ...
      (comp.lang.java.programmer)
    • Re: Byte array to string
      ... > I had this problem where I had a program running on Korea where the codepage ... Careful here - there's no such encoding as "Extended ASCII". ... but the point is that a byte array isn't an array of characters. ...
      (microsoft.public.dotnet.languages.csharp)
    • Re: CAtlRegExp crashes with pound sign!
      ... there are numerous Extended ASCII code pages. ... for 8-bit characters that'd be UTF-8. ... Microsoft MVP, MCSD ...
      (microsoft.public.vc.atl)
    • Re: Question about Extended ASCII character set, and fstream
      ... One problem with Extended ASCII codes is that nobody seems to agree on ... > characters fine, but nothing from the extended set (the extened set ... And as this is essentially specific to your platform, ... a.c.l.l.c-c++ FAQ: http://www.comeaucomputing.com/learn/faq ...
      (alt.comp.lang.learn.c-cpp)
    • Re: How to display extended ASCII characters? (boxes, lines, etc)
      ... the extended ASCII characters past 127 will not appear ... |>there are /NO/ ASCII characters beyond 0x7F. ... that implementation is released as GPL. ... GPL fonts for CP437 and CP850 exist there as well. ...
      (comp.os.linux.questions)