Re: producing junk when printing a string

From: Thomas Matthews (Thomas_MatthewsSpamBotsSuck_at_sbcglobal.net)
Date: 12/08/03


Date: Mon, 08 Dec 2003 21:26:19 GMT

Jumbo wrote:
> "Thomas Matthews" <Thomas_MatthewsSpitsOnSpamBots@sbcglobal.net> wrote in
> message news:K73Ab.3548$x54.1605@newssvr33.news.prodigy.com...
>>You are assuming the ASCII character set, which many not be used
>>on all platforms.
First off, by your replies to other posts, I assume your are one
of those types that must be correct all the time. So take the
following in stride.

Actually, the toupper, and tolower functions simplify the parsing
of user input. Rather than having to compare each case, these
functions take care of that. Less typing for the developer and
better quality code. Also, these functions take care of things
like accented characters and umlauts.

>
>
> You are assuming his code is intended to work on all platforms.
> He is probably not assuming ASCII. He more than likely knows his platform is
> ASCII and is not assuming.
[snip -- platform specific sniveling]
I take it you haven't dealt with a Marketing department before.
They will always find a way to make more money from products already
developed, "Well, if it works on a PC, why doesn't it work on a Mac?".

One can write platform specific code, there is nothing wrong with
that. However, one must employ checks in the execution as well as
the source code and documentation that the code will only work for
the designated platform and no others.

I write platform specific code on a daily basis. However, I've seen
and took part in ports of previous code to new platforms. Any effort
spent in non-platform coding saved us lots of development time.

By the way, the skill of making code readable is one that should
be learned at the start of porgramming. It is a habit that is worth
the investment.

>>Another mistake, is that you are assuming that a word is a sequence
>
>
> Coding for a specific character set is not a mistake.
Not as long as the intention is specified and the code takes measures
to prevent other character sets from being used or that the program
does not work with other character sets. Sometimes this documentation
and warnings are more effort that making the program work with multiple
character sets from the start.

>>of one or more letters. "Don't" that beat all!. In other words,
>>your algorithm doesn't account for contractions. It also considers
>>abbreviations (a.k.a.) as separate words. Your algorithm also
>>doesn't account for hyphenated words or words that are broken
>>across text lines.
>
>
> Good point this.
> What do you think should be considered a word sequence then?
> foo
> fo(line break)o
> foo-bar
> f.o.o
> foo()
> foo:
> foo:-
> foo,
> foo.
> "foo"
> foo!
> foo?
>
> I don't know maybe you can think of more but what I wanted to say to you
> was..
> What about the idea of checking for spaces to terminate a word sequence.
> Maybe this would be better but it would depend on the context of the data.
> Then perhaps you might want to call a sub process to further parse each word
> i.e: to strip full stops of the end etc.
Many words are not terminated by spaces (in the English language).
Like many sentences, words can be terminated by colons, commas, periods,
exclamation points, question marks, quotation marks, and others.

I was just pointing out that the OP's algorithm had some holes in it.
I don't have the time to spend posting a complete word parsing
algorithm. Besides, other people have done it and posted it publicly.
Search the web.

>>When you do decide to use numbers in a program, prefer named
>>constants rather than "magic" numbers without any meaning:
>>const char ASCII_LETTER_A = 65; /* 0x41 */
>>const char ASCII_LETTER_Z = 90; /* 0x5A */
>>// ...
>> if ((c >= ASCII_LETTER_A && c <= ASCII_LETTER_Z)
>>//...
>
>
> Unecessary preferably a comment at top of page or something like this:
> /*############ CODE FOR ASCII ################## */
Actually, the numbers should not be used when character constants
are better suited. Some people know the ASCII chart in decimal.
I have it completely memorized in hex. The numbers took me a
while (convert to hex, then to ASCII). A trival character
constant requires no translations, 'A' is self explanatory (and
is not dependent on the character set).

>>The above is more readable than the number 65, whose meaning
>>must be deduced from the context. Which is more work on the
>>reader's part. Also, the identifiers don't change the code
>>size or execution time from using the numbers. Compilers
>>translate the identifiers to their numeric values automatically
>>for you.
>
>
> Yeah clearer coding is always a good thing in my mind too but remember that
> the codes' first objective is to be read by a compiler not a human. Some
> people prefer to have unreadable code because they like to be the only ones
> who can read it :o)
Generally, if the code is readable by others (hopefully someone with
little knowledge of programming) it is more correct on the first pass
than cryptic code.

-- 
Thomas Matthews
C++ newsgroup welcome message:
          http://www.slack.net/~shiva/welcome.txt
C++ Faq: http://www.parashift.com/c++-faq-lite
C Faq:   http://www.eskimo.com/~scs/c-faq/top.html
alt.comp.lang.learn.c-c++ faq:
          http://www.raos.demon.uk/acllc-c++/faq.html
Other sites:
     http://www.josuttis.com  -- C++ STL Library book
     http://www.sgi.com/tech/stl -- Standard Template Library


Relevant Pages

  • Re: producing junk when printing a string
    ... He is probably not assuming ASCII. ... His code will most probably only work on the platform on which it is ... character set of his target platform and perhaps account for more than one ...
    (alt.comp.lang.learn.c-cpp)
  • Re: Bypassing of web filters by using ASCII
    ... The character set ASCII encodes every character with 7 bits. ... Internet ... connections transmit octets with 8 bits. ...
    (Bugtraq)
  • Re: OT Brief heads-up
    ... ASCII is a character set that contains 256 items. ... I wouldn't want to predict whether Sibelius or Emacs with LilyPond would be ... If your message looks like spam I may not see it. ...
    (rec.music.early)
  • Re: extended ascii
    ... % Displays current ASCII encoding in use in Matlab ... What standard does ML use for ascii? ... text simply has to know a priori what character set to use. ...
    (comp.soft-sys.matlab)
  • Re: extended ascii
    ... % Displays current ASCII encoding in use in Matlab ... What standard does ML use for ascii? ... the character set is explicitly specified in a header. ... This is likely the character set used by the MATLAB ...
    (comp.soft-sys.matlab)

Loading