Re: Test for contiguous alphabet in character set





Stefan Krah wrote:
> Hello,
>
> I am currently writing code where it is convenient to convert
> char [a-zA-Z] to int [0-25]. The conversion function relies on
> a character set with contiguous alphabets.
>
> int set_mesg(Key *key, char *s)
> {
> char *x;
>
> if (strlen(s) != 3)
> return 0;
>
> x = s;
> while (*x != '\0') {
> if (!isalpha(*x))

FYI: Use `isalpha((unsigned char)*x)', and similarly
for the other <ctype.h> functions.

> return 0;
> *x = tolower(*x);
> x++;
> }
>
> x = s;
> /* l_mesg, m_mesg, r_mesg are int */
> key->l_mesg = *x++ - 'a';
> key->m_mesg = *x++ - 'a';
> key->r_mesg = *x - 'a';
>
> return 1;
> }
>
>
> According to K&R2 (p43) contiguous alphabets cannot be safely assumed.
> This function would test the lowercase alphabet:
>
> int cont_lower_alpha(void)
> {
> char a[26] = "abcdefghijklmnopqrstuvwxyz";
> int i;
>
> for (i = 0; i < 26; i++)
> if (a[i] - 'a' != i)
> return 0;
>
> return 1;
> }
>
> Is there an easier way of doing this?

This tests contiguity of the lower-case alphabet, and
upper-case could be tested in the same way. But if the
test says "discontiguous," then what? Your real problem,
I think, is not to determine whether the alphabets are
contiguous, but to find some code that will work correctly
even if they are not contiguous.

One way would be to use the position of the character
in a reference string instead of the character's code. For
example, you could write

const char alphabet[] = "abcdefghijklmnopqrstuvwxyz";
...
key->l_mesg = strchr(alphabet, *x++) - alphabet;

As written, this is unsafe if there's any chance that
the target character might not be found in alphabet[], because
strchr() would then return NULL and `NULL - alphabet' is
nonsense. (You may think this cannot happen since the code
has eliminated all non-alphabetics and converted everything
to lower-case, but keep in mind that isalpha() and tolower()
are locale-dependent. Letters like å, ç, ñ, and þ are found
in many character sets, and may be considered lower-case
alphabetics in some locales -- so they would pass through the
earlier portions of your code only to be found missing from
the alphabet[] array.) You could call strchr() and check the
result for NULL before trying to subtract, or you could make
sure that strchr() always finds the target character:

char alphabetplus[] = "abcdefghijklmnopqrstuvwxyz?";
int pos;
...
alphabetplus[26] = *x;
pos = strchr(alphabetplus, *x++) - alphabetplus;
if (pos < 26)
key->l_mesg = pos;
else
return 0; /* unknown lower-case alphabetic */

If you intend to compute a large number of these message
codes, though, it is probably better to use a table:

static char code[1+UCHAR_MAX];
if (code['a'] == 0) {
/* initialize table on the first call */
const char alpha[] = "abcdefghijklmnopqrstuvwxyz";
int i;
for (i = 0; alpha[i] != '\0'; ++i) {
code[alpha[i]] = i + 1;
code[toupper(alpha[i])] = i + 1;
}
}
...
if (code[(unsigned char)*x] == 0)
return 0;
key->l_mesg = code[(unsigned char)*x] - 1;

(Note 1: Yes, I warned you to cast the argument of <ctype.h>
functions, yet I did not do so in the toupper() call. This
happens to be safe because I know that all the characters in
a..z are in the "basic execution" character set, and all these
are guaranteed to have non-negative code values. When you don't
have such knowledge of the input string, though, you must cast --
as in the references to the code[] array, although only the first
of those two is strictly necessary.)

(Note 2: Observe that the table-based method eliminates
the need to weed out non-alphabetics and convert case. All
letters outside a..z and A..Z will be detected by virtue of
their zero code[] values, and for all the rest you will have
code['a'] == code['A'], code['b'] == code['B'], and so on.)

--
Eric.Sosman@xxxxxxx


.



Relevant Pages

  • Re: Test for contiguous alphabet in character set
    ... >> a character set with contiguous alphabets. ... >> This function would test the lowercase alphabet: ... > This tests contiguity of the lower-case alphabet, ... I was aware of the locale issue, but I thought the default locale ...
    (comp.lang.c)
  • Re: Flattening frequency distribution with homophones
    ... is highly susceptible to attack via frequency analysis. ... If nowadys in computer processing a character is ... alphabet of size 26, one could exploit the fact that there are 256 ... symbols available for implementing homophones without increasing the ...
    (sci.crypt)
  • Re: [9fans] combining characters
    ... And that's exactly the place where I think Unicode goes against common ... character is supposed to be used over the decomposition. ... "There are no accents in Russian language" ... now you're confusing language and alphabet! ...
    (comp.os.plan9)
  • Re: How to determine passphrase entropy?
    ... emits the characters of the alphabet perfectly randomly. ... probability of the attacker using the same distribution succeeding ... same character is repeated 10 times. ... generator otherwise has a 26 character alphabet and is generating ...
    (sci.crypt)
  • Re: Standard way to implement ASCII control chars
    ... "Karl Kiesel" wrote in message ... > (without an IN alphabet clause). ... In this case the alphabet is NATIVE (ASCII), ... feed character, the value of which is X'0A', is the eleventh charater in the ...
    (comp.lang.cobol)