Re: Binary-mode i/o, width of char, endianness

From: T Koster (reply-to-group_at_use.net)
Date: 03/01/05


Date: Tue, 01 Mar 2005 13:11:38 GMT

infobahn wrote:
> T Koster wrote:
>
>>I'm having some difficulty figuring out the most portable way to read 24
>>bits from a file. This is related to a Base-64 encoding.
>>
>>The file is opened in binary mode, and I'm using fread to read three
>>bytes from it. The question is though, where should fread put this? I
>>have considered two alternatives, but neither seem like a good idea:
>>
>>In most cases, the width of a char is 8 bits, so an array of 3 chars
>>would suffice, but the width of a char is guaranteed to be only *at
>>least* 8 bits, so the actual number of chars required would be 24 /
>>CHAR_BIT, rounded up. Since you can't round in a constant integral
>>expression, 3 chars is a good safe buffer size because it's guaranteed
>>to be at least 24 bits.
>
> To store BITS bits, you need at least (BITS + CHAR_BIT - 1) / CHAR_BIT
> bytes. If BITS is constant:
>
> #define BITS 24
>
> then:
>
> unsigned char buf[(BITS + CHAR_BIT - 1) / CHAR_BIT] = {0};
>
> is legal.

Ahh, good idea.

>>However, since I need to be able to divide
>>those 24 bits into four 6-bit numbers, indices into the char array
>>become more complicated as the 6-bit numbers do not fall evenly on the
>>(presumably) 8-bit boundaries that indexes in the array would give me.
>
> So you need to mask and shift. If we assume that each octet of data
> is stored in a separate byte, then this isn't as hard as it sounds.
>
> /* 1. get bits 7 through 2 of first octet */
> num[0] = (buf[0] & 0xFC) >> 2;
> /* 2. get bits 1 and 0 of first octet, and bits 7 through 4 of
> second octet */
> num[1] = ((buf[0] & 0x03) << 6) | ((buf[1] & 0xF0) >> 4);
>
> etc.
>
>>If the width of a char is not 8 bits, then knowing which indices to look
>>at and shift/mask is even more difficult.
>
> See above if they're spread out, with 8 value bits to each byte
> (the remaining bits being unused). If they're packed in, you just
> have to be a little clever with CHAR_BIT. Once you start to analyse
> this problem, you'll see that it isn't as hard as it sounds.

We seem to be using the term 'byte' with different meanings...see below.

>>As such, I thought of the
>>second option.
>>
>>The second option is to allocate the input buffer as simply one int
>>object that is guaranteed to be at least 24 bits wide: the long int,
>>which even has 8 bytes to spare.
>
> Well, at least 8 *bits* to spare. :-)

Certainly :)

>>fread can safely write 3 bytes of data
>>into a long int.
>
> Not necessarily. On platforms such as the kind you are worrying about
> (CHAR_BIT > 8), long int may well be fewer than four bytes wide!
>
> Consider a platform with 11-bit bytes. On such a platform, long ints
> may only occupy 3 bytes. On (perhaps more common) platforms with
> 16-bit or 32-bit bytes, long int may be only 2 bytes, or even 1 byte.

Hmmm, this appears to be becoming a question of terminology. I thought
that by definition, one byte is eight bits wide. I'm not using the C
type 'char' interchangably with 'an int that is one _byte_ big'. When I
consider that CHAR_BIT may be greater than 8, I mean exactly that, and
not that a byte of storage on this platform has more than eight bits,
since I thought that was nonsense. That is, a char may occupy more than
one byte of storage, but a byte is still an 8-bit byte. Calling fread
and asking for three bytes implies that 24 bits will be read,
irrespective of platform, correct? As such, a long int, being
guaranteed to have at least 32 bits, is guaranteed to occupy at least
four bytes of storage, which is why I say that fread can safely store
three bytes (24 bits by definition) in a long int. Correct me if I'm
wrong here.

> I would stick to unsigned char for this project. Long ints will
> multiply your headaches, divide your attention, add to your
> worries, and subtract from your understanding (modulo their
> day-to-day uses, obviously).

Thanks,
Thomas.



Relevant Pages

  • Re: Type of argc
    ... So did removing implicit int, but it improved the language nevertheless. ... Clever programmer #2 has to port the program to platform #2 which has a compiler not up to date. ... And that this C implementation could provide a non-standard way to access all the parameters, such as making argv a NULL-terminated string of pointers to char. ...
    (comp.std.c)
  • Re: Integer types "ambiguous"
    ... likely to only be 32 bitsw on a 32 bit platform, but could well be 64 on a 64 bit platform? ... The other requirement is that each type (char, short, int, long, long long) in sequence be no smaller than the preceding one. ... and it specifies ranks of the types such that conversion from a lower order rank to a higher does not cause changes of the value. ...
    (comp.os.linux.development.apps)
  • Re: strange use of format specifier in printf
    ... On a platform with signed 'char' type, when 'char' values are passed as ... So it's 'int' values that are actually passed. ... On a platform with unsigned 'char' type, it is possible that 'int' is not large ... conversion specifier, or passing an unsigned integer type with a value ...
    (comp.lang.c)
  • Re: Binary-mode i/o, width of char, endianness
    ... > In C, by definition, a char is exactly one byte in size. ... which is why I say that fread can safely store ... >>three bytes in a long int. ... > You can certainly guarantee to get 24 bits into a long int, ...
    (comp.lang.c)
  • Re: sizeof operator
    ... on your platform. ... 'g' is not a char. ... It's an int. ... Ernst Jan Plugge ...
    (comp.lang.c)