Re: ANSI C question about 'volatile'

From: Chris Torek (nospam_at_torek.net)
Date: 08/13/04


Date: 13 Aug 2004 20:17:55 GMT


>> Chris Torek wrote:
>> > ... the same C Standard says:
>> > ... What constitutes an access to an object that has
>> > volatile-qualified type is implementation-defined.
>> > which leaves the implementor a truck-sized loophole: he can simply
>> > define away all but one of the actual memory references, leaving
>> > only one of them as an "access".

>Eric Sosman <Eric.Sosman@sun.com> wrote in message
>news:<411CD4EA.5070402@sun.com>...
>> Hard to avoid such gaps, I think.
[example snipped, but it includes a large structure assignment]
>> Most implementations, I think, would be forced to define the
>> single C-level "access" to `s1' in terms of multiple hardware-
>> level "accesses."

Indeed.

In article <news:2d31a9f9.0408131053.a1e046@posting.google.com>
j0mbolar <j0mbolar@engineer.com> asked:
>this begs the question, what in the world is multiple hardware-level
>accesses? and how does this affect something volatile?

The *intent* of the C Standard is clear: the hardware has some
set(s) of instruction(s) that perform hardware-level access, and
there is some mapping from "hardware access" to "C code". That
mapping is allowed to be optimized as much as possible *except*
in the presence of "volatile" qualifiers, where the mapping should
be as direct as possible.

Suppose we have a conventional load/store architecture, for instance,
in which there are only "two kinds" of "hardware access": the "load"
and the "store". In assembly these are achieved via "ld" and "st"
instructions. Only one "bus width" is supported (the 32-bit-word),
so that:

    ld r1,(r2)
    st r1,(r3)

"means": "do a 32-bit bus access to the address given by r2, putting
the value retrieved into r1; then do a 32-bit bus access to the
address given by r3, storing the value now in r1".

Hardware *devices* may then respond in particular (and peculiar)
ways to these two hardware-level bus transactions.

Since the C compiler for this particular machine has 32-bit "int"s,
we can do the same in C with:

    int r1;
    volatile int *r2, *r3;
    r1 = *r2;
    *r3 = r1;

and "expect" the C compiler to generate the "obvious" code (although
the register numbers might change in the process). The C Standard
gives us (C programmers) "volatile" to do it, but does not promise
us that the compiler will accede to our wishes; it is up to us to
obtain a C compiler that actually does so.

What happens, though, if we have a 16-bit or 8-bit hardware device
and have to connect it to this machine? The *machine* is PHYSICALLY
INCAPABLE of doing anything other than a 32-bit-wide access. How
can we take an AMD "Lance" Ethernet device, with its two 16-bit
registers, and make it work with this (MIPS-R2000-like) CPU?

The answer in this case was to put the 16-bit registers on 32-bit
boundaries:

    struct lance_registers {
        uint16_t pad1;
        uint16_t rap; /* Register Address Port */
        uint16_t pad2;
        uint16_t rdp; /* Register Data Port */
    }; /* (I might have the address and data ports backwards) */

This, however, is *not* how it is done on a conventional 80x86-like
CPU, which *does* have multiple different bus-size-transactions.
Here the compiler should use 16-bit bus accesses for 16-bit integers,
and 8-bit bus accesses for 8-bit integers, and the two "pad"s go
away in the structure.

Moreover, the 80x86 has what are called "read-modify-write" bus
cycles, as did the PDP-11 and VAX. Some PDP-11 Unibus hardware
devices *required* certain operations to use these r/m/w cycles
to obtain predictable results. To get such a bus cycle, an assembler
programmer might use the "bis" or "bic" instructions on the VAX:

    bisw2 r1,(r6)

This instruction reads from the (presumably Unibus) location given
by r6, sets the bits given by r1, and writes the result back, all
within a single bus operation using the "r/m/w" cycle. The C programmer
familiar with all this would write the code as:

    *r6 |= r1;

and "expect" to get the same bisw2 instruction (provided r6 has
type "volatile unsigned short *" or similar). Writing:

    *r6 = *r6 | r1;

would instead produce an assembler sequence like:

    movzwl (r6),r0 # or perhaps just movw
    bisl2 r1,r0 # in which case this would be a bisw2
    movw r0,(r6)

Again, while "volatile" is *necessary* to tell the compiler "please
do not attempt to optimize this", it is not *sufficient* -- the
compilre must actually generate different code for the "|=" operation.
A similer compiler on a load/store architecture *cannot* generate
a single instruction for this, though, because there IS NO SUCH
SINGLE INSTRUCTION (and there are no r/m/w bus cycles).

The answers to j0mbolar's questions, then, are: "access" is really
defined by the hardware, and as C programmers, we have to know not
only what the hardware does, but also whether we can convince our
C compilers to generate the necessary code. When C's types and
operations "map nicely" onto the hardware, we can expect, and should
really demand, that our C compilers do the "obvious thing".

What about the cases where C's types and operations do not fit well
with the hardware-level operations? Consider the V8 SPARC's "ldstub"
(load/store unsigned byte) instruction, or V9's compare and swap;
the 80x86 compare-and-exchange instructions; and the MIPS and PowerPC
style "load linked / store conditional" pairs. The ldstub
instruction is defined as an atomic bus cycle that:

  - reads a byte from memory
  - stores 0xff into memory

and gives you the original byte in the register. If two devices
or processors attempt this at the "same time", and the byte is
originally not 0xff, one of them will "see" the original byte and
the other will see the 0xff. The compare and swap (aka compare
and exchange) instructions, which are more powerful, take two
registers and a memory location and atomically:

  - compare the first register with the memory value
  - if they are equal, change the memory value to the second register,
    but if they are not equal, leave the memory value alone
  - leave the result of the comparison or the original memory value
    (or both) in one of the registers and/or in some condition codes

The ll/sc sequence, which is perhaps the most powerful of all,
loads a value from memory into a register, and then later stores
a new value (as given by a register) into that memory location but
only if no one else has changed it yet. (This is done through the
cache protocols -- the CPU cache uses MESI or MOESI to cooperate
with other devices, and is alerted if the value gets changed between
the two separate instructions. While CAS can be used to implement
atomic adds and mutexes, LL+SC can be used to implement atomic
queues.)

The closest one can come to writing CAS in C, for instance, is:

    tmp = *mem;
    if (tmp == r1)
        *mem = r2;
     r1 = tmp;

but all this happens in a single bus cycle. There is no C operator
that compresses this down to one operation. The LL/SC sequence
actually takes multiple bus cycles and cannot be expressed at all
in C.

Today, the usual tack for handling the "cannot be written in C at
all" instructions is to use assembly code -- either a C-callable
subroutine, or inline expansion.

-- 
In-Real-Life: Chris Torek, Wind River Systems
Salt Lake City, UT, USA (40°39.22'N, 111°50.29'W)  +1 801 277 2603
email: forget about it   http://web.torek.net/torek/index.html
Reading email is like searching for food in the garbage, thanks to spammers.


Relevant Pages

  • Re: NASM 0.98.39 vs. NASM 2.03.01 disassembly
    ... source register. ... The output of the register is gated to the data bus only during ... instructions. ... sub-fields may be connected to a latch instead of the main bus since the ...
    (alt.lang.asm)
  • Re: Errors while using WRITE_PORT_UCHAR
    ... Only the hardware specification would tell you what the 'right' value is. ... it drives the data lines on the bus with *whatever ... The device can look at those data signals, store them, ... As you said I really assumed after I wrote a value to the register I could ...
    (microsoft.public.windowsce.app.development)
  • Re: [Fwd: 64-bits is a really big number! - was z/OS level for SETFRR for AMODE(64)]
    ... Register size, address size, bus size within CPU, bus size to memory, bus size to peripherals, could all be different bit widths, and physical hardware register sizes don't have to match the register sizes of the hardware architecture visible to the user. ...
    (bit.listserv.ibm-main)
  • Re: Blue Chip Technology + MagnumX?
    ... store 1 byte pin state register and 32 bit counter value. ... refreshed and depending on the type this is done autmatically by hardware ... See above about code in interrupt handler and performance issues. ... Get the designer of the board generating the digital inputs to ...
    (comp.arch.embedded)
  • Re: [PATCH 29/30] W1: Documentation/w1/masters/ds2490 update
    ... DS9490is a USB W1 bus master device ... +- While the ds2490 supports a hardware search the code doesn't take ... a write buffer and a read buffer as arguments. ... the bulk read will return an error and the hardware will ...
    (Linux-Kernel)