Re: Kind of new: function implementation questions, MASM

From: Ross A. Finlayson (spamtrap_at_crayne.org)
Date: 10/01/04

  • Next message: David J. Craig: "Re: Can't get full drive transfer rate -- what am I doing wrong?"
    Date: Thu, 30 Sep 2004 22:09:12 +0000 (UTC)
    
    

    Tim Roberts wrote:

    > "Ross A. Finlayson" <spamtrap@crayne.org> wrote:
    > >
    > >OK. Can you explain some of the deal about address generation? That is to say,
    > >please explain for:
    > >
    > >mov ecx, [esp+ edx]
    > >vs.
    > >mov ecx, [esp+ 4*edx]
    > >vs.
    > >mov ecx, [esp+ 4*edx + 2]
    > >
    > >if the instruction with less stuff is faster than the instruction with more stuff
    > >or the address generation is constant time.
    >
    > This is a surprisingly complicated topic. Some addressing modes require
    > the use of an additional byte (the SIB byte, named for the three bitfields
    > it contains: scale, index, base). Instructions with a SIB require an extra
    > cycle (at least they did in the Pentium & Pentium II; I have lost track in
    > the P4).
    >
    > [reg] and [reg+reg] do not require a SIB byte. [reg+constant] does not
    > require a SIB, unless the register is esp. This is because "esp+constant"
    > was usurped as the signal that "a SIB byte follows". [reg+K*reg] and
    > [reg+K*reg+constant] requires a SIB.
    >
    > This extra cycle is ignorable, unless it is used multiple times in your
    > innermost loop, especially if it allows you to keep stuff in a register
    > without spilling to memory.
    >
    > >> >I'm basically looking at a bitscanner here, the data is in memory in
    > >> >big-endian order. I load the input data onto a register, the bytes are
    > >> >swapped. I can use bswap, which is a fast operation, ...
    > >>
    > >> It isn't on 486 and Pentium.
    > >
    > >Yeah, I want to avoid any instruction that is not available on a 386 on principle.
    >
    > "bswap" was introduced on the 486; what I meant by my statement was that
    > "it isn't FAST on 486 and Pentium".
    >
    > >I think it's probably a good idea to "reroll" the loop somewhat to keep it in the
    > >cache.
    >
    > Now, this optimization point is one that I agree with. If you make a
    > change that causes a single loop to exceed the cache size, you will be
    > very, very sorry. However, that requires a pretty big loop.
    >
    > >If/when I get to the point where I want to compare the various ways of
    > >implementing this thing, then I will want to collect some timing data. I'm not
    > >sure of the simple procedure to do that. I read something about RDTSC only being
    > >callable from real mode, I don't think that's accurate.
    >
    > No, it's available it all modes.
    >
    > >I think I would
    > >have to store the timestamp results if I want to call the function a million times
    > >to average its cycle count among system/platform task switching. Basically I want
    > >to instrument it simply.
    >
    > RDTSC is ideal for that. It usually isn't necessary to run a million loops
    > to eliminate task switching overhead; in my experience, after a dozen runs,
    > the wild points are usually very clear, and the standard deviation of the
    > rest is quite small.
    >
    > >The reason I use assembler is because this software needs to operate on the bit
    > >streams. There are shifts in the higher level languages, eg C, a midrange
    > >language, and some other notions about the machine code output, where a lot of
    > >pointer arithmetic is really close to the machine, and some gets farther with
    > >processor developments, but anyways as an example there is no rotate operator in
    > >C.
    >
    > I do not argue with this at all. I've done MPEG encoders/decoders, and the
    > bit extraction routines are FIRST in line for assemblerizing.
    >
    > Are you doing MPEG?
    > --
    > - Tim Roberts, timr@probo.com
    > Providenza & Boekelheide, Inc.

    Hi,

    Hey that helps me understand the address generation some more.

    So basically where possible I should try and use these forms:

    [reg]
    [reg+reg]
    [reg+imm]

    over

    [reg+ (1,2, 4, or 8) * reg]
    [reg+ (1,2, 4, or 8) * reg + imm]
    [esp + imm]

    Then, the address generation is more simple for the processor because there is no SIB
    byte in the encoded instruction.

    About the bitstream scanning, I'm gearing more towards JPEG than MPEG, and the
    concepts of the bitscanning are remarkably similar for any data format that is a
    sequence of 1's and 0's.

    In JPEG, there are basically markers. The marker is a marker code, often followed by
    a marker segment body. All of the markers are aligned on bytes. The byte 0x00 is
    called the zero or fill byte. The byte 0xFF is followed by the marker code. The
    marker code is a predefined and fixed value, all values not specified in T.81-T.89
    (JPEG, JBIG, JPEG compliance, JPEG extension, JBIG profiles, JPEG extension, JPEG-LS)
    or T.80x (JPEG2000) are reserved. So anyways there are markers that contain basically
    byte aligned or at worst nybble-aligned data. Then, among some of the markers there
    is the "entropy-coded data", the compressed data, compressed according to the
    parameters contained within the marker structures. In JPEG, an 0xFF byte is followed
    by a stuffed 0x00 byte to indicate it is not a marker within the entropy coded data,
    in JPEG-LS and JPEG2000 instead of a fill byte the following byte's high bit is
    cleared, where the reserved codes have the high bit set. That enables scanning the
    file for bytes to be able to determine whether a given 0xFF byte is part of a marker
    code or part of entropy-coded data. The marker, where it has a marker body, has a
    length value as the first marker segment structure, it is illegal or just plain wrong
    values of that indicator that cause problems with naive JPEG readers, bashing the
    stack, and trashing the instruction pointer.

    So anyways the markers are read plainly with pretty much just moving the elements
    according to the marker's data structure into place. The entropy-coded data is the
    data with dynamic variable length bit sequences, and one of the reasons I seek
    discussion of efficiently scanning these tokens of dynamic variable length bit
    sequences is to help understand how to read this data.

    In MPEG, I may be wrong, there are bitfields within some of the "marker" analogs:
    static fixed length bit sequences, and then of course the entropy-coded data is of
    dynamic variables length bit sequences.

    About RDTSC, Read Timestamp Counter, it reads a 64 bit value and bits the high 32 bits
    in EDX and the low 32 bits in EAX. So, if that is instrumented into a function then
    it would be along the lines of, say, a preprocessor conditional directive that stores
    EAX and EDX as they are probably being used by the function in the main:

    push eax
    push edx
    rdtsc
    ; here do something with the contents of eax and edx
    pop edx
    pop eax

    One notion of instrumenting the function from beginning to end is to put the
    timestamp count values into an array so that later the array contents can be analyzed
    to get an idea of average running time of the procedure. The instrumentation should
    be "minimally invasive" to not skew the results that already vary upon the current
    state of the pipeline and execution buffers and the cache. So I think where it says
    "do something with the contents of eax and edx" that means there is use of the data
    section variables because the stack is dynamic between instrumented function calls,
    and passing an extra array pointer to the function changes the call stack and
    signature and as well yada yada. So then the count of the timestamp array and the
    current offset and the beginning of the timestamp array, to not overload the timestamp
    array but rather make it a circular queue of sorts that can be inspected, would be
    basically aligned to the end of the data segment, or far off in the data segment, so
    that the function happily works with its beginning of the data segment.

    Besides RDTSC, there are other types of performance registers on the P6, there is some
    explanation of performance measurement techniques here:

    http://www.scl.ameslab.gov/Projects/Rabbit/

    That's about using Rabbit as opposed to instrumenting a cat, Acoustikitty.

    Anyways, basically it seems that esp, the stack pointer, is key to storing stack
    offsets, and that it should never be used if at all possible.

    In terms of assemblerizing, for almost all intents and purposes where the projects are
    in HLL's, they should mostly stay there, but when it gets down to bits, the HLL's are
    lacking standardized or regularized bit routines. Don't get me wrong, they're great
    and the routines are implementable using the HLL's, often more efficiently than
    introducing new tools and methods, but an inner loop executed zillions of time does
    help protect the environment, all things considered, when it is implemented to be
    efficient.

    Warm regards,

    Ross F.


  • Next message: David J. Craig: "Re: Can't get full drive transfer rate -- what am I doing wrong?"

    Relevant Pages

    • Re: 2.6.22 -mm merge plans
      ... markers infrastructure as soon as it hits mainline). ... All these companies would be really happy to have a marker ... "kprobes remains a vital foundation for SystemTap. ... XMC-safe instruction patching. ...
      (Linux-Kernel)
    • Re: [PATCH 0/4] Linux Kernel Markers
      ... would be built on top of the marker infrastructure. ... execute a serializing instruction ... instruction on every CPU even if we do not execute the trap handler. ...
      (Linux-Kernel)
    • [RFC patch 19/27] Markers - define non optimized marker
      ... To support the forthcoming "immediate values" marker optimization, ... * @format: format string ... Should be used for markers in code paths where instruction ...
      (Linux-Kernel)
    • [patch 29/37] Markers - define non optimized marker
      ... To support the forthcoming "immediate values" marker optimization, ... * @format: format string ... Should be used for markers in code paths where instruction ...
      (Linux-Kernel)
    • [patch for 2.6.26 1/7] Markers - define non optimized marker
      ... To support the forthcoming "immediate values" marker optimization, ... * @format: format string ... Should be used for markers in code paths where instruction ...
      (Linux-Kernel)