Re: Kind of new: function implementation questions, MASM

From: Tim Roberts (spamtrap_at_crayne.org)
Date: 09/28/04

  • Next message: Tim Roberts : "Re: gamma correction"
    Date: Tue, 28 Sep 2004 06:07:25 +0000 (UTC)
    
    

    You know, Ross, you and I share a common attibute: we both have a bad habit
    of using 25 words to say something that could have been said in 5. Between
    the two of us, we are going to win the "average number of words per post"
    contest this week.

    "Ross A. Finlayson" <spamtrap@crayne.org> wrote:
    >
    >Hey thanks, there is more understanding here after reading your cogent
    >explanation.
    >
    >You mention the relocation table and the linker, I'm unfamiliar with the
    >semantics of relocatable object code, I guess that's what it is: object code
    >with a relocation table that contains offsets to items that are relative
    >offsets that the linker is to modify at link time.

    Yes. An object file is just a collection of linker tables. Some of those
    table contain machine language instructions. Some of them contain
    relocation information. Some of them contain the external symbols we are
    advertising for others to use. Some of them contain the names of symbols
    we need to use. Some of them contain debug informmation.

    >I think I want to not use at all ebp in the function, but only use esp.

    Why would you make such a decision at this point? Again, that's Step 29,
    and you're on Step 5. Get the function WORKING, then decide whether you
    need an extra register.

    >It seems to me that if I don't actually touch ebp then that's OK. For
    >example instead of:
    >
    >%define varsize 24
    >
    >push ebp
    >mov ebp, esp
    >sub esp, varsize
    >
    >that I could instead just use
    >
    >sub esp, varsize
    >
    >Now, maybe that's incorrect,

    No, it's perfectly valid. Visual C++ does this at maximum optimization
    level. It's harder for an assembly programmer, because you have to track
    your own stack use. For example, if you push a temporary variable,
    suddenly the parameter than had been at [esp+44] is now at [esp+48].

    >where the notion there is that I am going to be storing a couple bits in one
    >of the table cell values, where instead of using some conditional jumps to
    >select which of the local variables to copy into the working register, I can
    >just use those two bits left in a masked register to calculate the offset from
    >esp in a short instruction, with a few less conditional jumps. That is to
    >say, take two bits that represent a number 0-3 in a register, and then use
    >them as part of the mov or lea instruction.
    >
    >and dx, 3 ; mask off two low bits
    >mov ecx, [esp+4*dx] ; copy correct variable into ecx

    Unfortunately, you cannot mix 16-bit and 32-bit registers in addressing
    like that. You would have to use

      and edx, 3
      mov ecx, [esp+4*edx]

    >I am reading from Agner Fog's that there are various reasons why it is better
    >to use
    >
    >mov ecx, [eax] ; <- operand is eax
    >vs.
    >mov ecx, [ebp] ; <- operand is ebp
    >or
    >mov ecx, [esp] ; <- operand is esp
    >
    >because the first instruction has a shorter encoding. So if there's a free
    >register for this strategy I might copy esp into that.

    You're skipping ahead again. This is micro-micro-optimization.

    >2) About cacheing, is it FIFO or LIFO? That means, when the cache is full,
    >which line gets dumped/evicted? Is it the newest(LIFO) or oldest(FIFO), or
    >other?

    It's FIFO, but it's not as simple as that. Pentium caches are
    set-associative. That means that a given address cannot go just anywhere
    in the cache. If you have 512kB of 8-way associative cache, there are
    exactly 8 slots into which a given address can go.

    >If I have these labels on the data segment each 32 bytes of around 2048 byes,
    >or
    >section .data align 64
    >data32_0: db 1, 2, 3, ... (32 bytes)
    >data32_1: ...
    >...
    >data32_63
    >
    >where I expect the data32_0 to be used more than data32_63, should I prefetch
    >data32_0 first or last?

    You probably shouldn't use the prefetch instructions at all. Intel's
    engineers are very smart people. The processor will keep stuff in the
    cache if it NEEDS to be in the cache. You certainly don't want to plop
    prefetch instructions into version 0.1 of your code, because almost any
    code change you make will change the cache behavior. Wait until the thing
    is coded and working, and THEN see if you might be able to prefetch
    something.

    >I'm basically looking at a bitscanner here, the data is in memory in
    >big-endian order. I load the input data onto a register, the bytes are
    >swapped. I can use bswap, which is a fast operation, ...

    It isn't on 486 and Pentium.

    -- 
    - Tim Roberts, timr@probo.com
      Providenza & Boekelheide, Inc.
    

  • Next message: Tim Roberts : "Re: gamma correction"

    Relevant Pages

    • Re: Kind of new: function implementation questions, MASM
      ... >>I think I want to not use at all ebp in the function, but only use esp. ... > need an extra register. ... >>esp in a short instruction, with a few less conditional jumps. ... > cache if it NEEDS to be in the cache. ...
      (comp.lang.asm.x86)
    • Re: Optimization Questions
      ... cycles you'd save would be more than offset by the cycles you'd burn ... instructions go through port 0 and port 1. ... a 16-bit register, writing one afterwards will be fast. ... Pre-read the value in EAX ...
      (comp.lang.asm.x86)
    • Re: Throttling Process CPU Utilization
      ... >> I run a bonch of BOINC processes on my machine (SETI@home, ... My processor chips cost $750 each, and these days, a 1 Megabyte ... L3 cache is pretty pathetic. ... instructions, and some of them were completely worthless. ...
      (comp.os.linux.misc)
    • Re: Throttling Process CPU Utilization
      ... >> I run a bonch of BOINC processes on my machine (SETI@home, ... My processor chips cost $750 each, and these days, a 1 Megabyte ... L3 cache is pretty pathetic. ... instructions, and some of them were completely worthless. ...
      (comp.os.linux.development.system)
    • Re: my assembler is better than your assembler
      ... from the cache. ... Sure Not if a single register value does exact the same. ... use reg as LOCAL ... but you're misusing the term "redundant" here. ...
      (alt.lang.asm)