Simple (I think) question I've having the most impossible time with...



Essentially, I am writing a matrix-type operation in assembly for pure
speed. I've been reading the optimization guides, experimenting, and
making steady progress in cutting down the original naive code into a
fast routine. I went ahead and moved everything into aligned SSE
instructions, which resulted in a pretty large improvement.
Essentially, I load an SSE register with 4 floating points, do my math,
store it, repeat. Things improved very quickly. So I've moved on to
memory/cache optimization.

My 'test' program takes a very large array (~16mb), does some math to
it, and puts it into a similarly sized output array. I run this
subroutine 1000 times or so, to amortize the cost of reading in the
data and printing outputs, etc. The total run-time is ~60 seconds or
so.

Running under Linux, with the time command, I end up with near 50% of
the execution time in kernel mode. I surmise that this cannot be
optimal. From my own (naive) attempts to diagnose the culprit, I've
come to the conclusion that a very large (50+%) percentage of the time
in this loop is spent on the store instruction. That guess came from
alot comment-out voodoo, with my own understanding that alot of
parrallelism is lost when you comment out functionally independant
instructions.

So, my question is, what can I do to improve SSE store instructions? I
am currently using the MOVNTPS instruction. Is there any way to improve
this situation of doing a single store per iteration? Is there some
sort of prefetch that would help? Or, perhaps, is there something I can
do at a higher level in the memory hierarchy that is causing this
delay... perhaps page faults or TLB issues? I'm not really sure how to
even localize the problem to a particular area of memory. Or, for that
matter, I'm not really sure anything _is_ wrong... 50% of the execution
time being in kernel mode seems extremely poor to me, but maybe that's
how it's supposed to be?

.



Relevant Pages

  • Re: Simple (I think) question Ive having the most impossible time with...
    ... I've been reading the optimization guides, experimenting, and making steady progress in cutting down the original naive code into a fast routine. ... I went ahead and moved everything into aligned SSE instructions, which resulted in a pretty large improvement. ... in this loop is spent on the store instruction. ... even localize the problem to a particular area of memory. ...
    (comp.lang.asm.x86)
  • Re: Office 2000 - remove from XP SP2
    ... Reading down the instructions, it says that the utility has not been tested on XP. ... Milly Staples [MVP - Outlook] ... | Is there a way to completely remove Office 2000 from Windows XP SP2? ...
    (microsoft.public.office.misc)
  • Re: Upgrading laptop processor
    ... That the notebook manufacturer will politely smile and refuse to assist them if they have a problem and try to return it to the service center. ... You have only to read the number of posts from folks who add an optical drive and have no idea what a jumper is (again, despite reading the instructions) to know that your experience and knowledge is not that of others. ...
    (microsoft.public.windowsxp.hardware)
  • Re: The Colonoscopy Diet
    ... about a colonoscopy. ... but the instructions I was given are ... a very technical reading of the instructions does ... not exclude vodka or other clear liquids during the ...
    (rec.food.cooking)
  • Re: wifi Printer standalone?
    ... simple once I resigned myself to actually reading the instructions. ... Install the Print Server software ... everything, reboot and start over. ...
    (alt.internet.wireless)