Re: which way is faster?



På Mon, 14 Jan 2008 15:01:52 +0100, skrev Wolfgang Kern <nowhere@xxxxxxxx>:


Wannabee skrev:

[about time]

:) I still keep making theese "brainos" (n1 frank)

< http://szmyggenpv.com/downloads/ >

if the peekmessage.dispatch-using bitblit is the one you mean here,
then I wont be surprised that it is slow ...
Call the API for every single dot ?

It ANTI ANTI Aliasing. Just 1/4 of a dot every run.

I get a numeric figure of 970 (+/-2) on this test.

then you vesa isnt fast at all. But your gc and windows is.

btw: I needed to an three finger salut to end it.

ALT+F4

Where is yours?
To direct access a flat VideoRAM it needs a 32-bit OS which allow
to write to this memory range (best without paging issues).

ok. But I still dont understand why you cannot just extract that code
insert it in the dos file, go to 32 bit flat mode and just do the blts
and write the numbers. After 26 years building an OS I imagine you
could do that inside of 10 minutes?

It is possible during a DOS600 session, but rare within a windoze DOSbox.
First it needs to scan the VESA-BIOS for supported modes and create
a list with mode specific data and capabilities.
Then it must have:
full PM32 support with forward/backward links
any kind of memory manager which allow access to PCI-ranges,
which also asks for a PCI device detector ...
So what you ask me here would be a tiny OS on top of DOS.

For next "x-mas" then. :))

I am pretty sure I could do that in a couple of hours, or a day,
if I had the info.
(even I never did any dos programming)

You can try ... ;)

Yes without sorted info it would have been.

then how can I verify your findings?
The diffrence of my app is between a AMD64 and a 1500mhz Athlon XP
is 4900 copies per second, to just 460+ per second.

using the OS BitBlt which I have considered fast, and which I have few
alternatives to unless using hardware acceleration.

Yours run on a much slower computer, but achives 1/5 of the AMD64 running
at >2 gigahz

I can hardly belive it. Your code is 6+ times faster, then the atlon xp

ok, the 32-bit colour part looks like:

usage:
MOV eax,00001019h |INT 7F ;set VESAmode to 1024*768,32
ecx= 0100 Ysize
ebx= 0100 Xsize
edx= 0 X+Yposition (Y in hw)
eax= 0 colour mask
esi= source ;btw: KESYS.bitmaps aren't stored upside down!
AND [vflag],0f0 ;clear all options
CALL draw_bmp
MOV eax 00001009h |INT 7F ;set VESAmode to 1024*768,8 again
_________
draw_bmp:
OR edi,ebx
OR edi,ecx

what the heck is this? (above)

Just initialising regs and set Vmode,
or if you mean the two ORs, they check if both x+y are zero.

(YSize/XSize) i guess.

|JZ ret ;just in case
PUSH ebx
PUSH edx ;[esp]=Xpos [esp+2]=Ypos

;clip_it:
MOVZX eax,w[esp+2] ;eax= Ypos

Stack abuse? :D
Yes, classical 'LOCALs' in here ;)
I could replace it by MOV eax,edx |SHR eax,010
but the value is needed lateron too.

MOV edx,0300 ;max lines (altered by Vmode)
ADD eax,ebx
CMP eax,edx |Jc L1>
SUB eax,ebx |MOV ebx,edx |SUB ebx,eax |JS L9>
L1:
MOVZX eax,w[esp] ;Xpos
MOV edx,01000 ;scan line size (altered by Vmode)

the Vmode change recode this one? SMC

Yes, this immediate constant values were altered on Vmode changes.

Why?


ADD eax,ecx
CMP eax,edx |Jc L2>
SUB eax,ecx |MOV ecx,edx |SUB ecx,eax |JS L9>
L2:
MOVZX eax,w[esp+2]
IMUL eax,edx ;y*line size
LEA edi,[eax+screen_start] ;from VESA-info,(altered by Vmode)

nice.

MOVZX eax w[esp] ;+x for 8-bit, +4*x for 32bit
TEST[Vflag]40h ;indicates 8/32 bit colours
JZ draw8 ;not shown yet
TEST[Vmode]04h indicates colour mask active
JNZ draw_32_eax ;not shown yet
LEA edi,[edi+eax*4];

;draw it:
PUSH ecx
L3:MOV eax,edi ;keep the line start
REP MOVSD
ADD eax,edx ;add scan line size
DEC ebx |MOV ecx[esp]
MOV edi,eax |JNZ L3<
POP ecx
L9: POP edx |POP ebx
ret: RET
___________
You see it's not optimised at all,

? :D Looks very nice to me. short and excellent code I gather.

I could try to improve the loop
with MOVD/MOVNTQ or SSE 128-bit moves, even then any unaligned parts
may destroy the gain.

If you want your OS out of the picture. why dont you just write
it as a dos image instead?

It wont work in plain DOS because it must use 32-bit code to
access a flat VRAM (usually above 2GB).
EMM and XMS wont do well here, because IRQs become disabled for too
long and may lock up some hardware then.

But shouldnt be all that hard still? To run a com, break the barried by
your own code? Or am I speaking of ignorance here?
I cant figure it could be much of a job for you?

As said above, it would need to write a tiny OS on top of DOS,
I've planned to release a new DOS6 based DEMO soon anyway

btw, I still have the copy of you demo. Will it run on that?

I think this DEMO was a version.000 or 001, so it wont contain
the bitmap draw nor any 32-bit colour support.

ok. I like your code, but would very much like to see it running
with printed numbers (fps). (as fast as it can run) Since we cant do that
I would just have to trust you ... (I am not really hardwired for that)
:)

So you pushing 1/5 of a AMD64 400mhz fsb performance on a 500mhz
antique AMD?
And 6 times that of a 266mhz fsb athlon?
hmmm.....(teeth gnizzeling sounds) ..... Get out of here!

I don't know how to interprete your '970' message.
My estimation was about to be three times faster than windoze.

This is 970 builts of the bitmap, each second.
What resolution?

What have you expected when you compare a HLL-driven peek&pokeOS solution
with one written in machine code running in 'un'-protected mode without
paging ?

I had expected that the result would equal out.
Given what you just said, ,maybe they do, if you use the same resolution that I do (1024x768x32) for the K7. I even now think that windows may be faster then your VESA.
or very close. For all I know, it uses vesa, why not.

But thats hard to tell when you arnt able to provide code.


Rewrite it to a dos image, that set up the flat mode, and vesa,
and runs the app.
I know you can do that easily. And I promise you, if you do that i read
the code in hex.

Again as above, perhaps I do it one day.

You could even do it in windows.

And I also will then restart the testing of the demo, if you want to.
(I now have enough hardware for dedicating a machine to testing).

This olde Demo is almost obsolete yet, I'd wait for the new version.

You want to prove a point, you have the means, (easily) so whats
stopping you?

I don't need to sell my solution in this NG, and for me it's enough
to know that my code performs much faster than winoze/L'unix/or else.

Until you post the resolution you ran my app in, thats unknown.

At this level with code like yours, the code plays a lesser part, and the hardware is the only limiting factor. This is not where asm has advantages. if you times are correct then the code only counts for 3.1% of the speed.

__
wolfgang
.