Amiga-Development

Please login or register.

Login with username, password and session length
Advanced search  

News:

Created for developers of all Amiga camps

Pages: 1 2 3 [4]

Author Topic: 68k Assembler help  (Read 18750 times)

0 Members and 1 Guest are viewing this topic.

Matt Hey

  • Sr. Member
  • ****
  • Posts: 293
    • View Profile
Re: 68k Assembler help
« Reply #45 on: April 19, 2013, 03:16:40 AM »

The graphical issue is still there with your C2P, in your console make sure you check the value of 'cl_test' to make sure it's set to 1.   To check the value simply open the console and just type 'cl_test' and it will tell you want it is currently set to.

Ok. I see it now. I can't believe I played that much without noticing before :-[. Thanks.
Logged

Matt Hey

  • Sr. Member
  • ****
  • Posts: 293
    • View Profile
Re: 68k Assembler help
« Reply #46 on: April 22, 2013, 03:47:26 AM »

I just downloaded it from EAB myself and ran it and managed to capture the glitches in the attach screen-grab.  You should be able to see that half of the pillar contains a messed up texture.

I have 2 new versions of DrawSpans16() in assembler to test. The 1st is my attempt at fixing the original assembler version and is the fastest (There is a FAST_DIV option too). The 2nd version is closer to the original C source but slower. Both versions do an integer division at the same time as a FDIV (for free if I understand the docs) to create an integer reciprocal approximation that is used later to replace 2 integer divisions with multiplies. This is more accurate than John's version using 2 word divisions while hopefully saving ~40 cycles (~70 over the full accuracy long divisions a compiler would use). It would be much easier and fully accurate if the 68060 had the integer 32x32=64 in hardware for integer reciprocal approximations. I had to brush up on my fixed point integer math for this one (never my strong point). There are some other modifications and there is still some room for cleanup but I wanted to limit what I've messed up so far ;). Some places where I improved accuracy, Q2 did not like. Psychedelic wall effects anyone?

It would be nice if the fps worked in case we ever get an optimized DrawSpans16() to work. Even if it's not an accurate fps, it probably would give us an indication of relative improvement. It might be easier if I had the source to make my own debug versions. The source should be tiny compared to DOSBox, right? Is it C++ again?
Logged

NovaCoder

  • Full Member
  • ***
  • Posts: 139
    • View Profile
Re: 68k Assembler help
« Reply #47 on: April 22, 2013, 01:35:08 PM »

They didn't work Matt so I just sent you my Workspace, it's developed in C so that should make things easier.

I think you'll find the frame rate is more accurate now that I've make a tweak.

Your first ASM version has some strange glitches and the second version didn't do any texture mapping.
« Last Edit: April 23, 2013, 04:12:51 AM by NovaCoder »
Logged

Matt Hey

  • Sr. Member
  • ****
  • Posts: 293
    • View Profile
Re: 68k Assembler help
« Reply #48 on: April 24, 2013, 11:47:28 AM »

They didn't work Matt so I just sent you my Workspace, it's developed in C so that should make things easier.

There's nothing like being able to do lots of fast builds ;).

I think you'll find the frame rate is more accurate now that I've make a tweak.

Well, it gives me a number now. I don't know if I would use the word accurate though. I get about 9fps on everything. I guess it does work better than my last assembler attempt though :-[.

Your first ASM version has some strange glitches and the second version didn't do any texture mapping.

Hopefully I have most of the problems sorted now. The 1st version is the slowest and highest quality while the 2nd removes the clamping (no visual difference) and turns on FAST_DIV=1 (there are now 3 settings) which causes a few pixel artifacts at longer distances mostly on edges. It's probably not worth being down and dirty for 9fps like the other though. If they work for you this time, then maybe you can have some fun playing with them this time. I couldn't really get an idea of speed on my system for comparison.


Logged

NovaCoder

  • Full Member
  • ***
  • Posts: 139
    • View Profile
Re: 68k Assembler help
« Reply #49 on: April 24, 2013, 03:02:54 PM »

Hiya Matt,

Thanks for that, just tested them on my 1200 using my own map.

1st version: 9.7 FPS
2nd version (the one with the pixel artifacts): 9.8 FPS
John's original version: 9.7 FPS

Is there any way of getting more speed without losing accuracy?   Have you and any more thoughts about using Frank's look-up tables from his asm?
Logged

Matt Hey

  • Sr. Member
  • ****
  • Posts: 293
    • View Profile
Re: 68k Assembler help
« Reply #50 on: April 25, 2013, 03:57:05 AM »

Thanks for that, just tested them on my 1200 using my own map.

1st version: 9.7 FPS
2nd version (the one with the pixel artifacts): 9.8 FPS
John's original version: 9.7 FPS

You might make sure that this assembler code is called in all cases. There were several versions of the same function in the executable being called before (and the old can be removed). I'm surprised there isn't a bigger difference in speed by removing so much code including divisions and branches (several with incorrect branch logic for 040/060).

You can set FAST_DIV=0 in the noclamp version and it should take care of the pixel artifacts but slow the code down some.

Is there any way of getting more speed without losing accuracy?   Have you and any more thoughts about using Frank's look-up tables from his asm?

Probably not. My division to multiplication replacement reciprocal estimation technique actually has more precision than the 2 divisions used but the algorithm wants the calculation performed in exactly the way described. I ran into the same problem with reorganizing calculations that improved precision as well as being faster but it created problems as well. More precision is usually a good thing but not always. A reciprocal lookup table shouldn't be any faster as my integer division used to generate the reciprocal estimate should have been in parallel with a floating point division according to the MC68060UM:

Quote
The floating-point pre-exception model of the MC68060 supports execution overlap
between multi-cycle floating-point instructions and the integer execute engines. Once a
multi-cycle floating-point instruction has started its execution, the primary and secondary
OEPs may continue to dispatch and complete integer instructions in parallel with the
floating-point instructions. The OEPs will stall only if another floating-point instruction is
encountered before the first floating-point instruction has completed its execution. The
floating-point instructions that permit this execution overlap are classified as pOEP-butallows-
sOEP in Table 10-4.

The table could be generated with results that avoid the pixel problems but there isn't likely to be a noticeable speed advantage as there isn't any when using FAST_DIV>0. The 2 integer divisions don't cause much slow down. They are <=22 cycles pOEP only (no parallelism) but do have early exit points. My FAST_DIV>0 code should do 1xDIVU done in parallel with a FDIV for free and then 2xMULS.L pluse 4xshifts that only takes 6 cycles total if I understand the documentation correctly. Much of the bounds checking should have been done in parallel with FDIV instructions so there shouldn't be much problem there. Maybe the FDIV parallelism stops when a branch instruction is encountered. More tests with timed code would be good. I would expect my code would be 5-15% faster than John's code (which is pretty good) but a faster function might not make as much of a difference, especially if there is bottlenecks elsewhere like memory or cache size limitations not that I would expect them to be a problem here.

I can see some of the endian BYTESWAP code that is awful like _LongSwap and _ShortSwap. It's kind of funny because _FloatSwap isn't bad. I'm looking at the unoptimized versions but I bet these can be improved. I can't tell how much they are being used because most functions go through a function list/table (C++ object oriented style). This is very bad on the 68060 because there is no indirect branch prediction and these functions can't be inlined. The function lists have labels like _functionList, _mmoveList and _spawns. Both integer and fp math routines are poor also. ___mulsi3 calls the utility.library SMULT32 function (32x32=32 only needed by the 68000) instead of using MULS.L directly which is faster and could be inlined. This function is called many times to do an integer 64x64=64 and can be done in the slow fp unit in a fraction of the time. Many fp instructions call functions from 3 of the IEEE libraries when the FPU is turned on. GCC does not compress immediate floats or convert some FDIVs to FMULs like vasm does which would help also. I believe this code could run significantly faster but would require a lot of improvements. A profiler might pinpoint some problems.

Logged

Team Chaos Leader

  • Administrator
  • Sr. Member
  • *****
  • Posts: 484
  • JC + Asm Coder
    • View Profile
Re: 68k Assembler help
« Reply #51 on: April 25, 2013, 05:06:20 AM »

Quote
I believe this code could run significantly faster but would require a lot of improvements.
I believe u just created a lot of new work for urself  ;D
Logged

Matt Hey

  • Sr. Member
  • ****
  • Posts: 293
    • View Profile
Re: 68k Assembler help
« Reply #52 on: April 25, 2013, 08:42:38 AM »

Quote
I believe this code could run significantly faster but would require a lot of improvements.
I believe u just created a lot of new work for urself  ;D

Maybe. The byte swapping code is easy enough although I expect it would only get inlined once because of that blasted function table :(.

Code: [Select]
static inline unsigned short ShortSwap(unsigned short val)
{
__asm__("rol.w #8,%0"
:"=d"(val)
:"0"(val)
);
return(val);
}

static inline unsigned long LongSwap(unsigned long val)
{
__asm__("rol.w #8,%0;swap %0;rol.w #8,%0"
:"=d"(val)
:"0"(val)
);
return(val);
}

#define FloatSwap(val) LongSwap(val)  // FloatSwap() & LongSwap() functionality is equal

Here is the fine compiler output without optimization although the algorithm doesn't change  ::).

Code: [Select]
_ShortSwap:
   link     a5,#-4                      ; 117b8 : 4e55 fffc
   move.l   (8,a5),d0                   ; 117bc : 202d 0008
   move.w   d0,(-2,a5)                  ; 117c0 : 3b40 fffe
   move.w   (-2,a5),d1                  ; 117c4 : 322d fffe
   st       d0                          ; 117c8 : 50c0
   and.b    d0,d1                       ; 117ca : c200
   move.b   d1,(-3,a5)                  ; 117cc : 1b41 fffd
   move.w   (-2,a5),d0                  ; 117d0 : 302d fffe
   move.w   d0,d1                       ; 117d4 : 3200
   asr.w    #8,d1                       ; 117d6 : e041
   st       d0                          ; 117d8 : 50c0
   and.b    d0,d1                       ; 117da : c200
   move.b   d1,(-4,a5)                  ; 117dc : 1b41 fffc
   clr.l    d0                          ; 117e0 : 4280
   move.b   (-3,a5),d0                  ; 117e2 : 102d fffd
   move.l   d0,d1                       ; 117e6 : 2200
   lsl.l    #8,d1                       ; 117e8 : e189
   clr.w    d0                          ; 117ea : 4240
   move.b   (-4,a5),d0                  ; 117ec : 102d fffc
   add.w    d1,d0                       ; 117f0 : d041
   movea.w  d0,a0                       ; 117f2 : 3040
   move.l   a0,d0                       ; 117f4 : 2008
   unlk     a5                          ; 117f6 : 4e5d
   rts

_LongSwap:
   link     a5,#-4                      ; 11810 : 4e55 fffc
   move.l   d2,-(sp)                    ; 11814 : 2f02
   move.l   (8,a5),d1                   ; 11816 : 222d 0008
   st       d0                          ; 1181a : 50c0
   and.b    d0,d1                       ; 1181c : c200
   move.b   d1,(-1,a5)                  ; 1181e : 1b41 ffff
   move.l   (8,a5),d0                   ; 11822 : 202d 0008
   move.l   d0,d1                       ; 11826 : 2200
   asr.l    #8,d1                       ; 11828 : e081
   st       d0                          ; 1182a : 50c0
   and.b    d0,d1                       ; 1182c : c200
   move.b   d1,(-2,a5)                  ; 1182e : 1b41 fffe
   move.l   (8,a5),d0                   ; 11832 : 202d 0008
   move.l   d0,d1                       ; 11836 : 2200
   moveq    #$10,d2                     ; 11838 : 7410
   asr.l    d2,d1                       ; 1183a : e4a1
   st       d0                          ; 1183c : 50c0
   and.b    d0,d1                       ; 1183e : c200
   move.b   d1,(-3,a5)                  ; 11840 : 1b41 fffd
   move.l   (8,a5),d0                   ; 11844 : 202d 0008
   move.l   d0,d1                       ; 11848 : 2200
   moveq    #$18,d2                     ; 1184a : 7418
   asr.l    d2,d1                       ; 1184c : e4a1
   st       d0                          ; 1184e : 50c0
   and.b    d0,d1                       ; 11850 : c200
   move.b   d1,(-4,a5)                  ; 11852 : 1b41 fffc
   clr.l    d1                          ; 11856 : 4281
   move.b   (-1,a5),d1                  ; 11858 : 122d ffff
   moveq    #$18,d0                     ; 1185c : 7018
   lsl.l    d0,d1                       ; 1185e : e1a9
   clr.l    d0                          ; 11860 : 4280
   move.b   (-2,a5),d0                  ; 11862 : 102d fffe
   moveq    #$10,d2                     ; 11866 : 7410
   lsl.l    d2,d0                       ; 11868 : e5a8
   add.l    d0,d1                       ; 1186a : d280
   clr.l    d0                          ; 1186c : 4280
   move.b   (-3,a5),d0                  ; 1186e : 102d fffd
   lsl.l    #8,d0                       ; 11872 : e188
   add.l    d0,d1                       ; 11874 : d280
   clr.l    d0                          ; 11876 : 4280
   move.b   (-4,a5),d0                  ; 11878 : 102d fffc
   add.l    d0,d1                       ; 1187c : d280
   move.l   d1,d0                       ; 1187e : 2001
   move.l   (sp)+,d2                    ; 11880 : 241f
   unlk     a5                          ; 11882 : 4e5d
   rts

_FloatSwap:
   link     a5,#-8                      ; 11892 : 4e55 fff8
   move.l   (8,a5),(-4,a5)              ; 11896 : 2b6d 0008 fffc
   move.b   (-1,a5),(-8,a5)             ; 1189c : 1b6d ffff fff8
   move.b   (-2,a5),(-7,a5)             ; 118a2 : 1b6d fffe fff9
   move.b   (-3,a5),(-6,a5)             ; 118a8 : 1b6d fffd fffa
   move.b   (-4,a5),(-5,a5)             ; 118ae : 1b6d fffc fffb
   move.l   (-8,a5),d0                  ; 118b4 : 202d fff8
   unlk     a5                          ; 118b8 : 4e5d
   rts
Logged

NovaCoder

  • Full Member
  • ***
  • Posts: 139
    • View Profile
Re: 68k Assembler help
« Reply #53 on: April 25, 2013, 12:13:13 PM »

Hiya Matt,

Thanks for that, just tested them on my 1200 using my own map.

1st version: 9.7 FPS
2nd version (the one with the pixel artifacts): 9.8 FPS
John's original version: 9.7 FPS

Is there any way of getting more speed without losing accuracy?   Have you and any more thoughts about using Frank's look-up tables from his asm?

Tried the 1st version again with each of the following settings:

FAST_DIV = 0   ;original 2xDIVS.W instructions
FAST_DIV = 1   ;faster but less accurate
FAST_DIV = 2   ;fastest but least accurate

FAST_DIV = 0 gives the same FPS as John's original.

FAST_DIV = 1 and FAST_DIV = 2 give an extra 10th of a frame per second.

It could well be the even FAST_DIV = 0 is faster than John's but if it's less than a 10th difference you won't see in in the FPS reading.

It's probably like you said, the bottle neck is no longer with this function but elsewhere and it's impossible to tell without a decent gcc profiler (which I don't believe exists).

I've actually removed a whole ton of code from the render loop to only see a tiny effect on the FPS, it's amazing how much code an 060 can process each frame.
Logged

Veda

  • Hero Member
  • *****
  • Gender: Male
  • Posts: 1008
  • Sleep is overrated
    • View Profile
Re: 68k Assembler help
« Reply #54 on: April 25, 2013, 12:59:06 PM »

Seems the Bus currently is the limiting factor.
Logged

Team Chaos Leader

  • Administrator
  • Sr. Member
  • *****
  • Posts: 484
  • JC + Asm Coder
    • View Profile
Re: 68k Assembler help
« Reply #55 on: April 26, 2013, 02:17:03 AM »

Quote
I believe this code could run significantly faster but would require a lot of improvements.
I believe u just created a lot of new work for urself  ;D

Maybe. The byte swapping code is easy enough although I expect it would only get inlined once because of that blasted function table :(.

Code: [Select]
static inline unsigned short ShortSwap(unsigned short val)
{
__asm__("rol.w #8,%0"
:"=d"(val)
:"0"(val)
);
return(val);
}

static inline unsigned long LongSwap(unsigned long val)
{
__asm__("rol.w #8,%0;swap %0;rol.w #8,%0"
:"=d"(val)
:"0"(val)
);
return(val);
}

#define FloatSwap(val) LongSwap(val)  // FloatSwap() & LongSwap() functionality is equal

Here is the fine compiler output without optimization although the algorithm doesn't change  ::).

Code: [Select]
_ShortSwap:
   link     a5,#-4                      ; 117b8 : 4e55 fffc
   move.l   (8,a5),d0                   ; 117bc : 202d 0008
   move.w   d0,(-2,a5)                  ; 117c0 : 3b40 fffe
   move.w   (-2,a5),d1                  ; 117c4 : 322d fffe
   st       d0                          ; 117c8 : 50c0
   and.b    d0,d1                       ; 117ca : c200
   move.b   d1,(-3,a5)                  ; 117cc : 1b41 fffd
   move.w   (-2,a5),d0                  ; 117d0 : 302d fffe
   move.w   d0,d1                       ; 117d4 : 3200
   asr.w    #8,d1                       ; 117d6 : e041
   st       d0                          ; 117d8 : 50c0
   and.b    d0,d1                       ; 117da : c200
   move.b   d1,(-4,a5)                  ; 117dc : 1b41 fffc
   clr.l    d0                          ; 117e0 : 4280
   move.b   (-3,a5),d0                  ; 117e2 : 102d fffd
   move.l   d0,d1                       ; 117e6 : 2200
   lsl.l    #8,d1                       ; 117e8 : e189
   clr.w    d0                          ; 117ea : 4240
   move.b   (-4,a5),d0                  ; 117ec : 102d fffc
   add.w    d1,d0                       ; 117f0 : d041
   movea.w  d0,a0                       ; 117f2 : 3040
   move.l   a0,d0                       ; 117f4 : 2008
   unlk     a5                          ; 117f6 : 4e5d
   rts

_LongSwap:
   link     a5,#-4                      ; 11810 : 4e55 fffc
   move.l   d2,-(sp)                    ; 11814 : 2f02
   move.l   (8,a5),d1                   ; 11816 : 222d 0008
   st       d0                          ; 1181a : 50c0
   and.b    d0,d1                       ; 1181c : c200
   move.b   d1,(-1,a5)                  ; 1181e : 1b41 ffff
   move.l   (8,a5),d0                   ; 11822 : 202d 0008
   move.l   d0,d1                       ; 11826 : 2200
   asr.l    #8,d1                       ; 11828 : e081
   st       d0                          ; 1182a : 50c0
   and.b    d0,d1                       ; 1182c : c200
   move.b   d1,(-2,a5)                  ; 1182e : 1b41 fffe
   move.l   (8,a5),d0                   ; 11832 : 202d 0008
   move.l   d0,d1                       ; 11836 : 2200
   moveq    #$10,d2                     ; 11838 : 7410
   asr.l    d2,d1                       ; 1183a : e4a1
   st       d0                          ; 1183c : 50c0
   and.b    d0,d1                       ; 1183e : c200
   move.b   d1,(-3,a5)                  ; 11840 : 1b41 fffd
   move.l   (8,a5),d0                   ; 11844 : 202d 0008
   move.l   d0,d1                       ; 11848 : 2200
   moveq    #$18,d2                     ; 1184a : 7418
   asr.l    d2,d1                       ; 1184c : e4a1
   st       d0                          ; 1184e : 50c0
   and.b    d0,d1                       ; 11850 : c200
   move.b   d1,(-4,a5)                  ; 11852 : 1b41 fffc
   clr.l    d1                          ; 11856 : 4281
   move.b   (-1,a5),d1                  ; 11858 : 122d ffff
   moveq    #$18,d0                     ; 1185c : 7018
   lsl.l    d0,d1                       ; 1185e : e1a9
   clr.l    d0                          ; 11860 : 4280
   move.b   (-2,a5),d0                  ; 11862 : 102d fffe
   moveq    #$10,d2                     ; 11866 : 7410
   lsl.l    d2,d0                       ; 11868 : e5a8
   add.l    d0,d1                       ; 1186a : d280
   clr.l    d0                          ; 1186c : 4280
   move.b   (-3,a5),d0                  ; 1186e : 102d fffd
   lsl.l    #8,d0                       ; 11872 : e188
   add.l    d0,d1                       ; 11874 : d280
   clr.l    d0                          ; 11876 : 4280
   move.b   (-4,a5),d0                  ; 11878 : 102d fffc
   add.l    d0,d1                       ; 1187c : d280
   move.l   d1,d0                       ; 1187e : 2001
   move.l   (sp)+,d2                    ; 11880 : 241f
   unlk     a5                          ; 11882 : 4e5d
   rts

_FloatSwap:
   link     a5,#-8                      ; 11892 : 4e55 fff8
   move.l   (8,a5),(-4,a5)              ; 11896 : 2b6d 0008 fffc
   move.b   (-1,a5),(-8,a5)             ; 1189c : 1b6d ffff fff8
   move.b   (-2,a5),(-7,a5)             ; 118a2 : 1b6d fffe fff9
   move.b   (-3,a5),(-6,a5)             ; 118a8 : 1b6d fffd fffa
   move.b   (-4,a5),(-5,a5)             ; 118ae : 1b6d fffc fffb
   move.l   (-8,a5),d0                  ; 118b4 : 202d fff8
   unlk     a5                          ; 118b8 : 4e5d
   rts

I was going to call the Code Police to have that compiler code arrested and shot.  But then I noticed  that you said its "without optimization".  ;D
Logged
Pages: 1 2 3 [4]