furrtek (./28) :
Noïce
I still have to update the fix page, I think I was confused with the cropping of some TVs.Have to check but I think it's 40x28 in NTSC and 40x32 (not 30) in PAL. So 16 pixels more top and bottom.
lea $100000,a0 ; palette source lea $400000,a1 ; palette destination loop add.l #$34,a1 movem.l (a0)+,d0-d7/a2-a6 movem.l d0-d7/a2-a6,-(a1) ; a1=0 add.l #$68,A1 movem.l (a0)+,d0-d7/a2-a6 movem.l d0-d7/a2-a6,-(a1) ; a1=34 add.l #$68,A1 movem.l (a0)+,d0-d7/a2-a6 movem.l d0-d7/a2-a6,-(a1) ; a1=68 add.l #$68,A1 movem.l (a0)+,d0-d7/a2-a6 movem.l d0-d7/a2-a6,-(a1) ; a1=9c add.l #$64,A1 movem.l (a0)+,d0-d7/a2-a5 movem.l d0-d7/a2-a5,-(a1) ; a1=d0 add.l #$64,a1 movem.l (a0)+,d0-d7/a2-a6 movem.l d0-d7/a2-a6,-(a1) ; a1=0 add.l #$68,A1 movem.l (a0)+,d0-d7/a2-a6 movem.l d0-d7/a2-a6,-(a1) ; a1=34 add.l #$68,A1 movem.l (a0)+,d0-d7/a2-a6 movem.l d0-d7/a2-a6,-(a1) ; a1=68 add.l #$68,A1 movem.l (a0)+,d0-d7/a2-a6 movem.l d0-d7/a2-a6,-(a1) ; a1=9c add.l #$64,A1 movem.l (a0)+,d0-d7/a2-a5 movem.l d0-d7/a2-a5,-(a1) ; a1=d0 add.l #$64,a1 movem.l (a0)+,d0-d7/a2-a6 movem.l d0-d7/a2-a6,-(a1) ; a1=0 add.l #$68,A1 movem.l (a0)+,d0-d7/a2-a6 movem.l d0-d7/a2-a6,-(a1) ; a1=34 add.l #$68,A1 movem.l (a0)+,d0-d7/a2-a6 movem.l d0-d7/a2-a6,-(a1) ; a1=68 add.l #$68,A1 movem.l (a0)+,d0-d7/a2-a6 movem.l d0-d7/a2-a6,-(a1) ; a1=9c add.l #$64,A1 movem.l (a0)+,d0-d7/a2-a5 movem.l d0-d7/a2-a5,-(a1) ; a1=d0 add.l #$64,a1 movem.l (a0)+,d0-d7/a2-a6 movem.l d0-d7/a2-a6,-(a1) ; a1=0 add.l #$68,A1 movem.l (a0)+,d0-d7/a2-a6 movem.l d0-d7/a2-a6,-(a1) ; a1=34 add.l #$68,A1 movem.l (a0)+,d0-d7/a2-a6 movem.l d0-d7/a2-a6,-(a1) ; a1=68 add.l #$68,A1 movem.l (a0)+,d0-d7/a2-a6 movem.l d0-d7/a2-a6,-(a1) ; a1=9c add.l #$64,A1 movem.l (a0)+,d0-d7/a2-a5 movem.l d0-d7/a2-a5,-(a1) ; a1=d0 add.l #$30,A1 cmp.l #$102000,a0 ; end of copy? bne loop rts
lea CHUNKY_BUFFER,a0 lea PALETTES+16*2*5,a1 sub.l #32,a1 rept 250 add.l #64,a1 sub.l #2,a0 movem.l (a0)+,d0-d7 movem.l d0-d7,-(a1) endr
lea PALETTES+16*2*5+2,a0 lea CHUNKY_BUFFER,a1 rept 250 move.l (a1)+,(a0)+ move.l (a1)+,(a0)+ move.l (a1)+,(a0)+ move.l (a1)+,(a0)+ move.l (a1)+,(a0)+ move.l (a1)+,(a0)+ move.l (a1)+,(a0)+ move.w (a1)+,(a0)+ add.l #2,a0 endr
blastar (./33) :
I updated 'DIFF' to v1.1
- no more tearing![]()
- uses 2x20sprites (instead of 4x20)
- screenupdate triggered (saves some frames)
blastar > lea PALETTES+16*2*5+2,a0 lea CHUNKY_BUFFER,a1 rept 250 move.l (a1)+,(a0)+ ; 20 cycles move.l (a1)+,(a0)+ ; 20 cycles move.l (a1)+,(a0)+ ; 20 cycles move.l (a1)+,(a0)+ ; 20 cycles move.l (a1)+,(a0)+ ; 20 cycles move.l (a1)+,(a0)+ ; 20 cycles move.l (a1)+,(a0)+ ; 20 cycles move.w (a1)+,(a0)+ ; 12 cycles add.l #2,a0 ; 16 cy168 cycles per palette
lea CHUNKY_BUFFER,a0 lea PALETTES+16*2*5,a1 sub.l #32,a1 rept 250 add.l #64,a1 ; 16 cycles sub.l #2,a0 ; 16 cycles movem.l (a0)+,d0-d7 ; 12 + (8 * 8) = 76 cycles movem.l d0-d7,-(a1) ; 8 + (8 * 8) = 72 cycles endr180 cycles per palette
Zerosquare (./41) :These are very interesting suggestions!
./38 > Le Folco, par l'odeur de l'assembleur 68k alléché...
[cut blastar's code]
(hint : you can save 8 cycles by using addq.l #2,a0 instead of add.l #2,a0)
vs
[cut your movem code]
180 cycles per palette
-> You're right, movem is actually slower here... surprising!
If you've got some time left before VBLANK, here's another strategy:
Preconvert your chunky buffer to a palette buffer (basically, you insert color #0 before each 15-entry palette) before VBLANK. Then, during VBLANK, copy the whole buffer to the palettes registers using movem. To copy 28 colors that way, the movem version (using 14 registers) needs 244 cycles, instead of 280 cycles for the move.l (a1)+,(a0)+. So you can set more palettes during VBLANK, even if it uses more cycles per frame.
I've thought about using self-modifying code too, but actually it's not faster than move.l (a1)+,(a0)+.
add.l a2,a1 ; 8 cycles subq.l #2,a0 ; 8 cycles movem.l (a0)+,d0-d7 ; 12 + (8 * 8) = 76 cycles movem.l d0-d7,-(a1) ; 8 + (8 * 8) = 72 cycles164 cycles. Hey, 4 cycles faster!
MOVEM.w CHUNKY_BUFFER,d0-d7/a0-a6 ; 80 cycles - read 15 words MOVEM.w d0-d7/a0-a6,PALETTES+16*2*5+2 ; 76 cycles - write 15 words at +2 offset MOVEM.w CHUNKY_BUFFER+30,d0-d7/a0-a6 ; next 15 words MOVEM.w d0-d7/a0-a6,PALETTES+16*2*5+34 ; offset 1*32+1*2 MOVEM.w CHUNKY_BUFFER+60,d0-d7/a0-a6 ; next 15 words MOVEM.w d0-d7/a0-a6,PALETTES+16*2*5+66 ; offset 2*32+2*2 ...156 cycles and 16B code per palette!
MOVE.l a7,STACKSTORE LEA CHUNKY_BUFFER,a7 MOVEM.w (a7)+,d0-d7/a0-a6 ; 72 cycles - read 15 words MOVEM.w d0-d7/a0-a6,PALETTES+16*2*5+2 ; 76 cycles - write 15 words at +2 offset MOVEM.w (a7)+,d0-d7/a0-a6 ; next 15 words MOVEM.w d0-d7/a0-a6,PALETTES+16*2*5+34 ; offset 1*32+1*2 MOVEM.w (a7)+,d0-d7/a0-a6 ; next 15 words MOVEM.w d0-d7/a0-a6,PALETTES+16*2*5+66 ; offset 2*32+2*2 ... MOVE.l STACKSTORE,a7 RTS
LEA image+64,a7 ; 12 cycles MOVEM.l testdata,d0-d7/a0-a6 ; d0-d7: 0123456789abcde0 a0-a6: 123456789abcde read = 20 + (15 * 8) = 140 MOVEM.l a0-a6,-(a7) ; a0-a6: 12 34 56 78 9a bc de (0) written = 64 cycles MOVE.w d7,-(a7) ; d7.w: 0 written at correct position = 8 cycles MOVEM.l d0-d7,-(a7) ; d0-d7: _0123456789abcde (0) written = 72 cycles ;296 cycles for 2 palettes -> 148 cycles/palette LEA image+128,a7 ; 12 cycles MOVEM.l testdata+60,d0-d7/a0-a6 ; d0-d7: 0123456789abcde0 a0-a6: 123456789abcde read = 20 + (15 * 8) = 140 MOVEM.l a0-a6,-(a7) ; a0-a6: 12 34 56 78 9a bc de (0) written = 64 cycles MOVE.w d7,-(a7) ; d7.w: 0 written at correct position = 8 cycles MOVEM.l d0-d7,-(a7) ; d0-d7: _0123456789abcde (0) written = 72 cycles ;296 cycles for 2 palettes -> 148 cycles/palette
MOVE.l a7,STACKSTORE MOVEM.l CHUNKY_BUFFER+64*0,d0-d7/a0-a7 MOVEM.l d0-d7/a0-a7,PALETTES+16*2*5+64*0 MOVEM.l CHUNKY_BUFFER+64*1,d0-d7/a0-a7 MOVEM.l d0-d7/a0-a7,PALETTES+16*2*5+64*1 ... MOVEM.l CHUNKY_BUFFER+64*124,d0-d7/a0-a7 MOVEM.l d0-d7/a0-a7,PALETTES+16*2*5+64*124 MOVE.l STACKSTORE,a7
MOVE.l a7,STACKSTORE LEA CHUNKY_BUFFER,a7 MOVEM.l (a7)+,d0-d7/a0-a6 MOVEM.l d0-d7/a0-a6,PALETTES+16*2*5+60*0 MOVEM.l (a7)+,,d0-d7/a0-a6 MOVEM.l d0-d7/a0-a6,PALETTES+16*2*5+60*1 ... MOVE.l STACKSTORE,a7
blastar (./52) :So for a copy loop including #0 this seems to be the final option.
you are right, this way it's a bit faster - 47 lines.
MOVE.l a7,STACKSTORE LEA CHUNKY_BUFFER,a7 LEA PALETTES+0x1FE0,a6 MOVEM.l (a7)+,d0-d7/a0-a5 MOVEM.l d0-d7/a0-a5,-(a6), MOVEM.l (a7)+,,d0-d7/a0-a5 MOVEM.l d0-d7/a0-a5,-(a6), ... MOVE.l STACKSTORE,a7
Dresdenboy (./53) :blastar (./52) :So for a copy loop including #0 this seems to be the final option.
you are right, this way it's a bit faster - 47 lines.
Did you also test the movem.W loop for skipping color #0? (Posts #43 and #45)
This means, that depending on the rendering requirements (color #0 rows or columns possible), there are fast copy loops with and without color #0 skippings.
MOVE.l a7,STACKSTORE LEA CHUNKY_BUFFER,a7 MOVEM.w (a7)+,d0-d7/a0-a6 MOVEM.w d0-d7/a0-a6,PALETTES+16*2*5+2+32*0 MOVEM.w (a7)+,d0-d7/a0-a6 MOVEM.w d0-d7/a0-a6,PALETTES+16*2*5+2+32*1 ... MOVEM.w (a7)+,d0-d7/a0-a6 MOVEM.w d0-d7/a0-a6,PALETTES+16*2*5+2+32*249 MOVE.l STACKSTORE,a7
blastar (./56) :
movem.w-loop without color#0 and using A7 is not that slow: 49 linesMOVE.l a7,STACKSTORE LEA CHUNKY_BUFFER,a7 MOVEM.w (a7)+,d0-d7/a0-a6 MOVEM.w d0-d7/a0-a6,PALETTES+16*2*5+2+32*0 MOVEM.w (a7)+,d0-d7/a0-a6 MOVEM.w d0-d7/a0-a6,PALETTES+16*2*5+2+32*1 ... MOVEM.w (a7)+,d0-d7/a0-a6 MOVEM.w d0-d7/a0-a6,PALETTES+16*2*5+2+32*249 MOVE.l STACKSTORE,a7
Razoola (./55) :The movem.w variant completely avoids reading or writing values twice and or writing color #0. It reads 15 words and writes 15 words and has linear (a7)+ reading, no address adjustments. That's where this method wins. It has 2x movem-initialization cycles per palette, but also only needs 4c/word.Dresdenboy (./53) :Those are not going to be faster than the longword method because there has to be twice as many opcodes. They were defo the best way though if blaster never added color #0 into buffer.blastar (./52) :So for a copy loop including #0 this seems to be the final option.
you are right, this way it's a bit faster - 47 lines.
Did you also test the movem.W loop for skipping color #0? (Posts #43 and #45)
This means, that depending on the rendering requirements (color #0 rows or columns possible), there are fast copy loops with and without color #0 skippings.