As a matter of interest if you are not already, have you tried movem.l to see if you can copy the palettes quicker ( It should be faster than unrolled move.l (a0)+,(a1)+ )?
For example; have the palette stored in workRAM and use A0 and A1 as pointers. Then something like this (I think I got the A1 additions right).
lea $100000,a0 ; palette source
lea $400000,a1 ; palette destination
loop add.l #$34,a1
movem.l (a0)+,d0-d7/a2-a6
movem.l d0-d7/a2-a6,-(a1) ; a1=0
add.l #$68,A1
movem.l (a0)+,d0-d7/a2-a6
movem.l d0-d7/a2-a6,-(a1) ; a1=34
add.l #$68,A1
movem.l (a0)+,d0-d7/a2-a6
movem.l d0-d7/a2-a6,-(a1) ; a1=68
add.l #$68,A1
movem.l (a0)+,d0-d7/a2-a6
movem.l d0-d7/a2-a6,-(a1) ; a1=9c
add.l #$64,A1
movem.l (a0)+,d0-d7/a2-a5
movem.l d0-d7/a2-a5,-(a1) ; a1=d0
add.l #$64,a1
movem.l (a0)+,d0-d7/a2-a6
movem.l d0-d7/a2-a6,-(a1) ; a1=0
add.l #$68,A1
movem.l (a0)+,d0-d7/a2-a6
movem.l d0-d7/a2-a6,-(a1) ; a1=34
add.l #$68,A1
movem.l (a0)+,d0-d7/a2-a6
movem.l d0-d7/a2-a6,-(a1) ; a1=68
add.l #$68,A1
movem.l (a0)+,d0-d7/a2-a6
movem.l d0-d7/a2-a6,-(a1) ; a1=9c
add.l #$64,A1
movem.l (a0)+,d0-d7/a2-a5
movem.l d0-d7/a2-a5,-(a1) ; a1=d0
add.l #$64,a1
movem.l (a0)+,d0-d7/a2-a6
movem.l d0-d7/a2-a6,-(a1) ; a1=0
add.l #$68,A1
movem.l (a0)+,d0-d7/a2-a6
movem.l d0-d7/a2-a6,-(a1) ; a1=34
add.l #$68,A1
movem.l (a0)+,d0-d7/a2-a6
movem.l d0-d7/a2-a6,-(a1) ; a1=68
add.l #$68,A1
movem.l (a0)+,d0-d7/a2-a6
movem.l d0-d7/a2-a6,-(a1) ; a1=9c
add.l #$64,A1
movem.l (a0)+,d0-d7/a2-a5
movem.l d0-d7/a2-a5,-(a1) ; a1=d0
add.l #$64,a1
movem.l (a0)+,d0-d7/a2-a6
movem.l d0-d7/a2-a6,-(a1) ; a1=0
add.l #$68,A1
movem.l (a0)+,d0-d7/a2-a6
movem.l d0-d7/a2-a6,-(a1) ; a1=34
add.l #$68,A1
movem.l (a0)+,d0-d7/a2-a6
movem.l d0-d7/a2-a6,-(a1) ; a1=68
add.l #$68,A1
movem.l (a0)+,d0-d7/a2-a6
movem.l d0-d7/a2-a6,-(a1) ; a1=9c
add.l #$64,A1
movem.l (a0)+,d0-d7/a2-a5
movem.l d0-d7/a2-a5,-(a1) ; a1=d0
add.l #$30,A1
cmp.l #$102000,a0 ; end of copy?
bne loop
rts
[edited] I had a small mistake in the ASM which is now fixed.
If your boundary's are ok replacing add.l with add.w will save a few more cycles. When I wrote this I was comparing 68k copy speed against the CD DMA system and was moving large blocks. This code is faster than DMA on the TOP loader at least. You can of course also totally unroll it and use A7 too. for extra.