Page 1 of 1

Questions about mips asm/inline asm and other stuff

Posted: Sat Oct 27, 2007 8:30 am
by ffgriever
Ok, I know... this is not truly ps2 development related (more mips gcc specific), but right now I'm just not sure whether those "problems" are not connected to ps2 more than I think they are.

Anyway, first question. Let's consider some simple code, that fetches word from memory address which is not always aligned (or even will be mostly unaligned).

Code: Select all

void fetch_and_do_something(u32 * srcdst)
{
   u32 tmpDest;
   __asm__ (
	   "lwl %0, 3(%1)\n\t"
	   "lwr %0, 0(%1)\n\t"
	   :"=r"(tmpDest)
	   :"r"(srcdst)
   );
  //do something here with the tmpDst
  //...
  //store tmpDst to the same location using similar syntax (swl/swr)
}
The problem is, that with no optimizations (compiled with "ee-gcc -D_EE -G0 -g") and added debugging info this code gives me tlb load exceptions at the lwr (bad address). Checking the assembly, made the reason obvious. This code has been compiled into (simplifying):

Code: Select all

#standard sp/fp stuff
sd $a0, srcdst_stack($sp)
lw $v0, srcdst_stack($sp)
lwl $v0, 3($v0)
lwr $v0, 0($v0)
Fine, so as I wrote, it's obvious why it leads to exception (register modified and then used again as base, which shouldn't happen unless specified). The funny part is that it doesn't happen when compiled with "-O2" or "-O3" (separate registers are used for base and source). Am I doing something wrong? Is this normal behavior? The same happens with other instructions (like lw/lh/lb/etc.).

I've came up with something like this, to overcome this problem when debugging.

Code: Select all

void fetch_and_do_something(u32 * srcdst)
{
   u32 tmpDest;
   __asm__ (
	   "lwl $t9, 3(%1)\n\t"
	   "lwr $t9, 0(%1)\n\t"
	   "sw $t9, %0\n\t"
	   :"=m"(tmpDest)
	   :"r"(srcdst)
	   :"t9"
   );
  //do something here with the tmpDst
  //...
  //store tmpDst to the same location using similar syntax (swl/swr)
}
But this one is less efficient (even with "-O3" it uses some additional daddu, and always stores the result on stack, then loads it again, where in previous version it was preserving the value in register across all the function.

Well, I think I can have two versions of this simple code (one for debug, one for optimized compilation)... but it doesn't seem like a good idea.

Now, I have to say I'm not very good at mips assembly (I have some experience with PSX), nor gcc specific stuff (mostly programming for w32, I'm using MSVS most of the time, with Visual c# recently more than anything else).

Second one: what is the cost of lwl/lwr swl/swr? Sure, I can do my own tests, but maybe someone already did (didn't find specific data). What I mean is, maybe it's just more efficient to use other methods. In this sample, the destination/source is not always word aligned but always halfword aligned, so it's quite easy to load/store it in the other way, like:

Code: Select all

tmpDest = &#40;&#40;*&#40;u16*&#41;srcdst&#41;&0xffff&#41;|&#40;&#40;*&#40;&#40;&#40;u16*&#41;srcdst&#41;+1&#41;&#41;<<16&#41;;
The third question is:

Do you know of an efficient way to convert 16bit (5:5:5:1) textures into 32bit (8:8:8:8) and vice versa? I've been able to come up with something like this (for 16b->32b, for the other just revers order and pack instead of ext):

Code: Select all

u64 sec;
u128 tempColor;
u16 texture16bit&#91;size&#93;; //64b aligned
u32 texture32bit&#91;size&#93;; //128b aligned
for&#40;pixel=0;pixel<size;pixel+=4&#41;
&#123;
   sec=*&#40;u64*&#41;&texture16bit&#91;pixel&#93;;
   __asm__&#40;
      "pexcw %1, %1\n\t"
      "pexch %1, %1\n\t"
      "pext5 %0, %1\n\t"
      &#58;"=r"&#40;tempColor&#41;&#58;"r"&#40;sec&#41;
   &#41;;
   *&#40;u128*&#41;&texture32bit&#91;pixel&#93;=tempColor;
&#125;
It converts four pixels at time, so it's a little bit more efficient than what I've been doing to this time, and can be even better with unrolling up to four (16pixels at time, more gives no increase)... but is not what I would expect.

Thank you in advance.

Re: Questions about mips assembler/inline assembler

Posted: Sat Oct 27, 2007 9:19 am
by jimparis
ffgriever wrote:Anyway, first question. Let's consider some simple code, that fetches word from memory address which is not always aligned (or even will be mostly unaligned).
..
The problem is, that with no optimizations (compiled with "ee-gcc -D_EE -G0 -g") and added debugging info this code gives me tlb load exceptions at the lwr (bad address). Checking the assembly, made the reason obvious. This code has been compiled into (simplifying):

Code: Select all

#standard sp/fp stuff
sd $a0, srcdst_stack&#40;$sp&#41;
lw $v0, srcdst_stack&#40;$sp&#41;
lwl $v0, 3&#40;$v0&#41;
lwr $v0, 0&#40;$v0&#41;
Hi,
If I understand things right, you need to mark the output as earlyclobber (&) because the output registers are written before the input operands are no longer needed. Something like:

Code: Select all

 __asm__ &#40;
      "lwl %0, 3&#40;%1&#41;\n\t"
      "lwr %0, 0&#40;%1&#41;\n\t"
      &#58;"=&r"&#40;tmpDest&#41;
      &#58;"r"&#40;srcdst&#41;
   &#41;; 

Posted: Sat Oct 27, 2007 5:50 pm
by ffgriever
Aww... shame on me. Such an easy mistake. Still not get used to the gcc inline assembler. More, everything was clearly stated in the every gcc inline asm howtos. Must've left my mind behind.

Thank you.

(PS. the other qs still apply)

Posted: Sat Oct 27, 2007 9:51 pm
by Mihawk
The IPU (Image Processing Unit) can do RGB32 to RGB16 (not the other way I believe) using the PACK command.

I've been using the IPU for color space conversion some two years ago but can't find the code,
otherwise I could have posted the relevant parts.

Posted: Mon Oct 29, 2007 6:57 am
by ffgriever
Thanks, but IPU is busy at the time. EE has still some idle time, that's why I'm using it to make some precalculations for next frame. Plus, as you wrote, IPU does support packing to 16bit only, not extending to 32bit (and I need to do both, sometimes even on quite large amount of data).