VFPU playground, code generation for gas-unsupported opcodes

Post by **mrbrown** » Wed Oct 12, 2005 6:11 am

By VFPU assember I mean gas. If you saw what it took to implement support for register operands you'd have chickened out like I did :).

Oh and games always run in usermode. Just as there's no way for us to get into the kernel once in usermode, official games can't either.

jsgf · Post by **jsgf** » Wed Oct 12, 2005 9:59 am

holger wrote:
jsgf wrote:I would like to see a very simple, thin libvfpu which provites two things:

a set of macros to make inline assembler access to the VFPU easy (like gcc/icc's xmmintrin.h for SSE)

yes, this would be nice, but requires some work in the toolchain, so that gcc knows how to schedule the VFPU registers.

It doesn't require it. It would be nice to have a constraint for VFPU registers so that gcc can reorder the asm statements with respect to other code while maintaining dependencies properly, and doubly nice if there were a gcc type for VFPU register variables, but not essential. A plain asm() with memory-use constraints should be enough.

jsgf wrote:[*] a simple lightweight context switching mechanism to allow multiple libraries to share the VFPU without stomping on each other
becomes obsolete with the above...

Perhaps. But that's quite a bit more work.

holger · Post by **holger** » Wed Oct 12, 2005 11:15 pm

jsgf wrote:
holger wrote:
jsgf wrote:I would like to see a very simple, thin libvfpu which provites two things:

a set of macros to make inline assembler access to the VFPU easy (like gcc/icc's xmmintrin.h for SSE)

yes, this would be nice, but requires some work in the toolchain, so that gcc knows how to schedule the VFPU registers.
It doesn't require it. It would be nice to have a constraint for VFPU registers so that gcc can reorder the asm statements with respect to other code while maintaining dependencies properly, and doubly nice if there were a gcc type for VFPU register variables, but not essential. A plain asm() with memory-use constraints should be enough.

you can implement this by wrapping all __asm__ volatile (cgen_asm()) macros with inline functions, but it would not be of much use -- the great thing about intrinsics is that you get rid of the load of register scheduling...

jsgf · Post by **jsgf** » Thu Oct 13, 2005 1:16 am

holger wrote:you can implement this by wrapping all __asm__ volatile (cgen_asm()) macros with inline functions, but it would not be of much use -- the great thing about intrinsics is that you get rid of the load of register scheduling...

Yep, I'm with you there. But since that will require a non-trivial amount of gcc hacking, it would be nice to have a workable substitute for now, if nothing else so we can get a feeling for how and where the VFPU is actually useful.

holger · Post by **holger** » Thu Oct 13, 2005 2:28 am

I fear writing big section of inline asm (right now macro-based, hope gas-support comes soon), or dynamic macro-based code generation, is the only option now.

groepaz · Post by **groepaz** » Tue Oct 18, 2005 2:38 am

i have added the info from this thread to my doc, those who have messaged me before can get it from its previous location... let me know if there are any obvious errors :)

MrMr[iCE] · Post by **MrMr[iCE]** » Wed Oct 19, 2005 3:56 pm

I've written up a reference that shows how the registers are mapped in the various single/pair/triple/quad modes. This should help a bit when trying to juggle all those matrices/vectors around the vfpu register space.

http://bradburn.net/mr.mr/vfpu.html
http://bradburn.net/mr.mr/vfpu2.html <-- this one is nice for a cheat sheet

EDIT: I've also added this to the wiki.

holger · Post by **holger** » Thu Oct 20, 2005 5:41 am

nice explanation! but the wiki seems dead these minutes...

MrMr[iCE] · Post by **MrMr[iCE]** » Thu Oct 20, 2005 10:12 am

Heh the wiki just links to the first page I pasted above. Ill do a proper entry for the wiki later.

sherpya · Post by **sherpya** » Thu Oct 20, 2005 11:01 am

why not adding it directly to gas? it's not possible?

MrMr[iCE] · Post by **MrMr[iCE]** » Thu Oct 20, 2005 12:34 pm

no just very difficult to do...im not familiar with adding opcodes to gas, and I have no clue how to get gcc to schedule the register usage. That requires someone who really knows binutils and gcc to do that. For now well stick to the macro stuff, much easier to use =)

nugi · Post by **nugi** » Sun Oct 23, 2005 5:09 am

Yeah~ good job.

I'm wondering how to load/save 1 or 2 byte integer to vector register of GPU at once. For example, loading 32bit color value(RGBA) to C000 register. And after some processing write C000 to memory(32bit).

Is it possible? I cant find a way from current codegen.h. hmm~~~

chp · Post by **chp** » Sun Oct 23, 2005 1:16 pm

I have added very preliminary VFPU support to GUM now, just as a working example. To enable this support, remove the comment from

Code: Select all

//#define GUM_USE_VFPU

in gumInternal.h, rebuild the library and set THREAD_ATTR_VFPU in the desired program. I have tested a few of the samples and they have all run fine.

Only the stack-functions have been fixed so far, I intend to finish the rest of them tomorrow.

Thanks to holger and MrMr[iCE] for their work on this. libpspvgum from MrMr[iCE] was used as initial inspiration for this implementation.

MrMr[iCE] · Post by **MrMr[iCE]** » Sun Oct 23, 2005 2:46 pm

I've updated the wiki again, now theres information on loading/storing values into the vfpu.

holger · Post by **holger** » Mon Oct 24, 2005 6:29 am

maybe a note about lvl.q/lvr.q/svl.q/svr.q makes sense, to ease unaligned load/stores. Semantics are similiar to unaligned word load/stores.

curlyfuzz · Post by **curlyfuzz** » Mon Oct 24, 2005 3:41 pm

MrMr[iCE] wrote:and here comes another run of ops ive tested:

Code: Select all

/*
+-----------------------------------------+--+--------------+-+--------------+
|31                                    16 |15| 14         8 |7| 6         0  |
+-----------------------------------------+--+--------------+-+--------------+
| opcode 0xd0180000 &#40;s&#41;                   | 0| vfpu_rs&#91;6-0&#93; |0| vfpu_rd&#91;6-0&#93; |
| opcode 0xd0180080 &#40;p&#41;                   | 0| vfpu_rs&#91;6-0&#93; |1| vfpu_rd&#91;6-0&#93; |
| opcode 0xd0188000 &#40;t&#41;                   | 1| vfpu_rs&#91;6-0&#93; |0| vfpu_rd&#91;6-0&#93; |
| opcode 0xd0188080 &#40;q&#41;                   | 1| vfpu_rs&#91;6-0&#93; |1| vfpu_rd&#91;6-0&#93; |
+-----------------------------------------+--+--------------+-+--------------+

	NegativeReciprocal.Single/Pair/Triple/Quad

	vnrcp.s  %vfpu_rd, %vfpu_rs   ; calculate negative reciprocal
	vnrcp.p  %vfpu_rd, %vfpu_rs   ; calculate negative reciprocal
	vnrcp.t  %vfpu_rd, %vfpu_rs   ; calculate negative reciprocal
	vnrcp.q  %vfpu_rd, %vfpu_rs   ; calculate negative reciprocal

	%vfpu_rd&#58;   VFPU Vector Target Register &#40;&#91;s|p|t|q&#93;reg 0..127&#41;
	%vfpu_rs&#58;   VFPU Vector Source Register &#40;&#91;s|p|t|q&#93;reg 0..127&#41;

	vfpu_regs&#91;%vfpu_rd&#93; <- -1/vfpu_regs&#91;%vfpu_rs&#93;
*/

#define vnrcp_s&#40;vfpu_rd, vfpu_rs&#41; &#40;0xd0180000 | &#40;vfpu_rs << 8&#41; | &#40;vfpu_rd&#41;&#41;
#define vnrcp_p&#40;vfpu_rd, vfpu_rs&#41; &#40;0xd0180080 | &#40;vfpu_rs << 8&#41; | &#40;vfpu_rd&#41;&#41;
#define vnrcp_t&#40;vfpu_rd, vfpu_rs&#41; &#40;0xd0188000 | &#40;vfpu_rs << 8&#41; | &#40;vfpu_rd&#41;&#41;
#define vnrcp_q&#40;vfpu_rd, vfpu_rs&#41; &#40;0xd0188080 | &#40;vfpu_rs << 8&#41; | &#40;vfpu_rd&#41;&#41;


/*
+-----------------------------------------+--+--------------+-+--------------+
|31                                    16 |15| 14         8 |7| 6         0  |
+-----------------------------------------+--+--------------+-+--------------+
| opcode 0xd01a0000 &#40;s&#41;                   | 0| vfpu_rs&#91;6-0&#93; |0| vfpu_rd&#91;6-0&#93; |
| opcode 0xd01a0080 &#40;p&#41;                   | 0| vfpu_rs&#91;6-0&#93; |1| vfpu_rd&#91;6-0&#93; |
| opcode 0xd01a8000 &#40;t&#41;                   | 1| vfpu_rs&#91;6-0&#93; |0| vfpu_rd&#91;6-0&#93; |
| opcode 0xd01a8080 &#40;q&#41;                   | 1| vfpu_rs&#91;6-0&#93; |1| vfpu_rd&#91;6-0&#93; |
+-----------------------------------------+--+--------------+-+--------------+

	NegativeSin.Single/Pair/Triple/Quad

	vnsin.s  %vfpu_rd, %vfpu_rs   ; calculate negative sin
	vnsin.p  %vfpu_rd, %vfpu_rs   ; calculate negative sin
	vnsin.t  %vfpu_rd, %vfpu_rs   ; calculate negative sin
	vnsin.q  %vfpu_rd, %vfpu_rs   ; calculate negative sin

	%vfpu_rd&#58;   VFPU Vector Target Register &#40;&#91;s|p|t|q&#93;reg 0..127&#41;
	%vfpu_rs&#58;   VFPU Vector Source Register &#40;&#91;s|p|t|q&#93;reg 0..127&#41;

	vfpu_regs&#91;%vfpu_rd&#93; <- sqrt&#40;vfpu_regs&#91;%vfpu_rs&#93;&#41;
*/

#define vnsin_s&#40;vfpu_rd, vfpu_rs&#41; &#40;0xd01a0000 | &#40;vfpu_rs << 8&#41; | &#40;vfpu_rd&#41;&#41;
#define vnsin_p&#40;vfpu_rd, vfpu_rs&#41; &#40;0xd01a0080 | &#40;vfpu_rs << 8&#41; | &#40;vfpu_rd&#41;&#41;
#define vnsin_t&#40;vfpu_rd, vfpu_rs&#41; &#40;0xd01a8000 | &#40;vfpu_rs << 8&#41; | &#40;vfpu_rd&#41;&#41;
#define vnsin_q&#40;vfpu_rd, vfpu_rs&#41; &#40;0xd01a8080 | &#40;vfpu_rs << 8&#41; | &#40;vfpu_rd&#41;&#41;


/*
+-----------------------------------------+--+--------------+-+--------------+
|31                                    16 |15| 14         8 |7| 6         0  |
+-----------------------------------------+--+--------------+-+--------------+
| opcode 0xd01c0000 &#40;s&#41;                   | 0| vfpu_rs&#91;6-0&#93; |0| vfpu_rd&#91;6-0&#93; |
| opcode 0xd01c0080 &#40;p&#41;                   | 0| vfpu_rs&#91;6-0&#93; |1| vfpu_rd&#91;6-0&#93; |
| opcode 0xd01c8000 &#40;t&#41;                   | 1| vfpu_rs&#91;6-0&#93; |0| vfpu_rd&#91;6-0&#93; |
| opcode 0xd01c8080 &#40;q&#41;                   | 1| vfpu_rs&#91;6-0&#93; |1| vfpu_rd&#91;6-0&#93; |
+-----------------------------------------+--+--------------+-+--------------+

	ReciprocalExp2.Single/Pair/Triple/Quad

	vrexp2.s  %vfpu_rd, %vfpu_rs   ; calculate 1/&#40;2^y&#41;
	vrexp2.p  %vfpu_rd, %vfpu_rs   ; calculate 1/&#40;2^y&#41;
	vrexp2.t  %vfpu_rd, %vfpu_rs   ; calculate 1/&#40;2^y&#41;
	vrexp2.q  %vfpu_rd, %vfpu_rs   ; calculate 1/&#40;2^y&#41;

	%vfpu_rd&#58;   VFPU Vector Target Register &#40;&#91;s|p|t|q&#93;reg 0..127&#41;
	%vfpu_rs&#58;   VFPU Vector Source Register &#40;&#91;s|p|t|q&#93;reg 0..127&#41;

	vfpu_regs&#91;%vfpu_rd&#93; <- 1/exp2&#40;vfpu_regs&#91;%vfpu_rs&#93;&#41;
*/

#define vrexp2_s&#40;vfpu_rd, vfpu_rs&#41; &#40;0xd01c0000 | &#40;vfpu_rs << 8&#41; | &#40;vfpu_rd&#41;&#41;
#define vrexp2_p&#40;vfpu_rd, vfpu_rs&#41; &#40;0xd01c0080 | &#40;vfpu_rs << 8&#41; | &#40;vfpu_rd&#41;&#41;
#define vrexp2_t&#40;vfpu_rd, vfpu_rs&#41; &#40;0xd01c8000 | &#40;vfpu_rs << 8&#41; | &#40;vfpu_rd&#41;&#41;
#define vrexp2_q&#40;vfpu_rd, vfpu_rs&#41; &#40;0xd01c8080 | &#40;vfpu_rs << 8&#41; | &#40;vfpu_rd&#41;&#41;

Seems like these opcodes are the same as for reciprocal/sin/exp2 but with the flag 0x00080000 ored into the opcode. (meaning to negate the input register before the calculation). Maybe this is a more general feature?

MrMr[iCE] · Post by **MrMr[iCE]** » Mon Oct 24, 2005 4:56 pm

Honestly I don't know. I'm using the opcode list in binutils, this is what you would see if these opcodes were found in a binary with psp-objdump.

groepaz · Post by **groepaz** » Mon Oct 24, 2005 5:52 pm

bit 24-26 seem to be more like an "extended opcode" field, not directly related to a specific feature...

edit: doh...0x00080000 isnt bit 24-26 :=P there is indeed a small chance that what you say is true :)

jonny · Post by **jonny** » Fri Nov 04, 2005 3:23 am

is this the latest codegen?

http://svn.pspdev.org/filedetails.php?r ... =1133&sc=1

nugi · Post by **nugi** » Tue Dec 06, 2005 10:34 pm

No more opcodes from this thread?

Here's some more opcodes from my test codes... no documents sorry~
I think there is no sense to use gas to assemble vfpu codes. The way of using vfpu codes was sufficient to me and greatly helped me. Thank you~
I hope this thread not to be closed due to no contribution!

int to short.
#define vi2s_p(vfpu_rd,vfpu_rs) (0xd03f0080 | ((vfpu_rs) << 8) | (vfpu_rd))
#define vi2s_q(vfpu_rd,vfpu_rs) (0xd03f8080 | ((vfpu_rs) << 8) | (vfpu_rd))

int to unsigned char.
#define vi2uc_q(vfpu_rd,vfpu_rs) (0xd03c8080 | ((vfpu_rs) << 8) | (vfpu_rd))

int to float.
#define vi2f_s(vfpu_rd,vfpu_rs,scale) (0xd2800000 | ((scale) << 16) | ((vfpu_rs) << 8) | (vfpu_rd))
#define vi2f_p(vfpu_rd,vfpu_rs,scale) (0xd2800080 | ((scale) << 16) | ((vfpu_rs) << 8) | (vfpu_rd))
#define vi2f_t(vfpu_rd,vfpu_rs,scale) (0xd2808000 | ((scale) << 16) | ((vfpu_rs) << 8) | (vfpu_rd))
#define vi2f_q(vfpu_rd,vfpu_rs,scale) (0xd2808080 | ((scale) << 16) | ((vfpu_rs) << 8) | (vfpu_rd))

float to int round to near.
#define vf2in_s(vfpu_rd,vfpu_rs,scale) (0xd2000000 | ((scale) << 16) | ((vfpu_rs) << 8) | (vfpu_rd))
#define vf2in_p(vfpu_rd,vfpu_rs,scale) (0xd2000080 | ((scale) << 16) | ((vfpu_rs) << 8) | (vfpu_rd))
#define vf2in_t(vfpu_rd,vfpu_rs,scale) (0xd2008000 | ((scale) << 16) | ((vfpu_rs) << 8) | (vfpu_rd))
#define vf2in_q(vfpu_rd,vfpu_rs,scale) (0xd2008080 | ((scale) << 16) | ((vfpu_rs) << 8) | (vfpu_rd))

also there are vf2id, vf2id instructions with different rounding methods.

Maybe something is wrong in unused instruction to me.

Brunni · Post by **Brunni** » Sun Dec 11, 2005 10:26 pm

Hello. I have some problems with lv_q. For example, if I do this:

Code: Select all

float vfpu_add&#40;float f1, float f2&#41;
&#123;
   vfpu_vars&#91;0&#93; = f1;
   vfpu_vars&#91;1&#93; = f2;
   register void *ptr __asm &#40;"a0"&#41; = vfpu_vars; 
   __asm__ volatile &#40; 
      cgen_asm&#40;lv_q&#40;0, 0, R_a0, 0&#41;&#41;
      cgen_asm&#40;vadd_s&#40;124, 0, 1&#41;&#41;
      cgen_asm&#40;sv_q&#40;31, 0 * 4, R_a0, 0&#41;&#41; 
   &#58; "=r"&#40;ptr&#41; &#58; "r"&#40;ptr&#41; &#58; "memory"&#41;; 
   return vfpu_vars&#91;0&#93;;
&#125;

This won't work as expected (i.e. add f1 and f2 and return the result). It seems that f1 is not loaded, if I load something else in register 1 before, it will keep this value after lv_q.
Instead, something like this will work:

Code: Select all

float vfpu_add&#40;float f1, float f2&#41;
&#123;
   vfpu_vars&#91;0&#93; = f1;
   vfpu_vars&#91;1&#93; = f2;
   register void *ptr __asm &#40;"a0"&#41; = vfpu_vars; 
   __asm__ volatile &#40; 
      cgen_asm&#40;lv_s&#40;0, 0, R_a0, 0&#41;&#41;
      cgen_asm&#40;lv_s&#40;1, 1, R_a0, 0&#41;&#41;
      cgen_asm&#40;vadd_s&#40;124, 0, 1&#41;&#41;
      cgen_asm&#40;sv_q&#40;31, 0 * 4, R_a0, 0&#41;&#41; 
   &#58; "=r"&#40;ptr&#41; &#58; "r"&#40;ptr&#41; &#58; "memory"&#41;; 
   return vfpu_vars&#91;0&#93;;
&#125;

But in the .h file it's indicated:

Code: Select all

    lv.q %vfpu_rt, offset&#40;%base&#41;

   %fpu_rt&#58;   VFPU Vector Target Register &#40;column0-31/row32-63&#41;

But here fpu_rt seems to be just the register number... so maybe it's my fault (I'm a real beginner), or there is really something I didn't understand. If someone could help me please...
Thanks in advance ^^

nugi · Post by **nugi** » Mon Dec 12, 2005 2:17 am

Some information and suggestion.
1. Before start, final address should be aligned to 16 for quad version(q) and 4 for single version(s).
2. Use Q_C000 style register defined in codegen.h rather than using direct register numbers.
3. Example : lv_q(Q_C000, 0, R_a0)
Loads 16 byte data(4xfloat) into Q_C000 register from address pointed by R_a0(==vfpu_vars).
That is, if float vfpu_vars[4]={100, 101, 102, 103} then S_S000=100, S_S001=101... As noted, vfpu_vars should be 16 byte aligned.
4. a simple test program(newvfpu.c?) will greatly help! Find it in this thread and use it.

forums.ps2dev.org

VFPU playground, code generation for gas-unsupported opcodes

load 1,2 byte integer?

Thread down?

Hmm....