VFPU playground, code generation for gas-unsupported opcodes
It doesn't require it. It would be nice to have a constraint for VFPU registers so that gcc can reorder the asm statements with respect to other code while maintaining dependencies properly, and doubly nice if there were a gcc type for VFPU register variables, but not essential. A plain asm() with memory-use constraints should be enough.holger wrote:yes, this would be nice, but requires some work in the toolchain, so that gcc knows how to schedule the VFPU registers.jsgf wrote:I would like to see a very simple, thin libvfpu which provites two things:
- a set of macros to make inline assembler access to the VFPU easy (like gcc/icc's xmmintrin.h for SSE)
Perhaps. But that's quite a bit more work.becomes obsolete with the above...jsgf wrote:[*] a simple lightweight context switching mechanism to allow multiple libraries to share the VFPU without stomping on each other
you can implement this by wrapping all __asm__ volatile (cgen_asm()) macros with inline functions, but it would not be of much use -- the great thing about intrinsics is that you get rid of the load of register scheduling...jsgf wrote:It doesn't require it. It would be nice to have a constraint for VFPU registers so that gcc can reorder the asm statements with respect to other code while maintaining dependencies properly, and doubly nice if there were a gcc type for VFPU register variables, but not essential. A plain asm() with memory-use constraints should be enough.holger wrote:yes, this would be nice, but requires some work in the toolchain, so that gcc knows how to schedule the VFPU registers.jsgf wrote:I would like to see a very simple, thin libvfpu which provites two things:
- a set of macros to make inline assembler access to the VFPU easy (like gcc/icc's xmmintrin.h for SSE)
Yep, I'm with you there. But since that will require a non-trivial amount of gcc hacking, it would be nice to have a workable substitute for now, if nothing else so we can get a feeling for how and where the VFPU is actually useful.holger wrote:you can implement this by wrapping all __asm__ volatile (cgen_asm()) macros with inline functions, but it would not be of much use -- the great thing about intrinsics is that you get rid of the load of register scheduling...
I've written up a reference that shows how the registers are mapped in the various single/pair/triple/quad modes. This should help a bit when trying to juggle all those matrices/vectors around the vfpu register space.
http://bradburn.net/mr.mr/vfpu.html
http://bradburn.net/mr.mr/vfpu2.html <-- this one is nice for a cheat sheet
EDIT: I've also added this to the wiki.
http://bradburn.net/mr.mr/vfpu.html
http://bradburn.net/mr.mr/vfpu2.html <-- this one is nice for a cheat sheet
EDIT: I've also added this to the wiki.
load 1,2 byte integer?
Yeah~ good job.
I'm wondering how to load/save 1 or 2 byte integer to vector register of GPU at once. For example, loading 32bit color value(RGBA) to C000 register. And after some processing write C000 to memory(32bit).
Is it possible? I cant find a way from current codegen.h. hmm~~~
I'm wondering how to load/save 1 or 2 byte integer to vector register of GPU at once. For example, loading 32bit color value(RGBA) to C000 register. And after some processing write C000 to memory(32bit).
Is it possible? I cant find a way from current codegen.h. hmm~~~
I have added very preliminary VFPU support to GUM now, just as a working example. To enable this support, remove the comment fromin gumInternal.h, rebuild the library and set THREAD_ATTR_VFPU in the desired program. I have tested a few of the samples and they have all run fine.
Only the stack-functions have been fixed so far, I intend to finish the rest of them tomorrow.
Thanks to holger and MrMr[iCE] for their work on this. libpspvgum from MrMr[iCE] was used as initial inspiration for this implementation.
Code: Select all
//#define GUM_USE_VFPU
Only the stack-functions have been fixed so far, I intend to finish the rest of them tomorrow.
Thanks to holger and MrMr[iCE] for their work on this. libpspvgum from MrMr[iCE] was used as initial inspiration for this implementation.
GE Dominator
Seems like these opcodes are the same as for reciprocal/sin/exp2 but with the flag 0x00080000 ored into the opcode. (meaning to negate the input register before the calculation). Maybe this is a more general feature?MrMr[iCE] wrote:and here comes another run of ops ive tested:Code: Select all
/* +-----------------------------------------+--+--------------+-+--------------+ |31 16 |15| 14 8 |7| 6 0 | +-----------------------------------------+--+--------------+-+--------------+ | opcode 0xd0180000 (s) | 0| vfpu_rs[6-0] |0| vfpu_rd[6-0] | | opcode 0xd0180080 (p) | 0| vfpu_rs[6-0] |1| vfpu_rd[6-0] | | opcode 0xd0188000 (t) | 1| vfpu_rs[6-0] |0| vfpu_rd[6-0] | | opcode 0xd0188080 (q) | 1| vfpu_rs[6-0] |1| vfpu_rd[6-0] | +-----------------------------------------+--+--------------+-+--------------+ NegativeReciprocal.Single/Pair/Triple/Quad vnrcp.s %vfpu_rd, %vfpu_rs ; calculate negative reciprocal vnrcp.p %vfpu_rd, %vfpu_rs ; calculate negative reciprocal vnrcp.t %vfpu_rd, %vfpu_rs ; calculate negative reciprocal vnrcp.q %vfpu_rd, %vfpu_rs ; calculate negative reciprocal %vfpu_rd: VFPU Vector Target Register ([s|p|t|q]reg 0..127) %vfpu_rs: VFPU Vector Source Register ([s|p|t|q]reg 0..127) vfpu_regs[%vfpu_rd] <- -1/vfpu_regs[%vfpu_rs] */ #define vnrcp_s(vfpu_rd, vfpu_rs) (0xd0180000 | (vfpu_rs << 8) | (vfpu_rd)) #define vnrcp_p(vfpu_rd, vfpu_rs) (0xd0180080 | (vfpu_rs << 8) | (vfpu_rd)) #define vnrcp_t(vfpu_rd, vfpu_rs) (0xd0188000 | (vfpu_rs << 8) | (vfpu_rd)) #define vnrcp_q(vfpu_rd, vfpu_rs) (0xd0188080 | (vfpu_rs << 8) | (vfpu_rd)) /* +-----------------------------------------+--+--------------+-+--------------+ |31 16 |15| 14 8 |7| 6 0 | +-----------------------------------------+--+--------------+-+--------------+ | opcode 0xd01a0000 (s) | 0| vfpu_rs[6-0] |0| vfpu_rd[6-0] | | opcode 0xd01a0080 (p) | 0| vfpu_rs[6-0] |1| vfpu_rd[6-0] | | opcode 0xd01a8000 (t) | 1| vfpu_rs[6-0] |0| vfpu_rd[6-0] | | opcode 0xd01a8080 (q) | 1| vfpu_rs[6-0] |1| vfpu_rd[6-0] | +-----------------------------------------+--+--------------+-+--------------+ NegativeSin.Single/Pair/Triple/Quad vnsin.s %vfpu_rd, %vfpu_rs ; calculate negative sin vnsin.p %vfpu_rd, %vfpu_rs ; calculate negative sin vnsin.t %vfpu_rd, %vfpu_rs ; calculate negative sin vnsin.q %vfpu_rd, %vfpu_rs ; calculate negative sin %vfpu_rd: VFPU Vector Target Register ([s|p|t|q]reg 0..127) %vfpu_rs: VFPU Vector Source Register ([s|p|t|q]reg 0..127) vfpu_regs[%vfpu_rd] <- sqrt(vfpu_regs[%vfpu_rs]) */ #define vnsin_s(vfpu_rd, vfpu_rs) (0xd01a0000 | (vfpu_rs << 8) | (vfpu_rd)) #define vnsin_p(vfpu_rd, vfpu_rs) (0xd01a0080 | (vfpu_rs << 8) | (vfpu_rd)) #define vnsin_t(vfpu_rd, vfpu_rs) (0xd01a8000 | (vfpu_rs << 8) | (vfpu_rd)) #define vnsin_q(vfpu_rd, vfpu_rs) (0xd01a8080 | (vfpu_rs << 8) | (vfpu_rd)) /* +-----------------------------------------+--+--------------+-+--------------+ |31 16 |15| 14 8 |7| 6 0 | +-----------------------------------------+--+--------------+-+--------------+ | opcode 0xd01c0000 (s) | 0| vfpu_rs[6-0] |0| vfpu_rd[6-0] | | opcode 0xd01c0080 (p) | 0| vfpu_rs[6-0] |1| vfpu_rd[6-0] | | opcode 0xd01c8000 (t) | 1| vfpu_rs[6-0] |0| vfpu_rd[6-0] | | opcode 0xd01c8080 (q) | 1| vfpu_rs[6-0] |1| vfpu_rd[6-0] | +-----------------------------------------+--+--------------+-+--------------+ ReciprocalExp2.Single/Pair/Triple/Quad vrexp2.s %vfpu_rd, %vfpu_rs ; calculate 1/(2^y) vrexp2.p %vfpu_rd, %vfpu_rs ; calculate 1/(2^y) vrexp2.t %vfpu_rd, %vfpu_rs ; calculate 1/(2^y) vrexp2.q %vfpu_rd, %vfpu_rs ; calculate 1/(2^y) %vfpu_rd: VFPU Vector Target Register ([s|p|t|q]reg 0..127) %vfpu_rs: VFPU Vector Source Register ([s|p|t|q]reg 0..127) vfpu_regs[%vfpu_rd] <- 1/exp2(vfpu_regs[%vfpu_rs]) */ #define vrexp2_s(vfpu_rd, vfpu_rs) (0xd01c0000 | (vfpu_rs << 8) | (vfpu_rd)) #define vrexp2_p(vfpu_rd, vfpu_rs) (0xd01c0080 | (vfpu_rs << 8) | (vfpu_rd)) #define vrexp2_t(vfpu_rd, vfpu_rs) (0xd01c8000 | (vfpu_rs << 8) | (vfpu_rd)) #define vrexp2_q(vfpu_rd, vfpu_rs) (0xd01c8080 | (vfpu_rs << 8) | (vfpu_rd))
Thread down?
No more opcodes from this thread?
Here's some more opcodes from my test codes... no documents sorry~
I think there is no sense to use gas to assemble vfpu codes. The way of using vfpu codes was sufficient to me and greatly helped me. Thank you~
I hope this thread not to be closed due to no contribution!
int to short.
#define vi2s_p(vfpu_rd,vfpu_rs) (0xd03f0080 | ((vfpu_rs) << 8) | (vfpu_rd))
#define vi2s_q(vfpu_rd,vfpu_rs) (0xd03f8080 | ((vfpu_rs) << 8) | (vfpu_rd))
int to unsigned char.
#define vi2uc_q(vfpu_rd,vfpu_rs) (0xd03c8080 | ((vfpu_rs) << 8) | (vfpu_rd))
int to float.
#define vi2f_s(vfpu_rd,vfpu_rs,scale) (0xd2800000 | ((scale) << 16) | ((vfpu_rs) << 8) | (vfpu_rd))
#define vi2f_p(vfpu_rd,vfpu_rs,scale) (0xd2800080 | ((scale) << 16) | ((vfpu_rs) << 8) | (vfpu_rd))
#define vi2f_t(vfpu_rd,vfpu_rs,scale) (0xd2808000 | ((scale) << 16) | ((vfpu_rs) << 8) | (vfpu_rd))
#define vi2f_q(vfpu_rd,vfpu_rs,scale) (0xd2808080 | ((scale) << 16) | ((vfpu_rs) << 8) | (vfpu_rd))
float to int round to near.
#define vf2in_s(vfpu_rd,vfpu_rs,scale) (0xd2000000 | ((scale) << 16) | ((vfpu_rs) << 8) | (vfpu_rd))
#define vf2in_p(vfpu_rd,vfpu_rs,scale) (0xd2000080 | ((scale) << 16) | ((vfpu_rs) << 8) | (vfpu_rd))
#define vf2in_t(vfpu_rd,vfpu_rs,scale) (0xd2008000 | ((scale) << 16) | ((vfpu_rs) << 8) | (vfpu_rd))
#define vf2in_q(vfpu_rd,vfpu_rs,scale) (0xd2008080 | ((scale) << 16) | ((vfpu_rs) << 8) | (vfpu_rd))
also there are vf2id, vf2id instructions with different rounding methods.
Maybe something is wrong in unused instruction to me.
Here's some more opcodes from my test codes... no documents sorry~
I think there is no sense to use gas to assemble vfpu codes. The way of using vfpu codes was sufficient to me and greatly helped me. Thank you~
I hope this thread not to be closed due to no contribution!
int to short.
#define vi2s_p(vfpu_rd,vfpu_rs) (0xd03f0080 | ((vfpu_rs) << 8) | (vfpu_rd))
#define vi2s_q(vfpu_rd,vfpu_rs) (0xd03f8080 | ((vfpu_rs) << 8) | (vfpu_rd))
int to unsigned char.
#define vi2uc_q(vfpu_rd,vfpu_rs) (0xd03c8080 | ((vfpu_rs) << 8) | (vfpu_rd))
int to float.
#define vi2f_s(vfpu_rd,vfpu_rs,scale) (0xd2800000 | ((scale) << 16) | ((vfpu_rs) << 8) | (vfpu_rd))
#define vi2f_p(vfpu_rd,vfpu_rs,scale) (0xd2800080 | ((scale) << 16) | ((vfpu_rs) << 8) | (vfpu_rd))
#define vi2f_t(vfpu_rd,vfpu_rs,scale) (0xd2808000 | ((scale) << 16) | ((vfpu_rs) << 8) | (vfpu_rd))
#define vi2f_q(vfpu_rd,vfpu_rs,scale) (0xd2808080 | ((scale) << 16) | ((vfpu_rs) << 8) | (vfpu_rd))
float to int round to near.
#define vf2in_s(vfpu_rd,vfpu_rs,scale) (0xd2000000 | ((scale) << 16) | ((vfpu_rs) << 8) | (vfpu_rd))
#define vf2in_p(vfpu_rd,vfpu_rs,scale) (0xd2000080 | ((scale) << 16) | ((vfpu_rs) << 8) | (vfpu_rd))
#define vf2in_t(vfpu_rd,vfpu_rs,scale) (0xd2008000 | ((scale) << 16) | ((vfpu_rs) << 8) | (vfpu_rd))
#define vf2in_q(vfpu_rd,vfpu_rs,scale) (0xd2008080 | ((scale) << 16) | ((vfpu_rs) << 8) | (vfpu_rd))
also there are vf2id, vf2id instructions with different rounding methods.
Maybe something is wrong in unused instruction to me.
Hello. I have some problems with lv_q. For example, if I do this:
This won't work as expected (i.e. add f1 and f2 and return the result). It seems that f1 is not loaded, if I load something else in register 1 before, it will keep this value after lv_q.
Instead, something like this will work:
But in the .h file it's indicated:
But here fpu_rt seems to be just the register number... so maybe it's my fault (I'm a real beginner), or there is really something I didn't understand. If someone could help me please...
Thanks in advance ^^
Code: Select all
float vfpu_add(float f1, float f2)
{
vfpu_vars[0] = f1;
vfpu_vars[1] = f2;
register void *ptr __asm ("a0") = vfpu_vars;
__asm__ volatile (
cgen_asm(lv_q(0, 0, R_a0, 0))
cgen_asm(vadd_s(124, 0, 1))
cgen_asm(sv_q(31, 0 * 4, R_a0, 0))
: "=r"(ptr) : "r"(ptr) : "memory");
return vfpu_vars[0];
}
Instead, something like this will work:
Code: Select all
float vfpu_add(float f1, float f2)
{
vfpu_vars[0] = f1;
vfpu_vars[1] = f2;
register void *ptr __asm ("a0") = vfpu_vars;
__asm__ volatile (
cgen_asm(lv_s(0, 0, R_a0, 0))
cgen_asm(lv_s(1, 1, R_a0, 0))
cgen_asm(vadd_s(124, 0, 1))
cgen_asm(sv_q(31, 0 * 4, R_a0, 0))
: "=r"(ptr) : "r"(ptr) : "memory");
return vfpu_vars[0];
}
Code: Select all
lv.q %vfpu_rt, offset(%base)
%fpu_rt: VFPU Vector Target Register (column0-31/row32-63)
Thanks in advance ^^
Sorry for my bad english
Oldschool library for PSP - PC version released
Oldschool library for PSP - PC version released
Hmm....
Some information and suggestion.
1. Before start, final address should be aligned to 16 for quad version(q) and 4 for single version(s).
2. Use Q_C000 style register defined in codegen.h rather than using direct register numbers.
3. Example : lv_q(Q_C000, 0, R_a0)
Loads 16 byte data(4xfloat) into Q_C000 register from address pointed by R_a0(==vfpu_vars).
That is, if float vfpu_vars[4]={100, 101, 102, 103} then S_S000=100, S_S001=101... As noted, vfpu_vars should be 16 byte aligned.
4. a simple test program(newvfpu.c?) will greatly help! Find it in this thread and use it.
1. Before start, final address should be aligned to 16 for quad version(q) and 4 for single version(s).
2. Use Q_C000 style register defined in codegen.h rather than using direct register numbers.
3. Example : lv_q(Q_C000, 0, R_a0)
Loads 16 byte data(4xfloat) into Q_C000 register from address pointed by R_a0(==vfpu_vars).
That is, if float vfpu_vars[4]={100, 101, 102, 103} then S_S000=100, S_S001=101... As noted, vfpu_vars should be 16 byte aligned.
4. a simple test program(newvfpu.c?) will greatly help! Find it in this thread and use it.