VFPU help

PacManFan · Post by **PacManFan** » Sun Jan 22, 2006 10:27 am

Hello eveyone,
PacManFan here,
I've been following the VFPU discussions, and I'm trying to replace all the messy GTE matrix multiplication code in PSPSOne (pcsx PSP port) with some optimized VFPU code.

Without getting into too much detail, I need a function that takes the form of this:

void MatrixMultiplyVector(float *matrix3x3,float *inVector3,float *outVector3){
//some vfpu magic here

}

From browsing through some of the gum vfpu code, I saw how to load up a matrix:

void sceGumLoadMatrix(const ScePspFMatrix4* m)
{
register ScePspFMatrix4* r __asm("a0") = GUM_ALIGNED_MATRIX();
memcpy(r,m,sizeof(ScePspFMatrix4));

__asm__ volatile (
cgen_asm(lv_q(Q_C300,0,R_a0,0))
cgen_asm(lv_q(Q_C310,4,R_a0,0))
cgen_asm(lv_q(Q_C320,8,R_a0,0))
cgen_asm(lv_q(Q_C330,12,R_a0,0))
: "=r"(r) : "r"(r), "r"(m) : "memory");

gum_matrix_update[gum_current_mode] = 1;
}

and I had also seen a transform function, but I wasn't sure how to get the results back out:

void vfpu_transform(float x, float y, float z) {
vfpu_vars[0] = x;
vfpu_vars[1] = y;
vfpu_vars[2] = z;
vfpu_vars[3] = 1.0;
register void *ptr __asm ("a0") = vfpu_vars;
__asm__ volatile (
cgen_asm(lv_q(Q_R700, 0, R_a0, 0))
cgen_asm(vtfm4_q(Q_R701, Q_M500, Q_R700))
cgen_asm(vtfm4_q(Q_R702, Q_M300, Q_R701))
: "=r"(ptr) : "r"(ptr) : "memory");
}

Can someone help me put these 2 together?

-PacManFan

korskarn · Post by **korskarn** » Sun Jan 22, 2006 2:07 pm

but I wasn't sure how to get the results back out

The lv_q instruction is "Load Quad Word to VFPU", so to get the result out, you just do the opposite: "Save Quad Word from VFPU", which is sv_q

The parameters of the load instruction are:
1. The destination VFPU register,
2. Offset from the source address in the CPU regsiter
3. CPU register containing the source address
4. I don't know... ??? the lv.q instruction has only 3 parameters...

The sv.q parameters should be the same (except that data goes from VFPU to CPU instead!)

The transform function:
A vector is loaded from CPU to Q_R700 in the VFPU, then multiplied by matrix Q_M500, result stored in vector Q_R701 which in turn gets transformed by matrix Q_M300 and stored in vector Q_R702

How to understand the registers is a little complicated...
The VFPU has 128 registers that heac contain a 32 bit float. YOu can picture them has 8 4x4 matrix placed side by side. The register names indicate where the upper-left corner of your vector/matrix is in that, and the lettre before the number is how to access it.

Registers that are Q_R*** represent a row vector, Q_C*** is a column vector, Q_M*** is a matrix with row vectors, Q_E*** is a matrix with column vectors.

The numbers are always 3 numbers, the leftmost one goes from 0 to 7 and is which one of the 8 4x4 matrix, the two other numbers are like x y coordinates into that matrix and each range from 0 to 3

Here is an example with only the first 3 matrices:
|000|010|020|030||100|110|120|130||200|210|220|230|
|001|011|021|031||101|111|121|131||201|211|221|231|
|002|012|022|032||102|112|122|132||202|212|222|232|
|003|013|023|033||103|113|123|133||203|213|223|233|

PacManFan · Post by **PacManFan** » Sun Jan 22, 2006 2:43 pm

Does this look about right?

//vfpu_vars[0] - vfpu_vars[3] //are aligned float memory
void MultVec(ScePspFMatrix4 *mat,float *invec,float *outvec){

register ScePspFMatrix4* r __asm("a0") = GUM_ALIGNED_MATRIX();
memcpy(r,mat,sizeof(ScePspFMatrix4));

vfpu_vars[0] = invec[0]; //x
vfpu_vars[1] = invec[1]; //y;
vfpu_vars[2] = invec[2]; //z;
vfpu_vars[3] = 1.0; // not used

//load up the matrix into the coprocessor matrix 3
__asm__ volatile (
cgen_asm(lv_q(Q_C300,0,R_a0,0))
cgen_asm(lv_q(Q_C310,4,R_a0,0))
cgen_asm(lv_q(Q_C320,8,R_a0,0))
//cgen_asm(lv_q(Q_C330,12,R_a0,0)) // don't need this one
: "=r"(r) : "r"(r), "r"(m) : "memory");

register void *ptr __asm ("b0") = vfpu_vars;

//load up the vfpu in-vector regs into Matrix 7, row 0
__asm__ volatile (
cgen_asm(lv_q(Q_R700, 0, R_b0, 0)) //load the vector
cgen_asm(vtfm3_t(Q_R700, Q_M300, Q_R700)) //multiply by the matrix 3
cgen_asm(sv_q(Q_C700,0,R_b0,0)) //store the results
: "=r"(ptr) : "r"(ptr) : "memory");
// store the results

outvec[0] = vfpu_vars[0];
outvec[1] = vfpu_vars[1];
outvec[2] = vfpu_vars[2];
}

I haven't tried to compile this yet, does it look right?
-PMF

jsgf · Post by **jsgf** » Sun Jan 22, 2006 4:18 pm

korskarn wrote:1. The destination VFPU register,
2. Offset from the source address in the CPU regsiter
3. CPU register containing the source address
4. I don't know... ??? the lv.q instruction has only 3 parameters...

The 4th parameter is a cache bypass flag; whether to write through (0) or write back (1).

korskarn · Post by **korskarn** » Sun Jan 22, 2006 11:34 pm

PacManFan wrote:Does this look about right?

Looks ok to me, except that as far as I know, there is no b0 register?

jsgf wrote:The 4th parameter is a cache bypass flag; whether to write through (0) or write back (1)

Thats for the store instruction., which indeed has a fourth "wb" parameter.
I don't think its used for the load instruction.[/quote]

PacManFan · Post by **PacManFan** » Mon Jan 23, 2006 5:21 am

This is my code so far, but I can't seem to get it to even execute the load matrix. Can someone point out the error?

static float vfpu_vars [4] __attribute__((aligned(16)));
#define ALIGNED_MATRIX() (ScePspFMatrix4*)((((unsigned int)alloca(sizeof(ScePspFMatrix4)+64)) + 63) & ~63)
//#define ALIGNED_VECTOR() (ScePspFVector4*)((((unsigned int)alloca(sizeof(ScePspFVector4)+64)) + 63) & ~63)
#define ALIGNED_VECTOR() (float *)((((unsigned int)alloca(sizeof(16)+64)) + 63) & ~63)

void TransformVec(float *m,float *invec,float *outvec){
register ScePspFMatrix4* r __asm("a0") = ALIGNED_MATRIX();
int c;
DLog("In Vector ");
for (c=0;c<4;c++){
DLog("%d %f \n",c,invec[c]);
}

DLog("Matrix ");
for (c=0;c<16;c++){
DLog("%d %f \n",c,m[c]);
}

memcpy(r,m,64);

vfpu_vars[0] = invec[0]; //x
vfpu_vars[1] = invec[1]; //y;
vfpu_vars[2] = invec[2]; //z;
vfpu_vars[3] = 1.0; // not used

//load up the matrix into the coprocessor matrix 3
__asm__ volatile (
cgen_asm(lv_q(Q_C300,0,R_a0,0))
cgen_asm(lv_q(Q_C310,4,R_a0,0))
cgen_asm(lv_q(Q_C320,8,R_a0,0))
cgen_asm(lv_q(Q_C330,12,R_a0,0))
: "=r"(r) : "r"(r), "r"(m) : "memory");
DLog("Program does not get here at all..., crashes above somewhere");
register void *ptr __asm ("a0") = vfpu_vars;

//load up the vfpu in-vector regs into Matrix 7, row 0
__asm__ volatile (
cgen_asm(lv_q(Q_R700, 0, R_a0,0)) //load the vector
cgen_asm(vtfm3_t(Q_R700, Q_M300, Q_R700)) //multiply by the matrix 3
cgen_asm(sv_q(Q_R700,0,R_a0,0)) //store the results
: "=r"(ptr) : "r"(ptr) : "memory");
// store the results

outvec[0] = vfpu_vars[0];
outvec[1] = vfpu_vars[1];
outvec[2] = vfpu_vars[2];

DLog("Out Vector ");
for (c=0;c<4;c++){
DLog("%d %f \n",c,outvec[c]);
}

}

korskarn · Post by **korskarn** » Mon Jan 23, 2006 8:12 am

I don't see whats causing the crash. Can you try to use your DLog function just before loading the matrix? That way we can see if its the matrix load code that crashes or something before.
If it crashes in the matrix code, try also to use DLog between each of the 4 lv.q and see at which one it crashes.

PacManFan · Post by **PacManFan** » Mon Jan 23, 2006 9:40 am

Yeah, I put a DLog call right before I go into the inline assembly, it's definately crashing somewhere in the assembly.
I forgot to mention I also have the VFPU flag in my PSP_MAIN_THREAD_ATTR. I'm going to break it apart, and execute 1 assembly instruction and see what happens. It's going to have to wait a few hours though, my wife took my PSP to the laundermat.

-PMF

korskarn · Post by **korskarn** » Mon Jan 23, 2006 10:08 am

I see something that might be the cause of the crash, depending on how the compiler interprets the value...
in the lv_q instruction, there is a numerical value that is the offset relative to the address int he register a0. For a 4x4 matrix, these offsets would be 0, 16, 32 and 48 bytes, yet in the code they are 0, 4, 8 and 12.

Now, when the instruction is encoded, only the 14 high bits of this 16 bits offset are taken: the address must always be word aligned therefore
the two low bits are always 0 and they are useless, and shifting the value gives 2 more bits in the instruction code to store other information.

The question is, does the compiler takes the real byte offset and does the shift itself (this would seem more logical) or do we shift the offset ourself?

If the compiler takes a byte offset and shift them, then the values (0, 4, 8, 12) in the code would be wrong, and would cause an alignment exception because they are not qword aligned.

Try allocating a matrix 4 times the real size (just to be safe in case my theory is wrong) and change the offsets to 0, 16, 32 and 48 and see if it still crashes.

Then memset your matrix memory to 0 (the whole 4 times bigger) and read it back from the VFPU (without changing it) and verify that it is intact.

jsgf · Post by **jsgf** » Thu Jan 26, 2006 11:41 am

PMF:

1. You should definitely convert your code to take advantage of the binutils support for the VFPU, rather than using the macros. There might be a bug in the opcodes they generate; at the very least, they're not very readable, and they don't support the full power of the VFPU.

2. Use

Code: Select all

&#91;code&#93; &#91;/code&#93;

blocks for quoting code; its hard to read otherwise

3. Look at using libpspvfpu (which I recently checked into SVN), so that your VFPU code will coexist with other VFPU code (plus it will set the VFPU thread attrib for you).

4. I think korskarn is right about your crash; the offsets in

Code: Select all

cgen_asm&#40;lv_q&#40;Q_C300,0,R_a0,0&#41;&#41;
cgen_asm&#40;lv_q&#40;Q_C310,4,R_a0,0&#41;&#41;
cgen_asm&#40;lv_q&#40;Q_C320,8,R_a0,0&#41;&#41;
cgen_asm&#40;lv_q&#40;Q_C330,12,R_a0,0&#41;&#41;

are in bytes, not FP array units, so your loads are unaligned. Use 0,16,32,48 as the offsets.

5. There is no 5.

PacManFan · Post by **PacManFan** » Thu Jan 26, 2006 1:06 pm

Thanks, I figured that out a few days ago, and put in coprocessor vertex transformation functions into the PSPSOne emulator already. It's giving me a slight speed increase. I need to re-write some more of the GTE in order to avoid reloading the matrix every time I transform a vertex, otherwise, I'm not saving any time.

-PMF

hlide · Post by **hlide** » Sun Oct 29, 2006 12:12 am

Any working progress since the last post here ?