VFPU help
VFPU help
Hello eveyone,
PacManFan here,
I've been following the VFPU discussions, and I'm trying to replace all the messy GTE matrix multiplication code in PSPSOne (pcsx PSP port) with some optimized VFPU code.
Without getting into too much detail, I need a function that takes the form of this:
void MatrixMultiplyVector(float *matrix3x3,float *inVector3,float *outVector3){
//some vfpu magic here
}
From browsing through some of the gum vfpu code, I saw how to load up a matrix:
void sceGumLoadMatrix(const ScePspFMatrix4* m)
{
register ScePspFMatrix4* r __asm("a0") = GUM_ALIGNED_MATRIX();
memcpy(r,m,sizeof(ScePspFMatrix4));
__asm__ volatile (
cgen_asm(lv_q(Q_C300,0,R_a0,0))
cgen_asm(lv_q(Q_C310,4,R_a0,0))
cgen_asm(lv_q(Q_C320,8,R_a0,0))
cgen_asm(lv_q(Q_C330,12,R_a0,0))
: "=r"(r) : "r"(r), "r"(m) : "memory");
gum_matrix_update[gum_current_mode] = 1;
}
and I had also seen a transform function, but I wasn't sure how to get the results back out:
void vfpu_transform(float x, float y, float z) {
vfpu_vars[0] = x;
vfpu_vars[1] = y;
vfpu_vars[2] = z;
vfpu_vars[3] = 1.0;
register void *ptr __asm ("a0") = vfpu_vars;
__asm__ volatile (
cgen_asm(lv_q(Q_R700, 0, R_a0, 0))
cgen_asm(vtfm4_q(Q_R701, Q_M500, Q_R700))
cgen_asm(vtfm4_q(Q_R702, Q_M300, Q_R701))
: "=r"(ptr) : "r"(ptr) : "memory");
}
Can someone help me put these 2 together?
-PacManFan
PacManFan here,
I've been following the VFPU discussions, and I'm trying to replace all the messy GTE matrix multiplication code in PSPSOne (pcsx PSP port) with some optimized VFPU code.
Without getting into too much detail, I need a function that takes the form of this:
void MatrixMultiplyVector(float *matrix3x3,float *inVector3,float *outVector3){
//some vfpu magic here
}
From browsing through some of the gum vfpu code, I saw how to load up a matrix:
void sceGumLoadMatrix(const ScePspFMatrix4* m)
{
register ScePspFMatrix4* r __asm("a0") = GUM_ALIGNED_MATRIX();
memcpy(r,m,sizeof(ScePspFMatrix4));
__asm__ volatile (
cgen_asm(lv_q(Q_C300,0,R_a0,0))
cgen_asm(lv_q(Q_C310,4,R_a0,0))
cgen_asm(lv_q(Q_C320,8,R_a0,0))
cgen_asm(lv_q(Q_C330,12,R_a0,0))
: "=r"(r) : "r"(r), "r"(m) : "memory");
gum_matrix_update[gum_current_mode] = 1;
}
and I had also seen a transform function, but I wasn't sure how to get the results back out:
void vfpu_transform(float x, float y, float z) {
vfpu_vars[0] = x;
vfpu_vars[1] = y;
vfpu_vars[2] = z;
vfpu_vars[3] = 1.0;
register void *ptr __asm ("a0") = vfpu_vars;
__asm__ volatile (
cgen_asm(lv_q(Q_R700, 0, R_a0, 0))
cgen_asm(vtfm4_q(Q_R701, Q_M500, Q_R700))
cgen_asm(vtfm4_q(Q_R702, Q_M300, Q_R701))
: "=r"(ptr) : "r"(ptr) : "memory");
}
Can someone help me put these 2 together?
-PacManFan
"I'm a little source code, short and stout
Here is my input, here is my out."
Author of PSPQuake and PSPSOne.
Here is my input, here is my out."
Author of PSPQuake and PSPSOne.
The lv_q instruction is "Load Quad Word to VFPU", so to get the result out, you just do the opposite: "Save Quad Word from VFPU", which is sv_qbut I wasn't sure how to get the results back out
The parameters of the load instruction are:
1. The destination VFPU register,
2. Offset from the source address in the CPU regsiter
3. CPU register containing the source address
4. I don't know... ??? the lv.q instruction has only 3 parameters...
The sv.q parameters should be the same (except that data goes from VFPU to CPU instead!)
The transform function:
A vector is loaded from CPU to Q_R700 in the VFPU, then multiplied by matrix Q_M500, result stored in vector Q_R701 which in turn gets transformed by matrix Q_M300 and stored in vector Q_R702
How to understand the registers is a little complicated...
The VFPU has 128 registers that heac contain a 32 bit float. YOu can picture them has 8 4x4 matrix placed side by side. The register names indicate where the upper-left corner of your vector/matrix is in that, and the lettre before the number is how to access it.
Registers that are Q_R*** represent a row vector, Q_C*** is a column vector, Q_M*** is a matrix with row vectors, Q_E*** is a matrix with column vectors.
The numbers are always 3 numbers, the leftmost one goes from 0 to 7 and is which one of the 8 4x4 matrix, the two other numbers are like x y coordinates into that matrix and each range from 0 to 3
Here is an example with only the first 3 matrices:
|000|010|020|030||100|110|120|130||200|210|220|230|
|001|011|021|031||101|111|121|131||201|211|221|231|
|002|012|022|032||102|112|122|132||202|212|222|232|
|003|013|023|033||103|113|123|133||203|213|223|233|
Does this look about right?
//vfpu_vars[0] - vfpu_vars[3] //are aligned float memory
void MultVec(ScePspFMatrix4 *mat,float *invec,float *outvec){
register ScePspFMatrix4* r __asm("a0") = GUM_ALIGNED_MATRIX();
memcpy(r,mat,sizeof(ScePspFMatrix4));
vfpu_vars[0] = invec[0]; //x
vfpu_vars[1] = invec[1]; //y;
vfpu_vars[2] = invec[2]; //z;
vfpu_vars[3] = 1.0; // not used
//load up the matrix into the coprocessor matrix 3
__asm__ volatile (
cgen_asm(lv_q(Q_C300,0,R_a0,0))
cgen_asm(lv_q(Q_C310,4,R_a0,0))
cgen_asm(lv_q(Q_C320,8,R_a0,0))
//cgen_asm(lv_q(Q_C330,12,R_a0,0)) // don't need this one
: "=r"(r) : "r"(r), "r"(m) : "memory");
register void *ptr __asm ("b0") = vfpu_vars;
//load up the vfpu in-vector regs into Matrix 7, row 0
__asm__ volatile (
cgen_asm(lv_q(Q_R700, 0, R_b0, 0)) //load the vector
cgen_asm(vtfm3_t(Q_R700, Q_M300, Q_R700)) //multiply by the matrix 3
cgen_asm(sv_q(Q_C700,0,R_b0,0)) //store the results
: "=r"(ptr) : "r"(ptr) : "memory");
// store the results
outvec[0] = vfpu_vars[0];
outvec[1] = vfpu_vars[1];
outvec[2] = vfpu_vars[2];
}
I haven't tried to compile this yet, does it look right?
-PMF
//vfpu_vars[0] - vfpu_vars[3] //are aligned float memory
void MultVec(ScePspFMatrix4 *mat,float *invec,float *outvec){
register ScePspFMatrix4* r __asm("a0") = GUM_ALIGNED_MATRIX();
memcpy(r,mat,sizeof(ScePspFMatrix4));
vfpu_vars[0] = invec[0]; //x
vfpu_vars[1] = invec[1]; //y;
vfpu_vars[2] = invec[2]; //z;
vfpu_vars[3] = 1.0; // not used
//load up the matrix into the coprocessor matrix 3
__asm__ volatile (
cgen_asm(lv_q(Q_C300,0,R_a0,0))
cgen_asm(lv_q(Q_C310,4,R_a0,0))
cgen_asm(lv_q(Q_C320,8,R_a0,0))
//cgen_asm(lv_q(Q_C330,12,R_a0,0)) // don't need this one
: "=r"(r) : "r"(r), "r"(m) : "memory");
register void *ptr __asm ("b0") = vfpu_vars;
//load up the vfpu in-vector regs into Matrix 7, row 0
__asm__ volatile (
cgen_asm(lv_q(Q_R700, 0, R_b0, 0)) //load the vector
cgen_asm(vtfm3_t(Q_R700, Q_M300, Q_R700)) //multiply by the matrix 3
cgen_asm(sv_q(Q_C700,0,R_b0,0)) //store the results
: "=r"(ptr) : "r"(ptr) : "memory");
// store the results
outvec[0] = vfpu_vars[0];
outvec[1] = vfpu_vars[1];
outvec[2] = vfpu_vars[2];
}
I haven't tried to compile this yet, does it look right?
-PMF
"I'm a little source code, short and stout
Here is my input, here is my out."
Author of PSPQuake and PSPSOne.
Here is my input, here is my out."
Author of PSPQuake and PSPSOne.
The 4th parameter is a cache bypass flag; whether to write through (0) or write back (1).korskarn wrote:1. The destination VFPU register,
2. Offset from the source address in the CPU regsiter
3. CPU register containing the source address
4. I don't know... ??? the lv.q instruction has only 3 parameters...
Looks ok to me, except that as far as I know, there is no b0 register?PacManFan wrote:Does this look about right?
Thats for the store instruction., which indeed has a fourth "wb" parameter.jsgf wrote:The 4th parameter is a cache bypass flag; whether to write through (0) or write back (1)
I don't think its used for the load instruction.[/quote]
This is my code so far, but I can't seem to get it to even execute the load matrix. Can someone point out the error?
static float vfpu_vars [4] __attribute__((aligned(16)));
#define ALIGNED_MATRIX() (ScePspFMatrix4*)((((unsigned int)alloca(sizeof(ScePspFMatrix4)+64)) + 63) & ~63)
//#define ALIGNED_VECTOR() (ScePspFVector4*)((((unsigned int)alloca(sizeof(ScePspFVector4)+64)) + 63) & ~63)
#define ALIGNED_VECTOR() (float *)((((unsigned int)alloca(sizeof(16)+64)) + 63) & ~63)
void TransformVec(float *m,float *invec,float *outvec){
register ScePspFMatrix4* r __asm("a0") = ALIGNED_MATRIX();
int c;
DLog("In Vector ");
for (c=0;c<4;c++){
DLog("%d %f \n",c,invec[c]);
}
DLog("Matrix ");
for (c=0;c<16;c++){
DLog("%d %f \n",c,m[c]);
}
memcpy(r,m,64);
vfpu_vars[0] = invec[0]; //x
vfpu_vars[1] = invec[1]; //y;
vfpu_vars[2] = invec[2]; //z;
vfpu_vars[3] = 1.0; // not used
//load up the matrix into the coprocessor matrix 3
__asm__ volatile (
cgen_asm(lv_q(Q_C300,0,R_a0,0))
cgen_asm(lv_q(Q_C310,4,R_a0,0))
cgen_asm(lv_q(Q_C320,8,R_a0,0))
cgen_asm(lv_q(Q_C330,12,R_a0,0))
: "=r"(r) : "r"(r), "r"(m) : "memory");
DLog("Program does not get here at all..., crashes above somewhere");
register void *ptr __asm ("a0") = vfpu_vars;
//load up the vfpu in-vector regs into Matrix 7, row 0
__asm__ volatile (
cgen_asm(lv_q(Q_R700, 0, R_a0,0)) //load the vector
cgen_asm(vtfm3_t(Q_R700, Q_M300, Q_R700)) //multiply by the matrix 3
cgen_asm(sv_q(Q_R700,0,R_a0,0)) //store the results
: "=r"(ptr) : "r"(ptr) : "memory");
// store the results
outvec[0] = vfpu_vars[0];
outvec[1] = vfpu_vars[1];
outvec[2] = vfpu_vars[2];
DLog("Out Vector ");
for (c=0;c<4;c++){
DLog("%d %f \n",c,outvec[c]);
}
}
static float vfpu_vars [4] __attribute__((aligned(16)));
#define ALIGNED_MATRIX() (ScePspFMatrix4*)((((unsigned int)alloca(sizeof(ScePspFMatrix4)+64)) + 63) & ~63)
//#define ALIGNED_VECTOR() (ScePspFVector4*)((((unsigned int)alloca(sizeof(ScePspFVector4)+64)) + 63) & ~63)
#define ALIGNED_VECTOR() (float *)((((unsigned int)alloca(sizeof(16)+64)) + 63) & ~63)
void TransformVec(float *m,float *invec,float *outvec){
register ScePspFMatrix4* r __asm("a0") = ALIGNED_MATRIX();
int c;
DLog("In Vector ");
for (c=0;c<4;c++){
DLog("%d %f \n",c,invec[c]);
}
DLog("Matrix ");
for (c=0;c<16;c++){
DLog("%d %f \n",c,m[c]);
}
memcpy(r,m,64);
vfpu_vars[0] = invec[0]; //x
vfpu_vars[1] = invec[1]; //y;
vfpu_vars[2] = invec[2]; //z;
vfpu_vars[3] = 1.0; // not used
//load up the matrix into the coprocessor matrix 3
__asm__ volatile (
cgen_asm(lv_q(Q_C300,0,R_a0,0))
cgen_asm(lv_q(Q_C310,4,R_a0,0))
cgen_asm(lv_q(Q_C320,8,R_a0,0))
cgen_asm(lv_q(Q_C330,12,R_a0,0))
: "=r"(r) : "r"(r), "r"(m) : "memory");
DLog("Program does not get here at all..., crashes above somewhere");
register void *ptr __asm ("a0") = vfpu_vars;
//load up the vfpu in-vector regs into Matrix 7, row 0
__asm__ volatile (
cgen_asm(lv_q(Q_R700, 0, R_a0,0)) //load the vector
cgen_asm(vtfm3_t(Q_R700, Q_M300, Q_R700)) //multiply by the matrix 3
cgen_asm(sv_q(Q_R700,0,R_a0,0)) //store the results
: "=r"(ptr) : "r"(ptr) : "memory");
// store the results
outvec[0] = vfpu_vars[0];
outvec[1] = vfpu_vars[1];
outvec[2] = vfpu_vars[2];
DLog("Out Vector ");
for (c=0;c<4;c++){
DLog("%d %f \n",c,outvec[c]);
}
}
"I'm a little source code, short and stout
Here is my input, here is my out."
Author of PSPQuake and PSPSOne.
Here is my input, here is my out."
Author of PSPQuake and PSPSOne.
I don't see whats causing the crash. Can you try to use your DLog function just before loading the matrix? That way we can see if its the matrix load code that crashes or something before.
If it crashes in the matrix code, try also to use DLog between each of the 4 lv.q and see at which one it crashes.
If it crashes in the matrix code, try also to use DLog between each of the 4 lv.q and see at which one it crashes.
Yeah, I put a DLog call right before I go into the inline assembly, it's definately crashing somewhere in the assembly.
I forgot to mention I also have the VFPU flag in my PSP_MAIN_THREAD_ATTR. I'm going to break it apart, and execute 1 assembly instruction and see what happens. It's going to have to wait a few hours though, my wife took my PSP to the laundermat.
-PMF
I forgot to mention I also have the VFPU flag in my PSP_MAIN_THREAD_ATTR. I'm going to break it apart, and execute 1 assembly instruction and see what happens. It's going to have to wait a few hours though, my wife took my PSP to the laundermat.
-PMF
"I'm a little source code, short and stout
Here is my input, here is my out."
Author of PSPQuake and PSPSOne.
Here is my input, here is my out."
Author of PSPQuake and PSPSOne.
I see something that might be the cause of the crash, depending on how the compiler interprets the value...
in the lv_q instruction, there is a numerical value that is the offset relative to the address int he register a0. For a 4x4 matrix, these offsets would be 0, 16, 32 and 48 bytes, yet in the code they are 0, 4, 8 and 12.
Now, when the instruction is encoded, only the 14 high bits of this 16 bits offset are taken: the address must always be word aligned therefore
the two low bits are always 0 and they are useless, and shifting the value gives 2 more bits in the instruction code to store other information.
The question is, does the compiler takes the real byte offset and does the shift itself (this would seem more logical) or do we shift the offset ourself?
If the compiler takes a byte offset and shift them, then the values (0, 4, 8, 12) in the code would be wrong, and would cause an alignment exception because they are not qword aligned.
Try allocating a matrix 4 times the real size (just to be safe in case my theory is wrong) and change the offsets to 0, 16, 32 and 48 and see if it still crashes.
Then memset your matrix memory to 0 (the whole 4 times bigger) and read it back from the VFPU (without changing it) and verify that it is intact.
in the lv_q instruction, there is a numerical value that is the offset relative to the address int he register a0. For a 4x4 matrix, these offsets would be 0, 16, 32 and 48 bytes, yet in the code they are 0, 4, 8 and 12.
Now, when the instruction is encoded, only the 14 high bits of this 16 bits offset are taken: the address must always be word aligned therefore
the two low bits are always 0 and they are useless, and shifting the value gives 2 more bits in the instruction code to store other information.
The question is, does the compiler takes the real byte offset and does the shift itself (this would seem more logical) or do we shift the offset ourself?
If the compiler takes a byte offset and shift them, then the values (0, 4, 8, 12) in the code would be wrong, and would cause an alignment exception because they are not qword aligned.
Try allocating a matrix 4 times the real size (just to be safe in case my theory is wrong) and change the offsets to 0, 16, 32 and 48 and see if it still crashes.
Then memset your matrix memory to 0 (the whole 4 times bigger) and read it back from the VFPU (without changing it) and verify that it is intact.
PMF:
1. You should definitely convert your code to take advantage of the binutils support for the VFPU, rather than using the macros. There might be a bug in the opcodes they generate; at the very least, they're not very readable, and they don't support the full power of the VFPU.
2. Use blocks for quoting code; its hard to read otherwise
3. Look at using libpspvfpu (which I recently checked into SVN), so that your VFPU code will coexist with other VFPU code (plus it will set the VFPU thread attrib for you).
4. I think korskarn is right about your crash; the offsets in are in bytes, not FP array units, so your loads are unaligned. Use 0,16,32,48 as the offsets.
5. There is no 5.
1. You should definitely convert your code to take advantage of the binutils support for the VFPU, rather than using the macros. There might be a bug in the opcodes they generate; at the very least, they're not very readable, and they don't support the full power of the VFPU.
2. Use
Code: Select all
[code] [/code]
3. Look at using libpspvfpu (which I recently checked into SVN), so that your VFPU code will coexist with other VFPU code (plus it will set the VFPU thread attrib for you).
4. I think korskarn is right about your crash; the offsets in
Code: Select all
cgen_asm(lv_q(Q_C300,0,R_a0,0))
cgen_asm(lv_q(Q_C310,4,R_a0,0))
cgen_asm(lv_q(Q_C320,8,R_a0,0))
cgen_asm(lv_q(Q_C330,12,R_a0,0))
5. There is no 5.
Thanks, I figured that out a few days ago, and put in coprocessor vertex transformation functions into the PSPSOne emulator already. It's giving me a slight speed increase. I need to re-write some more of the GTE in order to avoid reloading the matrix every time I transform a vertex, otherwise, I'm not saving any time.
-PMF
-PMF
"I'm a little source code, short and stout
Here is my input, here is my out."
Author of PSPQuake and PSPSOne.
Here is my input, here is my out."
Author of PSPQuake and PSPSOne.