Can I use inline asm/vfpu to speed this matrix multiply up?

Discuss the development of new homebrew software, tools and libraries.

Moderators: cheriff, TyRaNiD

Post Reply
Kojima
Posts: 275
Joined: Mon Jun 26, 2006 3:49 am

Can I use inline asm/vfpu to speed this matrix multiply up?

Post by Kojima »

Here's a new matrix multiply function I wrote for raptor. it works, the demo supplied with alpha 3 now runs between 30-46fps instead of 20, but it's still not fast enough for my needs.

So i'm wondering what I can do to optimize it further? I've unrolled all the loops so it's pure math and variable look ups.

I've been advised to use the vfpu, but the guy i was speaking with only had experience with the official sdk, and wasn't sure of whether pspdev has that ability. Does it?

If so, could you please provide me with a short example, even if it's just adding one integer to another on the vfpu so I can get started with it.

thanks

Code: Select all

	void Multiply(Matrix *mat)
	{
		Matrix new_mat;
	
		
		new_mat.grid[0][0]=(grid[0][0]*mat->grid[0][0]) + (grid[1][0]*mat->grid[0][1]) + (grid[2][0]*mat->grid[0][2]) + (grid[3][0]*mat->grid[0][3]); 
		new_mat.grid[0][1]=(grid[0][1]*mat->grid[0][0]) + (grid[1][1]*mat->grid[0][1]) + (grid[2][1]*mat->grid[0][2]) + (grid[3][1]*mat->grid[0][3]); 
		new_mat.grid[0][2]=(grid[0][2]*mat->grid[0][0]) + (grid[1][2]*mat->grid[0][1]) + (grid[2][2]*mat->grid[0][2]) + (grid[3][2]*mat->grid[0][3]);
		new_mat.grid[0][3]=(grid[0][3]*mat->grid[0][0]) + (grid[1][3]*mat->grid[0][1]) + (grid[2][3]*mat->grid[0][2]) + (grid[3][3]*mat->grid[0][3]); 

		new_mat.grid[1][0]=(grid[0][0]*mat->grid[1][0]) + (grid[1][0]*mat->grid[1][1]) + (grid[2][0]*mat->grid[1][2]) + (grid[3][0]*mat->grid[1][3]); 
		new_mat.grid[1][1]=(grid[0][1]*mat->grid[1][0]) + (grid[1][1]*mat->grid[1][1]) + (grid[2][1]*mat->grid[1][2]) + (grid[3][1]*mat->grid[1][3]); 
		new_mat.grid[1][2]=(grid[0][2]*mat->grid[1][0]) + (grid[1][2]*mat->grid[1][1]) + (grid[2][2]*mat->grid[1][2]) + (grid[3][2]*mat->grid[1][3]); 
		new_mat.grid[1][3]=(grid[0][3]*mat->grid[1][0]) + (grid[1][3]*mat->grid[1][1]) + (grid[2][3]*mat->grid[1][2]) + (grid[3][3]*mat->grid[1][3]); 

		new_mat.grid[2][0]=(grid[0][0]*mat->grid[2][0]) + (grid[1][0]*mat->grid[2][1]) + (grid[2][0]*mat->grid[2][2]) + (grid[3][0]*mat->grid[2][3]); 
		new_mat.grid[2][1]=(grid[0][1]*mat->grid[2][0]) + (grid[1][1]*mat->grid[2][1]) + (grid[2][1]*mat->grid[2][2]) + (grid[3][1]*mat->grid[2][3]); 
		new_mat.grid[2][2]=(grid[0][2]*mat->grid[2][0]) + (grid[1][2]*mat->grid[2][1]) + (grid[2][2]*mat->grid[2][2]) + (grid[3][2]*mat->grid[2][3]); 
		new_mat.grid[2][3]=(grid[0][3]*mat->grid[2][0]) + (grid[1][3]*mat->grid[2][1]) + (grid[2][3]*mat->grid[2][2]) + (grid[3][3]*mat->grid[2][3]); 

		new_mat.grid[3][0]=(grid[0][0]*mat->grid[3][0]) + (grid[1][0]*mat->grid[3][1]) + (grid[2][0]*mat->grid[3][2]) + (grid[3][0]*mat->grid[3][3]); 
		new_mat.grid[3][1]=(grid[0][1]*mat->grid[3][0]) + (grid[1][1]*mat->grid[3][1]) + (grid[2][1]*mat->grid[3][2]) + (grid[3][1]*mat->grid[3][3]); 
		new_mat.grid[3][2]=(grid[0][2]*mat->grid[3][0]) + (grid[1][2]*mat->grid[3][1]) + (grid[2][2]*mat->grid[3][2]) + (grid[3][2]*mat->grid[3][3]); 
		new_mat.grid[3][3]=(grid[0][3]*mat->grid[3][0]) + (grid[1][3]*mat->grid[3][1]) + (grid[2][3]*mat->grid[3][2]) + (grid[3][3]*mat->grid[3][3]); 

	
		grid[0][0] = new_mat.grid[0][0];
		grid[0][1] = new_mat.grid[0][1];
		grid[0][2] = new_mat.grid[0][2];
		grid[0][3] = new_mat.grid[0][3];
		grid[1][0] = new_mat.grid[1][0];
		grid[1][1] = new_mat.grid[1][1];
		grid[1][2] = new_mat.grid[1][2];
		grid[1][3] = new_mat.grid[1][3];
		grid[2][0] = new_mat.grid[2][0];
		grid[2][1] = new_mat.grid[2][1];
		grid[2][2] = new_mat.grid[2][2];
		grid[2][3] = new_mat.grid[2][3];
		grid[3][0] = new_mat.grid[3][0];
		grid[3][1] = new_mat.grid[3][1];
		grid[3][2] = new_mat.grid[3][2];
		grid[3][3] = new_mat.grid[3][3]; 
		
	
	}
	
User avatar
Jim
Posts: 476
Joined: Sat Jul 02, 2005 10:06 pm
Location: Sydney
Contact:

Post by Jim »

You can use the vfpu with pspsdk. But you can probably get more speed out of what you've got by avoiding all those double array indices using pointers instead.

Jim
User avatar
ReKleSS
Posts: 73
Joined: Sat Jun 18, 2005 12:57 pm
Location: Melbourne, Australia

Post by ReKleSS »

Well... the hardware specifications say the PSP can do a vector * 4x4 matrix multiply in 22 cycles... two 4x4 matrices should then be 88 cycles. Look at the vfpu_gum code in the sdk to see how to do it - you'll need to load both matrices into the vfpu, multiply them, then pull out the one you want. Something tells me this will still be faster than what you're doing there...

-ReK
Tinnus
Posts: 67
Joined: Sat Jul 29, 2006 1:12 am

Post by Tinnus »

Anywhere we can see a list of the VFPU commands?
Let's see what the PSP reserves... well, I'd say anything is better than Palm OS.
siberianstar
Posts: 70
Joined: Thu Jun 22, 2006 9:24 pm

Post by siberianstar »

Sure,

most of vfpu commands are documented and listed in pspgl_codegen.h inside the PSPGL made by Jeremy Fitzhardinge.

And halso here http://hitmen.c02.at/files/yapspd/psp_d ... tml#sec4.9
Tinnus
Posts: 67
Joined: Sat Jul 29, 2006 1:12 am

Post by Tinnus »

Thanks!
Let's see what the PSP reserves... well, I'd say anything is better than Palm OS.
User avatar
Raphael
Posts: 646
Joined: Tue Jan 17, 2006 4:54 pm
Location: Germany
Contact:

Post by Raphael »

Some other questions regarding vfpu:

1) how far is it possible to interlace cpu and vfpu code to optimize speed, or are there restrictions (apart from result dependencies)?
2) Where can I find a document with more or less accurate cycle cost of each instruction?
<Don't push the river, it flows.>
http://wordpress.fx-world.org - my devblog
http://wiki.fx-world.org - VFPU documentation wiki

Alexander Berl
Kojima
Posts: 275
Joined: Mon Jun 26, 2006 3:49 am

Post by Kojima »

I again did it the hard way and hand optimized everything. it now runs as fast as the vfpu :) (well they both hit the hardware limit, so maybe just as fast :) )
siberianstar
Posts: 70
Joined: Thu Jun 22, 2006 9:24 pm

Post by siberianstar »

vfpu is faster, but if you have to load each time matrices and vector from memory into vfpu registers it becomes slower.
User avatar
Jim
Posts: 476
Joined: Sat Jul 02, 2005 10:06 pm
Location: Sydney
Contact:

Post by Jim »

Is the bottom row/right column of your matrix (0,0,0,1)?

Jim
Kojima
Posts: 275
Joined: Mon Jun 26, 2006 3:49 am

Post by Kojima »

The bottom row is x,y,z,1. and the rest is just an identity matrix so if do the multiply by hand you don't even have to change row 0-2 at all as the identity will produce the same matrix. It's only on the x,y,z line where the input is dynamic you need to do the math to get the new matrix. So it's much much faster now.

Sib, faster perhaps, but it doesn't work the way I need for it to be usable.
That said I'd be surprised if vfpu didn't have a transform func of some sort.
User avatar
Jim
Posts: 476
Joined: Sat Jul 02, 2005 10:06 pm
Location: Sydney
Contact:

Post by Jim »

Are you really saying one of your matrices is

1 0 0 0
0 1 0 0
0 0 1 0
x y z 1

?

If so, then nearly all the multiplies in your code are mulitplying by 0 or 1 - what a waste! Optimise it out.

Jim
Tinnus
Posts: 67
Joined: Sat Jul 29, 2006 1:12 am

Post by Tinnus »

That's basically a case of

x0 += x; y0 += y; z0 += z;

Forget about a matrix...
Let's see what the PSP reserves... well, I'd say anything is better than Palm OS.
Kojima
Posts: 275
Joined: Mon Jun 26, 2006 3:49 am

Post by Kojima »

Jim, I challenge you to optimize it any further than I have. :)
It's not even using two matrices now, all the 1,0s I don't even compute so you're right, just a bit late to the party :)

Code: Select all


	float fx,fy,fz; 
	fx=&#40;new_mat->grid&#91;0&#93;&#91;0&#93;*ovx&#41; + &#40;new_mat->grid&#91;1&#93;&#91;0&#93;*ovy&#41; + &#40;new_mat->grid&#91;2&#93;&#91;0&#93;*ovz&#41; + new_mat->grid&#91;3&#93;&#91;0&#93;; 
	fy=&#40;new_mat->grid&#91;0&#93;&#91;1&#93;*ovx&#41; + &#40;new_mat->grid&#91;1&#93;&#91;1&#93;*ovy&#41; + &#40;new_mat->grid&#91;2&#93;&#91;1&#93;*ovz&#41; + new_mat->grid&#91;3&#93;&#91;1&#93;; 
	fz=&#40;new_mat->grid&#91;0&#93;&#91;2&#93;*ovx&#41; + &#40;new_mat->grid&#91;1&#93;&#91;2&#93;*ovy&#41; + &#40;new_mat->grid&#91;2&#93;&#91;2&#93;*ovz&#41; + new_mat->grid&#91;3&#93;&#91;2&#93;;
				

Tinnus, that's not true in this case, my multiply as you can see is a transform not a translation.
Tinnus
Posts: 67
Joined: Sat Jul 29, 2006 1:12 am

Post by Tinnus »

Ah OK, sorry.

Anyway I was wondering... can I use the VFPU to do a matrix multiplication with signed int's rather than floats? I want to do something like

a b c x
d e f * y = u v w
g h i z

Where all values are signed int's.
Let's see what the PSP reserves... well, I'd say anything is better than Palm OS.
Fanjita
Posts: 217
Joined: Wed Sep 28, 2005 9:31 am

Post by Fanjita »

Kojima wrote:Jim, I challenge you to optimize it any further than I have. :)
You will still probably get some further improvement by caching the results of the indirections (as mentioned by someone previously in this thread).
Got a v2.0-v2.80 firmware PSP? Download the eLoader here to run homebrew on it!
The PSP Homebrew Database needs you!
siberianstar
Posts: 70
Joined: Thu Jun 22, 2006 9:24 pm

Post by siberianstar »

Well, you should before cast your values into floats. Don't use the VFPU for int values: CPU is better in that case
User avatar
Raphael
Posts: 646
Joined: Tue Jan 17, 2006 4:54 pm
Location: Germany
Contact:

Post by Raphael »

Kojima wrote:Jim, I challenge you to optimize it any further than I have. :)
It's not even using two matrices now, all the 1,0s I don't even compute so you're right, just a bit late to the party :)

Code: Select all


	float fx,fy,fz; 
	fx=&#40;new_mat->grid&#91;0&#93;&#91;0&#93;*ovx&#41; + &#40;new_mat->grid&#91;1&#93;&#91;0&#93;*ovy&#41; + &#40;new_mat->grid&#91;2&#93;&#91;0&#93;*ovz&#41; + new_mat->grid&#91;3&#93;&#91;0&#93;; 
	fy=&#40;new_mat->grid&#91;0&#93;&#91;1&#93;*ovx&#41; + &#40;new_mat->grid&#91;1&#93;&#91;1&#93;*ovy&#41; + &#40;new_mat->grid&#91;2&#93;&#91;1&#93;*ovz&#41; + new_mat->grid&#91;3&#93;&#91;1&#93;; 
	fz=&#40;new_mat->grid&#91;0&#93;&#91;2&#93;*ovx&#41; + &#40;new_mat->grid&#91;1&#93;&#91;2&#93;*ovy&#41; + &#40;new_mat->grid&#91;2&#93;&#91;2&#93;*ovz&#41; + new_mat->grid&#91;3&#93;&#91;2&#93;;
				

Well, you still should definately change the double indices array style to a single indice array.
You could anyway still improve this with using vfpu, since all you do here is 3 vector dot products.
So load your matrix, then load (ovx,ovy,ovz,1):=v as vector and do the three dot products.
Should look something like this in asm then (not tested, and I'm just starting on vfpu so there might be errors ;)):

Code: Select all

lv.q  r000, %0  // r000 = mat->grid&#91;0&#93;
lv.q  r010, %0+16  // r010 = mat->grid&#91;1&#93;
lv.q  r020, %0+32  // r020 = mat->grid&#91;2&#93;
lv.q  r030, %0+48  // r030 = mat->grid&#91;3&#93;
lv.q  c100, %1   // c100 = v = &#40;ovx,ovy,ovz,1&#41;

vdot.q s000, c000, c100  // s000 = fx
vdot.q s001, c010, c100  // s001 = fy
vdot.q s002, c020, c100  // s002 = fz
vzero.s s003

sv.q r000, %2  // r =&#40;fx,fy,fz,0&#41;
You could also do 3 vector scales and 4 vector adds instead on the matrix rows, but I doubt that would be faster.
Last edited by Raphael on Tue Aug 08, 2006 10:48 pm, edited 1 time in total.
<Don't push the river, it flows.>
http://wordpress.fx-world.org - my devblog
http://wiki.fx-world.org - VFPU documentation wiki

Alexander Berl
User avatar
Jim
Posts: 476
Joined: Sat Jul 02, 2005 10:06 pm
Location: Sydney
Contact:

Post by Jim »

:p to Kojima. You have to admit you were barking up the wrong tree without a paddle.

I tried a lot of ways of optimising the C, and without knowing allegro CPU you can't for certain make it faster.

But how many iterations are you doing? The fpu has 32 registers. You can put your whole matrix in there and send loads of vertices through.
A small bit of asm sees all of new_mat stuffed in there, and that makes everything a lot faster. Right now you load your matrix and flush it out per vertex.

Jim
Kojima
Posts: 275
Joined: Mon Jun 26, 2006 3:49 am

Post by Kojima »

You see Jim, I told you there were still ways to optimize it further. :P

I've toyed with using asm, but I'm just not good enough with it tbh. I need to put some time aside to do some asm tests and get familar with the psp cpu. I was making some nice progress on the ps2 asm side then I got the psp and havn't touched the ps2 since.
You will still probably get some further improvement by caching the results of the indirections (as mentioned by someone previously in this thread).
I've never heard that termonology before. Could you show me a simple example of how to do it?
--

Ralph, thanks for the code/idea, I never released they were just dot products before. I mean I do now, now that you've told me, but yeah that should provide a mean speed up. I'll try that next.

One question, is there any way to render more than 60 frames per second on the psp? Cos my current model hits the hardware limit so I have no way of knowing how big an improvement these changes make.
Simplying adding another model doesn't work cos the added load of two glDrawElements calls seems to choke the fps.
-

New version of raptor will be out tonight, now with a killer single surface particle system and vastly improved anim speed on boned entities.(I.e animated b3ds.)
User avatar
Raphael
Posts: 646
Joined: Tue Jan 17, 2006 4:54 pm
Location: Germany
Contact:

Post by Raphael »

Kojima wrote:
You will still probably get some further improvement by caching the results of the indirections (as mentioned by someone previously in this thread).
I've never heard that termonology before. Could you show me a simple example of how to do it?
I think what Fanjita was aiming at, was removing all those -> pointer indirections by working with a pointer to the grid struct. Every such indirection is pretty expensive, so you should avoid that in time critical parts of your code.
You could just do something like

Code: Select all

float *m = new_mat->grid;
and then replace all that new_mat->grid with just m.
Ralph, thanks for the code/idea, I never released they were just dot products before. I mean I do now, now that you've told me, but yeah that should provide a mean speed up. I'll try that next.
The biggest speed up would probably be jims idea, to unroll your matrix updates. As you probably can see with my code, the loading of the matrix now takes up quite a lot of the time.
One question, is there any way to render more than 60 frames per second on the psp? Cos my current model hits the hardware limit so I have no way of knowing how big an improvement these changes make.
Simplying adding another model doesn't work cos the added load of two glDrawElements calls seems to choke the fps.
The problem is not the psp not being able to render more than 60fps, but the LCD not being able to show more than 60fps. So if you want to bench stuff, remove all VSyncWait calls, to let the psp render as fast as it can, independent of the LCD refresh rate. You should then easily 'see' (as in the fps counter tell you so ;) frame rates above the 100's.
[/code]
<Don't push the river, it flows.>
http://wordpress.fx-world.org - my devblog
http://wiki.fx-world.org - VFPU documentation wiki

Alexander Berl
Kojima
Posts: 275
Joined: Mon Jun 26, 2006 3:49 am

Post by Kojima »

I dont use a vsync though, that's whats wierd. I even tried using a single buffer with no backbuffer but then I just got a blank screen.

I just glClear to cls, and glutSwapBuffers. or glSwapbuffers. I suppose the vsync is being done by pspgl then?

As for matrices, I did unroll everything. It's just four lines of dot products now as you said yourself.
Though the indirection idea is good. I always wondered if C++ did indirections pre-compile or at runtime. Guess now I know :)
User avatar
Jim
Posts: 476
Joined: Sat Jul 02, 2005 10:06 pm
Location: Sydney
Contact:

Post by Jim »

Removing the pointer indirection doesn't make any difference in this case. Here's the variations I tried

Code: Select all

typedef struct
&#123;
	float grid&#91;4&#93;&#91;4&#93;;
&#125; matrix;

float fx,fy,fz;
void kojima&#40;matrix *new_mat, float ovx, float ovy, float ovz&#41;
&#123;
   fx = &#40;new_mat->grid&#91;0&#93;&#91;0&#93;*ovx&#41; +
	&#40;new_mat->grid&#91;1&#93;&#91;0&#93;*ovy&#41; +
	&#40;new_mat->grid&#91;2&#93;&#91;0&#93;*ovz&#41; +
	 new_mat->grid&#91;3&#93;&#91;0&#93;; 

   fy = &#40;new_mat->grid&#91;0&#93;&#91;1&#93;*ovx&#41; +
	&#40;new_mat->grid&#91;1&#93;&#91;1&#93;*ovy&#41; +
	&#40;new_mat->grid&#91;2&#93;&#91;1&#93;*ovz&#41; +
	 new_mat->grid&#91;3&#93;&#91;1&#93;; 

   fz = &#40;new_mat->grid&#91;0&#93;&#91;2&#93;*ovx&#41; +
	&#40;new_mat->grid&#91;1&#93;&#91;2&#93;*ovy&#41; +
	&#40;new_mat->grid&#91;2&#93;&#91;2&#93;*ovz&#41; +
	 new_mat->grid&#91;3&#93;&#91;2&#93;; 
&#125;

void jim&#40;matrix *new_mat, float ovx, float ovy, float ovz&#41;
&#123;
   float *mat = &#40;float *&#41;new_mat;
   fx = ovx * *mat++;
   fy = ovx * *mat++;
   fz = ovx * *mat++;
   mat++;
   fx += ovy * *mat++;
   fy += ovy * *mat++;
   fz += ovy * *mat++;
   mat++;
   fx += ovz * *mat++;
   fy += ovz * *mat++;
   fz += ovz * *mat++;
   mat++;
   fx += *mat++;
   fy += *mat++;
   fz += *mat++;
&#125;

void jim2&#40;matrix *new_mat, float ovx, float ovy, float ovz&#41;
&#123;
   float *mat = &#40;float *&#41;new_mat;
   fx = ovx * mat&#91;0&#93;;
   fy = ovx * mat&#91;1&#93;;
   fz = ovx * mat&#91;2&#93;;
   fx += ovy * mat&#91;4&#93;;
   fy += ovy * mat&#91;5&#93;;
   fz += ovy * mat&#91;6&#93;;
   fx += ovz * mat&#91;8&#93;;
   fy += ovz * mat&#91;9&#93;;
   fz += ovz * mat&#91;10&#93;;
   fx += mat&#91;12&#93;;
   fy += mat&#91;13&#93;;
   fz += mat&#91;14&#93;;
&#125;

void jim3&#40;matrix *new_mat, float ovx, float ovy, float ovz&#41;
&#123;
   float *mat = &#40;float *&#41;new_mat;
   fx = ovx * mat&#91;0&#93;;
   fx += ovy * mat&#91;4&#93;;
   fx += ovz * mat&#91;8&#93;;
   fx += mat&#91;12&#93;;

   fy = ovx * mat&#91;1&#93;;
   fy += ovy * mat&#91;5&#93;;
   fy += ovz * mat&#91;9&#93;;
   fy += mat&#91;13&#93;;

   fz = ovx * mat&#91;2&#93;;
   fz += ovy * mat&#91;6&#93;;
   fz += ovz * mat&#91;10&#93;;
   fz += mat&#91;14&#93;;
&#125;

void jim4&#40;matrix *new_mat, float ovx, float ovy, float ovz&#41;
&#123;
   float *mat = &#40;float *&#41;new_mat;
   fx = ovx * mat&#91;0&#93;
    + ovy * mat&#91;4&#93;
    + ovz * mat&#91;8&#93;
    + mat&#91;12&#93;;

   fy = ovx * mat&#91;1&#93;
    + ovy * mat&#91;5&#93;
    + ovz * mat&#91;9&#93;
    + mat&#91;13&#93;;

   fz = ovx * mat&#91;2&#93;
    + ovy * mat&#91;6&#93;
    + ovz * mat&#91;10&#93;
    + mat&#91;14&#93;;
&#125;
I compiled with
psp-gcc -O -G0 -S -fno-float-store test.c

Basically the only difference is how the optimiser ends up pipelining the multiplies in the fpu. Does anyone know the allegro well enough to say if 4 muls issued back-to-back is faster than inteleaving some adds/loads in there?
You need nofloat-store otherwise the ones where I use += always store the intermediate results to memory (as it should).
Try different levels of optimising to see slightly different results.

Anyway pre-loading the FPU with the matrix is going to be about 3x the speed of any of these, for any non-trivial set of vertices.

Jim
siberianstar
Posts: 70
Joined: Thu Jun 22, 2006 9:24 pm

Post by siberianstar »

vsync is automatically enabled in pspgl, if you want to disable it just open the file

pspgl_vidmem.c

and remove all sceDisplayWaitVblankStart calls and recompile it, you will get a bad flickering effect.

if you read buffer from the psp pad make sure to use sceCtrlPeekBufferPositive

and not sceCtrlReadBufferPositive because the last one implicitly waits for vblank.

it's stupid to try to optimize more a vector transform like this:

Code: Select all

 fx=&#40;new_mat->grid&#91;0&#93;&#91;0&#93;*ovx&#41; + &#40;new_mat->grid&#91;1&#93;&#91;0&#93;*ovy&#41; + &#40;new_mat->grid&#91;2&#93;&#91;0&#93;*ovz&#41; + new_mat->grid&#91;3&#93;&#91;0&#93;; 
   fy=&#40;new_mat->grid&#91;0&#93;&#91;1&#93;*ovx&#41; + &#40;new_mat->grid&#91;1&#93;&#91;1&#93;*ovy&#41; + &#40;new_mat->grid&#91;2&#93;&#91;1&#93;*ovz&#41; + new_mat->grid&#91;3&#93;&#91;1&#93;; 
   fz=&#40;new_mat->grid&#91;0&#93;&#91;2&#93;*ovx&#41; + &#40;new_mat->grid&#91;1&#93;&#91;2&#93;*ovy&#41; + &#40;new_mat->grid&#91;2&#93;&#91;2&#93;*ovz&#41; + new_mat->grid&#91;3&#93;&#91;2&#93;; 
well if you use double indeces [][] or single index [] speed won't change.

if you want your program to run faster you need to optimise bigger segment of code.
Kojima
Posts: 275
Joined: Mon Jun 26, 2006 3:49 am

Post by Kojima »

So I was Right Jim. :) (Aside from vfpu which is not something I'm going to tackle just yet. Too many other bugs to sniff out)

Tbh I didn't expect indirections to cause much slowdown..not unless psp differ vastly from pcs. Which I guess it does but still.

Sib, thanks for the tip. I'm gonna leave it for now, I don't want flickering just to gauge true fps. not worth it.
User avatar
Jim
Posts: 476
Joined: Sat Jul 02, 2005 10:06 pm
Location: Sydney
Contact:

Post by Jim »

If this was an SH4 (Dreamcast) not a MIPS, the one using ++ to get at the matrix would be quite a bit quicker since SH4 has instructions which increment the source address register after the load. You really have to know your architecture to micro-optimise bits like this.

Jim
Post Reply