Performance Help from Raphael!

Discuss the development of new homebrew software, tools and libraries.

Moderators: cheriff, TyRaNiD

Post Reply
roland
Posts: 13
Joined: Sat Feb 11, 2006 9:11 pm

Performance Help from Raphael!

Post by roland »

HI...

I thought I'd share this with you since Raphael here has been a huge help for me..

I didn't really do much except follow his advice and the performance Issues were gone...

This was my Initial Question to him:
HI Raphael..

I am currently working on some small thing for Breakpoint and gkmotu was kind Enough to send me a sample scene which I exported from Max ..

Attached is the source code.. Can you give me some sort of hint, why this is so Heavy on the hardware?

I am roughly doing 170.000 Vertices which I believe should not be a problem for The PSP right?!?!

I would really appreciate it if you could give me a hint on whats eating up All the CPU Time!??
Thanx
Roland

PS: Grab the source from Code:
http://212.41.237.42/sony/Cube_Exp.zip
And this his Answer:
Hi there.
I'm really not the optimal guru for highest optimized GU programming, yet there are some things I would suggest you change/try to change in your program.

First of all, you declare your textures as an array of short ints with array size width*4*height*4. Since you load only one texture, you are totally off the real amount of data you might want to declare there. If you calculate the size of the celshade_texture and celshade_swizzled you will get they each will be 64*4*64*4*2 [width*4*height*4*sizeof(short)]=131072 bytes with the way you do it. However, your raw texture is only 16384 bytes in size, since its a 64x64x32bit texture. So what you really should declare your textures like should be
unsigned int celshade_texture[64*64];
OR
unsigned char celshade_texture[64*64*4];
The first allocates the texture as an array of ints (each being 32bt wide and thus exactly the size of one pixel in your texture) and the second as array of chars (each being 8bits in size, so you need 4 to correspond one pixel in the texture - namely the R,G,B and A components).
I prefer the first method, since it saves you from thinking about getting the bytesize of the texture calculated and you can simply put width*height for the arraysize, however if you want to manipulate the texture on a single component or something similar, you have to type cast the array into a char* first, but that's no big deal.
This will not make your program any faster yet, but in a real-world program/game you need to watch for such huge memory wastes, since you could store 8 textures in the space you allocated for one, and for example save on reloading textures from MS, which would be slow then.
Don't forget to correctly apply the changes to your loadRAW and swizzle parameters though (change one of the two 64*4 parameters to 64 only).

The next thing I notice is, that your program only streams the vertice data each frame from system memory (your structures in tunnel2.h) to the GU and lets it render it. Even though this might seem pretty straightforward, it's not the way to get close to the max theoretical rendering throughput of the PSP, mainly for these reasons:

1) copying such a huge data structure (I estimated it at roughly 4MB vertice data only) every frame from system ram to get it rendered is bad, as the memory bandwidth will cut your performance, even though the PSP is designed pretty well for such purpose.
First of all, you should cut on the size of this data, which can be done by using 16 or 8BIT ints for your vertice coords and normal coords (this would also highly improve the cache usage and thus the speed, if you get the size of a vertice to 8-12bytes instead of the 24bytes it currently uses [6 floats = 6*4byte]) and by using indexed face vertices rather than submitting three complete vertices for each face (thus cutting down memory size to a min of 1/3, in average to 1/2 - 2/3, depending on your meshes).
Then to get the max out of it, you would have to try some more sophisticated approaches like caching the most frequently used data (with higher priority to textures in general) in VRAM (not trivial for the vertice data, but in your case you should just start copying your textures to VRAM after loading - however since you only have one small texture, the gain will be unnoticeable most likely), or using more tricky ways of not redrawing the whole scene every frame (one connected example is not clearing the framebuffer every frame, since most of the screen will be unchanged, so only clear the zbuffer - or even better, use a z-inversion trick [this is more complicated though] to even bypass the clearing neccesity for the zbuffer).

2) If you are mainly streaming data, the cache is your worst enemy for performance, so ordering and sending your data correctly is important. One inbuilt method the PSP uses is swizzled textures, so use them whenever possible (for every and all static textures in your scene). Another method, when using multiple textures is ordering and drawing your data depending on assigned texture, meaning you load one texture, draw everything using that texture, then load the next texture, etc.
For vertice data, as already stated above, its best to reduce the size to a cache friendly one, which for PSP is 8-12bytes.
In general: The smaller your data is, the more of it fits into the cache at the same time (the data cache of the PSP CPU is 64bytes afaik, don't know exactly for the GU though) and the faster your program will get - with the maximum speed when ALL data is in cache. At the same time however, order your data so that subsequently needed data is stored after each other in memory, for textures meaning swizzling, for vertice data meaning usage of vertex-strips/fans or at least indexed vertex-lists.

3) don't setup all GU render states from scratch for each frame. The PSP GU, like OpenGL on PC is a "state machine", meaning when you set a specific option, it will be kept active as long as you don't change it.
One good example in your code is the setup of the Projection matrix
sceGumMatrixMode(GU_PROJECTION);
sceGumLoadIdentity();
sceGumPerspective(fov,16.0f/9.0f,0.5f,1000.0f);
which you do every frame, yet the perspective never changes in your program, since the fov variable cannot be changed at runtime.
More generally: Since the PSP (at least afaik) like the PS2 is designed for high data throughput and low code throughput (as opposed to PC design), avoid unneccessary calls whereever possible in your inner loop (and if code cache works similar to the data cache - don't know though, im not too much into code caches yet - try to favor multiple calls to one function over different functions and also your code so that the same functions get called after each other where possible)

4) don't abuse sceKernelDcacheWritebackAll() if not really needed (currently, you call it twice every frame, try to remove the one and the other and then both and check if everything is still displayed correclty - it should, since your data is mostly static), it just means the CPU has to check the whole cache for consistency with memory content and even though that's no big deal. Same goes for sceGuTexFlush - you only have static textures, so this shouldn't be neccessary.


Well, that's the most basic stuff I could think of the top of my head for now and yet it's quite a lot. Probably you should also start a new thread on this and post my answer there too, so everyone can gain from this.

Hope I could help you.

Greets,
Raphael
Just to show you, that the performance has increased :

OLD VERSION:
http://212.41.237.42/sony/Cube_Exp.zip

New Version
http://212.41.237.42/sony/Cube_Exp_opt.zip

I am scaling the Objects by 6000,6000,6000 and as described by starman2049 in http://forums.ps2dev.org/viewtopic.php?t=3506 there is something off with the the view in the optimized version eventhough I did not change anything there...

THANX RAPHAEL FOR YOUR HELP ON THIS!
Post Reply