I need to copy blocks of 32 bytes (16 pixels of 2 bytes length each), and was wondering what is the fastest way to perform such operation ?
The best one I've found is to use structs pointers like this :
typedef struct
{
u16 val00;
u16 val01;
u16 val02;
u16 val03;
u16 val04;
u16 val05;
u16 val06;
u16 val07;
u16 val08;
u16 val09;
u16 val10;
u16 val11;
u16 val12;
u16 val13;
u16 val14;
u16 val15;
}
FAST_COPY_32_BYTES;
Then by using pointers :
FAST_COPY_32_BYTES *fast_trg = TARGET_ADDRESS;
FAST_COPY_32_BYTES *fast_src = SOURCE_ADDRESS;
*fast_trg = *fast_src;
That gives some good performance, but not as much as I'd hope. I tried with half vals in my structs but all at u32, but get weird results. Is the VRAM can only be accessed by 16-bits chunks ?
Anyone knows of a faster way ?
What is the fastest way to copy from VRAM to VRAM ?
Thanks for the quick reply !
I just don't know if the bin output is as efficient on the psp CPU (I would believe so).
Any easy function I can call by passing a souce pointer and target pointer (all from VRAM) to copy X amount of bytes, and let the GPU do the work instead of the central CPU ?
I don't know about PSP CPU, but I'm from GBa programming, and on the ARM32 we get better bin code using this struct method, and save some cycles, than using a looping memcpy().ector wrote:memcpy() should beat that silly struct method of yours.
I just don't know if the bin output is as efficient on the psp CPU (I would believe so).
Where can I find sample code for that ? I actually never used 3D quads to draw 2D (as I said, being from a GBA background, I'm not very experienced with anything 3D).ector wrote:To copy stuff insanely quickly inside vram, set up a render target at your destination, your source as texture, and draw a quad.
Any easy function I can call by passing a souce pointer and target pointer (all from VRAM) to copy X amount of bytes, and let the GPU do the work instead of the central CPU ?
Sorry for the double post, but I just made it work using a struct with u32s, and it happen to be twice faster tha using the u16 struct OR memcpy(). Simply because the 32 bit CPU doesn't have to patch at 0 the 16 remaining bits when copying 16 bits at a time, but transfering blocks of 32 bits, there's no waste.
Surely, I'll be better using GPU pipeline, but until I figure out, this u32 struct should be fine for now.
typedef struct
{
u32 val00;
u32 val01;
u32 val02;
u32 val03;
u32 val04;
u32 val05;
u32 val06;
u32 val07;
}
FAST_COPY_32;
Only problem tho : using it on a non-aligned address cause the PSP to hang ... ugh.
Surely, I'll be better using GPU pipeline, but until I figure out, this u32 struct should be fine for now.
typedef struct
{
u32 val00;
u32 val01;
u32 val02;
u32 val03;
u32 val04;
u32 val05;
u32 val06;
u32 val07;
}
FAST_COPY_32;
Only problem tho : using it on a non-aligned address cause the PSP to hang ... ugh.
Yeah unaligned accesses are to be avoided :)
Strange that your struct method is so much faster. I must be too used to MSVC whose memcpy implementations (yes it has several, from just inserting MOVs to various unrolled loops). Maybe GCC isn't as good at memcpy intrinsic optimization, or something is not configured right.
Strange that your struct method is so much faster. I must be too used to MSVC whose memcpy implementations (yes it has several, from just inserting MOVs to various unrolled loops). Maybe GCC isn't as good at memcpy intrinsic optimization, or something is not configured right.
Look at the 'gu/rendertarget' and 'gu/blit' samples. They should give you enough information to figure out how to do your GPU-assisted blit.chiwaw wrote: Where can I find sample code for that ? I actually never used 3D quads to draw 2D (as I said, being from a GBA background, I'm not very experienced with anything 3D).
Any easy function I can call by passing a souce pointer and target pointer (all from VRAM) to copy X amount of bytes, and let the GPU do the work instead of the central CPU ?
GE Dominator