64bytes vb alignment required when using lpspgum_vfpu?

ftpiano · Post by **ftpiano** » Tue Jun 10, 2008 9:10 am

Hi all,

I'm using VFPU to help transforming my vertices. At the very beginning, I found my sprites blinking, and I think this is due to I didn't wait for Vblank? It starts to get worse when more and more sprites are transformed/rendered. They blinked more frenquently, sometimes they disappeared, or even mangled by mis-transformed vertex coords and bad uv coords. After that, I realized that it has nothing to do with Vblank.

I don't think this is a cache-related topic, 'cause I always update my vertex buffer by an uncached pointer, and I never flush cache after initializations. I don't think this is a VFPU-alignment topic either :p, 'cause I saw it's always using 'ulv/usv' to manipulate in/out matrices in pspgum_vfpu.c, even if my matrices are allocated on the stack, which means they may not be aligned.

And, the most important, I will get exactly what I want when linking with -lpspgum rather than -lpspgum_vfpu, using 16 bytes aligned vb/ibs. Everything runs perfectly when I stick to CPU/FPU(well...except for FPS). When I switch to VFPU, I have to align the buffers to 64 bytes to render(or transform?) them correctly.

Well, finally I got this fixed by memalign() my vertex buffers and index buffers to 64 bytes, but I don't know what's happening. Any ideas? Thanks in advance.

J.F. · Post by **J.F.** » Tue Jun 10, 2008 9:59 am

Are you using double-buffering? You know - render to a frame buffer that isn't being displayed, then flip it to being displayed while the previous buffer becomes the render target?

ftpiano · Post by **ftpiano** » Tue Jun 10, 2008 3:28 pm

J.F. wrote:Are you using double-buffering? You know - render to a frame buffer that isn't being displayed, then flip it to being displayed while the previous buffer becomes the render target?

Yes, by calling sceGuSwapBuffers().

J.F. · Post by **J.F.** » Wed Jun 11, 2008 1:00 am

Okay, just getting the silly questions out of the way early. :) You never know what someone will forget. Next silly question - you do have PSP_THREAD_ATTR_VFPU in the module header, right? And any threads that also use the VFPU?

You said you write to uncached memory - did you invalidate the cache for that region (or flush/invalidate the entire cache)? If a region is cached, accessing that region through the uncached portion of the map will STILL use the cache until it's been invalidated.

ftpiano · Post by **ftpiano** » Wed Jun 11, 2008 4:22 am

J.F. wrote:Okay, just getting the silly questions out of the way early. :) You never know what someone will forget. Next silly question - you do have PSP_THREAD_ATTR_VFPU in the module header, right? And any threads that also use the VFPU?

Well, I think I will never forget to specify the PSP_THREAD_ATTR_VFPU flag when I use VFPU. If I do, my PSP will crash-to-power-off to remind me that :p The other thread is just sleeping there, awaiting to serve the exit callback. And that's all threads I have.

J.F. wrote:You said you write to uncached memory - did you invalidate the cache for that region (or flush/invalidate the entire cache)?

The only trick I played with cache is sceKernelDcacheWritebackInvalidateAll(), I called this function after all initializations. After that I forget about cache. I never read from vertex buffers, but only update them with uncached pointers.

J.F. wrote:If a region is cached, accessing that region through the uncached portion of the map will STILL use the cache until it's been invalidated.

This is what I don't know before. Thanks for the information, J.F.

J.F. · Post by **J.F.** » Wed Jun 11, 2008 5:16 am

Well, the cache linesize is 64 bytes... it would be quite the coincidence if that wasn't part of your problem here. I've seen other cases where uncached memory was aligned to 64 bytes - the memory used in passing data to the MediaEngine for example.

ftpiano · Post by **ftpiano** » Wed Jun 11, 2008 1:44 pm

Yes, I think you get the point, J.F.

I've tried to use cached pointers then flush the cache explicitly before sceGumDrawArray() calls. And that works fine even with 16-bytes-aligned buffers.

So...according to what you said in the previous post:

If a region is cached, accessing that region through the uncached portion of the map will STILL use the cache until it's been invalidated.

This really makes several beginning/ending bytes of my vertex buffer to be cached, if I access the memory region (with a cached pointer) just a few bytes ahead/behind of the vb block. Because they may reside in same cache-lines, right? And that will make my subsequent update operations actually updated nothing but the cache.

Now I think I should not only align the start address of my vertex buffers to 64 bytes boundary, but also align the size of the buffers to multiple of 64.

So finally this is really a cache-related topic :p

Thanks J.F.

J.F. · Post by **J.F.** » Wed Jun 11, 2008 3:46 pm

Yes, that does sound logical. Just the sort of cache issue to drive people crazy tracking down. :)

jimparis · Post by **jimparis** » Wed Jun 11, 2008 5:29 pm

If a region is cached, accessing that region through the uncached portion of the map will STILL use the cache until it's been invalidated.

It shouldn't. What will happen is the cached data will eventually get written back as usual, and the changes you made to uncached memory would be overwritten and lost.

J.F. · Post by **J.F.** » Thu Jun 12, 2008 1:18 am

jimparis wrote:
If a region is cached, accessing that region through the uncached portion of the map will STILL use the cache until it's been invalidated.
It shouldn't. What will happen is the cached data will eventually get written back as usual, and the changes you made to uncached memory would be overwritten and lost.

Do you have a source for that? It's not typical behavior for CPUs. Caches are a shortcut that come before any external access and are physically mapped. Cached and uncached are the same physical address, which then shortcuts to the cache as it still has valid data. Every CPU I've ever worked on has this problem, which is why they all require you to invalidate the cache before using uncached memory.

crazyc · Post by **crazyc** » Thu Jun 12, 2008 1:53 am

J.F. wrote:Cached and uncached are the same physical address, which then shortcuts to the cache as it still has valid data.

From the CPU's perspective, 0x48000000, 0x08000000 and 0x88000000 are different physical addresses as it has no concept of virtual addressing.

jimparis · Post by **jimparis** » Thu Jun 12, 2008 2:05 am

The way I described it has the same logical result (you have to invalidate the cache before accessing uncached memory)... The cpu certainly doesn't decide whether to use the cache based on cache contents!

hlide · Post by **hlide** » Thu Jun 12, 2008 2:33 am

If you access a memory place through a cached segment, then try to access it through a uncached segment, two situations can occur :

1) a writeback occurs before accessing it through a uncached segment, you'll read the same value. If you write a new value, this is okay too

2) a writeback never occurs before accessing it through a uncached segment, you'll read a different value. If you write a new value, this new value will be scratched by the old value in cached segment after an explicit writeback.

The point is : be sure never to mix cached and uncached accesses.

note that the uncached accesses are good only for write-only purposes.

And you probably need a "sync" at the end to be sure the write-buffer updates the real memory as well.

EDIT : I wouldn't trust "ulv.q". I would use 4 "lv.s" or an aligned "lv.q" instead.