graphiccards have a WAY less texture cache than the spu has local store, you dont need the whole texture in it to rasterize.
my rasterizer is always working on 64 (8*8) pixels, one memtransfer of about 32*32 texels takes roughly 600cycles, so I have always like 10cycle/pixel (40cycles/float4) to do computations and this is just about right to have a balance (my profiling shows me that i've about 4cycles sync time in average, but this just due to some really big peaks, probably some hazards of too many dma trasfers at the same time).
my rough pipe is:
1. coarse check & early out
2. start dma zbuffer
3. calculate UVs for coarse tile and start dma texture tile
4. calculate coverage and depth values
5. sync zbuffer dma.
6. check z-values of pixel-fragments against zbuffer
7. sync texture (needs to be done here as the transfer order is not guaranteed to be the request order)
8. if all fragments (behind zbuffer || not covered) next 8x8 pixel quad
9. start dma framebuffer
10. sample texture (quite a lot of math work)
11. sync framebuffer
12. blend or mask framebuffer with color of fragments
my texturecache management is direct-mapped, so there is really not much logic to check if a 32*32 texture quad is in-mem. having a texture cache of 4 tiles gives like 60% hitrate, making it 16tiles gaves nearly no speedup or increase in hitrates. making it 4way associative gave like 0.1% better hitrates but was slower due to instruction dependancies, that's because the very predictable memory access pattern.
I can even get away without using a cache and just stream in texture quads from scratch (4* 32x32 texels), that does not slow down as long as there is no other memtransfer going on.
having 6spus and the ppu working on memory, the cache is helping.
if you'd have shaders that access random locations in the texture, it would be a way more of a headache to hide latency, but it's fairly simple if you implement some Fixed function pipeline.
having 8x8 pixel, you can estimate that at max 16*16 texel accessed. all are nearby, so at worst you will hit 4 different tiles. before you start the fine rasterization, you can make a coarse check at the 4 pixels on the border of a 8x8 pixel quad. this way you can
-check if any pixel of the 8x8 will set a pixel (fast skipping)
-calculate the UVs -> calculate the Mip level -> start to dma the 4 texture tiles into your cache if not yet there
(yes, this is not as accurate miplevel selection as gfx-cards do and just bilinear filtering if any enabled at all)
a bigger issue for me was the z-/framebuffer, as you need to transfer out the tiles first and then dma the new data in. Yes you can chain those requests, but that will make the latency nearly twice as high.
my rasterizer is just a forward rasterizer due to memory limitations, mainly because I have to quarantee some memory consuption and can't realloc memory if I ran out.
if you have the memory left, a full deferred renderer (deferred rasterization ala larabee
http://www.ddj.com/hpc-high-performance ... 602?pgno=3 for gbuffer + deferred shading) could really ahive high quality + speed.
rapso