SPE software rasterizer

mooriel · Post by **mooriel** » Tue Jun 09, 2009 3:55 am

Miya, a Japanese programmer, has released a 2D software rasterizer on Cell.
It's very cool, and source code is available.

http://www.youtube.com/watch?v=sljmSv4mt2Y

From his description:

3067FPS: 1000 polygon(200pixel) 720x480(this video)

60FPS: 30000 polygon(200pixel) 1920x1080

Source code available at:
http://miyazhp.hp.infoseek.co.jp/polydemo01.tar.gz

On my newer version demo,
Polygon count benchmark test:
49.841 MPolygons/sec

Pixel fill rate benchmark test:
6248.684 MPixels/sec

Its Source code:
http://miyazhp.hp.infoseek.co.jp/polydemo02.tar.gz

ps2devman · Post by **ps2devman** » Tue Jun 09, 2009 10:48 pm

Thanks

rapso · Post by **rapso** » Mon Jun 15, 2009 9:36 pm

is this using one or all SPUs?

jimparis · Post by **jimparis** » Tue Jun 16, 2009 7:21 am

From his description on the Youtube video, 6 SPEs.

ouasse · Post by **ouasse** » Tue Jun 16, 2009 7:35 am

I have had a quick look at the code, the rasterizer merely displays single-coloured triangles, and the code is supposed to be fast.

In fact, as the comments are in Japanese, I may have missed more features.

mooriel · Post by **mooriel** » Tue Jun 16, 2009 4:59 pm

rapso wrote:is this using one or all SPUs?

Maybe it is scalable.

ppe/main.c

Code: Select all

num_of_spe = 6;

J.F. · Post by **J.F.** » Tue Jun 16, 2009 6:09 pm

ouasse wrote:I have had a quick look at the code, the rasterizer merely displays single-coloured triangles, and the code is supposed to be fast.

It's a RASTERIZER, not a texture mapper. I guess the next thing would be to replace the solid color raster line draw with affine mapping of textures. That would be pretty simple and nearly as fast. You could go for full perspective correct mapping after that - it's a divide per pixel, but the SPEs should be able to do that fairly quickly.

ouasse · Post by **ouasse** » Wed Jun 17, 2009 4:43 am

J.F., I think texture mapping brings much more trouble than you think. Textures have to be loaded into SPE memory, and in the case of several textures to be mapped into different triangles, DMA transfers would certainly become a serious bottleneck. I have also written test rasterizer routines, and I really cannot find an efficient way to deal with several textures.

Of course, some kind of triangle sorting according to texture ID's would certainly optimize DMA transfers for the textures, but even in this case there seems to be no efficient way for doing this in parallel, using all SPE's. Maybe a PPE thread could handle this, in a kind of task pipeline model between PPE and SPE's there may be a way not to lose SPE time doing this.

I think this topic could be a good place for discussing tricks for realtime SPE 3D rendering ;)

J.F. · Post by **J.F.** » Wed Jun 17, 2009 6:03 am

Yeah, I was talking about just rendering, not texture management. You don't have a lot of local memory, so eventually, you'd need some ability to pre-load textures before they were needed. I think that could be ignored to start with until the code for rendering was done. Just as the current code just does a single color, the next step would render a single texture.

ouasse · Post by **ouasse** » Wed Jun 17, 2009 7:28 pm

sorry if my English lacks accuracy. for me a "rasterizer" is everything required to draw triangles on a framebuffer. Of course, you don't need textures to be able to draw triangles, but in my mind, textures are part of the job.

I hope Miya will find good ideas and a lot of motivation to get a fully working renderer/rasterizer/texture mapper/whatever engine.

I also hope I will find the time and motivation for my own work to lead to something releasable one day.

rapso · Post by **rapso** » Fri Jun 19, 2009 8:19 pm

graphiccards have a WAY less texture cache than the spu has local store, you dont need the whole texture in it to rasterize.

my rasterizer is always working on 64 (8*8) pixels, one memtransfer of about 32*32 texels takes roughly 600cycles, so I have always like 10cycle/pixel (40cycles/float4) to do computations and this is just about right to have a balance (my profiling shows me that i've about 4cycles sync time in average, but this just due to some really big peaks, probably some hazards of too many dma trasfers at the same time).

my rough pipe is:
1. coarse check & early out
2. start dma zbuffer
3. calculate UVs for coarse tile and start dma texture tile
4. calculate coverage and depth values
5. sync zbuffer dma.
6. check z-values of pixel-fragments against zbuffer
7. sync texture (needs to be done here as the transfer order is not guaranteed to be the request order)
8. if all fragments (behind zbuffer || not covered) next 8x8 pixel quad
9. start dma framebuffer
10. sample texture (quite a lot of math work)
11. sync framebuffer
12. blend or mask framebuffer with color of fragments

my texturecache management is direct-mapped, so there is really not much logic to check if a 32*32 texture quad is in-mem. having a texture cache of 4 tiles gives like 60% hitrate, making it 16tiles gaves nearly no speedup or increase in hitrates. making it 4way associative gave like 0.1% better hitrates but was slower due to instruction dependancies, that's because the very predictable memory access pattern.

I can even get away without using a cache and just stream in texture quads from scratch (4* 32x32 texels), that does not slow down as long as there is no other memtransfer going on.
having 6spus and the ppu working on memory, the cache is helping.

if you'd have shaders that access random locations in the texture, it would be a way more of a headache to hide latency, but it's fairly simple if you implement some Fixed function pipeline.

having 8x8 pixel, you can estimate that at max 16*16 texel accessed. all are nearby, so at worst you will hit 4 different tiles. before you start the fine rasterization, you can make a coarse check at the 4 pixels on the border of a 8x8 pixel quad. this way you can
-check if any pixel of the 8x8 will set a pixel (fast skipping)
-calculate the UVs -> calculate the Mip level -> start to dma the 4 texture tiles into your cache if not yet there
(yes, this is not as accurate miplevel selection as gfx-cards do and just bilinear filtering if any enabled at all)

a bigger issue for me was the z-/framebuffer, as you need to transfer out the tiles first and then dma the new data in. Yes you can chain those requests, but that will make the latency nearly twice as high.

my rasterizer is just a forward rasterizer due to memory limitations, mainly because I have to quarantee some memory consuption and can't realloc memory if I ran out.
if you have the memory left, a full deferred renderer (deferred rasterization ala larabee http://www.ddj.com/hpc-high-performance ... 602?pgno=3 for gbuffer + deferred shading) could really ahive high quality + speed.

rapso

ouasse · Post by **ouasse** » Sun Jun 21, 2009 11:40 pm

rapso, this sounds VERY interesting. Your project seems to be in a very advanced state. Have you got "viewable" results of what your rasterizer can do ?

ouasse · Post by **ouasse** » Mon Jun 22, 2009 6:57 pm

rapso wrote:graphiccards have a WAY less texture cache than the spu has local store, you dont need the whole texture in it to rasterize.

Well in my personal rasterizer work, I have taken advantage of that additional SPE memory, allowing to draw on much bigger zones than 8x8 squares at once. My test code can draw on screen stripes of 4 or 8 pixels height, depending on the required SPE memory for storing textures (which I do not handle for the moment).

All triangles are being drawn on each stripe, then the fully drawn stripe is sent to framebuffer. Parallelism among SPE's is trivial, as there is no dependency between different stripes. There is also no need to store a zbuffer in main memory, since everything happens in local store before sending the fully drawn stripe.

For now I haven't handled textures at all, which I'm sure is gonna be a big deal to get efficiently working. I may have time during the summer to get things a little bit further.

rapso · Post by **rapso** » Mon Jun 22, 2009 9:14 pm

ouasse wrote:rapso, this sounds VERY interesting. Your project seems to be in a very advanced state. Have you got "viewable" results of what your rasterizer can do ?

I'm not allowed to show anything as this is company work and all PR stuff is handled by dedicated people. it's also company politics that we try to show both console platforms to be equal quality for your tech, so i doubt it will ever be prised to the public.

rapso · Post by **rapso** » Mon Jun 22, 2009 9:26 pm

ouasse wrote:
rapso wrote:graphiccards have a WAY less texture cache than the spu has local store, you dont need the whole texture in it to rasterize.
Well in my personal rasterizer work, I have taken advantage of that additional SPE memory, allowing to draw on much bigger zones than 8x8 squares at once. My test code can draw on screen stripes of 4 or 8 pixels height, depending on the required SPE memory for storing textures (which I do not handle for the moment).

my setup for 8x8-pixel quads is just a (compiletime-)const that can be easily adjusted, 8x8 seemed to be best. 8x4-pixel, regarding my static code analyser, lead to a lot of gaps and stalls and benchmarks showed it to be slower. bigger quads were less efficient cause there were a lot more invisible pixels that were masked out later on (on smaller tiles were more effectively rejected by the coarse test).

All triangles are being drawn on each stripe, then the fully drawn stripe is sent to framebuffer. Parallelism among SPE's is trivial, as there is no dependency between different stripes. There is also no need to store a zbuffer in main memory, since everything happens in local store before sending the fully drawn stripe.

that's definately a good deferred approach. you might be even more efficient if you work on quads rather than stripes as you'll have to process less triangles per quad than per stripe. I think the larabbe papers describe that in good details :)

For now I haven't handled textures at all, which I'm sure is gonna be a big deal to get efficiently working. I may have time during the summer to get things a little bit further.

I'm lookin forward to see that :). threads like those are always motivating for me to continue my homebrew stuff as well. I hope i'll have it in some showable state any soon (not really related to my "company rasterizer")

I wish sony had build in just 4 cells, 32spus each and no other gfx hardware. it would be insanely lot of fun.

ouasse · Post by **ouasse** » Tue Jun 23, 2009 7:02 pm

rapso wrote: you might be even more efficient if you work on quads rather than stripes as you'll have to process less triangles per quad than per stripe. I think the larabbe papers describe that in good details :)

You are certainly right. I just never got an other idea than using stripes. I will someday try and use quads instead.

rapso wrote:I'm lookin forward to see that :).

Well, I still have lots of work to do before showing something viewable ;)

rapso wrote:threads like those are always motivating for me to continue my homebrew stuff as well.

this is exactly the same for me. Reading such messages let me think that software 3D on ps3 is for really soon, and I definitely want to be a part of it ;)

rapso wrote:I hope i'll have it in some showable state any soon (not really related to my "company rasterizer")

I'm looking forward to see that as well ! :)

What about your "company rasterizer" ? Are you working in a game company, or some company interested in software 3D rendering ?

rapso wrote:I wish sony had build in just 4 cells, 32spus each and no other gfx hardware. it would be insanely lot of fun.

Hehe, maybe on Playstation 4. Have you read about the future Cell 3, featuring 2 empowered PPEs and 32 SPEs ? ;)