using STL with the SPE's
using STL with the SPE's
basically writing a software rasterizer on my pc using SDL, which will eventually port to ps3 using the framebuffer, but im trying to think forward when it comes to efficiency, my render list i want to be able to depth sort etc etc, and im not sure if using template'd functions like STD vector work well on this type of cpu? any thoughts on this?
I'm not a C++ fan at all, but this is incorrect, it's quite possible to write a library in C++ and expose a normal C interface to applications linking it in. Although in an embedded scenario there may still be issues with ctors and other runtime/linktime related things.J.F. wrote:Using C++ for a library is rather bad. It makes it a huge pain in the ass to use from C programs. The converse is not the case - C libraries are easy to use from C++. If you are making something you want to reach the most folk, ditch C++.
I would highly recommend not using STL on the SPUs....or in any gamedev situation... or at all really :P, google if you need reasons for this. (I beat this horse far too much)
I believe I mentioned the "normal C interface", ctors, and other related things: I said "huge pain in the ass", which describes that just as clearly as what you said. :Djbit wrote:I'm not a C++ fan at all, but this is incorrect, it's quite possible to write a library in C++ and expose a normal C interface to applications linking it in. Although in an embedded scenario there may still be issues with ctors and other runtime/linktime related things.J.F. wrote:Using C++ for a library is rather bad. It makes it a huge pain in the ass to use from C programs. The converse is not the case - C libraries are easy to use from C++. If you are making something you want to reach the most folk, ditch C++.
I didn't say it was impossible, just a huge pain. And while it's "possible", very few c++ libraries make the effort as it IS such a huge pain.
Re: using STL with the SPE's
There is no SPU optimised C++ standard library implementation (that I'm aware of), which means that using the standard library will run as poorly on the SPU as any other generically written code, give or take.Compound wrote:basically writing a software rasterizer on my pc using SDL, which will eventually port to ps3 using the framebuffer, but im trying to think forward when it comes to efficiency, my render list i want to be able to depth sort etc etc, and im not sure if using template'd functions like STD vector work well on this type of cpu? any thoughts on this?
http://www.pixelglow.com/macstl/ has some SIMD optimised generic C++ containers which may be of interest, but only in principle as they do not have a SPU-optimised implementation. (edit: also, it hasn't been updated for more than three years...). edit: I had a closer look - I don't like it. Not recommended.
Also, what jbit said.
Care to beat that horse one more time? I'm curious why you don't like STL and why you think it's better to avoid it. I use STL for most if not all data structures I need, and also for a great deal of repetitive tasks on them, sorting, etc. etc. Together with the Boost libraries they really are the only reason I still prefer C++ over Java, C# and the like. If you're smart enough to write good software I assume you're also smart enough to know how to use the STL efficiently, and at least for me it has always a real enabler for doing things the right way (tm) instead of re-inventing the wheel all the time and ending up with suboptimal data structures and algorithms, just because it's too much work (re-)doing everything yourself. I'm not aware of any negative facts about runtime performance of the STL, I've always been under the impression STL code was pretty fast, at least when I look at the source code for STL templates they seem pretty close to the metal to me. But maybe I'm missing something.jbit wrote:I would highly recommend not using STL on the SPUs....or in any gamedev situation... or at all really :P, google if you need reasons for this. (I beat this horse far too much)
On the 'don't use C++ for your libraries because it will be a pain in the ass' I can only say: bullshit... It's just as easy to write a C wrapper interface for a C++ library as it is to call C code from C++. There's really no reason to stay away from C++ out of portability constraints or whatever. My personal opinion is that it's exactly the other way around: stay away from C as much as you can because _that_ will give you headaches in the long run in terms of maintainability, nasty ways of shooting yourself in the foot, limiting standard library (which isn't as portable as people might make you believe), obfuscated code that hides the algorithms, forcing you to do stuff at runtime that can also be done at compile time (using templates) etc. And it doesn't bring you _anything_ but fast compilation and a non-measurable performance increase in those parts of your code that shouldn't be performance-critical anyway if your code is worth anything. I can only think of 2 reasons to stick with C: 1) you're targetting a platform that doesn't have a (proper) C++ compiler or standard library (yes I'm looking at you, Solaris) and 2) you're writing a driver, part of a kernel, or some other piece of code that is so time-critical or should have absolutely no possible side-effects, such that every CPU cycle spent can be traced back directly to some line of code you wrote yourself.
Disclaimer: I'm not a game programmer (not even close) but I do make a living writing software, and the stuff I write goes into really big, really expensive machines that also really need good performance. Good performance not being defined as 'shaving off every last cycle' as in game programming but as 'scaling with really, really large datasets'
thanks for the feedback guys. although i am also curious like d-range as to the reasoning for not using STL etc, i can understand that it will be taking up unnecessary memory on the spe's, but I cant see how writing custom containers / data structures can have a noticeable difference over using STL
thanks, not trying to start a flame war here before anyone starts getting overly defensive, would just like to establish the facts
thanks, not trying to start a flame war here before anyone starts getting overly defensive, would just like to establish the facts
As stated above, this is a little religious, but there are serious issues with using STL on embedded systems:
* Excessive use of memory allocation, and while you can use custom allocators, you will usually end up with really bad memory fragmentation.
* Usually not hardware aware.. Things might not always be aligned correctly (especially for efficient DMA)
* Increases code size quite alot (which of course causes I$ issues)
* Not generally optimized for cache performance
* Lots of ABI and compatibility issues
* Can encourage programming with unknown memory consumption. (Not so much memory leaks as unmetered consumption)
* Makes debugging hell (maintainable code should not only be "easy to work with" but easy to debug)
Electronic Arts (and probably several other studios, publishers, etc) have their own, non-standard, versions of STL to try to mitigate some of the above issues.
http://www.open-std.org/jtc1/sc22/wg21/ ... n2271.html
* Excessive use of memory allocation, and while you can use custom allocators, you will usually end up with really bad memory fragmentation.
* Usually not hardware aware.. Things might not always be aligned correctly (especially for efficient DMA)
* Increases code size quite alot (which of course causes I$ issues)
* Not generally optimized for cache performance
* Lots of ABI and compatibility issues
* Can encourage programming with unknown memory consumption. (Not so much memory leaks as unmetered consumption)
* Makes debugging hell (maintainable code should not only be "easy to work with" but easy to debug)
Electronic Arts (and probably several other studios, publishers, etc) have their own, non-standard, versions of STL to try to mitigate some of the above issues.
http://www.open-std.org/jtc1/sc22/wg21/ ... n2271.html
I could rant on for pages why you should _NOT_ use STL (and not esp on SPUs) but jbit covered most of it really.
But one major issue is that the SPU has only one data type and that is a qword.
By not handling the data on the SPUs as qwords only will reduce the performance a lot (esp for a softwarerender)
But one major issue is that the SPU has only one data type and that is a qword.
By not handling the data on the SPUs as qwords only will reduce the performance a lot (esp for a softwarerender)
There's also the memory model. SPU is built for streaming. STL is not. SPU deals with data that might appear at different memory addresses on different processors. STL doesn't even pretend it can cope with that.
STLs usually give you a decent implementation of that particular interface, but if you care a lot about performance, it's the wrong interface. For the best performance, you need complete control over how things are laid out in memory, and when they are accessed. It is possible to write efficient code with STL, but it isn't natural, and you'd be better off without.
As emoon says, SPU likes quadwords. It really really likes them.
Let's say you're writing a game. Your world has objects, which have positions and velocities. To update the position of an object, you add the velocity to it. Most people would write something like this:
There are a number of problems with this. Firstly, if you have the sense to make Vector3 16 bytes (with 4 unused), you're only using three quarters of your vector unit. If you don't have that much sense, your data won't be aligned and the vector unit will hate you.
Secondly, there's a good chance that loop will be compiled like this:
That might look fine, but it isn't really. The loads and add have latency, which means that if you try to use the result too soon, the processor will stall. This code (and most code that I see written by people who aren't very careful about it) is very serial: the results of most instructions are used almost immediately. There will be stalls everywhere.
A better implementation is this:
Every vector instruction here is doing four operations of useful work, so that's 25% faster already. Also, the results of the loads and adds aren't used immediately, so the latency is partially hidden. It's worth finding out the latencies of the instructions you use, and unrolling the loop enough to hide them completely.
You might protest "what if the number of objects isn't a multiple of four?" That's where careful data design comes in. With this structure, even if the last batch of four objects isn't full, you know that the memory is there and not used by anything else. You know there are no pointers. It's safe to process a few unused objects. With STL all you have is a black box. You can't make extra assumptions.
Most of this applies to all modern CPUs, not just SPU. Hardware has changed beyond recognition since the 1950s, but it seems the mental model of it that most programmers use hasn't. Learn what the hardware likes, and it will reward you.
I've more than doubled the speed of some parts of our game by doing things like this. It's worth it. If I was writing a software rasteriser, I wouldn't dream of doing it any other way.
STLs usually give you a decent implementation of that particular interface, but if you care a lot about performance, it's the wrong interface. For the best performance, you need complete control over how things are laid out in memory, and when they are accessed. It is possible to write efficient code with STL, but it isn't natural, and you'd be better off without.
As emoon says, SPU likes quadwords. It really really likes them.
Let's say you're writing a game. Your world has objects, which have positions and velocities. To update the position of an object, you add the velocity to it. Most people would write something like this:
Code: Select all
struct Object
{
Vector3 position;
Vector3 velocity;
};
std::vector<Object> objects;
iterator start = objects.begin();
iterator end = objects.end();
for ( iterator i = start; i != end; ++i )
{
i->position += i->velocity;
}
Secondly, there's a good chance that loop will be compiled like this:
Code: Select all
load position
load velocity
add velocity to position
store position
A better implementation is this:
Code: Select all
struct
{
Vector4 positionX[maxNumObjects/4];
Vector4 positionY[maxNumObjects/4];
Vector4 positionZ[maxNumObjects/4];
Vector4 velocityX[maxNumObjects/4];
Vector4 velocityY[maxNumObjects/4];
Vector4 velocityZ[maxNumObjects/4];
} objects;
for ( int i = 0; i < numObjects/4; i++ )
{
Vector4 positionX = objects.positionX[i];
Vector4 velocityX = objects.velocityX[i];
Vector4 positionY = objects.positionY[i];
Vector4 velocityY = objects.velocityY[i];
Vector4 positionZ = objects.positionZ[i];
Vector4 velocityZ = objects.velocityZ[i];
positionX += velocityX;
positionY += velocityY;
positionZ += velocityZ;
objects.positionX[i] = positionX;
objects.positionY[i] = positionY;
objects.positionZ[i] = positionZ;
}
You might protest "what if the number of objects isn't a multiple of four?" That's where careful data design comes in. With this structure, even if the last batch of four objects isn't full, you know that the memory is there and not used by anything else. You know there are no pointers. It's safe to process a few unused objects. With STL all you have is a black box. You can't make extra assumptions.
Most of this applies to all modern CPUs, not just SPU. Hardware has changed beyond recognition since the 1950s, but it seems the mental model of it that most programmers use hasn't. Learn what the hardware likes, and it will reward you.
I've more than doubled the speed of some parts of our game by doing things like this. It's worth it. If I was writing a software rasteriser, I wouldn't dream of doing it any other way.
Great post! You're completely right - a lot of programmers are too concerned with high-level programming and as a result, the programs are ponderous and bloated. They need to do a little more assembly language and banging directly on the hardware, then go back to the higher level languages with a better understanding of the machine. No one cares about generic templates if the program sucks because of it.
I forgot to add: there is a place for STL and virtual functions and other high level concepts.
Programs can usually be split into two parts. One is complex and executed infrequently. In games, that will be camera control, sophisticated AI on a very small number of the most important opponents, and that sort of thing. The high level stuff fits perfectly here. Life's much simpler that way.
The other part is simpler and executed very frequently. This is animation, rendering the world, simple AI on the hundreds of extras you can't afford to make too clever, and so on. This is where you start coding C++ as if it were assembly.
If you can't find that split, you've got some profiling and refactoring to do. Or you just put up with a program that isn't as fast as it could be.
Programs can usually be split into two parts. One is complex and executed infrequently. In games, that will be camera control, sophisticated AI on a very small number of the most important opponents, and that sort of thing. The high level stuff fits perfectly here. Life's much simpler that way.
The other part is simpler and executed very frequently. This is animation, rendering the world, simple AI on the hundreds of extras you can't afford to make too clever, and so on. This is where you start coding C++ as if it were assembly.
If you can't find that split, you've got some profiling and refactoring to do. Or you just put up with a program that isn't as fast as it could be.