automatic alignment of stack vars for proper vfpu

yabadabo · Post by **yabadabo** » Mon Apr 09, 2007 11:11 pm

Hi

If you have tried to use the vfpu, you will have notice that it's fundamental to have all the variables referenced to be 16-bytes aligned. This words fine for the global vars using the gcc __attribute__((aligned(16))) as ScePspFVector4 in psptypes.h

But if you create temporal variables in the stack that need to be referenced with vfpu instrucctions the app might generate a exception.

I found a solution to solve the previous problem which consists on using the -mpreferred-stack-boundary=4 when compiling the source files. This forces the gcc to create the temporal variables aligned to 16 bytes, instead of the default 8 bytes.

yabadabo

hlide · Post by **hlide** » Tue Apr 10, 2007 4:12 pm

yabadabo wrote:I found a solution to solve the previous problem which consists on using the -mpreferred-stack-boundary=4 when compiling the source files. This forces the gcc to create the temporal variables aligned to 16 bytes, instead of the default 8 bytes.

well, this is a recent addition to psp-gcc I added because it was only effective on i386-gcc beforehand.

however, you should be very prudent when using it in stack not to mix code with and without this option because if you call an external function compiled with this option from another function which isn't compiled with this option, you may end with a misaligned stack pointer because psp-gcc stack alignment doesn't insure the right alignment at the entry of function but assumed it is right aligned and try to allocate locals in such way that alignment is still right.

rapso · Post by **rapso** » Tue Apr 10, 2007 5:14 pm

it's not that hard to allign urself all vars on the stack, just a macro :)

yabadabo · Post by **yabadabo** » Wed Apr 11, 2007 7:41 am

Do you mean a macro to create an array of 16 bytes and a pointer to Matrix/quaternion var pointing over that stack?... Without having a clear way of how that macro would be I prefer the compiler option. :)

J.F. · Post by **J.F.** » Wed Apr 11, 2007 2:29 pm

Just use the defines from gumInternal.h:

Code: Select all

// these macros are because GCC cannot handle aligned matrices declared on the stack
#define GUM_ALIGNED_MATRIX&#40;&#41; &#40;ScePspFMatrix4*&#41;&#40;&#40;&#40;&#40;unsigned int&#41;alloca&#40;sizeof&#40;ScePspFMatrix4&#41;+64&#41;&#41; + 63&#41; & ~63&#41;
#define GUM_ALIGNED_VECTOR&#40;&#41; &#40;ScePspFVector4*&#41;&#40;&#40;&#40;&#40;unsigned int&#41;alloca&#40;sizeof&#40;ScePspFVector4&#41;+64&#41;&#41; + 63&#41; & ~63&#41;

yabadabo · Post by **yabadabo** » Thu Apr 12, 2007 8:00 am

Thanks, very interesting, I didn't know there where there.
Anyway, I think aligning to 16 bytes instead of 64 should be enough, at least for the vfpu instructions.

hlide · Post by **hlide** » Thu Apr 12, 2007 3:36 pm

J.F. wrote:Just use the defines from gumInternal.h:

Code: Select all

// these macros are because GCC cannot handle aligned matrices declared on the stack
#define GUM_ALIGNED_MATRIX&#40;&#41; &#40;ScePspFMatrix4*&#41;&#40;&#40;&#40;&#40;unsigned int&#41;alloca&#40;sizeof&#40;ScePspFMatrix4&#41;+64&#41;&#41; + 63&#41; & ~63&#41;
#define GUM_ALIGNED_VECTOR&#40;&#41; &#40;ScePspFVector4*&#41;&#40;&#40;&#40;&#40;unsigned int&#41;alloca&#40;sizeof&#40;ScePspFVector4&#41;+64&#41;&#41; + 63&#41; & ~63&#41;

alloca uses allocation in heap (unless gcc sees them as a builtin to replace them as an allocation in stack but i'm unsure about it).

chp · Post by **chp** » Thu Apr 12, 2007 6:04 pm

hlide wrote:alloca uses allocation in heap (unless gcc sees them as a builtin to replace them as an allocation in stack but i'm unsure about it).

Wrong. From the description of alloca():

The alloca function allocates size bytes of space in the stack frame of the caller. This temporary space is automatically freed when the function that called alloca returns to its caller.

Since I wrote those macros I think I know what they were supposed to do. :)

Aligning to 16 instead of 64 should be enough though, there was an argument about aligning matrices back when it was written, and I decided to go the safe route when doing the first VFPU stuff. It hasn't been updated since then.

hlide · Post by **hlide** » Fri Apr 13, 2007 4:08 am

chp wrote:
hlide wrote:alloca uses allocation in heap (unless gcc sees them as a builtin to replace them as an allocation in stack but i'm unsure about it).
Wrong. From the description of alloca():
The alloca function allocates size bytes of space in the stack frame of the caller. This temporary space is automatically freed when the function that called alloca returns to its caller.
Since I wrote those macros I think I know what they were supposed to do. :)

well the very first implementation of alloca i saw was allocating space in the heap in such a way the successice call to alloca would manage to free memory blocks which were allocated in callee - but i believed to see that gcc rearrange it thoughout a builtin function to force it in the stack. And probably now any compliant C compiler must handle alloca in such a way that it directly allocates in stack. Or it may be a GCC exception.

I must check it.

hlide · Post by **hlide** » Fri Apr 13, 2007 4:20 am

chp wrote:The alloca function allocates size bytes of space in the stack frame of the caller. This temporary space is automatically freed when the function that called alloca returns to its caller.

Please could you tell me how a C function can allocate space in stack without freeing this space at exit of this C function ? the only way I see for it to be able to allocate space in stack as it is used is that it is a builtin function that gcc handles by inserting code to allocate space in stack and free it at the exit of the function where alloca() is called.

oh bad, even if i do it that way with a macro, it doesn't work at all because tmp is discarded anyway before i can use its pointer :

#define alloca(size) (void *)({ char tmp[size]; &tmp; })

so could you explain me how to write a C plain alloca function which would allocate space in stack ?

hlide · Post by **hlide** » Fri Apr 13, 2007 5:24 am

chp wrote:Aligning to 16 instead of 64 should be enough though, there was an argument about aligning matrices back when it was written, and I decided to go the safe route when doing the first VFPU stuff. It hasn't been updated since then.

16 is enough since this is the smaller requirement for lv.q/sv.q instructions. The only reason I see 64-byte alignment is good is to fit a matrix in a cache line (which is 64-byte wide) perfectly instead of two if misaligned.

chp · Post by **chp** » Fri Apr 13, 2007 6:41 pm

Yes, there are versions of alloca() for platforms that cannot support grabbing memory from the stack that emulate the functionality by grabbing the current stack location and allocating from the heap, but it is not the original intention of the function.

These functions also break the functionality of the original definition, because they do not free memory on return but at the next alloca()-call, which might not also release that memory if you are at the same or lower stack-depth.

Example:

Code: Select all

int main&#40;&#41;
&#123;
char* a;
char* b;

a = call1&#40;&#41;;
b = call2&#40;&#41;;
printf&#40;"a&#58; %p b&#58; %p diff&#58; %d\n",a,b,b-a&#41;;

return 0;
&#125;

char* call1&#40;&#41;
&#123;
        char* buf = alloca&#40;2 << 16&#41;;
        return buf;
&#125;

char* call2&#40;&#41;
&#123;
        char* buf = alloca&#40;2 << 16&#41;;
        return buf;
&#125;

With a proper implementation actually allocating from the stack, the program result will be:

a: 0xbfddca50 b: 0xbfddca50 diff: 0

But with the C emulation of alloca(), the output will be this: (confirmed)

a: 0xb7e55010 b: 0xb7e34010 diff: -135168

As you can see, if you incidentally allocate from the same level always, you will end up leaking memory for each call. They have a "solution" for this, and it's calling alloca(0) at a higher level in the program, but it's not documented for the function in itself, only in the source of the emulation.

A worse case would be something like this:

Code: Select all

for &#40;i = 0; i < 100; ++i&#41;
&#123;
 char* b;

 b = call2&#40;&#41;;
 printf&#40;"a&#58; %p b&#58; %p diff&#58; %d\n",a,b,b-a&#41;;
&#125;

Aaaaanyway, it's not that important. :)

chp · Post by **chp** » Fri Apr 13, 2007 7:28 pm

And you're right, for the cache miss chance when just aligning at 16 bytes it's much more worth aligning the stack matrices to 64 bytes, it just wasn't the plan when it was written.

forums.ps2dev.org

automatic alignment of stack vars for proper vfpu

automatic alignment of stack vars for proper vfpu

Re: automatic alignment of stack vars for proper vfpu