automatic alignment of stack vars for proper vfpu

Discuss the development of new homebrew software, tools and libraries.

Moderators: cheriff, TyRaNiD

Post Reply
yabadabo
Posts: 10
Joined: Fri Mar 30, 2007 8:49 am

automatic alignment of stack vars for proper vfpu

Post by yabadabo »

Hi

If you have tried to use the vfpu, you will have notice that it's fundamental to have all the variables referenced to be 16-bytes aligned. This words fine for the global vars using the gcc __attribute__((aligned(16))) as ScePspFVector4 in psptypes.h

But if you create temporal variables in the stack that need to be referenced with vfpu instrucctions the app might generate a exception.

I found a solution to solve the previous problem which consists on using the -mpreferred-stack-boundary=4 when compiling the source files. This forces the gcc to create the temporal variables aligned to 16 bytes, instead of the default 8 bytes.

yabadabo
hlide
Posts: 739
Joined: Sun Sep 10, 2006 2:31 am

Re: automatic alignment of stack vars for proper vfpu

Post by hlide »

yabadabo wrote:I found a solution to solve the previous problem which consists on using the -mpreferred-stack-boundary=4 when compiling the source files. This forces the gcc to create the temporal variables aligned to 16 bytes, instead of the default 8 bytes.
well, this is a recent addition to psp-gcc I added because it was only effective on i386-gcc beforehand.

however, you should be very prudent when using it in stack not to mix code with and without this option because if you call an external function compiled with this option from another function which isn't compiled with this option, you may end with a misaligned stack pointer because psp-gcc stack alignment doesn't insure the right alignment at the entry of function but assumed it is right aligned and try to allocate locals in such way that alignment is still right.
rapso
Posts: 140
Joined: Mon Mar 28, 2005 6:35 am

Post by rapso »

it's not that hard to allign urself all vars on the stack, just a macro :)
yabadabo
Posts: 10
Joined: Fri Mar 30, 2007 8:49 am

Post by yabadabo »

Do you mean a macro to create an array of 16 bytes and a pointer to Matrix/quaternion var pointing over that stack?... Without having a clear way of how that macro would be I prefer the compiler option. :)
J.F.
Posts: 2906
Joined: Sun Feb 22, 2004 11:41 am

Post by J.F. »

Just use the defines from gumInternal.h:

Code: Select all

// these macros are because GCC cannot handle aligned matrices declared on the stack
#define GUM_ALIGNED_MATRIX() (ScePspFMatrix4*)((((unsigned int)alloca(sizeof(ScePspFMatrix4)+64)) + 63) & ~63)
#define GUM_ALIGNED_VECTOR() (ScePspFVector4*)((((unsigned int)alloca(sizeof(ScePspFVector4)+64)) + 63) & ~63)
yabadabo
Posts: 10
Joined: Fri Mar 30, 2007 8:49 am

Post by yabadabo »

Thanks, very interesting, I didn't know there where there.
Anyway, I think aligning to 16 bytes instead of 64 should be enough, at least for the vfpu instructions.
hlide
Posts: 739
Joined: Sun Sep 10, 2006 2:31 am

Post by hlide »

J.F. wrote:Just use the defines from gumInternal.h:

Code: Select all

// these macros are because GCC cannot handle aligned matrices declared on the stack
#define GUM_ALIGNED_MATRIX() (ScePspFMatrix4*)((((unsigned int)alloca(sizeof(ScePspFMatrix4)+64)) + 63) & ~63)
#define GUM_ALIGNED_VECTOR() (ScePspFVector4*)((((unsigned int)alloca(sizeof(ScePspFVector4)+64)) + 63) & ~63)
alloca uses allocation in heap (unless gcc sees them as a builtin to replace them as an allocation in stack but i'm unsure about it).
chp
Posts: 313
Joined: Wed Jun 23, 2004 7:16 am

Post by chp »

hlide wrote:alloca uses allocation in heap (unless gcc sees them as a builtin to replace them as an allocation in stack but i'm unsure about it).
Wrong. From the description of alloca():
The alloca function allocates size bytes of space in the stack frame of the caller. This temporary space is automatically freed when the function that called alloca returns to its caller.
Since I wrote those macros I think I know what they were supposed to do. :)

Aligning to 16 instead of 64 should be enough though, there was an argument about aligning matrices back when it was written, and I decided to go the safe route when doing the first VFPU stuff. It hasn't been updated since then.
GE Dominator
hlide
Posts: 739
Joined: Sun Sep 10, 2006 2:31 am

Post by hlide »

chp wrote:
hlide wrote:alloca uses allocation in heap (unless gcc sees them as a builtin to replace them as an allocation in stack but i'm unsure about it).
Wrong. From the description of alloca():
The alloca function allocates size bytes of space in the stack frame of the caller. This temporary space is automatically freed when the function that called alloca returns to its caller.
Since I wrote those macros I think I know what they were supposed to do. :)
well the very first implementation of alloca i saw was allocating space in the heap in such a way the successice call to alloca would manage to free memory blocks which were allocated in callee - but i believed to see that gcc rearrange it thoughout a builtin function to force it in the stack. And probably now any compliant C compiler must handle alloca in such a way that it directly allocates in stack. Or it may be a GCC exception.

I must check it.
Last edited by hlide on Fri Apr 13, 2007 4:26 am, edited 1 time in total.
hlide
Posts: 739
Joined: Sun Sep 10, 2006 2:31 am

Post by hlide »

chp wrote:The alloca function allocates size bytes of space in the stack frame of the caller. This temporary space is automatically freed when the function that called alloca returns to its caller.
Please could you tell me how a C function can allocate space in stack without freeing this space at exit of this C function ? the only way I see for it to be able to allocate space in stack as it is used is that it is a builtin function that gcc handles by inserting code to allocate space in stack and free it at the exit of the function where alloca() is called.

oh bad, even if i do it that way with a macro, it doesn't work at all because tmp is discarded anyway before i can use its pointer :

#define alloca(size) (void *)({ char tmp[size]; &tmp; })

so could you explain me how to write a C plain alloca function which would allocate space in stack ?
hlide
Posts: 739
Joined: Sun Sep 10, 2006 2:31 am

Post by hlide »

chp wrote:Aligning to 16 instead of 64 should be enough though, there was an argument about aligning matrices back when it was written, and I decided to go the safe route when doing the first VFPU stuff. It hasn't been updated since then.
16 is enough since this is the smaller requirement for lv.q/sv.q instructions. The only reason I see 64-byte alignment is good is to fit a matrix in a cache line (which is 64-byte wide) perfectly instead of two if misaligned.
chp
Posts: 313
Joined: Wed Jun 23, 2004 7:16 am

Post by chp »

Yes, there are versions of alloca() for platforms that cannot support grabbing memory from the stack that emulate the functionality by grabbing the current stack location and allocating from the heap, but it is not the original intention of the function.

These functions also break the functionality of the original definition, because they do not free memory on return but at the next alloca()-call, which might not also release that memory if you are at the same or lower stack-depth.

Example:

Code: Select all

int main()
{
char* a;
char* b;

a = call1();
b = call2();
printf("a: %p b: %p diff: %d\n",a,b,b-a);

return 0;
}

char* call1()
{
        char* buf = alloca&#40;2 << 16&#41;;
        return buf;
&#125;

char* call2&#40;&#41;
&#123;
        char* buf = alloca&#40;2 << 16&#41;;
        return buf;
&#125;
With a proper implementation actually allocating from the stack, the program result will be:
a: 0xbfddca50 b: 0xbfddca50 diff: 0
But with the C emulation of alloca(), the output will be this: (confirmed)
a: 0xb7e55010 b: 0xb7e34010 diff: -135168
As you can see, if you incidentally allocate from the same level always, you will end up leaking memory for each call. They have a "solution" for this, and it's calling alloca(0) at a higher level in the program, but it's not documented for the function in itself, only in the source of the emulation.

A worse case would be something like this:

Code: Select all

for &#40;i = 0; i < 100; ++i&#41;
&#123;
 char* b;

 b = call2&#40;&#41;;
 printf&#40;"a&#58; %p b&#58; %p diff&#58; %d\n",a,b,b-a&#41;;
&#125;
Aaaaanyway, it's not that important. :)
GE Dominator
chp
Posts: 313
Joined: Wed Jun 23, 2004 7:16 am

Post by chp »

And you're right, for the cache miss chance when just aligning at 16 bytes it's much more worth aligning the stack matrices to 64 bytes, it just wasn't the plan when it was written.
GE Dominator
Post Reply