Lock-free algorithms and atomic primitives
Lock-free algorithms and atomic primitives
I've been doing a lot of work recently trying to get Daedalus's audio processing moved over to the Media Engine.
I've had a fair amount of success so far, but I still have some tricky issues to deal with regarding synchronisation with the ME. Specifically, the inability to use mutexes and semaphores between the two cores makes a number of things difficult to implement.
To start with I've re-written a couple of simple datastructures I was using to be lock-free. With a little work I've rewritten the ringbuffer I was using for audio playback as a single-reader, single-writer FIFO. This now works great, and works without any kind of kernel-level synchronisation between the SC* and ME.
It got me thinking about whether anyone else had spent any time working on lock-free data structures for the PSP. Initially I'm just thinking about execution on multiple threads on the SC, but hopefully this would lead to code that could be easily adapted to run on the ME too.
There are a few lock-free libraries around, so my first question is whether or not anyone has taken a look at porting these to the PSP? I did spend some time searching and couldn't find anything relevant.
Many of the lock-free algorithms require an atomic compare-and-swap primitive, so my second question was whether anyone had already implemented this? Again, I couldn't find anything relevant when searching the forums. There doesn't seem to be anything along these lines in the sdk. The closest I could find was hlide's post on using SYNC to provide some synchronisation between the two cores, but I' expecting the CAS primitive to be implemented using LL/SC). There was also this post from hlide talking about the 'SC/ME mutex'. What is this?
Edit Hmmm, it seems that LL/SC will fail when used with uncached or noncoherent memory locations. I'd anticipated using an uncached pointer to avoid cache coherency issues between the ME/SC, so it seems that this rules out using LL/SC on the ME? Maybe I'm missing the point, and LL/SC are designed to work across multiple cores with separate caches?
-StrmnNrmn
*I've picked up this terminology as shorthand for 'the main CPU', but I have no ideas what it means :) Anyone?
I've had a fair amount of success so far, but I still have some tricky issues to deal with regarding synchronisation with the ME. Specifically, the inability to use mutexes and semaphores between the two cores makes a number of things difficult to implement.
To start with I've re-written a couple of simple datastructures I was using to be lock-free. With a little work I've rewritten the ringbuffer I was using for audio playback as a single-reader, single-writer FIFO. This now works great, and works without any kind of kernel-level synchronisation between the SC* and ME.
It got me thinking about whether anyone else had spent any time working on lock-free data structures for the PSP. Initially I'm just thinking about execution on multiple threads on the SC, but hopefully this would lead to code that could be easily adapted to run on the ME too.
There are a few lock-free libraries around, so my first question is whether or not anyone has taken a look at porting these to the PSP? I did spend some time searching and couldn't find anything relevant.
Many of the lock-free algorithms require an atomic compare-and-swap primitive, so my second question was whether anyone had already implemented this? Again, I couldn't find anything relevant when searching the forums. There doesn't seem to be anything along these lines in the sdk. The closest I could find was hlide's post on using SYNC to provide some synchronisation between the two cores, but I' expecting the CAS primitive to be implemented using LL/SC). There was also this post from hlide talking about the 'SC/ME mutex'. What is this?
Edit Hmmm, it seems that LL/SC will fail when used with uncached or noncoherent memory locations. I'd anticipated using an uncached pointer to avoid cache coherency issues between the ME/SC, so it seems that this rules out using LL/SC on the ME? Maybe I'm missing the point, and LL/SC are designed to work across multiple cores with separate caches?
-StrmnNrmn
*I've picked up this terminology as shorthand for 'the main CPU', but I have no ideas what it means :) Anyone?
Re: Lock-free algorithms and atomic primitives
At address 0xbc100048 on both cpus, there is a word mmio register that can only be modified by one cpu at a time. If it is set to 1 then the sc holds it and it can't be modified by the me and if it's set to 2 visa versa. If it's set to 0 then it is unheld. Of course it's a problem on the slim as it can only be changed in kernel mode and the kernel function sceSysregSemaTryLock and sceSysregSemaUnlock are not exported to user mode.StrmnNrmn wrote:There was also this post from hlide talking about the 'SC/ME mutex'.
I had assumed that LL/SC were unimplemented in allegrex as sony never intended for the ME to be used for anything but completely asynchronous media decoding.Hmmm, it seems that LL/SC will fail when used with uncached or noncoherent memory locations.
Re: Lock-free algorithms and atomic primitives
I see. That actually sounds pretty useful - it's a bit of a shame it's problematic on the Slim.crazyc wrote:At address 0xbc100048 on both cpus, there is a word mmio register that can only be modified by one cpu at a time. If it is set to 1 then the sc holds it and it can't be modified by the me and if it's set to 2 visa versa. If it's set to 0 then it is unheld. Of course it's a problem on the slim as it can only be changed in kernel mode and the kernel function sceSysregSemaTryLock and sceSysregSemaUnlock are not exported to user mode.StrmnNrmn wrote:There was also this post from hlide talking about the 'SC/ME mutex'.
I've not tried them yet, but I assumed they were essential for implementing any atomic operations in user mode. How are sceKernelSignalSema etc implemented in the kernel? Do they rely on the fact that interrupts are disabled whilst in kernel mode?crazyc wrote:I had assumed that LL/SC were unimplemented in allegrex as sony never intended for the ME to be used for anything but completely asynchronous media decoding.Hmmm, it seems that LL/SC will fail when used with uncached or noncoherent memory locations.
Edit Just tried this, and I can confirm LL/SC work (on the SC at least :):
Code: Select all
_AtomicIncrement:
ll $t1, 0($a0)
add $t2, $t1, 1
or $v0, $t2, $0 # Copy for return value
sc $t2, 0($a0)
beq $t2, 0, _AtomicIncrement
nop
jr $ra
nop
Code: Select all
extern "C" { u32 _AtomicIncrement( u32 * ptr ); }
...
u32 val = 0;
printf( "Incrementing %d\n", _AtomicIncrement( &val ) );
printf( "Incrementing %d\n", _AtomicIncrement( &val ) );
printf( "Incrementing %d\n", _AtomicIncrement( &val ) );
printf( "Incrementing %d\n", _AtomicIncrement( &val ) );
printf( "Incrementing %d\n", _AtomicIncrement( &val ) );
printf( "Incremented value is %d\n", val );
Code: Select all
Incrementing 1
Incrementing 2
Incrementing 3
Incrementing 4
Incrementing 5
Incremented value is 5
Last edited by StrmnNrmn on Sat Dec 15, 2007 5:27 am, edited 1 time in total.
Re: Lock-free algorithms and atomic primitives
enjoy !
i wonder if 0xbc00048 is only accessible in kernel mode... long time ago i didn't program on psp...
Code: Select all
int sceSysregSemaTryLock(void)
{
_sync();
v0 = _mfc0(22); // CPUId (0 -> SC / 1 -> ME)
a2 = v0 + 1; // 1 -> SC / 2 -> ME
_writew(a2, 0xbc100048); // try to acquire semaphore
a3 = _readw(0xbc100048) & 3; // who owns semaphore ?
return -((a2 ^ a3) != 0); // 0 if successful or -1 if not
}
void sceSysregSemaUnlock(void)
{
_sync();
_writew(0, 0xbc100048); // owner releases semaphore
_sync();
}
ll/sc pair works on a single processor (but Sony seems to be too lazy and prefers to use mfic/mtic pair). However i wouldn't expect them to work as a charm between two cores, so you still need the SC/ME mutex for this and 'sync' to flush the write buffer which Allegrex uses implicitely when storing data ( VFPU uses it explicitely through "vwb.q vr, ofs(rs)" ).
Re: Lock-free algorithms and atomic primitives
Awesome -thanks.hlide wrote:enjoy !
...
I'm assuming from crazyc's comment that this is only callable from the SC whilst in kernel mode, so I'd need to implement it in a kernel mode prx? I guess it would make a very useful addition to the MediaEngine.prx I've been using (courtesy of J.F.).
Agreed, I think it's going to be pretty messy. According to the MIPS R4k manual 'LL and SC instructions implicitly perform a SYNC', so this should be fine. However, the memory location must be cached for use with LL/SC, so I think the variable will need to live in a cache-aligned chunk of memory, and I'll have to insert some cache invalidation/writeback around the LL/SC call. Something like:hlide wrote:ll/sc pair works on a single processor (but Sony seems to be too lazy and prefers to use mfic/mtic pair). However i wouldn't expect them to work as a charm between two cores, so you still need the SC/ME mutex for this and 'sync' to flush the write buffer which Allegrex uses implicitely when storing data ( VFPU uses it explicitely through "vwb.q vr, ofs(rs)" ).
Code: Select all
struct Blah { int Val; char pad[60]; }
__attribute__(aligned(64)) Blah gBlah;
while( sceSysrecSemaTryLock() );
dcache_inv_range( &gBlah, 64 );
_AtomicIncrement( &gBlah.Val );
dcache_wb_range( &gBlah, 64 );
sceSysrecSemaUnlock();
Re: Lock-free algorithms and atomic primitives
sysreg is using some ranges like 0xbfc006xx, you may as well allocate them to have a fast and atomic counter since they are shared between SC and ME.StrmnNrmn wrote:Awesome -thanks.hlide wrote:enjoy !
...
I'm assuming from crazyc's comment that this is only callable from the SC whilst in kernel mode, so I'd need to implement it in a kernel mode prx? I guess it would make a very useful addition to the MediaEngine.prx I've been using (courtesy of J.F.).
Agreed, I think it's going to be pretty messy. According to the MIPS R4k manual 'LL and SC instructions implicitly perform a SYNC', so this should be fine. However, the memory location must be cached for use with LL/SC, so I think the variable will need to live in a cache-aligned chunk of memory, and I'll have to insert some cache invalidation/writeback around the LL/SC call. Something like:hlide wrote:ll/sc pair works on a single processor (but Sony seems to be too lazy and prefers to use mfic/mtic pair). However i wouldn't expect them to work as a charm between two cores, so you still need the SC/ME mutex for this and 'sync' to flush the write buffer which Allegrex uses implicitely when storing data ( VFPU uses it explicitely through "vwb.q vr, ofs(rs)" ).
It's not looking all that pretty/efficient really :)Code: Select all
struct Blah { int Val; char pad[60]; } __attribute__(aligned(64)) Blah gBlah; while( sceSysrecSemaTryLock() ); dcache_inv_range( &gBlah, 64 ); _AtomicIncrement( &gBlah.Val ); dcache_wb_range( &gBlah, 64 ); sceSysrecSemaUnlock();
Re: Lock-free algorithms and atomic primitives
Well, it's not that problematic as you can write a kernel module to do it for you. I guess I'm just annoyed that my own code has to be changed a lot as I wrote it so the same function can be shared between both cpus as hide's above and that becomes rather difficult in later psp kernels where it's not easy to create kernel threads.StrmnNrmn wrote:it's a bit of a shame it's problematic on the Slim.
I must be missing something. Why deal with LL/SC rather then disabling interrupts then writing an uncached pointer? After all LL/SC don't gain you anything wrt SC/ME synchronization.According to the MIPS R4k manual 'LL and SC instructions implicitly perform a SYNC', so this should be fine. However, the memory location must be cached for use with LL/SC, so I think the variable will need to live in a cache-aligned chunk of memory, and I'll have to insert some cache invalidation/writeback around the LL/SC call.
The only place i've seen sc used to refer to the main cpu is in the function sceSysregScResetEnable.*I've picked up this terminology as shorthand for 'the main CPU', but I have no ideas what it means :) Anyone?
Re: Lock-free algorithms and atomic primitives
Well I'm primarily thinking about lock-free primitives on just the SC for now. LL/SC work fine in user-mode and can easily be inlined, so I'm guessing they're cheaper than calling out to a prx to call mfic/mtic. Also, disabling interrupts just to do an atomic increment (e.g. on a refcounted pointer) seems a bit heavy duty to me!crazyc wrote:I must be missing something. Why deal with LL/SC rather then disabling interrupts then writing an uncached pointer? After all LL/SC don't gain you anything wrt SC/ME synchronization.According to the MIPS R4k manual 'LL and SC instructions implicitly perform a SYNC', so this should be fine. However, the memory location must be cached for use with LL/SC, so I think the variable will need to live in a cache-aligned chunk of memory, and I'll have to insert some cache invalidation/writeback around the LL/SC call.
For synchronisation between the ME/SC I think you're right - mtic/mfic through an uncached pointer is probably the way to go.
I realised over the weekend that very little of my code which needs synchronisation actually needs to run on the ME. I have a 'JobManager' class which arbitrates access to the ME. Each 'job' that I queue up for the ME has an optional initialise/finalise function which executes on the SC, e.g. the worker thread in the JobManager essentially does this:
Code: Select all
while( job = GetJob() )
{
job->Init();
CallME( ...., job->Main, ... );
job->Fini();
}
Hmmm, mysterious :)crazyc wrote:The only place i've seen sc used to refer to the main cpu is in the function sceSysregScResetEnable.*I've picked up this terminology as shorthand for 'the main CPU', but I have no ideas what it means :) Anyone?
Re: Lock-free algorithms and atomic primitives
SC => SystemControlStrmnNrmn wrote:Hmmm, mysterious :)crazyc wrote:The only place i've seen sc used to refer to the main cpu is in the function sceSysregScResetEnable.*I've picked up this terminology as shorthand for 'the main CPU', but I have no ideas what it means :) Anyone?
ME => MediaEngine
Re: Lock-free algorithms and atomic primitives
Aww, you've ruined the mystery now! :)hlide wrote:SC => SystemControl
ME => MediaEngine
Re: Lock-free algorithms and atomic primitives
A little part of me was hoping it'd be something incredibly stupid :DStrmnNrmn wrote:Aww, you've ruined the mystery now! :)hlide wrote:SC => SystemControl
ME => MediaEngine
ME => Media Engine
Yep, yep, we know that...
SC => Smelly Cheese
Wh-... What? Bad Sony!
Re: Lock-free algorithms and atomic primitives
Nah! It comes from early beta testers who would bang the malfunctioning systems on the table yelling "Stupid computer!"brin_vg wrote:A little part of me was hoping it'd be something incredibly stupid :DStrmnNrmn wrote:Aww, you've ruined the mystery now! :)hlide wrote:SC => SystemControl
ME => MediaEngine
ME => Media Engine
Yep, yep, we know that...
SC => Smelly Cheese
Wh-... What? Bad Sony!
:)
Re: Lock-free algorithms and atomic primitives
It could also be Silent Corner :) or Super Clowns!J.F. wrote:Ah well, mysteries are meant to be solved at some point. Hope all is working well :)brin_vg wrote:A little part of me was hoping it'd be something incredibly stupid :DStrmnNrmn wrote: Aww, you've ruined the mystery now! :)
Nah! It comes from early beta testers who would bang the malfunctioning systems on the table yelling "Stupid computer!"ME => Media Engine
Yep, yep, we know that...
SC => Smelly Cheese
Wh-... What? Bad Sony!
:)
Ah well
Re: Lock-free algorithms and atomic primitives
if so, a ll/sc pair should be enough as long as you don't need to access the atomic memory mutex or counter or what else on ME processor, right ?StrmnNrmn wrote:I realised over the weekend that very little of my code which needs synchronisation actually needs to run on the ME.
I also really prefer this ll/sc pair to mfic/mtic pair to synchronize between threads.
But I have a question : I heard PSP firmware uses a cooperative-thread scheduler. How PSP firmware is running timer callback for instance ? in a special thread, in the interrupted thread ? or does it mean we should call a blocking function to have a chance to run this callback ?
Re: Lock-free algorithms and atomic primitives
Reading in and writing out a cache line seems heavy too though, especially since the code will have to block on the read.StrmnNrmn wrote:Well I'm primarily thinking about lock-free primitives on just the SC for now. LL/SC work fine in user-mode and can easily be inlined, so I'm guessing they're cheaper than calling out to a prx to call mfic/mtic. Also, disabling interrupts just to do an atomic increment (e.g. on a refcounted pointer) seems a bit heavy duty to me!
Edit: I should check these things before replying. In fact reading and writing an uncached pointer takes the same number of clocks as invalidating, reloading and writing a cache line, ~65. I guess that isn't surprising. Disabling and enabling interrupts adds ~35 additional clocks. Also, it might be a good idea to figure out how to access the 0xbfc00xxx area from user mode as it only takes ~15 clocks to read and write uncached.
I'm interested to see it when it's done. My code is somewhat different. I just have a queue that the ME moves through one by one doing each entry in order and if the ME has some finalization (or anything else that needs to be done i.e. cache synchronization) afterward it places an entry in another queue that the SC checks occasionally.StrmnNrmn wrote:I realised over the weekend that very little of my code which needs synchronisation actually needs to run on the ME. I have a 'JobManager' class which arbitrates access to the ME. Each 'job' that I queue up for the ME has an optional initialise/finalise function which executes on the SC, e.g. the worker thread in the JobManager essentially does this:
I can set things up so that any synchronization is done in Init()/Fini(), which means that I only need synchronization between threads running on the SC. Typically in Fini() I'm just doing something like 'gSomeFlags |= JOB_DONE', but I've had a couple of bugs because '|=' isn't atomic.Code: Select all
while( job = GetJob() ) { job->Init(); CallME( ...., job->Main, ... ); job->Fini(); }
A linked list of jobs would be more handy in some cases. A priority based queue would also be an idea. The Amiga used priority based linked lists for queues for a number of things, like interrupt handlers.
You could have a function to push a prioritized task onto the ME queue. Tasks of the same priority would naturally be FIFO.
You could have a function to push a prioritized task onto the ME queue. Tasks of the same priority would naturally be FIFO.
this file linux/include/asm-mips/atomic.h provides several atomic functions using LL/SC. It may help you.
However I'm a bit confused on what follows :
a. R4000 implementation when you perform LL :
b. R4000 implementation, when you perform SC :
Something is really wrong... there is a LLaddr in CP0 register (the physical address used in the last executed LL) but it doesn't seem to be updated by LL if we refer to LL implementation. And SC should really use LLaddr to check against the physical address given by the current SC. Besides, it looks as if LLbit is never reset once set to 1.
Also note this is a weak implementation (from Wikipedia) :
if those implementations are right and LLaddr is really updated, it would mean a normal load or store operation should always update this LLaddr and reset LLbit : and so "another load or store operation between the two operations will cause the store-conditional to spuriously fail". But it doesn't work with the R4000 implementations.
Damned, those implementations make no sense :(((.
For me, the right implementations should be :
a. when you perform LL :
b. the right implementation, when you perform SC :
However I'm a bit confused on what follows :
a. R4000 implementation when you perform LL :
Code: Select all
VAddr = sign_extend(offset) + GPR[base];
(PAddr, uncached) = translate_address(VAddr, DATA);
GPR[rt] = load_memory(uncached, WORD, PAddr, VAddr, DATA);
LLbit = 1;
Code: Select all
VAddr = sign_extend(offset) + GPR[base];
(PAddr, uncached) = translate_address(VAddr, DATA);
if (LLbit) store_memory (uncached, WORD, data, PAddr, VAddr, DATA);
GPR[rt] = LLbit;
Also note this is a weak implementation (from Wikipedia) :
EDIT:Any exceptional events between the two operations, such as a context switch, another load-link, or even (on many platforms) another load or store operation, will cause the store-conditional to spuriously fail.
if those implementations are right and LLaddr is really updated, it would mean a normal load or store operation should always update this LLaddr and reset LLbit : and so "another load or store operation between the two operations will cause the store-conditional to spuriously fail". But it doesn't work with the R4000 implementations.
Damned, those implementations make no sense :(((.
For me, the right implementations should be :
a. when you perform LL :
Code: Select all
VAddr = sign_extend(offset) + GPR[base];
(PAddr, uncached) = translate_address(VAddr, DATA);
LLaddr = PAddr;
GPR[rt] = load_memory(uncached, WORD, PAddr, VAddr, DATA);
LLbit = 1;
Code: Select all
VAddr = sign_extend(offset) + GPR[base];
(PAddr, uncached) = translate_address(VAddr, DATA);
if (LLaddr != PAddr) LLbit = 0; <-- the reason why LLbit needs to be reset
if (LLbit) store_memory (uncached, WORD, data, PAddr, VAddr, DATA);
GPR[rt] = LLbit;
LLbit = 0; <-- may be optional to be able to issue several SC at the same physical address
In the MIPS32 databook it says:Something is really wrong... there is a LLaddr in CP0 register (the physical address used in the last executed LL) but it doesn't seem to be updated by LL if we refer to LL implementation.
The LLAddr register contains relevant bits of the physical address read by the most recent Load Linked instruction. This register is implementation dependent and for diagnostic purposes only and serves no function during normal operation.
After some testing it looks like the only time sc will fail is when there is an exception between ll and sc no matter whether the address is cached or not. That works nicely as an alternative to disabling interrupts.Hmmm, it seems that LL/SC will fail when used with uncached or noncoherent memory locations.
If so, it means an exception always resets LLbit. This is necessary because if a context switch occurs and runs SC, it would be incoherent and LL/SC pairs would be unusable.crazyc wrote:In the MIPS32 databook it says:Something is really wrong... there is a LLaddr in CP0 register (the physical address used in the last executed LL) but it doesn't seem to be updated by LL if we refer to LL implementation.ok, so it would mean LLAddr is not really used for checking purpose on SC execution.The LLAddr register contains relevant bits of the physical address read by the most recent Load Linked instruction. This register is implementation dependent and for diagnostic purposes only and serves no function during normal operation.
After some testing it looks like the only time sc will fail is when there is an exception between ll and sc no matter whether the address is cached or not. That works nicely as an alternative to disabling interrupts.Hmmm, it seems that LL/SC will fail when used with uncached or noncoherent memory locations.
So, restrictions are :
- don't nest (LL1/LL2/SC2/SC1) or cross (LL1/LL2/SC1/SC2) in the same thread several LL/SC pairs because SC would be fooled by LLbit.
- don't call blocking functions (which can switch context) between LL and SC for the same reason.
- don't mix normal load and store operations with LL/SC at the same address, because it would disrupt the coherency :
Code: Select all
// create a new top element
...
element_t *old_element_ptr = *list_top;
new_element_ptr->next = old_element_ptr; <-- wrong !!!
...
// critical session : update top_list
do
{
element_t *old_element_ptr = __asm_ll(&list_top);
}
while (!__asm_sc(new_element_ptr, &list_top));
the right thing is :
Code: Select all
// create a new top element
...
// critical session : update top_list
do
{
element_t *old_element_ptr = __asm_ll(&list_top);
// we assume a simple store at a different address wouldn't
// disrupt LL/SC pair. Otherwise, we are in big trouble.
new_element_ptr->next = old_element_ptr;
}
while (!__asm_sc(new_element_ptr, &list_top));
Last edited by hlide on Mon Dec 24, 2007 12:17 am, edited 1 time in total.
Both stores actually succeed here but if an exception is thrown between the first and second ll's, both stores still succeed. Almost like ll sets the ll_bit but sc doesn't clear it. Anyway, your advice stands.hlide wrote:- don't nest (LL1/LL2/SC2/SC1) or cross (LL1/LL2/SC1/SC2) in the same thread several LL/SC pairs because SC would be fooled by LLbit.
Actually, any kernel function will trigger a syscall exception causing the store to fail.hlide wrote:- don't call blocking functions (which can switch context) between LL and SC for the same reason.
Definitely. The ll/sc circuitry does not detect changes to the memory at the load address.hlide wrote:- don't mix normal load and store operations with LL/SC at the same address, because it would disrupt the coherency :
Exactly, it is said from R4000 doc :crazyc wrote:Both stores actually succeed here but if an exception is thrown between the first and second ll's, both stores still succeed. Almost like ll sets the ll_bit but sc doesn't clear it.
Actually, any kernel function will trigger a syscall exception causing the store to fail.
looking for "LLbit" in document shows me only LL, SC and ERET are reading/writing this bit (i don't know what they mean by _Invalidate_, cache operation ?).LLbit : Bit of state to specify synchronization instructions. Set by
_LL_, cleared by _ERET_ and _Invalidate_ and read by _SC_.