Some VFPU clocktick analysis

Discuss the development of new homebrew software, tools and libraries.

Moderators: cheriff, TyRaNiD

Post Reply
User avatar
Raphael
Posts: 646
Joined: Tue Jan 17, 2006 4:54 pm
Location: Germany
Contact:

Some VFPU clocktick analysis

Post by Raphael »

Hi, I just wrote a small bench program that approximates the ticks a specified operation takes and let all vfpu listed on http://hitmen.c02.at/files/yapspd/psp_d ... tml#sec4.9
go through it. Since I found it rather informational, I decided to share the results here. Maybe even some discussion will come from this, so if you have questions on the results, ask, or if you have other information, share it.
Everything was benched with the PSP at default speed, ie 222Mhz, so the ops/µs will increase by 33% when PSP is set to 333Mhz. Tick counts won't change though (tested), so they are reliable. The results also include any latencies induced, so interlacing costly ops with independant other ops might decrease the real tick cost somewhat.

UPDATE: I added ulv.q and usv.q ops for unaligned loads/stores
UPDATE: Added exec cost/latencys for interlacing vfpu with mips code. The latency measures how many ticks worth of mips code you can 'hide' after the vfpu op.

Code: Select all

OP			ops/µs		ticks/op		exec/latency
vadd.q		~220		1				1/0
vsub.q		~220		1				1/0
vdot.q		~220		1				1/0
vmul.q		~220		1				1/0
vhdp.q		~220		1				1/0
vdiv.q		~4			56				14/42
vmmul.q		~14			16				1/15
vmin.q		~220		1				1/0
vmax.q		~220		1				1/0
vabs.q		~220		1				1/0
vneg.q		~220		1				1/0
vidt.q		~77			3				1/2
vzero.q		~77			3				1/2
vone.q		~77			3				1/2
vrcp.q		~56			4				1/3
vrsq.q		~56			4				1/3
vsin.q		~56			4				1/3
vcos.q		~56			4				1/3
vexp2.q		~56			4				1/3
vlog2.q		~56			4				1/3
vsqrt.q		~56			4				1/3
vasin.q		~56			4				1/3
vnrcp.q		~56			4				1/3
vnsin.q		~56			4				1/3
vrexp2.q	~56			4				1/3
vi2uc.q		~220		1				1/0
vi2s.q		~220		1				1/0
vsgn.q		~220		1				1/0
vcst.q		~220		1				1/0
vf2in.q		~220		1				1/0
vi2f.q		~220		1				1/0
vhtfm4.q	~56			4				1/3
vtfm4.q		~56			4				1/3
vmidt.q		~19			12				1/11
vmzero.q	~19			12				1/11
lv.q(cache)	~219		1				1/0
lv.q(mem)	~4			68
ulv.q(cache)~109		2				2/0
ulv.q(mem)	~4			68
sv.q(cache)	~32			7				5/2
sv.q(mem)	~2			111
usv.q(cache)~16			14				10/4
usv.q(mem)	~2			111
Well, what I can say after this, is that the vector division, apart from memory reads/writes, is the most costly, so avoid that whenever possible. Also doing mem loads/stores from/to cache is to be recommended, so watch your data structures and accesses.

If I find time, I'll maybe also bench the triple, pair and single ops for comparison. Maybe also some comparison to MIPS counterpart ops would be useful (esp for vdiv, vmmul where it's not clear whether vfpu is really faster).

NOTE: If I missed something important, please LMK, I'm basing these results on my current knowledge of op tickcosts and latencies, which might not be 100% correct. So these results are also not warranted for :P
Last edited by Raphael on Fri Aug 11, 2006 8:04 pm, edited 2 times in total.
<Don't push the river, it flows.>
http://wordpress.fx-world.org - my devblog
http://wiki.fx-world.org - VFPU documentation wiki

Alexander Berl
siberianstar
Posts: 70
Joined: Thu Jun 22, 2006 9:24 pm

Post by siberianstar »

nice
noxa
Posts: 39
Joined: Sat Aug 05, 2006 9:03 am
Contact:

Post by noxa »

This is some really good information - thanks for posting. I'd be interested in any follow up info you find, too! It's nice to be able to make educated decisions instead of shots in the dark about performance ^_^
User avatar
Raphael
Posts: 646
Joined: Tue Jan 17, 2006 4:54 pm
Location: Germany
Contact:

Post by Raphael »

Thanks :)
noxa wrote:This is some really good information - thanks for posting. I'd be interested in any follow up info you find, too! It's nice to be able to make educated decisions instead of shots in the dark about performance ^_^
I tried going at finding latency costs, but that seems somewhat harder. Either the results above do NOT contain latencys yet, or I did something wrong. Execution of one vdiv.q and one vadd.q will use 58 ticks. Given the 1 tick vadd.q uses and the 56 ticks from vdiv, it should stay at 56 for the latencies to be included (vadd hidden within the latency of vdiv), but the additional 1 tick supposes there's no (or 1tick) latency to vdiv or my first measure was somewhat off.
I can say though, that its possible to hide mips code in the execution cost of the vdiv.q (I could do 42 addiu's, 1tick each, before the ticks increased). I'll check how good that works with the other ops too. Maybe you could even easily interlace 1tick vfpu and mips ops, so each pair executes in 1tick.

EDIT: One tick ops seem to not interlace well unfortunately :( I updated the first post with my vfpu/mips execution ticks and latency measures on the ops. The first value is the ticks spent on the execution of the op, the second is the latency it takes in which you can hide mips code. No other vfpu code though, so either all vfpu registers are somehow dependant on each other or the instructions are just not pipelineable.
<Don't push the river, it flows.>
http://wordpress.fx-world.org - my devblog
http://wiki.fx-world.org - VFPU documentation wiki

Alexander Berl
siberianstar
Posts: 70
Joined: Thu Jun 22, 2006 9:24 pm

Post by siberianstar »

measure also this

asm volatile ("mtv %0, s500" : : "r" (input_value);

and this

asm volatile ("mfv %0, s500" : "=r" (input_value);
User avatar
groepaz
Posts: 305
Joined: Thu Sep 01, 2005 7:44 am
Contact:

Post by groepaz »

nice.... expect that info to be assimilated in my doc :=P
User avatar
Raphael
Posts: 646
Joined: Tue Jan 17, 2006 4:54 pm
Location: Germany
Contact:

Post by Raphael »

siberianstar wrote:measure also this

asm volatile ("mtv %0, s500" : : "r" (input_value);

and this

asm volatile ("mfv %0, s500" : "=r" (input_value);
You forgot the magic word :P
Nah, I'll do that when I find time again.
groepaz wrote:nice.... expect that info to be assimilated in my doc :=P
Would be nice :P
<Don't push the river, it flows.>
http://wordpress.fx-world.org - my devblog
http://wiki.fx-world.org - VFPU documentation wiki

Alexander Berl
siberianstar
Posts: 70
Joined: Thu Jun 22, 2006 9:24 pm

Post by siberianstar »

Please :)
Post Reply