VFPU yummy goodness: instruction prefixes and rotation

Discuss the development of new homebrew software, tools and libraries.

Moderators: cheriff, TyRaNiD

Post Reply
jsgf
Posts: 254
Joined: Tue Jul 12, 2005 11:02 am
Contact:

VFPU yummy goodness: instruction prefixes and rotation

Post by jsgf »

I've been picking through the VFPU binutils patch and correlating it with the stuff hinted at in the BP05 presentation. There's a pile of interesting stuff in there that we haven't even started scratching at yet.

I don't know if anyone has looked into this yet, but since Chp's vfpu gum doesn't make use of this stuff, I assume it isn't common knowledge.

Prefixes

Each VPU instruction can have up to 3 prefixes which specify transformations which are performed on the two inputs and the output. These are:

Swizzling, which allows you to select which parts of an input vector you want to operate on. For example:

Code: Select all

vmax.q c000, c000, c010[X,X,X,X]   # only select X from the vector
vmax.q c000, c000, c010[X,Y,Z,W]   # no-op swizzle
vmax.q c000, c000, c010[W,Z,Y,X]   # reverse vector
You can also use this to inject constants:

Code: Select all

vmax.q c000, c000, c010[X,0,0,0]   # only X and zero out the rest
vmax.q c000, c000, c010[X,1/2,0,1]   # some other constants
The allowable constants are 0,1,2,3, 1/2, 1/3, 1/4, and their negatives. The syntax for the fractions is literally "1/2".

You can also clamp the outputs to the range 0:1 or -1:1:

Code: Select all

vmax.q			c000[0:1, -1:1, 0:1, 0:1], c010, c000
You can also do masking:

Code: Select all

vmax.q			c000[0:1, m, 0:1, 0:1], c010, c000
Masking and clamping can only be used on outputs, and swizzling+constants can only be used on inputs.

Code: Select all

vmax.q c000[0:1, 0:1, -1:1, m], c010[X,Y,1/2,1], c020[W,X,Y,Z]
The number of elements in the prefix specifier depends on the size of the operation:

Code: Select all

vmax.t			c000, c010[X,X,X], c000
vmax.p			c000, c010[X,X], c000

Rotation

The vrot instruction seems special. It looks like it will compute cos and sin (optionally negative) and load up a vector with the results. This makes it easy to generate a rotation matrix. The syntax is:

Code: Select all

vrot.q c000,s100,[c,-s,0,0]
vrot.q c010,s100,[s,c,0,0]
It looks like the sin/cos vector can only contain 1 's' (optionally -s) and 1 'c', so "[c,s,c,0]" is not valid. The input must be scalar (obviously, since its an angle).

I haven't tested this with running code yet. This is all from assembling+disassembling things with psp-as and psp-objdump.
chp
Posts: 313
Joined: Wed Jun 23, 2004 7:16 am

Post by chp »

Ahh, nice, so that's what the vrot op was for. :) I've tried this on the rotate-functions and it works just fine! The question is if it's faster to use this, or if the previous approach with generating single values is the way to go. I'll do some tests for this.
GE Dominator
jsgf
Posts: 254
Joined: Tue Jul 12, 2005 11:02 am
Contact:

Post by jsgf »

You could do something like:

Code: Select all

vrot.q c000, s100, [c,s,0,0]
vmul.q c010, c000[-1,1,0,0], c000[Y,X,0,0]
to avoid running vrot twice, if it makes a difference. Computing sin and cos together is free, so running separate sin and cos instructions seems potentially wasteful.
korskarn
Posts: 15
Joined: Sat Dec 31, 2005 11:24 pm

Post by korskarn »

For input registers, the VFPU also supports absolute values and negative absolute values. I don't know if the homebrew binutils supports the absolute values, but the hardware does.

In the constants that can be used, 1/6 is also available.


About the vrot functions, multiple sines are possible (but only one cosine):
you can have either 1 cosine, 1 sine and 2 zeros, or 1 cosine with 3 sines.

Not all sine and cosine combinations are possible. These are the restrictions:

- There is only 1 cosine
- There is always either 1 or 3 sine (can't have cosine only, or 1 cosine and 2 sine)
- Cosine is always positive
- If there are 3 sines, they are always all positive or all negative, can't have some positive and some negative.

These restrictions lead to the 32 following combinations (its enconded on 5 bits):

[ C, S, S, S]
[ S, C, 0, 0]
[ S, 0, C, 0]
[ S, 0, 0, C]
[ C, S, 0, 0]
[ S, C, S, S]
[ 0, S, C, 0]
[ 0, S, 0, C]
[ C, 0, S, 0]
[ 0, C, S, 0]
[ S, S, C, S]
[ 0, 0, S, C]
[ C, 0, 0, S]
[ 0, C, 0, S]
[ 0, 0, C, S]
[ S, S, S, C]

And repeat the same 16 combinations but with all sines negative

Like you said, the input is scalar, only the X component is used.

The unit used for angle is quarters of a whole turn, meaning that 1.0 is 90 degrees or pi/2 radians.

In terms of performance, the vrot has a latency of 8 cycles and a pitch of 2 cycles, so filling a rotation matrix with 2 vrot instructions sould take between 4 and 12 cycles depending on what follows the second vrot (there will be no pipeline stall if the next instruction has a latency of 8 or more cycles and if the next instruction does not use the result of the last vrot).

The vmul instruction has a latency of 5 with a pitch of 1
Since the vmul has a lower latency than the vrot, you will have a 3 cycles stall between
the vrot and the vmul, and in the end, the result of the vmul gets written exactly at the same time as a second vrot would. But, since the mult has a lower latency and pitch than the vrot, the next instruction has less chances to get stalled.
EDIT: I overlooked something... since the vmul instruction depends on the result of the vrot, it will be stalled untill the result of the vrot gets written and thus take longer than 2 vrot, unless you can interleave other instructions between the vrot and the vmul, in which case the vmul might be faster than a second vrot.
All depends if the instructions are poorly scheduled or well optimized for the pipeline!

The pipeline rules with the stalls, latency and pitch are the following:
An instruction has several stages, among others fethcing the registers, execution, etc. The last stage is writing the result to the destination register, and regardless of the dependencies, an instruction write stage can occur only AFTER the previous instruction's write stage has completed. So, if an instruction with a lower latency follows one with a higer latency, the end of the second instruction will always be one cycle after the previous instruction, no amtter if it is really fast to execute.
The pitch is the number of cycles between the execution stage of this instruction and the execution stage of the next. So we can say it is the real number of cycle the instruction takes if there are no other stalls.
jsgf
Posts: 254
Joined: Tue Jul 12, 2005 11:02 am
Contact:

Post by jsgf »

I forgot to mention it, but abs is supported: [|x|, |y|, ...].
The unit used for angle is quarters of a whole turn, meaning that 1.0 is 90 degrees or pi/2 radians.
I was wondering why the code didn't seem to be converting the angle into either degrees or radians.
since the vmul instruction depends on the result of the vrot, it will be stalled untill the result of the vrot gets written and thus take longer than 2 vrot, unless you can interleave other instructions between the vrot and the vmul, in which case the vmul might be faster than a second vrot.
All depends if the instructions are poorly scheduled or well optimized for the pipeline!
Yeah, I was proposing that sequence as an experiment, but I didn't think it would be faster than two independent vrots. Is there some documentation about the latency/throughput of the VFPU instructions?

Or even some documentation about what all the instructions do? Some are not obvious from the opcode.

Also, any idea what the cost of using a prefix is, if anything?
korskarn
Posts: 15
Joined: Sat Dec 31, 2005 11:24 pm

Post by korskarn »

Yeah, I was proposing that sequence as an experiment, but I didn't think it would be faster than two independent vrots. Is there some documentation about the latency/throughput of the VFPU instructions?
Or even some documentation about what all the instructions do? Some are not obvious from the opcode.
Nothing that I know of that is publicly accessible, unfortunately...
But I never searched either, and someone might have written some documentation from guesswork.

The cost of using a prefix: there is a cost of 1 cycle per prefixed operand per instruction. This is because prefixing is done by inserting one, two or all three of these instructions before the actual vfpu instruction:
vpfxs -> prefix source register
vpfxt -> prefix target register
vpfxd -> prefix destination register
These instructions have latency 0 and pitch 1 so they always cost only one cycle and never cause a pipeline stall. The prefix apply only to the following suitable instruction (prefix don't apply to all instructions)
jsgf
Posts: 254
Joined: Tue Jul 12, 2005 11:02 am
Contact:

Post by jsgf »

chp wrote:Ahh, nice, so that's what the vrot op was for. :) I've tried this on the rotate-functions and it works just fine! The question is if it's faster to use this, or if the previous approach with generating single values is the way to go. I'll do some tests for this.
BTW, there's a usv and ulv pseudo instructions which appear to be the unaligned versions of sv and lv, which should allow you to avoid the temporary local matricies, memcpys, etc in libgum.
chp
Posts: 313
Joined: Wed Jun 23, 2004 7:16 am

Post by chp »

I wasn't sure what they did, but I had my ideas... I was experimenting with lvl and lvr earlier but they locked up, but I guess I was using them incorrectly... I'll take a look at them!
GE Dominator
chp
Posts: 313
Joined: Wed Jun 23, 2004 7:16 am

Post by chp »

Ok, I've submitted some new code in GUM that's based on this thread... ulv & usv works just great. Also, lv.s/sv.s do not require more than 4 bytes of alignment it seems (as I suspected before), so I killed those unecessary aligns too.

The data-functions does not seem to support masking/etc however (which is quite understandable I guess).
GE Dominator
korskarn
Posts: 15
Joined: Sat Dec 31, 2005 11:24 pm

Post by korskarn »

chp wrote:Also, lv.s/sv.s do not require more than 4 bytes of alignment it seems (as I suspected before)
I can confirm lv.s/sv.s are word aligned, and lv.q/sv.q are qword aligned

What do you mean by "data-functions"?
If its all the load/store instructions, prefixing has no effect on them.
Post Reply