VFPU diggins

Raphael · Post by **Raphael** » Sat Jan 27, 2007 9:06 pm

Yeah, that looks a lot more reliable now. Also much nearer to my approximated cycles, so I'm happy :)

hlide · Post by **hlide** » Fri Feb 02, 2007 11:28 pm

!!! NEW INSTRUCTIONS !!!

MrMr[iCE] asked me if there were the inverse instruction for vi2c.
And effectively mips-opc.c doesn't seem to have one at our surprise. Why ? it is because they forget to add the inverse instruction or is it because it is unimplemented ?

diggin in mips-opc.c, we have :

vi2uc.q --> 0xd03c8080
vi2c.q --> 0xd03d8080
vi2us.q --> 0xd03e8080
vi2s.q --> 0xd03f8080

and

vus2i.p --> 0xd03a0080
vs2i.p --> 0xd03b0080
vi2us.p --> 0xd03e0080
vi2s.p --> 0xd03f0080

Oh well, we can make some analogies and found out if vc2i.q and vuc2i.q really exist :

vi2us.p - vus2i.p = 0x00040000
vi2s.p - vs2i.p = 0x00040000

so lets's check if vc2i.s = vi2c.q + vi2s.p - vs2i.p = 0xd0398080

Code: Select all

  asm volatile
          &#40;
          "lv.q C000, %0\n"
          "vi2c.q\tS000, C000\n"
          ".word 0xD0398081\n" // "vc2i.s\tC010, S000\n"
          "sv.q\tC010, %0\n"
          &#58; "+m"&#40;res2&#41; &#58; &#58; "memory"
          &#41;; 
  printf&#40;"%08x %08x %08x %08x\n", res2.x, res2.y, res2.z, res2.w&#41;;

BINGO ! C010 == C000 as a result so 0xD0398081 seems to act as vc2i.s C010, S000 !!!

now, why .s instead of .q ? because the source is a scalar vfpu register (vi2c.q has a vector vfpu register as source whereas the target was a scalr vfpu register, so by analogy it would make sense to use here .s instead of .q)

ok now, let's check if vuc2i.s = vi2uc.q + vi2us.p - vus2i.p = 0xd0388080

Code: Select all

  asm volatile
          &#40;
          "lv.q C000, %0\n"
          "vi2uc.q\tS000, C000\n"
          ".word 0xD0388081\n" // "vc2i.s\tC010, S000\n"
          "sv.q\tC010, %0\n"
          &#58; "+m"&#40;res2&#41; &#58; &#58; "memory"
          &#41;; 
  printf&#40;"%08x %08x %08x %08x\n", res2.x, res2.y, res2.z, res2.w&#41;;

hmmmm... C010 != C000

but we have always the same result this way :

C010.x = (S000[7..0] * 0x01010101) >> 1;
C010.y = (S000[15..8] * 0x01010101) >> 1;
C010.z = (S000[23..16] * 0x01010101) >> 1;
C010.w = (S000[31..24] * 0x01010101) >> 1;

anyway if you want to make a complex calculus on a RGBA8888, you can do it this way :

vuc2i.s C000, S000
vi2f.q C000, C000, 23
... your calculus here
vf2iz.q C000, C000, 23
vi2uc.q S000, C000

although this instruction isn't the invert instruction of vi2uc.q.

so now I propose to add those both instructions to psp-as which doesn't reckonize them.

About the case of vuc2i.s, if someone has some idea about what its name should be, i would like to hear it.

EDIT: corrected the fact that "uc2i" halfs the result.

hlide · Post by **hlide** » Fri Mar 16, 2007 10:51 pm

hummm some suggestions about what those functions do :

- vsbz.s sd, ss : may change the binary logarithmic scale of a floating point value to 0 ?

sd = (-1)^(ss.s) x 2^0 x (1 + ss.m/2^23).

- vsbn.s sd, ss, st : may change the binary logarithmic scale of a floating point value to N ?

sd = (-1)^(ss.s) x 2^N x (1 + ss.m/2^23) where N is given by st.

- vlgb.s sd, ss : may give the binary logarithm of floating point value ?

sd = ss.e - 127.

- vwbn.s sd, ss, imm : may give the modulus of floating point value ?

sd = (-1)^(ss.s) x 2^(N-127) x (1 + (ss.m % 2^N)/2^23) where N is given by imm.

No need to say they are just speculations... and I may be wrong...

StrmnNrmn · Post by **StrmnNrmn** » Sun Mar 18, 2007 8:08 am

Hmm, I've been playing around with vuc2i, and I've been getting slightly different results. Am I missing something?

Code: Select all

u32 col = 0x014080c0;
i4			res;

asm volatile
	  &#40;
	  "lv.s S200, %0\n"
	  ".word 0xd0388080 | &#40;8<<8&#41; | &#40;40&#41;\n"	// vuc2i.s	R200, S200	
	  "sv.q\tR200, %1\n"
	  &#58; "+m"&#40;col&#41;, "+m"&#40;res&#41; &#58; &#58; "memory"
	  &#41;;
printf&#40;"Col&#58; %08x\n", col&#41;; 
printf&#40;"Res&#58; %08x %08x %08x %08x\n", res.x, res.y, res.z, res.w&#41;;

Which results in:

Col: 004080c0
Res: 60606060 40404040 20202020 00808080

So it looks to me like vuc2i.s is actually doing:

Code: Select all

vuc2i.s vd.q, vs.s
&#123;
  vd.q&#91;0&#93; = &#40;vs.s&#91;0&#93;&#40; 0.. 7&#41; * 0x01010101&#41; >> 1;
  vd.q&#91;1&#93; = &#40;vs.s&#91;0&#93;&#40; 8..15&#41; * 0x01010101&#41; >> 1;
  vd.q&#91;2&#93; = &#40;vs.s&#91;0&#93;&#40;16..23&#41; * 0x01010101&#41; >> 1;
  vd.q&#91;3&#93; = &#40;vs.s&#91;0&#93;&#40;24..31&#41; * 0x01010101&#41; >> 1;
&#125;

I guess the final >>1 prevents the top bit from ever being set, which means that when you use vi2f.q (which is signed) your data is correctly comes through unsigned.

anyway if you want to make a complex calculus on a RGBA8888, you can do it this way :

vuc2i.s C000, S000
vi2f.q C000, C000, 24
...

I think this actually needs to be:

vuc2i.s C000, S000
vi2f.q C000, C000, 23

(at least that works correctly for me - shifting by 24 gives me half-bright colours.)

StrmnNrmn

(Edit - fix code)

StrmnNrmn · Post by **StrmnNrmn** » Sun Mar 18, 2007 8:36 am

This is a little pedantic, but the pseudo-code on the first page seems to imply that the low-order bits are discarded:

Code: Select all

vi2f.q/t/p/s vd, vs, imm                     1            0
&#123;
  for &#40;i = 0; i < |q/t/p/s|; ++i&#41;
    vd&#91;i&#93; = &#40;float&#41;&#40;vs&#91;i&#93; >> imm&#41;;
&#125;

I think this would more accurately be described as:

Code: Select all

vi2f.q/t/p/s vd, vs, imm                     1            0
&#123;
  for &#40;i = 0; i < |q/t/p/s|; ++i&#41;
    vd&#91;i&#93; = &#40;float&#41;&#40;vs&#91;i&#93;&#41; / &#40;float&#41;&#40;1<<imm&#41;;
&#125;

i.e. the fractional bits are taken into account:

Code: Select all

u32 col = 0x00800000;
float res;

asm volatile
	  &#40;
	  "lv.s S200, %0\n"
	  "vi2f.s S200, S200, 24\n"
	  "sv.s\tS200, %1\n"
	  &#58; "+m"&#40;col&#41;, "+m"&#40;res&#41; &#58; &#58; "memory"
	  &#41;;
printf&#40;"Col&#58; %08x\n", col&#41;; 
printf&#40;"Res&#58; %f\n", res&#41;;

Prints out:

00800000
0.500000

StrmnNrmn

hlide · Post by **hlide** » Sun Mar 18, 2007 10:28 am

StrmnNrmn wrote:Hmm, I've been playing around with vuc2i, and I've been getting slightly different results. Am I missing something?

nope, this topic is quite old indeed and not corrected.

if you looked at all the posts of topic, especially those MrMr[iCE] and mine you would have found that vuc2i is indeed as you say here.

For cycles, you can look at http://wiki.fx-world.org/doku.php?id=general:cycles. Details on instructions still need to be done.

EDIT: in fact, no. There isn't any post correcting vuc2i. Most corrections were indeed done when dicussing on IRC, not in this topic. Well your post is welcome :P

EDIT2: i corrected my post where i found out "vuc2i" because it is something i already knew but forgot to modify in this post.

StrmnNrmn · Post by **StrmnNrmn** » Sun Mar 18, 2007 10:28 pm

hlide wrote:For cycles, you can look at http://wiki.fx-world.org/doku.php?id=general:cycles. Details on instructions still need to be done.

That's an excellent resource. Somehow I've managed to miss the overlook the URL while searching around on these forums. Thanks :)

nDEV · Post by **nDEV** » Sun Apr 29, 2007 8:47 pm

Holy crap!
You guys ARE GENIUS...wtf is all this?! i dont understand anything :lol.gif:

hlide · Post by **hlide** » Sun Apr 29, 2007 9:09 pm

nDEV wrote:Holy crap!
You guys ARE GENIUS...wtf is all this?! i dont understand anything :lol.gif:

the main processor of PSP has two coprocessors, a FPU and a VFPU.

FPU is quite standard and is used for single floating point computation. Every homebrews is using it when using C float. VFPU probably stands for Vector Floating point Unit and is a very powerful SIMD-like FPU. Very few homebrews uses it or marginally because gcc has no knowlegde about it (only gas - the assembler - has) the same way gcc has for SSE or Altivec.

VFPU offers 128 single float registers (instead of 32 from FPU) which can be arranged to be accessed as :
- 8 non-overlapping 4x4 matrixes
- 8 non-overlapping 3x3 or 16 overlapping 3x3 matrixes
- 16 non-overlapping 2x2 matrixes

In each matrix, we can also access a register as a column or a row. Or simply a register as en element of the matrix.

That's very powerful to compute matrixes and vectors which are used for 2D and 3D. Quaternions are also easier and faster to compute with VFPU.

SIMD = Single Instruction Multiple Data, see wikipedia for some explanation.

nDEV · Post by **nDEV** » Sun Apr 29, 2007 10:26 pm

hlide wrote:
nDEV wrote:Holy crap!
You guys ARE GENIUS...wtf is all this?! i dont understand anything :lol.gif:
the main processor of PSP has two coprocessors, a FPU and a VFPU.

FPU is quite standard and is used for single floating point computation. Every homebrews is using it when using C float. VFPU probably stands for Vector Floating point Unit and is a very powerful SIMD-like FPU. Very few homebrews uses it or marginally because gcc has no knowlegde about it (only gas - the assembler - has) the same way gcc has for SSE or Altivec.

VFPU offers 128 single float registers (instead of 32 from FPU) which can be arranged to be accessed as :
- 8 non-overlapping 4x4 matrixes
- 8 non-overlapping 3x3 or 16 overlapping 3x3 matrixes
- 16 non-overlapping 2x2 matrixes

In each matrix, we can also access a register as a column or a row. Or simply a register as en element of the matrix.

That's very powerful to compute matrixes and vectors which are used for 2D and 3D. Quaternions are also easier and faster to compute with VFPU.

SIMD = Single Instruction Multiple Data, see wikipedia for some explanation.

Thanks , thats wayyy to advanced for me.

Anyway , impressive work!!

gauri · Post by **gauri** » Sun Jan 20, 2008 11:22 pm

nice work, guys!
I'm trying to speed up to VFPU diggins and for now collecting all the info on VFPU tha I can find. Already dug thru this forum.
Any other links you think will do good? Yep, I've already been to wiki.fx-world.org :-)

P.S. In fact, I'm building (yet another) wiki with PSP info. I'll share the link later, when there will be more stuff.

hlide · Post by **hlide** » Mon Jun 23, 2008 8:30 pm

LIST UPDATED

Looking at POPS binary, i found out a special use of VCMOVF/T instruction :

Code: Select all

vcmovt.t vd, vs, 6

i was wondering what this number 6 meaned. As a recall :

0 : comparison on X component
1 : comparison on Y component
2 : comparison on Z component
3 : comparison on W component
4 : OR comparison on all components
5 : AND comparison on all components

The code using it seems to imply we can set the components individually according their own comparisons.

So :

Code: Select all

vcmovt.t vd, vs, 6

would be equivalent to :

Code: Select all

vcmovt.s vd&#91;0&#93;, vs&#91;0&#93;, 0
vcmovt.s vd&#91;1&#93;, vs&#91;1&#93;, 1
vcmovt.s vd&#91;2&#93;, vs&#91;2&#93;, 2

I thought I knew every bit of VFPU :)

MrMr[iCE] · Post by **MrMr[iCE]** » Sun Jul 20, 2008 1:35 pm

still keeping it going eh hlide?

come see me on irc, you know where to go =)

anmabagima · Post by **anmabagima** » Tue Oct 13, 2009 11:53 pm

Hi there,

this might be a stupid question of a noob: but what is the intention of this ? Where and how do I use this usually in my PSP development project?

Regards
AnMaBaGiMa

jojojoris · Post by **jojojoris** » Wed Oct 14, 2009 1:20 am

anmabagima wrote:Hi there,

this might be a stupid question of a noob: but what is the intention of this ? Where and how do I use this usually in my PSP development project?

Regards
AnMaBaGiMa

Do your own research. >:(
Search this forum and use google.

anmabagima · Post by **anmabagima** » Wed Oct 14, 2009 1:30 am

Hi,

thanks for this hint....I've done a research and found this topic. I've searched for something like ASM to be used within PSP development to speed things up. However, the results I get pointed me to this thread. But what is this telling me ? What I've found out so far is that the PSP do have it's own dialect of assembler. It seem not to be comparable with ix86 assembler where you accessing registers like AX, BX, ESI and all this stuff. So I'm here and wondering if some wane could give me just a small finger tipp into the right direction...

Thanks...

jojojoris · Post by **jojojoris** » Wed Oct 14, 2009 3:12 am

Search for "VFPU" since it's in the topic title.

I bet you don't have any coding experience since you are not even able to figure out what the topic title means.

dridri · Post by **dridri** » Wed Oct 14, 2009 3:40 am

anmabagima wrote:What I've found out so far is that the PSP do have it's own dialect of assembler. It seem not to be comparable with ix86 assembler where you accessing registers like AX, BX, ESI and all this stuff

The PSP uses a MIPS 32bit processor. So in assembler program you can use all MIPS commands and more: FPU and VFPU.
FPU = Floating Point Unit
VFPU = Video FPU => done by the GPU

So the VFPU commands are "ready to use" for maths: calculating a cosinus by using VFPU is +/- 800% faster than a normal cosinus (using libMaths).
And it can do a lot of matrices operations (rotate, translate, multiply...)

anmabagima · Post by **anmabagima** » Wed Oct 14, 2009 4:31 pm

jojojoris wrote:Search for "VFPU" since it's in the topic title.

I bet you don't have any coding experience since you are not even able to figure out what the topic title means.

Usually I could start and argue why I feel I have already some code exerience - take a look at http://www.anmabagima.de/
Look for the project page and you will find my C++ tutorial for DDraw on Windows.

However, my assumtion was we are in that forum to help each other not to affront someone...Anyway..thanks to anyone else for the help...

dridri · Post by **dridri** » Wed Oct 14, 2009 5:19 pm

I posted ;)

J.F. · Post by **J.F.** » Thu Oct 15, 2009 2:19 am

dridri wrote:
anmabagima wrote:What I've found out so far is that the PSP do have it's own dialect of assembler. It seem not to be comparable with ix86 assembler where you accessing registers like AX, BX, ESI and all this stuff
The PSP uses a MIPS 32bit processor. So in assembler program you can use all MIPS commands and more: FPU and VFPU.
FPU = Floating Point Unit
VFPU = Video FPU => done by the GPU

So the VFPU commands are "ready to use" for maths: calculating a cosinus by using VFPU is +/- 800% faster than a normal cosinus (using libMaths).
And it can do a lot of matrices operations (rotate, translate, multiply...)

Uh... no. VFPU = VECTOR FPU, and it's another CPU coprocessor, just like the FPU. It just works on vectors instead of single values. The VFPU commands are CPU assembler commands specific to the Allegro CPU inside the PSP.

Here's one place to start with the vfpu: http://forums.ps2dev.org/viewtopic.php?p=67320#67320

dridri · Post by **dridri** » Thu Oct 15, 2009 2:21 am

Ok, thanks ^^