VFPU diggins

Raphael · Post by **Raphael** » Fri Nov 17, 2006 11:54 pm

You are right, I just checked them again, and it seems those were wrong. Actually vone, vzero and vidt all take 1 cycle.
vmone/vmzero/vmidt take 4/3/2 cycles according to version used. Needs update.

hlide · Post by **hlide** » Sat Nov 18, 2006 1:11 am

values in hexa by default of rcx0-7 before any use of vrndf1 :

3f800001
3f800002
3f800004
3f800008
3f800000
3f800000
3f800000
3f800000

functions to save/restaore them :

Code: Select all

void vfpu_save_rcx&#40;float context&#91;8&#93;&#41;
&#123;
	__asm__ volatile
  &#40;
		".set push;"
    ".set noreorder;"
    "vmfvc S000, $136;"
    "vmfvc S001, $137;"
    "vmfvc S002, $138;"
    "vmfvc S003, $139;"
    "vmfvc S010, $140;"
    "vmfvc S011, $141;"
    "vmfvc S012, $142;"
    "vmfvc S013, $143;"
    "usv.q C000,  0&#40;%0&#41;;"
    "usv.q C010, 16&#40;%0&#41;;"
    ".set pop"
    &#58;
		&#58; "r"&#40;context&#41;
		&#58; "memory"
	&#41;;
&#125;

void vfpu_load_rcx&#40;float context&#91;8&#93;&#41;
&#123;
	__asm__ volatile
  &#40;
		".set push;"
    ".set noreorder;"
    "ulv.q C000,  0&#40;%0&#41;;"
    "ulv.q C010, 16&#40;%0&#41;;"
    "vmtvc $136, S000;"
    "vmtvc $137, S001;"
    "vmtvc $138, S002;"
    "vmtvc $139, S003;"
    "vmtvc $140, S010;"
    "vmtvc $141, S011;"
    "vmtvc $142, S012;"
    "vmtvc $143, S013;"
    ".set pop"
    &#58;
		&#58; "r"&#40;context&#41;
	&#41;;
&#125;

code to test vrndf1:

Code: Select all

  float min = 0.0, max = 0.0;

  int i;

  for &#40;i = 0; i < 100000; ++i&#41;
  &#123;
    float val;
    asm volatile
      &#40;
      "vrndf1.s S000;"
      "sv.s S000, %0;"
      &#58; "=m"&#40;val&#41;
      &#41;;
    if &#40;min > val&#41; min = val;
    if &#40;val > max&#41; max = val;
  &#125;
  pspDebugScreenPrintf&#40;"%f\n%f\n", min, max&#41;;
  wait&#40;&#41;;

I got : min = 0.0 max = 1.99999

so i guess vrndf1 gives us : 0.0 <= value < 2.0.

now i must test vrndf2 and vrndi.

hlide · Post by **hlide** » Sat Nov 18, 2006 1:14 am

vrndf2 gives us : 0.0 <= value < 4.0 (max = 3.999979)

hlide · Post by **hlide** » Sat Nov 18, 2006 1:29 am

well it SEEMS that vrndi gives us : -2^31 <= value < 2^31

by the way, it mustn't be used for float, otherwise some crashes are expected.

Code: Select all

  int min = 0, max = 0;

  int i;

  for &#40;i = 0; i < 100000; ++i&#41;
  &#123;
    int val;
    asm volatile
      &#40;
      "vrndi.s S000;"
      "mfv %0, S000"
      &#58; "=r"&#40;val&#41;
      &#41;;
    if &#40;min > val&#41; min = val;
    if &#40;val > max&#41; max = val;
  &#125;

Raphael · Post by **Raphael** » Tue Nov 21, 2006 7:14 pm

I created a docuwiki on my webspace to use that as a documentation place for the further diggins, since the SVN request seems to not get answered.
Most likely a docuwiki is even better than a single txt file in a SVN.
Unfortunately I didn't have much time to add a lot of ops to the wiki, so it's rather sparse right now. If you feel like it, you're free to register and start adding our documents informations into it. The same goes for anyone else with knowledge about VFPU.

http://wiki.fx-world.org/

LMK what you think

hlide · Post by **hlide** » Tue Nov 21, 2006 9:30 pm

Raphael wrote:I created a docuwiki on my webspace to use that as a documentation place for the further diggins, since the SVN request seems to not get answered.
Most likely a docuwiki is even better than a single txt file in a SVN.
Unfortunately I didn't have much time to add a lot of ops to the wiki, so it's rather sparse right now. If you feel like it, you're free to register and start adding our documents informations into it. The same goes for anyone else with knowledge about VFPU.

http://wiki.fx-world.org/

LMK what you think

i will tell it to you when at home, i cannot register here.

hlide · Post by **hlide** » Wed Nov 22, 2006 6:04 am

it looks as if i can edit but not create a page :/

Raphael · Post by **Raphael** » Wed Nov 22, 2006 6:30 am

Should work normally. I made you admin anyway.

Here's a site with links to most pages: http://wiki.fx-world.org/doku.php?id=general:cycles
You can then create them with the button on the bottom left

hlide · Post by **hlide** » Sun Dec 31, 2006 10:22 am

hlide · Post by **hlide** » Tue Jan 02, 2007 10:58 pm

vcmp.q/t/p/s updated

hlide · Post by **hlide** » Sun Jan 07, 2007 3:56 am

vrnds.s added

purpose: to set a new seed for random instructions vrndi/f1/f2.

MrMr[iCE] · Post by **MrMr[iCE]** » Tue Jan 16, 2007 3:28 pm

hlide wrote:

Code: Select all


vasin.q/t/p/s vd, vs                        4/?/?/?         3
&#123;
  for &#40;i = 0; i < |q/t/p/s|; ++i&#41;
    vd&#91;i&#93; = asin&#40;vs&#91;i&#93;&#41; * 2/PI; // not sure about this conversion
&#125;

This function returns values in the range of -1 to 1, where libm's asinf returns values from -pi/2 to pi/2. I wrote a small routine to perform acos (needed for a quaternion function to return axis/angle from a quat):

Code: Select all

float vacos&#40;float x&#41; &#123;
	float result;
	__asm__ volatile &#40;
		"mtv      %0,   S000\n"				// load x
		"vcst.s   S001, VFPU_PI_2\n"		 // S001 = PI/2
		"vasin.s  S002, S000\n"				// S002 = asin&#40;x&#41;
		"vmul.s   S000, S002, S001\n"		// S000 = asin&#40;x&#41; * PI/2 &#40;vfpu returns -1 to 1, we need -PI/2 to PI/2&#41;
		"vsub.s   S002, S001, S000\n"		// S002 = acos&#40;x&#41; = PI/2 - asin&#40;x&#41;
		"mfv      %1,   S002\n"				// store result
		&#58; "=r"&#40;result&#41; &#58; "r"&#40;x&#41;&#41;;
	return result;
&#125;

comparison between this and libm's acosf:

Code: Select all

1.000000000000000&#58; acosf = 0.000000000000000, vacos = 0.000000000000000
0.999998331069946&#58; acosf = 0.001826981431805, vacos = 0.000026941299438
0.999983608722687&#58; acosf = 0.005725612863898, vacos = 0.000263690948486
0.999941170215607&#58; acosf = 0.010847153142095, vacos = 0.000943064689636
0.999869108200073&#58; acosf = 0.016179904341698, vacos = 0.002095341682434
0.999770760536194&#58; acosf = 0.021412530913949, vacos = 0.003669381141663
0.999642193317413&#58; acosf = 0.026751749217510, vacos = 0.005728125572205
0.999485075473785&#58; acosf = 0.032092638313770, vacos = 0.008242845535278
0.999303340911865&#58; acosf = 0.037329345941544, vacos = 0.011150956153870
0.999089717864990&#58; acosf = 0.042671307921410, vacos = 0.014570236206055
0.998847484588623&#58; acosf = 0.048015348613262, vacos = 0.018447875976562
0.998576760292053&#58; acosf = 0.053358737379313, vacos = 0.022781252861023
0.998277544975281&#58; acosf = 0.058701783418655, vacos = 0.027570843696594
0.997949719429016&#58; acosf = 0.064046569168568, vacos = 0.032818794250488
0.997593402862549&#58; acosf = 0.069391109049320, vacos = 0.038522601127625
0.997216641902924&#58; acosf = 0.074627749621868, vacos = 0.044555068016052
0.996803939342499&#58; acosf = 0.079972051084042, vacos = 0.051162123680115
0.996362805366516&#58; acosf = 0.085315905511379, vacos = 0.058223485946655
0.995893180370331&#58; acosf = 0.090660177171230, vacos = 0.065743565559387
0.995395064353943&#58; acosf = 0.096004940569401, vacos = 0.073717951774597
0.994868516921997&#58; acosf = 0.101349666714668, vacos = 0.082148909568787
0.994313597679138&#58; acosf = 0.106693953275681, vacos = 0.091034412384033
0.993730247020721&#58; acosf = 0.112038522958755, vacos = 0.100376486778259
0.993118524551392&#58; acosf = 0.117382980883121, vacos = 0.110171675682068
0.992478370666504&#58; acosf = 0.122727967798710, vacos = 0.120423436164856
0.991823673248291&#58; acosf = 0.127964779734611, vacos = 0.127515912055969
0.991127431392670&#58; acosf = 0.133309572935104, vacos = 0.132169842720032
0.990402877330780&#58; acosf = 0.138654336333275, vacos = 0.137006640434265
0.989650011062622&#58; acosf = 0.143999248743057, vacos = 0.142025470733643
0.988868892192841&#58; acosf = 0.149344027042389, vacos = 0.147226572036743
0.988059520721436&#58; acosf = 0.154688835144043, vacos = 0.152607440948486

I dont have a sci calc handy to verify which is 'correct', but im leaning towards vfpu version :)

NOTE: vacos is almost 10 times faster than libm's acosf :)

hlide · Post by **hlide** » Tue Jan 16, 2007 10:33 pm

I coded it but i didn't test it :

Code: Select all

# typedef struct quaternion_s &#123; float i, j, k, r; &#125; __attribute__&#40;&#40;aligned&#40;16&#41;&#41;&#41; quaternion_t; 
# typedef struct axis_angle_s &#123; float x, y, z, theta; &#125; __attribute__&#40;&#40;aligned&#40;16&#41;&#41;&#41; axis_angle_t; 
# void quaternion_to_axis_angle&#40;quaternion_t *q, axis_angle_t *aa&#41;
# &#123;
#       quaternion_normalise&#40;q&#41;;
# 
#       aa.theta = acos&#40;q->r&#41; * 2;
# 
#       vx = q->i;
#       vy = q->j;
#       vz = q->k;
# 
#       norm = sqrt&#40;vx * vx + vy * vy + vz * vz&#41;;
#       if &#40;norm > 0.0005&#41;
#       &#123;
#               aa->x = vx / norm;
#               aa->y = vy / norm;
#               aa->z = vz / norm;
#       &#125;
# &#125;
.global quaternion_to_axis_angle
quaternion_to_axis_angle&#58;
        lv.q            C000, 0&#40;$a0&#41;            # C000.q = &#40;q.i, q.j, q.k, q.r&#41;
        vcst.s          S102, VFPU_PI           # S102.s = PI
        vdot.q          S100, C000, C000        # S100.s = &#40;q.i^2 + q.j^2 + q.k^2 + q.r^2&#41;      
        vrsq.s          S100, S100              # S100.s = 1/sqrt&#40;q.i^2 + q.j^2 + q.k^2 + q.r^2&#41;
        vscl.q          C000, C000, S100        # C000.q = &#40;vx, vy, vy&#41; = &#40;q.i, q.j, q.k, q.r&#41;*&#40;1/sqrt&#40;q.i^2 + q.j^2 + q.k^2 + q.r^2&#41;   
        vasin.s         S101, S000              # S101.s = vasin&#40;q.r&#41; = 2*asin&#40;q.r&#41;/PI
        vfim.s          S100, 0.00005           # S100.s = epsilon
        vdot.t          S103, C000, C000        # S103.s = &#40;vx^2 + vy^2 + vz^2&#41;         
        vocp.s          S101, S101              # S101.s = 1 - 2*asin&#40;q.r&#41;/PI
        vrsq.s          S103, S103              # S103.s = norm = 1/sqrt&#40;vx^2, vy^2, vz^2&#41;
        vmul.s          S003, S101, S102        # S003.s = 2*acos&#40;q.r&#41; = 2*&#40;PI/2 - asin&#40;q.r&#41;&#41; = PI*&#40;1 - 2*asin&#40;q.r&#41;/PI&#41; 
        vcmp.s          LT, S103, S100          # VFPU_CC&#91;0&#93; = norm < epsilon
        bvtl            0, 0f   
        vzero.t         C000                    # if &#40;VFPU_CC&#91;0&#93; == true&#41; C000.t = &#40;0, 0, 0&#41;; 
        vscl.t          C000, C000, S103        # if &#40;VFPU_CC&#91;0&#93; == false&#41; C000.t = &#40;vx, vy, vz&#41;/norm
0&#58;      jr              ra
        sv.q            C000, 0&#40;$a1&#41;            # aa.x = S000.s, aa.y = S001.s, aa.z = S002.s, aa.theta = S003.s

hlide · Post by **hlide** » Tue Jan 16, 2007 11:26 pm

humm, i made a mistake when norm < epsilon, I tend to set default value when I cannot compute the value, but here I made a big mistake, so i chnaged it in something more appropriate.

Remove "vzero.t" and transform "bvtl" into "bvfl" :

Code: Select all

        bvfl            0, 0f   
        vscl.t          C000, C000, S103        # if &#40;VFPU_CC&#91;0&#93; == false&#41; C000.t = &#40;vx, vy, vz&#41;/norm
0&#58;      jr              ra
        sv.q            C000, 0&#40;$a1&#41;            # aa.x = S000.s, aa.y = S001.s, aa.z = S002.s, aa.theta = S003.s

or replace "bvtl" by "vcmovt" :

Code: Select all

        vscl.t          C100, C000, S103
        vcmovf.t        C000, C100, 0           # if &#40;VFPU_CC&#91;0&#93; == false&#41; C000.t = &#40;vx, vy, vz&#41;/norm
0&#58;      jr              ra
        sv.q            C000, 0&#40;$a1&#41;            # aa.x = S000.s, aa.y = S001.s, aa.z = S002.s, aa.theta = S003.s

hlide · Post by **hlide** » Tue Jan 16, 2007 11:36 pm

humm i cannot connect to irc.freenode.org... i'm alone ?

MrMr[iCE] · Post by **MrMr[iCE]** » Tue Jan 16, 2007 11:46 pm

bloody brilliant man...thats some tight code, the normalization code is certainly much shorter than what I had =)

As to freenode, I'm there right now...try connecting to one of the alternatives here: http://freenode.net/irc_servers.shtml

hlide · Post by **hlide** » Wed Jan 17, 2007 9:37 am

very bad news ! after some tests it appears that you cannot use a VFPU instruction in a branch delay slot, that is :

- after a bvt
- after a bvf
- after a bvtl
- after a bvfl
- after a beq
- after a bne
- after a b...z
- after a bal...z
- after a bal
- after a b
- after a jr
- after a jalr
- after a j

:((((

hlide · Post by **hlide** » Wed Jan 17, 2007 11:09 am

okay this version totally rocks and show how you can use efficiently the VFPU flag 4 (at leats one component matches the condition) to detect any NaN or Inf value in VFPU registers :

Code: Select all

# bool vfpuQuaternionToAxisAngle&#40;vfpu_quaternion_t quaternion, vfpu_axis_angle_t axis_angle&#41;
.global vfpuQuaternionToAxisAngle
.p2align 4
vfpuQuaternionToAxisAngle&#58;
        lv.q            C000, &#40;$a0&#41;
        vcst.s          S011, VFPU_PI
        vdot.q          S012, C000, C000
        vrsq.s          S013, S012
        vscl.q          C000, C000, S013
        vdot.t          S013, C000, C000
        vrsq.s          S012, S013
        vscl.t          C000, C000, S012
        vcmp.q          ES, C000
        vasin.s         S012, S003
        vocp.s          S013, S012
        vmul.s          S003, S013, S011
        vcmov.q         C000, C000&#91;1, 0, 0, 0&#93;, 4
        sv.q            C000, &#40;$a1&#41;
        jr              $ra
        nop

NOTE : NaN and Inf don't raise an exception so we don't need to exit early and it is preferably to run all the instructions in the normal case without any branch to take. To set a default value, we just only test at the end if our result register has at leat one of this component equals to NaN or Inf value. We cannot make it simpler.

Oh ! MrMr[iCE], fear the madness of VFPU prefix here ;P

MrMr[iCE] · Post by **MrMr[iCE]** » Wed Jan 17, 2007 12:11 pm

wow slick man...works great =)
Come on irc, made another routine you might wanna check out =)

MrMr[iCE] · Post by **MrMr[iCE]** » Wed Jan 17, 2007 6:55 pm

Code: Select all

void sceQuatToMatrix&#40;ScePspQuatMatrix *q, ScePspFMatrix4 *m&#41; &#123;
/*	x2	= SQR&#40;x&#41;;	y2	= SQR&#40;y&#41;;	z2	= SQR&#40;z&#41;;	w2	= SQR&#40;w&#41;;

	xy	= x * y;
	xz	= x * z;
	yz	= y * z;

	wx	= w * x;
	wy	= w * y;
	wz	= w * z;

	matrix&#91;0&#93; =	float&#40;1 - 2*&#40;y2 + z2&#41;&#41;;
	matrix&#91;1&#93; =	float&#40;2 * &#40;xy + wz&#41;&#41;;
	matrix&#91;2&#93; =	float&#40;2 * &#40;xz - wy&#41;&#41;;

	matrix&#91;4&#93; =	float&#40;2 * &#40;xy - wz&#41;&#41;;
	matrix&#91;5&#93; =	float&#40;1 - 2*&#40;x2 + z2&#41;&#41;;
	matrix&#91;6&#93; =	float&#40;2 * &#40;yz + wx&#41;&#41;;

	matrix&#91;8&#93; =	float&#40;2 * &#40;xz + wy&#41;&#41;;
	matrix&#91;9&#93; =	float&#40;2 * &#40;yz - wx&#41;&#41;;
	matrix&#91;10&#93; =float&#40;1 - 2*&#40;x2 + y2&#41;&#41;;*/

	__asm__ volatile &#40;
       "lv.q      C000, %1\n"                               // C000 = &#91;x,  y,  z,  w &#93;
       "vmul.q    C010, C000, C000\n"                       // C010 = &#91;x2, y2, z2, w2&#93;
       "vcrs.t    C020, C000, C000\n"                       // C020 = &#91;yz, xz, xy &#93;
       "vmul.q    C030, C000, C000&#91;w,w,w,0&#93;\n"		        // C030 = &#91;wx, wy, wz &#93;

       "vadd.q    C100, C020&#91;0,z,y,0&#93;, C030&#91;0,z,-y,0&#93;\n"    // C100 = &#91;0,     xy+wz, xz-wy&#93;
       "vadd.s    S100, S011, S012\n"                       // C100 = &#91;y2+z2, xy+wz, xz-wy&#93;

       "vadd.q    C110, C020&#91;z,0,x,0&#93;, C030&#91;-z,0,x,0&#93;\n"    // C110 = &#91;xy-wz, 0,     yz+wx&#93;
       "vadd.s    S111, S010, S012\n"                       // C110 = &#91;xy-wz, x2+z2, yz+wx&#93;

       "vadd.q    C120, C020&#91;y,x,0,0&#93;, C030&#91;y,-x,0,0&#93;\n"    // C120 = &#91;xz+wy, yz-wx, 0    &#93;
       "vadd.s    S122, S010, S011\n"                       // C120 = &#91;xz+wy, yz-wx, x2+y2&#93;

       "vmov.s    S033, S033&#91;2&#93;\n"
       "vscl.t    C100, C100, S033\n"                       // C100 = &#91;2*&#40;y2+z2&#41;, 2*&#40;xy+wz&#41;, 2*&#40;xz-wy&#41;&#93;
       "vscl.t    C110, C110, S033\n"                       // C110 = &#91;2*&#40;xy-wz&#41;, 2*&#40;x2+z2&#41;, 2*&#40;yz+wx&#41;&#93;
       "vscl.t    C120, C120, S033\n"                       // C120 = &#91;2*&#40;xz+wy&#41;, 2*&#40;yz-wx&#41;, 2*&#40;x2+y2&#41;&#93;

       "vocp.s    S100, S100\n"                             // C100 = &#91;1-2*&#40;y2+z2&#41;, 2*&#40;xy+wz&#41;,   2*&#40;xz-wy&#41;  &#93;
       "vocp.s    S111, S111\n"                             // C110 = &#91;2*&#40;xy-wz&#41;,   1-2*&#40;x2+z2&#41;, 2*&#40;yz+wx&#41;  &#93;
       "vocp.s    S122, S122\n"                             // C120 = &#91;2*&#40;xz+wy&#41;,   2*&#40;yz-wx&#41;,   1-2*&#40;x2+y2&#41;&#93;

       "vidt.q    C130\n"                                   // C130 = &#91;0, 0, 0, 1&#93;

       "sv.q      R100, 0  + %0\n"
       "sv.q      R101, 16 + %0\n"
       "sv.q      R102, 32 + %0\n"
       "sv.q      R103, 48 + %0\n"
       &#58; "=m"&#40;*m&#41; &#58; "m"&#40;*q&#41;&#41;;
&#125;

Something for you to chew on, hlide =)

MrMr[iCE] · Post by **MrMr[iCE]** » Fri Jan 19, 2007 9:08 am

hlide wrote: VFPU has control registers and some are relative to random seed i guess. They are documented in groepaz's document.

Code: Select all

128 	VFPU_PFXS 	Source prefix stack
129 	VFPU_PFXT 	Target prefix stack
130 	VFPU_PFXD 	Destination prefix stack
131 	VFPU_CC 	Condition information
132 	VFPU_INF4 	VFPU internal information 4
133 	VFPU_RSV5 	Not used &#40;reserved&#41;
134 	VFPU_RSV6 	Not used &#40;reserved&#41;
135 	VFPU_REV 	VFPU revision information
136 	VFPU_RCX0 	Pseudorandom number generator information 0
137 	VFPU_RCX1 	Pseudorandom number generator information 1
138 	VFPU_RCX2 	Pseudorandom number generator information 2
139 	VFPU_RCX3 	Pseudorandom number generator information 3
140 	VFPU_RCX4 	Pseudorandom number generator information 4
141 	VFPU_RCX5 	Pseudorandom number generator information 5
142 	VFPU_RCX6 	Pseudorandom number generator information 6
143 	VFPU_RCX7 	Pseudorandom number generator information 7

hlide and I did a bit of testing with vrnds, vrndf1 and vrndf2.
Turns out the 8 RCX registers are most likey part of a 'shift register' algorithm. They are updated everytime a rand value is queried with vrndf1/2, and there is a noticeable pattern between RCX2 and 3, and RCX 6 and 7 between iterations.

before vrnds or any vrndfX instruction is run, the state of the RCX registers are:

Code: Select all

3f800001     3f800002     3f800004     3f800008     3f800000     3f800000     3f800000     3f800000
1.0000001192 1.0000002384 1.0000004768 1.0000009537 1.0000000000 1.0000000000 1.0000000000 1.0000000000

after seeding with 1.4f:

Code: Select all

3f833333     3f833333     3f833333     3f833333     3f833fb3     3f8b3fb3     3f8f3fb3     3f833fb3
1.0249999762 1.0249999762 1.0249999762 1.0249999762 1.0253814459 1.0878814459 1.1191314459 1.0253814459

now 2 calls to vrndf1:

Code: Select all

rand&#58; 1.155536293983
3f8096d8     3f8084f9     3f803333     3f80cccc     3f804f4c     3f80637a     3f803fb3     3f80fecc
1.0046033859 1.0040580034 1.0015624762 1.0062499046 1.0024199486 1.0030357838 1.0019439459 1.0077757835

rand&#58; 1.911504626274
3f80c2f9     3f801c6b     3f80cccc     3f80cccb     3f80fad5     3f804f52     3f80fecc     3f803d4c
1.0059500933 1.0008672476 1.0062499046 1.0062497854 1.0076547861 1.0024206638 1.0077757835 1.0018706322

note RCX3 and RCX7 numbers in the first call, they get moved to RCX2 and RCX6 in the second call.

Also found vrndf1 returns floats in the range of 1.0 to 2.0, and vrndf2 returns in the range of 2.0 to 4.0...anyone have any idea what the 2->4 range would be useful for?

hlide · Post by **hlide** » Sat Jan 20, 2007 3:45 am

MrMr[iCE] wrote:Also found vrndf1 returns floats in the range of 1.0 to 2.0, and vrndf2 returns in the range of 2.0 to 4.0...anyone have any idea what the 2->4 range would be useful for?

note their differences :
- vrndf1 --> [1, 2[ --> |2.0-1.0| < 1.0
- vrndf2 --> [2, 4[ --> |4.0-2.0| < 2.0

to have [0, 1[ --> (vrndf1() - 1.0) :

Code: Select all

vrndf1.s S000
vsub.s S000, S000, S000&#91;1&#93;

to have [-1, 1[ --> (vrndf2() - 3.0) :

Code: Select all

vrndf2.s S000
vsub.s S000, S000, S000&#91;3&#93;

there is a possibilty that the algorithm only works on the mantissa of the random floats :
- sign bit always 0 : always positive number,
- exponent fixed to 127.
- mantissa is always [1, 2[

(-1)^0 x 2^(127-127) x (1.mantissa)

with 00000000000000000000000b <= mantissa < 11111111111111111111111b

so it would explain the range of [1, 2[ for "vrnd1f"

note also that vrndf2() <=> 2.0*vrndf1().

MrMr[iCE] · Post by **MrMr[iCE]** » Mon Jan 22, 2007 2:51 pm

Code: Select all

vi2uc.q vd.s, vs.q                           1            0
&#123;
  vd.s&#91;0&#93;&#40; 0.. 7&#41; = vs.q&#91;0&#93; & 0xFF;
  vd.s&#91;0&#93;&#40; 8..15&#41; = vs.q&#91;1&#93; & 0xFF;
  vd.s&#91;0&#93;&#40;16..23&#41; = vs.q&#91;2&#93; & 0xFF;
  vd.s&#91;0&#93;&#40;24..31&#41; = vs.q&#91;3&#93; & 0xFF;
&#125;

This opcode is giving me a bit of trouble...If i read it correctly, it should be doing a packed unsigned assign from a 4 reg vector....the behavior I'm seeing is not quite that tho:

Code: Select all

M000 before doing vi2uc &#40;note C020, this is the data i want to pack&#41;

      C000     C010     C020     C030
R000&#58; 43200000 42be0000 000000f5 7f800001
R001&#58; 43200000 42be0000 000000e3 7f800001
R002&#58; 43200000 42be0000 000000a7 7f800001
R003&#58; 7f800001 7f800001 000000ff 7f800001

after vi2uc.q S000, C020 i get&#58;

      C000     C010     C020     C030
R000&#58; 00000000 42be0000 000000f5 7f800001
R001&#58; 43200000 42be0000 000000e3 7f800001
R002&#58; 43200000 42be0000 000000a7 7f800001
R003&#58; 7f800001 7f800001 000000ff 7f800001

S000 = 0? It should be FFA7E3F5...what did I miss here?

Edit: I think I see whats happening...the vfpu is actually doing this:

Code: Select all

vi2uc.q vd.s, vs.q
&#123;
  vd.s&#91;0&#93;&#40; 0.. 7&#41; = &#40;vs.q&#91;0&#93; & 0x7F800000&#41; >> 23;
  vd.s&#91;0&#93;&#40; 8..15&#41; = &#40;vs.q&#91;1&#93; & 0x7F800000&#41; >> 23;
  vd.s&#91;0&#93;&#40;16..23&#41; = &#40;vs.q&#91;2&#93; & 0x7F800000&#41; >> 23;
  vd.s&#91;0&#93;&#40;24..31&#41; = &#40;vs.q&#91;3&#93; & 0x7F800000&#41; >> 23;
&#125;

So you have to left shift your integers by 23 bits before using this instruction, like with vf2iz.q C020, C020, 23. I have no idea why it takes bits 23-30, instead of 24-31 of vs[0]->vs[3], but this is producing the result I needed...any thoughts? Possibly something to do with bit 31 being a sign bit?

Also, I think vf2iX.s/p/t/q does a fixed point conversion. If you provide a non-0 shift immediate, it will fill the bits to the right of the decimal point with fraction bits converted from the original float value. I will test some more.

hlide · Post by **hlide** » Mon Jan 22, 2007 8:06 pm

hummm you're right, the text file is not updated because I was aware of the shift to do. That said, I never tested with negative values.

it looks like :

Code: Select all

vi2uc.q vd.s, vs.q
&#123;
  vd.s&#91;0&#93;&#40; 0.. 7&#41; = max_integer&#40;0, vs.q&#91;0&#93; >> 23&#41;;
  vd.s&#91;0&#93;&#40; 8..15&#41; = max_integer&#40;0, vs.q&#91;1&#93; >> 23&#41;;
  vd.s&#91;0&#93;&#40;16..23&#41; = max_integer&#40;0, vs.q&#91;2&#93; >> 23&#41;;
  vd.s&#91;0&#93;&#40;24..31&#41; = max_integer&#40;0, vs.q&#91;3&#93; >> 23&#41;;
&#125;

it means the integer source is like :

S = [s:1][i:8][f:23]

that is [s:1][i:8] would be a value between -256 and 255, not -128 and 127.

MrMr[iCE] · Post by **MrMr[iCE]** » Wed Jan 24, 2007 1:29 am

Found a bug with prefixes and vbfy1 instruction:

Code: Select all

vadd.s  S010, S000&#91;1&#93;
vsub.s  S011, S000&#91;1&#93;
vbfy1.p C020, C000&#91;x, 1&#93;

should produce&#58;

S010 = C000&#91;x&#93; + 1
S011 = C000&#91;x&#93; - 1

S020 = C000&#91;x&#93; + 1
S021 = C000&#91;x&#93; - 1

but we get a numerical error&#58;

      C000      C010    C020      C030
                        vvvvvvvv
R000&#58; 1.000000 2.000000 2.442695 0.000000
                        ^^^^^^^^
R001&#58; 1.442695 0.000000 0.000000 0.000000
R002&#58; 0.000000 0.000000 0.000000 0.000000
R003&#58; 0.000000 0.000000 0.000000 0.000000

S000 = 1.0, so S020 should be 2.0, but as you can see, not quite...so far only tested with the 1 prefix, will try others

workaround:
do a vone.s S001 and perform vbfy1 without the prefix

hlide · Post by **hlide** » Wed Jan 24, 2007 3:08 am

you should also test if swizzle prefix works :

1) swizzle operation : x, y, z, w
2) absolute operation : ?, |?| where ? is one of 1)
3) negation operation : ?, -? where ? is one of 2

now that we know that using constant insert operation is not working very well with vbfy1.p, we may also suppose it would be the same for vbfy1/2.q

hlide · Post by **hlide** » Fri Jan 26, 2007 12:34 am

Ok, for those who are interested i wrote several asm macros to count pitch and latency of a VFPU insn, here is the result :

NOTE:
The pitch represents resource-occupying cycles. An instruction using the same resources can only be issued after the resource-occupying cycles. We call the cycles the pitch of the instruction.

An instruction has a "Read After Write" hazard with the next instruction if the latter has a shorter latency, stall delay is latency(insn1) - 1.

An instruction has a "Write After Write" hazard with the next instruction if the latter has a shorter latency, stall delay is latency(insn1) - latency(insn2).

Code: Select all

INSTRUCTION     PITCH   LATENCY
--------------- ------- -------
lv.s            1       3
lv.q            1       3
mfv             6       0
mfvc            6       0
mtv             1       3
mtvc            1       3
sv.s            5       0
sv.q            5       0
svl.q           5       0
svr.q           5       0
vabs.s          1       3
vabs.p          1       3
vabs.t          1       3
vabs.q          1       3
vadd.s          1       5
vadd.p          1       5
vadd.t          1       5
vadd.q          1       5
vasin.s         1       7
vasin.p         2       8
vasin.t         3       9
vasin.q         4       10
vavg.p          1       7
vavg.t          1       7
vavg.q          1       7
vbfy1.p         1       5
vbfy1.q         1       5
vbfy2.q         1       5
vcmovf.s        1       5
vcmovf.p        1       5
vcmovf.t        1       5
vcmovf.q        1       5
vcmovt.s        1       5
vcmovt.p        1       5
vcmovt.t        1       5
vcmovt.q        1       5
vcmp.s          1       3
vcmp.p          1       3
vcmp.t          1       3
vcmp.q          1       3
vcos.s          1       7
vcos.p          2       8
vcos.t          3       9
vcos.q          4       10
vcrs.t          1       5
vcrsp.t         3       9
vcst.s          1       3
vcst.p          1       3
vcst.t          1       3
vcst.q          1       3
vdet.p          1       7
vdiv.s          14      17
vdiv.p          28      31
vdiv.t          42      45
vdiv.q          56      59
vdot.t          1       7
vdot.q          1       7
vexp2.s         1       7
vexp2.p         2       8
vexp2.t         3       9
vexp2.q         4       10
vf2h.p          1       5
vf2h.q          1       5
vf2id.s         1       5
vf2id.p         1       5
vf2id.t         1       5
vf2id.q         1       5
vf2in.s         1       5
vf2in.p         1       5
vf2in.t         1       5
vf2in.q         1       5
vf2iu.s         1       5
vf2iu.p         1       5
vf2iu.t         1       5
vf2iu.q         1       5
vf2iz.s         1       5
vf2iz.p         1       5
vf2iz.t         1       5
vf2iz.q         1       5
vfad.p          1       7
vfad.t          1       7
vfad.q          1       7
vh2f.s          1       5
vh2f.p          1       5
vhdp.p          1       7
vhdp.q          1       7
vhtfm2.p        2       8
vhtfm3.t        3       9
vhtfm4.q        4       10
vi2c.q          1       3
vi2f.s          1       5
vi2f.p          1       5
vi2f.t          1       5
vi2f.q          1       5
vi2s.p          1       3
vi2s.q          1       3
vi2uc.q         1       3
vi2us.p         1       3
vi2us.q         1       3
vidt.p          1       3
vidt.q          1       3
viim.s          1       5
vlgb.s          1       5
vlog2.s         1       7
vlog2.p         2       8
vlog2.t         3       9
vlog2.q         4       10
vmax.s          1       5
vmax.p          1       5
vmax.t          1       5
vmax.q          1       5
vmfvc           1       3
vmidt.p         2       4
vmidt.t         3       5
vmidt.q         4       6
vmin.s          1       5
vmin.p          1       5
vmin.t          1       5
vmin.q          1       5
vmmov.p         2       4
vmmov.t         3       5
vmmov.q         4       6
vmmul.p         4       10
vmmul.t         9       15
vmmul.q         16      22
vmone.p         2       4
vmone.t         3       5
vmone.q         4       6
vmscl.p         2       6
vmscl.t         3       7
vmscl.q         4       8
vmtvc           1       3
vmul.s          1       5
vmul.p          1       5
vmul.t          1       5
vmul.q          1       5
vmzero.p        2       4
vmzero.t        3       5
vmzero.q        4       6
vneg.s          1       3
vneg.p          1       3
vneg.t          1       3
vneg.q          1       3
vnrcp.s         1       7
vnrcp.p         2       8
vnrcp.t         3       9
vnrcp.q         4       10
vnsin.s         1       7
vnsin.p         2       8
vnsin.t         3       9
vnsin.q         4       10
vocp.s          1       5
vocp.p          1       5
vocp.t          1       5
vocp.q          1       5
vone.s          1       3
vone.p          1       3
vone.t          1       3
vone.q          1       3
vqmul.q         4       10
vrcp.s          1       7
vrcp.p          2       8
vrcp.t          3       9
vrcp.q          4       10
vrexp2.s        1       7
vrexp2.p        2       8
vrexp2.t        3       9
vrexp2.q        4       10
vrndf1.s        3       5
vrndf1.p        6       8
vrndf1.t        9       11
vrndf1.q        12      14
vrndf2.s        3       5
vrndf2.p        6       8
vrndf2.t        9       11
vrndf2.q        12      14
vrndi.s         3       5
vrndi.p         6       8
vrndi.t         9       11
vrndi.q         12      14
vrnds.s         1       3
vrot.p          2       8
vrot.t          2       8
vrot.q          2       8
vrsq.s          1       7
vrsq.p          2       8
vrsq.t          3       9
vrsq.q          4       10
vs2i.s          1       3
vs2i.p          1       3
vsat0.s         1       3
vsat0.p         1       3
vsat0.t         1       3
vsat0.q         1       3
vsat1.s         1       3
vsat1.p         1       3
vsat1.t         1       3
vsat1.q         1       3
vsbn.s          1       5
vsbz.s          1       5
vscl.p          1       5
vscl.t          1       5
vscl.q          1       5
vscmp.s         1       5
vscmp.p         1       5
vscmp.t         1       5
vscmp.q         1       5
vsge.s          1       5
vsge.p          1       5
vsge.t          1       5
vsge.q          1       5
vsgn.s          1       5
vsgn.p          1       5
vsgn.t          1       5
vsgn.q          1       5
vsgn.s          1       7
vsin.p          2       8
vsin.t          3       9
vsin.q          4       10
vslt.s          1       5
vslt.p          1       5
vslt.t          1       5
vslt.p          1       5
vsocp.s         1       5
vsocp.p         1       5
vsqrt.s         1       7
vsqrt.p         2       8
vsqrt.t         3       9
vsqrt.q         4       10
vsrt1.q         1       5
vsrt2.q         1       5
vsrt3.q         1       5
vsrt4.q         1       5
vsub.s          1       5
vsub.p          1       5
vsub.t          1       5
vsub.q          1       5
vt4444.q        1       3
vt5551.q        1       3
vt5650.q        1       3
vtfm2.p         2       8
vtfm3.t         3       9
vtfm4.q         4       10
vus2i.s         1       3
vus2i.p         1       3
vwbn.s          1       5
vzero.p         1       3
vzero.t         1       3
vzero.q         1       3

the stubs.S where i put all the test :

Code: Select all

.set noreorder
.text
 
// use it for a single-cycle instruction &#40;pitch = 1&#41;
// or for macro-instruction which reiterates the same single-cycle instruction &#40;pitch > 1&#41;
.macro test1 insn
.p2align 6
        
        mfc0                    $v1, $9
        nop
        vnop                    # stall next VFPU instruction
        \insn
0&#58;      mfc0                    $v0, $9
        sync
        subu                    $v0, $v0, $v1 # cycles1 
        move                    $a1, $v0
        
.p2align 6
        mfc0                    $v1, $9
        nop
        vnop                    # stall next VFPU instruction
        \insn
        vnop                    # block next CPU instructions so we can count the stall cycles
0&#58;      mfc0                    $v0, $9
        sync
        subu                    $v0, $v0, $v1 # cycles2
        subu                    $a2, $v0, $a1
        sb                      $a2, 0&#40;$a0&#41; # pitch = cycles2 - cycles1
       
.p2align 6
        mfc0                    $v1, $9
        nop
        vnop                    # stall next VFPU instruction
        \insn
        vsync                   # our latency is somewhere here !
        vnop                    # block next CPU instructions so we can count the stall cycles
0&#58;      mfc0                    $v0, $9
        sync
        subu                    $v0, $v0, $v1 # cycles3, note that cycles3 - cycles1 >= 2
        subu                    $a2, $v0, $a1
        addiu                   $a2, $a2, -2 # if no latency, cycles3 - cycles1 = 2 so we need to adjust here
        sb                      $a2, 1&#40;$a0&#41; # latency = cycles3 - cycles1 - 2
        addu                    $a0, $a0, 2
.endm

// use it for a macro-instruction which reiterates the same multi-cycle instruction
// NOTE &#58; we cannot directly compute the pitch but we can guess it by this way &#58;
// pitch&#40;vdiv.s&#41; = latency&#40;vdiv.p&#41; - latency&#40;vdiv.s&#41;
// pitch&#40;vdiv.p&#41; = pitch&#40;vdiv.s&#41; + pitch&#40;vdiv.s&#41;
// pitch&#40;vdiv.t&#41; = pitch&#40;vdiv.s&#41; + pitch&#40;vdiv.p&#41;
// pitch&#40;vdiv.q&#41; = pitch&#40;vdiv.s&#41; + pitch&#40;vdiv.t&#41;
.macro test2a insn
.p2align 6
        mfc0                    $v1, $9
        nop
        vnop                    # stall next VFPU instruction
        \insn
0&#58;      mfc0                    $v0, $9
        sync
        subu                    $v0, $v0, $v1 # cycles1 
        move                    $a1, $v0
        move                    $t9, $0
        
.p2align 6
        mfc0                    $v1, $9
        nop
        vnop                    # stall next VFPU instruction
        \insn
        vsync                   # our latency is somewhere here !
        vnop                    # block next CPU instructions so we can count the stall cycles
0&#58;      mfc0                    $v0, $9
        sync
        subu                    $v0, $v0, $v1 # cycles3, note that cycles3 - cycles1 >= 2
        subu                    $a2, $v0, $a1
        addiu                   $a2, $a2, -2 # if no latency, cycles3 - cycles1 = 2 so we need to adjust here
        sb                      $a2, 1&#40;$a0&#41; # latency' = cycles3 - cycles1 - 2
        move                    $t8, $a0
        addu                    $a0, $a0, 2
.endm

// use it for a macro-instruction which reiterates the same multi-cycle instruction
// NOTE &#58; we cannot directly compute the pitch but we can guess it by this way &#58;
// pitch&#40;vdiv.s&#41; = latency&#40;vdiv.p&#41; - latency&#40;vdiv.s&#41;
// pitch&#40;vdiv.p&#41; = latency&#40;vdiv.t&#41; - latency&#40;vdiv.p&#41; + pitch&#40;vdiv.s&#41;
// pitch&#40;vdiv.t&#41; = latency&#40;vdiv.q&#41; - latency&#40;vdiv.t&#41; + pitch&#40;vdiv.p&#41;
// pitch&#40;vdiv.q&#41; = pitch&#40;vdiv.t&#41; + pitch&#40;vdiv.s&#41; 
.macro test2b insn
        subu                    $t9, $t9, $a2
.p2align 6
        mfc0                    $v1, $9
        nop
        vnop                    # stall next VFPU instruction
        \insn
0&#58;      mfc0                    $v0, $9
        sync
        subu                    $v0, $v0, $v1
        move                    $a1, $v0
              
.p2align 6
        mfc0                    $v1, $9
        nop
        vnop                    # stall next VFPU instruction
        \insn
        vsync
        vnop                    # block next CPU instructions so we can count the stall cycles
0&#58;      mfc0                    $v0, $9
        sync
        subu                    $v0, $v0, $v1 # cycles3, note that cycles3 - cycles1 >= 2
        subu                    $a2, $v0, $a1
        addiu                   $a2, $a2, -2 # if no latency, cycles3 - cycles1 = 2 so we need to adjust here
        sb                      $a2, 1&#40;$a0&#41; # latency = cycles3 - cycles1 - 2
        addu                    $t9, $t9, $a2
        sb                      $t9, -2&#40;$a0&#41; # pitch = latency - latency' 
        sb                      $a2, 1&#40;$a0&#41; # latency
        addu                    $a0, $a0, 2
.endm

// use it for a macro-instruction which reiterates the same multi-cycle instruction
// NOTE &#58; we cannot directly compute the pitch but we can guess it by this way &#58;
// pitch&#40;vdiv.s&#41; = latency&#40;vdiv.p&#41; - latency&#40;vdiv.s&#41;
// pitch&#40;vdiv.p&#41; = latency&#40;vdiv.t&#41; - latency&#40;vdiv.p&#41; + pitch&#40;vdiv.s&#41;
// pitch&#40;vdiv.t&#41; = latency&#40;vdiv.q&#41; - latency&#40;vdiv.t&#41; + pitch&#40;vdiv.p&#41;
// pitch&#40;vdiv.q&#41; = pitch&#40;vdiv.t&#41; + pitch&#40;vdiv.s&#41; 
.macro test2c insn
        subu                    $t9, $t9, $a2
.p2align 6
        mfc0                    $v1, $9
        nop
        vnop                    # stall next VFPU instruction
        \insn
0&#58;      mfc0                    $v0, $9
        sync
        subu                    $v0, $v0, $v1
        move                    $a1, $v0
              
.p2align 6
        mfc0                    $v1, $9
        nop
        vnop                    # stall next VFPU instruction
        \insn
        vsync
        vnop                    # block next CPU instructions so we can count the stall cycles
0&#58;      mfc0                    $v0, $9
        sync
        subu                    $v0, $v0, $v1 # cycles3, note that cycles3 - cycles1 >= 2
        subu                    $a2, $v0, $a1
        addiu                   $a2, $a2, -2 # if no latency, cycles3 - cycles1 = 2 so we need to adjust here
        sb                      $a2, 1&#40;$a0&#41; # latency = cycles3 - cycles1 - 2
        addu                    $t9, $t9, $a2
        sb                      $t9, -2&#40;$a0&#41; # pitch = latency - latency' 
        sb                      $a2, 1&#40;$a0&#41; # latency
        lbu                     $v0, &#40;$t8&#41;
        addu                    $t9, $t9, $v0
        sb                      $t9, 0&#40;$a0&#41; # pitch
        addu                    $a0, $a0, 2
.endm

// use it for an instruction which needs to deal with writing into CPU register or memory
// since there is no latency
.macro test3 insn
.p2align 6
        mfc0                    $v1, $9
        nop
        vnop                    # stall next VFPU instruction
        vsync
        vnop                    # block next CPU instructions so we can count the stall cycles
0&#58;      mfc0                    $v0, $9
        sync
        subu                    $a1, $v0, $v1 # cycles1
       
.p2align 6
        mfc0                    $v1, $9
        nop
        vnop                    # stall next VFPU instruction
        \insn
        vsync
        vnop                    # block next CPU instructions so we can count the stall cycles
0&#58;      mfc0                    $v0, $9
        sync
        subu                    $v0, $v0, $v1
        subu                    $a2, $v0, $a1
        sb                      $a2, 0&#40;$a0&#41; # pitch = cycles2 - cycles1
        sb                      $0, 1&#40;$a0&#41; # latency = 0
        addu                    $a0, $a0, 2
.endm

.global test_vfpu_cycles
test_vfpu_cycles&#58;
        
        test1                    "lv.s $0,&#40;$a3&#41;"
        test1                    "lv.q $0,&#40;$a3&#41;"
        test3                    "mfv $t0, $0"
        test3                    "mfvc $t0, $131"
        test1                    "mtv $0, $0"
        test1                    "mtvc $0, $131"
        test3                    "sv.s $0,&#40;$a3&#41;"
        test3                    "sv.q $0,&#40;$a3&#41;"
        test3                    "svl.q $0,&#40;$a3&#41;"
        test3                    "svr.q $0,&#40;$a3&#41;"
        test1                    "vabs.s $1, $0"
        test1                    "vabs.p $1, $0"
        test1                    "vabs.t $1, $0"
        test1                    "vabs.q $1, $0"
        test1                    "vadd.s $2, $1, $0"
        test1                    "vadd.p $2, $1, $0"
        test1                    "vadd.t $2, $1, $0"
        test1                    "vadd.q $2, $1, $0"
        test1                    "vasin.s $1, $0"
        test1                    "vasin.p $1, $0"
        test1                    "vasin.t $1, $0"
        test1                    "vasin.q $1, $0"
        test1                    "vavg.p $1, $0"
        test1                    "vavg.t $1, $0"
        test1                    "vavg.q $1, $0"
        test1                    "vbfy1.p $1, $0"
        test1                    "vbfy1.q $1, $0"
        test1                    "vbfy2.q $1, $0"
        test1                    "vcmovf.s $1, $0, 0"
        test1                    "vcmovf.p $1, $0, 0"
        test1                    "vcmovf.t $1, $0, 0"
        test1                    "vcmovf.q $1, $0, 0"
        test1                    "vcmovt.s $1, $0, 0"
        test1                    "vcmovt.p $1, $0, 0"
        test1                    "vcmovt.t $1, $0, 0"
        test1                    "vcmovt.q $1, $0, 0"
        test1                    "vcmp.s EQ, $1, $0"
        test1                    "vcmp.p EQ, $1, $0"
        test1                    "vcmp.t EQ, $1, $0"
        test1                    "vcmp.q EQ, $1, $0"
        test1                    "vcos.s $1, $0"
        test1                    "vcos.p $1, $0"
        test1                    "vcos.t $1, $0"
        test1                    "vcos.q $1, $0"
        test1                    "vcrs.t $2, $1, $0"
        test1                    "vcrsp.t $2, $1, $0"
        test1                    "vcst.s $0, VFPU_PI"
        test1                    "vcst.p $0, VFPU_PI"
        test1                    "vcst.t $0, VFPU_PI"
        test1                    "vcst.q $0, VFPU_PI"
        test1                    "vdet.p $2, $1, $0"
        test2a                   "vdiv.s $2, $1, $0"
        test2b                   "vdiv.p $2, $1, $0"
        test2b                   "vdiv.t $2, $1, $0"
        test2c                   "vdiv.q $2, $1, $0"
        test1                    "vdot.t $2, $1, $0"
        test1                    "vdot.q $2, $1, $0"
        test1                    "vexp2.s $1, $0"
        test1                    "vexp2.p $1, $0"
        test1                    "vexp2.t $1, $0"
        test1                    "vexp2.q $1, $0"
        test1                    "vf2h.p $1, $0"
        test1                    "vf2h.q $1, $0"
        test1                    "vf2id.s $1, $0, 0"
        test1                    "vf2id.p $1, $0, 0"
        test1                    "vf2id.t $1, $0, 0"
        test1                    "vf2id.q $1, $0, 0"
        test1                    "vf2in.s $1, $0, 0"
        test1                    "vf2in.p $1, $0, 0"
        test1                    "vf2in.t $1, $0, 0"
        test1                    "vf2in.q $1, $0, 0"
        test1                    "vf2iu.s $1, $0, 0"
        test1                    "vf2iu.p $1, $0, 0"
        test1                    "vf2iu.t $1, $0, 0"
        test1                    "vf2iu.q $1, $0, 0"
        test1                    "vf2iz.s $1, $0, 0"
        test1                    "vf2iz.p $1, $0, 0"
        test1                    "vf2iz.t $1, $0, 0"
        test1                    "vf2iz.q $1, $0, 0"
        test1                    "vfad.p $1, $0"
        test1                    "vfad.t $1, $0"
        test1                    "vfad.q $1, $0"
        test1                    "vh2f.s $1, $0"
        test1                    "vh2f.p $1, $0"
        test1                    "vhdp.p $2, $1, $0"
        test1                    "vhdp.q $2, $1, $0"
        test1                    "vhtfm2.p $8, $4, $0"
        test1                    "vhtfm3.t $8, $4, $0"
        test1                    "vhtfm4.q $8, $4, $0"
        test1                    "vi2c.q $1, $0"
        test1                    "vi2f.s $1, $0, 0"
        test1                    "vi2f.p $1, $0, 0"
        test1                    "vi2f.t $1, $0, 0"
        test1                    "vi2f.q $1, $0, 0"
        test1                    "vi2s.p $1, $0"
        test1                    "vi2s.q $1, $0"
        test1                    "vi2uc.q $1, $0"
        test1                    "vi2us.p $1, $0"
        test1                    "vi2us.q $1, $0"
        test1                    "vidt.p $0"
        test1                    "vidt.q $0"
        test1                    "viim.s $0, 0"
        test1                    "vlgb.s $1, $0"
        test1                    "vlog2.s $1, $0"
        test1                    "vlog2.p $1, $0"
        test1                    "vlog2.t $1, $0"
        test1                    "vlog2.q $1, $0"
        test1                    "vmax.s $2, $1, $0"
        test1                    "vmax.p $2, $1, $0"
        test1                    "vmax.t $2, $1, $0"
        test1                    "vmax.q $2, $1, $0"
        test1                    "vmfvc $0, $131"
        test1                    "vmidt.p $0"
        test1                    "vmidt.t $0"
        test1                    "vmidt.q $0"
        test1                    "vmin.s $2, $1, $0"
        test1                    "vmin.p $2, $1, $0"
        test1                    "vmin.t $2, $1, $0"
        test1                    "vmin.q $2, $1, $0"
        test1                    "vmmov.p $4, $0"
        test1                    "vmmov.t $4, $0"
        test1                    "vmmov.q $4, $0"
        test1                    "vmmul.p $8, $4, $0"
        test1                    "vmmul.t $8, $4, $0"
        test1                    "vmmul.q $8, $4, $0"
        test1                    "vmone.p $0"
        test1                    "vmone.t $0"
        test1                    "vmone.q $0"
        test1                    "vmscl.p $8, $4, $0"
        test1                    "vmscl.t $8, $4, $0"
        test1                    "vmscl.q $8, $4, $0"
        test1                    "vmtvc $131, $0"
        test1                    "vmul.s $2, $1, $0"
        test1                    "vmul.p $2, $1, $0"
        test1                    "vmul.t $2, $1, $0"
        test1                    "vmul.q $2, $1, $0"
        test1                    "vmzero.p $0"
        test1                    "vmzero.t $0"
        test1                    "vmzero.q $0"
        test1                    "vneg.s $1, $0"
        test1                    "vneg.p $1, $0"
        test1                    "vneg.t $1, $0"
        test1                    "vneg.q $1, $0"
        test1                    "vnrcp.s $1, $0"
        test1                    "vnrcp.p $1, $0"
        test1                    "vnrcp.t $1, $0"
        test1                    "vnrcp.q $1, $0"
        test1                    "vnsin.s $1, $0"
        test1                    "vnsin.p $1, $0"
        test1                    "vnsin.t $1, $0"
        test1                    "vnsin.q $1, $0"
        test1                    "vocp.s $1, $0"
        test1                    "vocp.p $1, $0"
        test1                    "vocp.t $1, $0"
        test1                    "vocp.q $1, $0"
        test1                    "vone.s $0"
        test1                    "vone.p $0"
        test1                    "vone.t $0"
        test1                    "vone.q $0"
        test1                    "vqmul.q $2, $1, $0"
        test1                    "vrcp.s $1, $0"
        test1                    "vrcp.p $1, $0"
        test1                    "vrcp.t $1, $0"
        test1                    "vrcp.q $1, $0"
        test1                    "vrexp2.s $1, $0"
        test1                    "vrexp2.p $1, $0"
        test1                    "vrexp2.t $1, $0"
        test1                    "vrexp2.q $1, $0"
        test2a                   "vrndf1.s $0"
        test2b                   "vrndf1.p $0"
        test2b                   "vrndf1.t $0"
        test2c                   "vrndf1.q $0"
        test2a                   "vrndf2.s $0"
        test2b                   "vrndf2.p $0"
        test2b                   "vrndf2.t $0"
        test2c                   "vrndf2.q $0"
        test2a                   "vrndi.s $0"
        test2b                   "vrndi.p $0"
        test2b                   "vrndi.t $0"
        test2c                   "vrndi.q $0"
        test1                    "vrnds.s $0"
        test1                    "vrot.p $1, $0, &#91;c,s&#93;"
        test1                    "vrot.t $1, $0, &#91;c,s,0&#93;"
        test1                    "vrot.q $1, $0, &#91;c,s,0,0&#93;"
        test1                    "vrsq.s $1, $0"
        test1                    "vrsq.p $1, $0"
        test1                    "vrsq.t $1, $0"
        test1                    "vrsq.q $1, $0"
        test1                    "vs2i.s $1, $0"
        test1                    "vs2i.p $1, $0"
        test1                    "vsat0.s $1, $0"
        test1                    "vsat0.p $1, $0"
        test1                    "vsat0.t $1, $0"
        test1                    "vsat0.q $1, $0"
        test1                    "vsat1.s $1, $0"
        test1                    "vsat1.p $1, $0"
        test1                    "vsat1.t $1, $0"
        test1                    "vsat1.q $1, $0"
        test1                    "vsbn.s $2, $1, $0"
        test1                    "vsbz.s $1, $0"
        test1                    "vscl.p $2, $1, $0"
        test1                    "vscl.t $2, $1, $0"
        test1                    "vscl.q $2, $1, $0"
        test1                    "vscmp.s $2, $1, $0"
        test1                    "vscmp.p $2, $1, $0"
        test1                    "vscmp.t $2, $1, $0"
        test1                    "vscmp.q $2, $1, $0"
        test1                    "vsge.s $2, $1, $0"
        test1                    "vsge.p $2, $1, $0"
        test1                    "vsge.t $2, $1, $0"
        test1                    "vsge.q $2, $1, $0"
        test1                    "vsgn.s $1, $0"
        test1                    "vsgn.p $1, $0"
        test1                    "vsgn.t $1, $0"
        test1                    "vsgn.q $1, $0"
        test1                    "vsin.s $1, $0"
        test1                    "vsin.p $1, $0"
        test1                    "vsin.t $1, $0"
        test1                    "vsin.q $1, $0"
        test1                    "vslt.s $2, $1, $0"
        test1                    "vslt.p $2, $1, $0"
        test1                    "vslt.t $2, $1, $0"
        test1                    "vslt.p $2, $1, $0"
        test1                    "vsocp.s $1, $0"
        test1                    "vsocp.p $1, $0"
        test1                    "vsqrt.s $1, $0"
        test1                    "vsqrt.p $1, $0"
        test1                    "vsqrt.t $1, $0"
        test1                    "vsqrt.q $1, $0"
        test1                    "vsrt1.q $1, $0"
        test1                    "vsrt2.q $1, $0"
        test1                    "vsrt3.q $1, $0"
        test1                    "vsrt4.q $1, $0"
        test1                    "vsub.s $2, $1, $0"
        test1                    "vsub.p $2, $1, $0"
        test1                    "vsub.t $2, $1, $0"
        test1                    "vsub.q $2, $1, $0"
        test1                    "vt4444.q $1, $0"
        test1                    "vt5551.q $1, $0"
        test1                    "vt5650.q $1, $0"
        test1                    "vtfm2.p $1, $4, $0"
        test1                    "vtfm3.t $1, $4, $0"
        test1                    "vtfm4.q $1, $4, $0"
        test1                    "vus2i.s $1, $0"
        test1                    "vus2i.p $1, $0"
        test1                    "vwbn.s $1, $0, 0"
        test1                    "vzero.p $0"
        test1                    "vzero.t $0"
        test1                    "vzero.q $0"

        jr                      $ra
        nop

and the main.c :

Code: Select all

/*
 * PSP Software Development Kit - http&#58;//www.pspdev.org
 * -----------------------------------------------------------------------
 * Licensed under the BSD license, see LICENSE in PSPSDK root for details.
 *
 * main.c - Basic ELF template
 *
 * Copyright &#40;c&#41; 2005 Marcus R. Brown <mrbrown@ocgnet.org>
 * Copyright &#40;c&#41; 2005 James Forshaw <tyranid@gmail.com>
 * Copyright &#40;c&#41; 2005 John Kelley <ps2dev@kelley.ca>
 *
 * $Id&#58; main.c 1888 2006-05-01 08&#58;47&#58;04Z tyranid $
 * $HeadURL$
 */
#include <pspkernel.h>
#include <pspdebug.h>
#include <pspctrl.h>
#include <psptypes.h>

#include <math.h>
#include <stdio.h>

#define printf pspDebugScreenPrintf

/* Define the module info section */
PSP_MODULE_INFO&#40;"template", 0x1000, 1, 1&#41;;

/* Define the main thread's attribute value &#40;optional&#41; */
PSP_MAIN_THREAD_ATTR&#40;THREAD_ATTR_VFPU&#41;;

static void exception_handler&#40;PspDebugRegBlock *regs&#41;
&#123;
  pspDebugScreenInit&#40;&#41;;

  pspDebugScreenSetBackColor&#40;0x00FF0000&#41;;
  pspDebugScreenSetTextColor&#40;0xFFFFFFFF&#41;;
  pspDebugScreenClear&#40;&#41;;

  pspDebugScreenPrintf&#40;"\nSC - Exception Details&#58;\n"&#41;;
  pspDebugDumpException&#40;regs&#41;;

  pspDebugScreenPrintf&#40;"\n\nPress 'cross' button to exit."&#41;;

  wait&#40;&#41;;

  sceKernelExitGame&#40;&#41;;
&#125;

void save_file&#40;const char *data, unsigned int n, const char *name&#41;
&#123;
  int fdout;

  fdout = sceIoOpen&#40;name, PSP_O_WRONLY | PSP_O_CREAT | PSP_O_TRUNC, 0777&#41;;

  sceIoWrite&#40;fdout, data, n&#41;;

  sceIoClose&#40;fdout&#41;;
&#125;

static char *vfpu_insn&#91;&#93; =
&#123;
"lv.s",
"lv.q",
"mfv",
"mfvc",
"mtv",
"mtvc",
"sv.s",
"sv.q",
"svl.q",
"svr.q",
"vabs.s",
"vabs.p",
"vabs.t",
"vabs.q",
"vadd.s",
"vadd.p",
"vadd.t",
"vadd.q",
"vasin.s",
"vasin.p",
"vasin.t",
"vasin.q",
"vavg.p",
"vavg.t",
"vavg.q",
"vbfy1.p",
"vbfy1.q",
"vbfy2.q",
"vcmovf.s",
"vcmovf.p",
"vcmovf.t",
"vcmovf.q",
"vcmovt.s",
"vcmovt.p",
"vcmovt.t",
"vcmovt.q",
"vcmp.s",
"vcmp.p",
"vcmp.t",
"vcmp.q",
"vcos.s",
"vcos.p",
"vcos.t",
"vcos.q",
"vcrs.t",
"vcrsp.t",
"vcst.s",
"vcst.p",
"vcst.t",
"vcst.q",
"vdet.p",
"vdiv.s",
"vdiv.p",
"vdiv.t",
"vdiv.q",
"vdot.t",
"vdot.q",
"vexp2.s",
"vexp2.p",
"vexp2.t",
"vexp2.q",
"vf2h.p",
"vf2h.q",
"vf2id.s",
"vf2id.p",
"vf2id.t",
"vf2id.q",
"vf2in.s",
"vf2in.p",
"vf2in.t",
"vf2in.q",
"vf2iu.s",
"vf2iu.p",
"vf2iu.t",
"vf2iu.q",
"vf2iz.s",
"vf2iz.p",
"vf2iz.t",
"vf2iz.q",
"vfad.p",
"vfad.t",
"vfad.q",
"vh2f.s",
"vh2f.p",
"vhdp.p",
"vhdp.q",
"vhtfm2.p",
"vhtfm3.t",
"vhtfm4.q",
"vi2c.q",
"vi2f.s",
"vi2f.p",
"vi2f.t",
"vi2f.q",
"vi2s.p",
"vi2s.q",
"vi2uc.q",
"vi2us.p",
"vi2us.q",
"vidt.p",
"vidt.q",
"viim.s",
"vlgb.s",
"vlog2.s",
"vlog2.p",
"vlog2.t",
"vlog2.q",
"vmax.s",
"vmax.p",
"vmax.t",
"vmax.q",
"vmfvc",
"vmidt.p",
"vmidt.t",
"vmidt.q",
"vmin.s",
"vmin.p",
"vmin.t",
"vmin.q",
"vmmov.p",
"vmmov.t",
"vmmov.q",
"vmmul.p",
"vmmul.t",
"vmmul.q",
"vmone.p",
"vmone.t",
"vmone.q",
"vmscl.p",
"vmscl.t",
"vmscl.q",
"vmtvc",
"vmul.s",
"vmul.p",
"vmul.t",
"vmul.q",
"vmzero.p",
"vmzero.t",
"vmzero.q",
"vneg.s",
"vneg.p",
"vneg.t",
"vneg.q",
"vnrcp.s",
"vnrcp.p",
"vnrcp.t",
"vnrcp.q",
"vnsin.s",
"vnsin.p",
"vnsin.t",
"vnsin.q",
"vocp.s",
"vocp.p",
"vocp.t",
"vocp.q",
"vone.s",
"vone.p",
"vone.t",
"vone.q",
"vqmul.q",
"vrcp.s",
"vrcp.p",
"vrcp.t",
"vrcp.q",
"vrexp2.s",
"vrexp2.p",
"vrexp2.t",
"vrexp2.q",
"vrndf1.s",
"vrndf1.p",
"vrndf1.t",
"vrndf1.q",
"vrndf2.s",
"vrndf2.p",
"vrndf2.t",
"vrndf2.q",
"vrndi.s",
"vrndi.p",
"vrndi.t",
"vrndi.q",
"vrnds.s",
"vrot.p",
"vrot.t",
"vrot.q",
"vrsq.s",
"vrsq.p",
"vrsq.t",
"vrsq.q",
"vs2i.s",
"vs2i.p",
"vsat0.s",
"vsat0.p",
"vsat0.t",
"vsat0.q",
"vsat1.s",
"vsat1.p",
"vsat1.t",
"vsat1.q",
"vsbn.s",
"vsbz.s",
"vscl.p",
"vscl.t",
"vscl.q",
"vscmp.s",
"vscmp.p",
"vscmp.t",
"vscmp.q",
"vsge.s",
"vsge.p",
"vsge.t",
"vsge.q",
"vsgn.s",
"vsgn.p",
"vsgn.t",
"vsgn.q",
"vsgn.s",
"vsin.p",
"vsin.t",
"vsin.q",
"vslt.s",
"vslt.p",
"vslt.t",
"vslt.p",
"vsocp.s",
"vsocp.p",
"vsqrt.s",
"vsqrt.p",
"vsqrt.t",
"vsqrt.q",
"vsrt1.q",
"vsrt2.q",
"vsrt3.q",
"vsrt4.q",
"vsub.s",
"vsub.p",
"vsub.t",
"vsub.q",
"vt4444.q",
"vt5551.q",
"vt5650.q",
"vtfm2.p",
"vtfm3.t",
"vtfm4.q",
"vus2i.s",
"vus2i.p",
"vwbn.s",
"vzero.p",
"vzero.t",
"vzero.q",
0
&#125;;

ScePspFQuaternion res&#91;4&#93;;
char g_data&#91;4*2*512&#93;;
char g_text&#91;32*1024&#93;;

int main&#40;int argc, char *argv&#91;&#93;&#41;
&#123;
  pspDebugInstallErrorHandler&#40;exception_handler&#41;;

  pspDebugScreenInit&#40;&#41;;

  sceCtrlSetSamplingCycle&#40;0&#41;;
  sceCtrlSetSamplingMode&#40;PSP_CTRL_MODE_DIGITAL&#41;;

  // prevent miss cache
  res->x = 0;
  res->y = 0;
  res->z = 0;
  res->w = 0;
  test_vfpu_cycles&#40;g_data, g_data, g_data, res&#41;;

  int i = 0, len = 0;

  len = len + sprintf&#40;g_text + len, "INSTRUCTION\t\tPITCH\tLATENCY\n"&#41;;
  len = len + sprintf&#40;g_text + len, "---------------\t-------\t-------\n"&#41;;
  while &#40;vfpu_insn&#91;i&#93;&#41;
  &#123;
    char const *q = vfpu_insn&#91;i&#93;;
    len = len + sprintf&#40;g_text + len, "%s%s%d\t%d\n", q, &#40;strlen&#40;q&#41; < 8 ? "\t\t" &#58; "\t"&#41;, g_data&#91;i*2&#93;, g_data&#91;i*2+1&#93;&#41;;
    i++;
  &#125;

  save_file&#40;g_text, len, "ms0&#58;/cycles.txt"&#41;;

  sceKernelExitGame&#40;&#41;;

  return 0;
&#125;

Raphael · Post by **Raphael** » Fri Jan 26, 2007 3:21 pm

Some of the cycles look pretty synthetic though.

vdiv.s 1 2
vdiv.p 1 2
vdiv.t 1 3
vdiv.q 1 4

vrcp.s 1 2
vrcp.p 1 3
vrcp.t 1 4
vrcp.q 1 5

Makes me wonder if those are reliable for real-world usage, I doubt the VFPU does a full quadvector divide in 5 cycles, while the reciprocal takes 6 cycles. Especially in comparison to my tests.
So I wonder if that method will return the real cycles that the op takes on the VFPU, or just the cycles it takes the CPU to submit the OPS to the VFPU. Also, what happens if you insert another instruction instead of the vnop to check for latency?

hlide · Post by **hlide** » Fri Jan 26, 2007 6:45 pm

huh...

hlide · Post by **hlide** » Sat Jan 27, 2007 8:25 am

/!\ UPDATED /!\

Ok i think pitch and latency looks more accurate now, still I cannot really ascertain it.

vdiv and vrndX seem to be special because they are not single-cycle instructions so i was forced to tweast my macro to get indirectly their pitch.

macro-instructions are so-called because it seems some instructions iterate the same instruction a number of times according to the suffix .p, .t or .q.

forums.ps2dev.org

VFPU diggins

Re: VFPU diggins

quaternion -> to_axis_angle

Re: VFPU diggins