VFPU playground, code generation for gas-unsupported opcodes

Discuss the development of new homebrew software, tools and libraries.

Moderators: cheriff, TyRaNiD

holger
Posts: 204
Joined: Thu Aug 18, 2005 10:57 am

Post by holger »

MrMr[iCE] wrote: holger: more opcodes for ya
cool, added, except vidt:

Code: Select all

/*
+-------------------------------------------------------------+--------------+
|31                                   16 | 15 | 14     8  | 7 | 6         0  |
+-------------------------------------------------------------+--------------+
| opcode 0xd003 (p)                      |  0 |      0    | 1 | vfpu_rd[6-0] |
| opcode 0xd003 (t)                      |  1 |      0    | 0 | vfpu_rd[6-0] |
| opcode 0xd003 (q)                      |  1 |      0    | 1 | vfpu_rd[6-0] |
+-------------------------------------------------------------+--------------+
	
	VectorLoadIdentity.Pair/Triple/Quad

    vidt.p %vfpu_rd	; Set 2x1 Vector to Identity
    vidt.t %vfpu_rd	; Set 3x1 Vector to Identity
    vidt.q %vfpu_rd	; Set 4x1 Vector to Identity

        %vfpu_rd:	VFPU Vector Destination Register ([s|p|t|q]reg 0..127)

    vfpu_regs&#91;%vfpu_rd&#93; <- identity vector
*/
#define vidt_p&#40;vfpu_rd&#41;  &#40;0xd0030080 | &#40;vfpu_rd&#41;&#41;
#define vidt_t&#40;vfpu_rd&#41;  &#40;0xd0038000 | &#40;vfpu_rd&#41;&#41;
#define vidt_q&#40;vfpu_rd&#41;  &#40;0xd0038080 | &#40;vfpu_rd&#41;&#41;

what is an identity vector? Is this (0, 0, 0, 1) or (1, 1, 1, 1)/vone?
MrMr[iCE]
Posts: 43
Joined: Mon Oct 03, 2005 4:55 pm

Post by MrMr[iCE] »

what is an identity vector? Is this (0, 0, 0, 1) or (1, 1, 1, 1)/vone?
My bad I didnt test this function well enough to notice what it really does, but I did a few more tests and here's what I found:

This function initializes a vector to identity, but it does so according to the layout of the appropriate matrix. In the register display, I see this:

vidt_q Q_C000
vidt_q Q_C010

1 0 0 0
0 1 0 0
x x x x
x x x x


Now with Q_R000 and Q_R0001, i get:

1 0 x x
0 1 x x
0 0 x x
0 0 x x

Think of this as setting a row or column of a matrix to identity. The same behavior applies to triple and pair as well.

and heres another batch of ops:

Code: Select all

/*
+-------------------------------------+----+--------------+---+--------------+
|31                                16 | 15 | 14         8 | 7 | 6          0 |
+-------------------------------------+----+--------------+---+--------------+
| opcode 0xd0010000 &#40;s&#41;               |  0 | vfpu_rs&#91;6-0&#93; | 0 | vfpu_rd&#91;6-0&#93; |
| opcode 0xd0010080 &#40;p&#41;               |  0 | vfpu_rs&#91;6-0&#93; | 1 | vfpu_rd&#91;6-0&#93; |
| opcode 0xd0018000 &#40;t&#41;               |  1 | vfpu_rs&#91;6-0&#93; | 0 | vfpu_rd&#91;6-0&#93; |
| opcode 0xd0018080 &#40;q&#41;               |  1 | vfpu_rs&#91;6-0&#93; | 1 | vfpu_rd&#91;6-0&#93; |
+-------------------------------------+----+--------------+---+--------------+

  AbsoluteValue.Single/Pair/Triple/Quad

	vabs.s %vfpu_rd, %vfpu_rs    ; Absolute Value Single
    vabs.p %vfpu_rd, %vfpu_rs    ; Absolute Value Pair
    vabs.t %vfpu_rd, %vfpu_rs    ; Absolute Value Triple
    vabs.q %vfpu_rd, %vfpu_rs    ; Absolute Value Quad

        %vfpu_rd&#58;	VFPU Vector Destination Register &#40;m&#91;p|t|q&#93;reg 0..127&#41;
        %vfpu_rs&#58;	VFPU Vector Source Register &#40;m&#91;p|t|q&#93;reg 0..127&#41;

    vfpu_regs&#91;%vfpu_rd&#93; <- abs&#40;vfpu_regs&#91;%vfpu_rs&#93;&#41;
*/

#define vabs_s&#40;vfpu_rd,vfpu_rs&#41;  &#40;0xd0010000 | &#40;&#40;vfpu_rs&#41; << 8&#41; | &#40;vfpu_rd&#41;&#41;
#define vabs_p&#40;vfpu_rd,vfpu_rs&#41;  &#40;0xd0010080 | &#40;&#40;vfpu_rs&#41; << 8&#41; | &#40;vfpu_rd&#41;&#41;
#define vabs_t&#40;vfpu_rd,vfpu_rs&#41;  &#40;0xd0018000 | &#40;&#40;vfpu_rs&#41; << 8&#41; | &#40;vfpu_rd&#41;&#41;
#define vabs_q&#40;vfpu_rd,vfpu_rs&#41;  &#40;0xd0018080 | &#40;&#40;vfpu_rs&#41; << 8&#41; | &#40;vfpu_rd&#41;&#41;

/*
+-------------------------------------+----+--------------+---+--------------+
|31                                16 | 15 | 14         8 | 7 | 6          0 |
+-------------------------------------+----+--------------+---+--------------+
| opcode 0xd002 &#40;s&#41;                   |  0 | vfpu_rs&#91;6-0&#93; | 0 | vfpu_rd&#91;6-0&#93; |
| opcode 0xd002 &#40;p&#41;                   |  0 | vfpu_rs&#91;6-0&#93; | 1 | vfpu_rd&#91;6-0&#93; |
| opcode 0xd002 &#40;t&#41;                   |  1 | vfpu_rs&#91;6-0&#93; | 0 | vfpu_rd&#91;6-0&#93; |
| opcode 0xd002 &#40;q&#41;                   |  1 | vfpu_rs&#91;6-0&#93; | 1 | vfpu_rd&#91;6-0&#93; |
+-------------------------------------+----+--------------+---+--------------+

  Negate.Single/Pair/Triple/Quad

	vneg.s %vfpu_rd, %vfpu_rs    ; Negate Single
    vneg.p %vfpu_rd, %vfpu_rs    ; Negate Pair
    vneg.t %vfpu_rd, %vfpu_rs    ; Negate Triple
    vneg.q %vfpu_rd, %vfpu_rs    ; Negate Quad

        %vfpu_rd&#58;	VFPU Vector Destination Register &#40;m&#91;p|t|q&#93;reg 0..127&#41;
        %vfpu_rs&#58;	VFPU Vector Source Register &#40;m&#91;p|t|q&#93;reg 0..127&#41;

    vfpu_regs&#91;%vfpu_rd&#93; <- -vfpu_regs&#91;%vfpu_rs&#93;
*/

#define vneg_s&#40;vfpu_rd,vfpu_rs&#41;  &#40;0xd0020000 | &#40;&#40;vfpu_rs&#41; << 8&#41; | &#40;vfpu_rd&#41;&#41;
#define vneg_p&#40;vfpu_rd,vfpu_rs&#41;  &#40;0xd0020080 | &#40;&#40;vfpu_rs&#41; << 8&#41; | &#40;vfpu_rd&#41;&#41;
#define vneg_t&#40;vfpu_rd,vfpu_rs&#41;  &#40;0xd0028000 | &#40;&#40;vfpu_rs&#41; << 8&#41; | &#40;vfpu_rd&#41;&#41;
#define vneg_q&#40;vfpu_rd,vfpu_rs&#41;  &#40;0xd0028080 | &#40;&#40;vfpu_rs&#41; << 8&#41; | &#40;vfpu_rd&#41;&#41;


/*
+-------------------------------------+----+--------------+---+--------------+
|31                                16 | 15 | 14         8 | 7 | 6          0 |
+-------------------------------------+----+--------------+---+--------------+
| opcode 0xd04a &#40;s&#41;                   |  0 | vfpu_rs&#91;6-0&#93; | 0 | vfpu_rd&#91;6-0&#93; |
| opcode 0xd04a &#40;p&#41;                   |  0 | vfpu_rs&#91;6-0&#93; | 1 | vfpu_rd&#91;6-0&#93; |
| opcode 0xd04a &#40;t&#41;                   |  1 | vfpu_rs&#91;6-0&#93; | 0 | vfpu_rd&#91;6-0&#93; |
| opcode 0xd04a &#40;q&#41;                   |  1 | vfpu_rs&#91;6-0&#93; | 1 | vfpu_rd&#91;6-0&#93; |
+-------------------------------------+----+--------------+---+--------------+

  Sign.Single/Pair/Triple/Quad

	vsgn.s %vfpu_rd, %vfpu_rs    ; Get Sign Single
    vsgn.p %vfpu_rd, %vfpu_rs    ; Get Sign Pair
    vsgn.t %vfpu_rd, %vfpu_rs    ; Get Sign Triple
    vsgn.q %vfpu_rd, %vfpu_rs    ; Get Sign Quad

        %vfpu_rd&#58;	VFPU Vector Destination Register &#40;m&#91;p|t|q&#93;reg 0..127&#41;
        %vfpu_rs&#58;	VFPU Vector Source Register &#40;m&#91;p|t|q&#93;reg 0..127&#41;

    vfpu_regs&#91;%vfpu_rd&#93; <- sign&#40;vfpu_regs&#91;%vfpu_rs&#93;&#41;

    this will set rd values to 1 or -1, depending on sign of input values
*/

#define vsgn_s&#40;vfpu_rd,vfpu_rs&#41;  &#40;0xd04a0000 | &#40;&#40;vfpu_rs&#41; << 8&#41; | &#40;vfpu_rd&#41;&#41;
#define vsgn_p&#40;vfpu_rd,vfpu_rs&#41;  &#40;0xd04a0080 | &#40;&#40;vfpu_rs&#41; << 8&#41; | &#40;vfpu_rd&#41;&#41;
#define vsgn_t&#40;vfpu_rd,vfpu_rs&#41;  &#40;0xd04a8000 | &#40;&#40;vfpu_rs&#41; << 8&#41; | &#40;vfpu_rd&#41;&#41;
#define vsgn_q&#40;vfpu_rd,vfpu_rs&#41;  &#40;0xd04a8080 | &#40;&#40;vfpu_rs&#41; << 8&#41; | &#40;vfpu_rd&#41;&#41;

/*
+----------------------+--------------+----+--------------+---+--------------+
|31                 23 | 22        16 | 15 | 14         8 | 7 | 6         0  |
+----------------------+--------------+----+--------------+---+--------------+
| opcode 0x6d0 &#40;s&#41;     | vfpu_rt&#91;6-0&#93; |  0 | vfpu_rs&#91;6-0&#93; | 0 | vfpu_rd&#91;6-0&#93; |
| opcode 0x6d0 &#40;p&#41;     | vfpu_rt&#91;6-0&#93; |  0 | vfpu_rs&#91;6-0&#93; | 1 | vfpu_rd&#91;6-0&#93; |
| opcode 0x6d0 &#40;t&#41;     | vfpu_rt&#91;6-0&#93; |  1 | vfpu_rs&#91;6-0&#93; | 0 | vfpu_rd&#91;6-0&#93; |
| opcode 0x6d0 &#40;q&#41;     | vfpu_rt&#91;6-0&#93; |  1 | vfpu_rs&#91;6-0&#93; | 1 | vfpu_rd&#91;6-0&#93; |
+----------------------+--------------+----+--------------+---+--------------+

  VectorMin.Single/Pair/Triple/Quad

	vmin.s %vfpu_rd, %vfpu_rs, %vfpu_rt ; Get Minimum Value Single
    vmin.p %vfpu_rd, %vfpu_rs, %vfpu_rt ; Get Minimum Value Pair
    vmin.t %vfpu_rd, %vfpu_rs, %vfpu_rt ; Get Minimum Value Triple
    vmin.q %vfpu_rd, %vfpu_rs, %vfpu_rt ; Get Minimum Value Quad

        %vfpu_rt&#58;	VFPU Vector Source Register &#40;sreg 0..127&#41;
        %vfpu_rs&#58;	VFPU Vector Source Register &#40;&#91;p|t|q&#93;reg 0..127&#41;
        %vfpu_rd&#58;	VFPU Vector Destination Register &#40;&#91;s|p|t|q&#93;reg 0..127&#41;

    vfpu_regs&#91;%vfpu_rd&#93; <- min&#40;vfpu_regs&#91;%vfpu_rs&#93;, vfpu_reg&#91;%vfpu_rt&#93;&#41;
*/

#define vmin_s&#40;vfpu_rd,vfpu_rs,vfpu_rt&#41;  &#40;0x6D000000 | &#40;&#40;vfpu_rt&#41; << 16&#41; | &#40;&#40;vfpu_rs&#41; << 8&#41; | &#40;vfpu_rd&#41;&#41;
#define vmin_p&#40;vfpu_rd,vfpu_rs,vfpu_rt&#41;  &#40;0x6D000080 | &#40;&#40;vfpu_rt&#41; << 16&#41; | &#40;&#40;vfpu_rs&#41; << 8&#41; | &#40;vfpu_rd&#41;&#41;
#define vmin_t&#40;vfpu_rd,vfpu_rs,vfpu_rt&#41;  &#40;0x6D008000 | &#40;&#40;vfpu_rt&#41; << 16&#41; | &#40;&#40;vfpu_rs&#41; << 8&#41; | &#40;vfpu_rd&#41;&#41;
#define vmin_q&#40;vfpu_rd,vfpu_rs,vfpu_rt&#41;  &#40;0x6D008080 | &#40;&#40;vfpu_rt&#41; << 16&#41; | &#40;&#40;vfpu_rs&#41; << 8&#41; | &#40;vfpu_rd&#41;&#41;


/*
+----------------------+--------------+----+--------------+---+--------------+
|31                 23 | 22        16 | 15 | 14         8 | 7 | 6         0  |
+----------------------+--------------+----+--------------+---+--------------+
| opcode 0x6d8 &#40;s&#41;     | vfpu_rt&#91;6-0&#93; |  0 | vfpu_rs&#91;6-0&#93; | 0 | vfpu_rd&#91;6-0&#93; |
| opcode 0x6d8 &#40;p&#41;     | vfpu_rt&#91;6-0&#93; |  0 | vfpu_rs&#91;6-0&#93; | 1 | vfpu_rd&#91;6-0&#93; |
| opcode 0x6d8 &#40;t&#41;     | vfpu_rt&#91;6-0&#93; |  1 | vfpu_rs&#91;6-0&#93; | 0 | vfpu_rd&#91;6-0&#93; |
| opcode 0x6d8 &#40;q&#41;     | vfpu_rt&#91;6-0&#93; |  1 | vfpu_rs&#91;6-0&#93; | 1 | vfpu_rd&#91;6-0&#93; |
+----------------------+--------------+----+--------------+---+--------------+

  VectorMax.Single/Pair/Triple/Quad

	vmax.s %vfpu_rd, %vfpu_rs, %vfpu_rt ; Get Maximum Value Single
    vmax.p %vfpu_rd, %vfpu_rs, %vfpu_rt ; Get Maximum Value Pair
    vmax.t %vfpu_rd, %vfpu_rs, %vfpu_rt ; Get Maximum Value Triple
    vmax.q %vfpu_rd, %vfpu_rs, %vfpu_rt ; Get Maximum Value Quad

        %vfpu_rt&#58;	VFPU Vector Source Register &#40;sreg 0..127&#41;
        %vfpu_rs&#58;	VFPU Vector Source Register &#40;&#91;p|t|q&#93;reg 0..127&#41;
        %vfpu_rd&#58;	VFPU Vector Destination Register &#40;&#91;s|p|t|q&#93;reg 0..127&#41;

    vfpu_regs&#91;%vfpu_rd&#93; <- max&#40;vfpu_regs&#91;%vfpu_rs&#93;, vfpu_reg&#91;%vfpu_rt&#93;&#41;
*/

#define vmax_s&#40;vfpu_rd,vfpu_rs,vfpu_rt&#41;  &#40;0x6D800000 | &#40;&#40;vfpu_rt&#41; << 16&#41; | &#40;&#40;vfpu_rs&#41; << 8&#41; | &#40;vfpu_rd&#41;&#41;
#define vmax_p&#40;vfpu_rd,vfpu_rs,vfpu_rt&#41;  &#40;0x6D800080 | &#40;&#40;vfpu_rt&#41; << 16&#41; | &#40;&#40;vfpu_rs&#41; << 8&#41; | &#40;vfpu_rd&#41;&#41;
#define vmax_t&#40;vfpu_rd,vfpu_rs,vfpu_rt&#41;  &#40;0x6D808000 | &#40;&#40;vfpu_rt&#41; << 16&#41; | &#40;&#40;vfpu_rs&#41; << 8&#41; | &#40;vfpu_rd&#41;&#41;
#define vmax_q&#40;vfpu_rd,vfpu_rs,vfpu_rt&#41;  &#40;0x6D808080 | &#40;&#40;vfpu_rt&#41; << 16&#41; | &#40;&#40;vfpu_rs&#41; << 8&#41; | &#40;vfpu_rd&#41;&#41;
I'm working on v(h)tfm4/3/2, these are the vector*matrix ops, almost have a working set of functions to setup x/y/z rotation, projection matrix and concat matrix functions.

I need very specific info on vmmul. According to mips_dis.c, the vmmul instruction recieves special treatment and requires bit 13 (RXC bit) to be inverted...Im not sure how to do that, or what purpose that serves..can someone enlighten us on this?
holger
Posts: 204
Joined: Thu Aug 18, 2005 10:57 am

Post by holger »

MrMr[iCE] wrote:
what is an identity vector? Is this (0, 0, 0, 1) or (1, 1, 1, 1)/vone?
My bad I didnt test this function well enough to notice what it really does, but I did a few more tests and here's what I found:

This function initializes a vector to identity, but it does so according to the layout of the appropriate matrix. In the register display, I see this:

vidt_q Q_C000
vidt_q Q_C010

1 0 0 0
0 1 0 0
x x x x
x x x x


Now with Q_R000 and Q_R0001, i get:

1 0 x x
0 1 x x
0 0 x x
0 0 x x

Think of this as setting a row or column of a matrix to identity. The same behavior applies to triple and pair as well.
mmmh... sounds logical, but seems a little hard to explain in the documentation... how could we elaborate this in a few short sentences?
MrMr[iCE] wrote: and heres another batch of ops:
I'll add these ones right now...
Seems as if we have now almost everything together to implement a VFPU-based psplibc/libmath. Maybe a good field test for our findings ;)

nevertheless this requires to set MALLOC_ALIGNMENT in newlib to 16. Are there any objections?
MrMr[iCE]
Posts: 43
Joined: Mon Oct 03, 2005 4:55 pm

Post by MrMr[iCE] »

And they just keep on coming:

Code: Select all

/*
+----------------------+--------------+----+--------------+---+--------------+
|31                 23 | 22        16 | 15 | 14         8 | 7 | 6         0  |
+----------------------+--------------+----+--------------+---+--------------+
| opcode 0xf08  &#40;p&#41;    | vfpu_rt&#91;6-0&#93; |  0 | vfpu_rs&#91;6-0&#93; | 1 | vfpu_rd&#91;6-0&#93; |
| opcode 0xf10  &#40;t&#41;    | vfpu_rt&#91;6-0&#93; |  1 | vfpu_rs&#91;6-0&#93; | 0 | vfpu_rd&#91;6-0&#93; |
| opcode 0xf18  &#40;q&#41;    | vfpu_rt&#91;6-0&#93; |  1 | vfpu_rs&#91;6-0&#93; | 1 | vfpu_rd&#91;6-0&#93; |
+----------------------+--------------+----+--------------+---+--------------+

  VectorTransform.Pair/Triple/Quad

    vtfm2.p %vfpu_rd, %vfpu_rs, %vfpu_rt ; Transform pair vector by pair matrix
    vtfm3.t %vfpu_rd, %vfpu_rs, %vfpu_rt ; Transform triple vector by triple matrix
    vtfm4.q %vfpu_rd, %vfpu_rs, %vfpu_rt ; Transform quad vector by quad matrix

        %vfpu_rt&#58;	VFPU Vector Source Register &#40;qreg 0..127&#41;
        %vfpu_rs&#58;	VFPU Matrix Source Register &#40;qmatrix 0..127&#41;
        %vfpu_rd&#58;	VFPU Vector Destination Register &#40;qreg 0..127&#41;

    vfpu_regs&#91;%vfpu_rd&#93; <- transform&#40;vfpu_matrix&#91;%vfpu_rs&#93;, vfpu_vector&#91;%vfpu_rt&#93;&#41;
*/

#define vtfm2_p&#40;vfpu_rd,vfpu_rs,vfpu_rt&#41;  &#40;0xF0800080 | &#40;&#40;vfpu_rt&#41; << 16&#41; | &#40;&#40;vfpu_rs&#41; << 8&#41; | &#40;vfpu_rd&#41;&#41;
#define vtfm3_t&#40;vfpu_rd,vfpu_rs,vfpu_rt&#41;  &#40;0xF1008000 | &#40;&#40;vfpu_rt&#41; << 16&#41; | &#40;&#40;vfpu_rs&#41; << 8&#41; | &#40;vfpu_rd&#41;&#41;
#define vtfm4_q&#40;vfpu_rd,vfpu_rs,vfpu_rt&#41;  &#40;0xF1808080 | &#40;&#40;vfpu_rt&#41; << 16&#41; | &#40;&#40;vfpu_rs&#41; << 8&#41; | &#40;vfpu_rd&#41;&#41;

/*
+----------------------+--------------+----+--------------+---+--------------+
|31                 23 | 22        16 | 15 | 14         8 | 7 | 6         0  |
+----------------------+--------------+----+--------------+---+--------------+
| opcode 0xf08 &#40;p&#41;     | vfpu_rt&#91;6-0&#93; |  0 | vfpu_rs&#91;6-0&#93; | 0 | vfpu_rd&#91;6-0&#93; |
| opcode 0xf10 &#40;t&#41;     | vfpu_rt&#91;6-0&#93; |  0 | vfpu_rs&#91;6-0&#93; | 1 | vfpu_rd&#91;6-0&#93; |
| opcode 0xf18 &#40;q&#41;     | vfpu_rt&#91;6-0&#93; |  1 | vfpu_rs&#91;6-0&#93; | 0 | vfpu_rd&#91;6-0&#93; |
+----------------------+--------------+----+--------------+---+--------------+

  VectorHomogeneousTransform.Pair/Triple/Quad

    vhtfm2.p %vfpu_rd, %vfpu_rs, %vfpu_rt ; Homogeneous transform quad vector by quad matrix
    vhtfm3.t %vfpu_rd, %vfpu_rs, %vfpu_rt ; Homogeneous transform quad vector by quad matrix
    vhtfm4.q %vfpu_rd, %vfpu_rs, %vfpu_rt ; Homogeneous transform quad vector by quad matrix

        %vfpu_rt&#58;	VFPU Vector Source Register &#40;qreg 0..127&#41;
        %vfpu_rs&#58;	VFPU Matrix Source Register &#40;qmatrix 0..127&#41;
        %vfpu_rd&#58;	VFPU Vector Destination Register &#40;qreg 0..127&#41;

    vfpu_regs&#91;%vfpu_rd&#93; <- homeogenoustransform&#40;vfpu_matrix&#91;%vfpu_rs&#93;, vfpu_vector&#91;%vfpu_rt&#93;&#41;
*/

#define vhtfm2_p&#40;vfpu_rd,vfpu_rs,vfpu_rt&#41;  &#40;0xF0800000 | &#40;&#40;vfpu_rt&#41; << 16&#41; | &#40;&#40;vfpu_rs&#41; << 8&#41; | &#40;vfpu_rd&#41;&#41;
#define vhtfm3_t&#40;vfpu_rd,vfpu_rs,vfpu_rt&#41;  &#40;0xF1000080 | &#40;&#40;vfpu_rt&#41; << 16&#41; | &#40;&#40;vfpu_rs&#41; << 8&#41; | &#40;vfpu_rd&#41;&#41;
#define vhtfm4_q&#40;vfpu_rd,vfpu_rs,vfpu_rt&#41;  &#40;0xF1808000 | &#40;&#40;vfpu_rt&#41; << 16&#41; | &#40;&#40;vfpu_rs&#41; << 8&#41; | &#40;vfpu_rd&#41;&#41;
also my implementation of vmmul is wrong, it should be:

Code: Select all

#define vmmul_p&#40;vfpu_rd, vfpu_rs, vfpu_rt&#41; &#40;0xf0000080 | &#40;&#40;vfpu_rt&#41; << 16&#41; | &#40;&#40;&#40;vfpu_rs&#41; ^ 0x20&#41; << 8&#41; | &#40;vfpu_rd&#41;&#41;
#define vmmul_t&#40;vfpu_rd, vfpu_rs, vfpu_rt&#41; &#40;0xf0008000 | &#40;&#40;vfpu_rt&#41; << 16&#41; | &#40;&#40;&#40;vfpu_rs&#41; ^ 0x20&#41; << 8&#41; | &#40;vfpu_rd&#41;&#41;
#define vmmul_q&#40;vfpu_rd, vfpu_rs, vfpu_rt&#41; &#40;0xf0008080 | &#40;&#40;vfpu_rt&#41; << 16&#41; | &#40;&#40;&#40;vfpu_rs&#41; ^ 0x20&#41; << 8&#41; | &#40;vfpu_rd&#41;&#41;
this implements the inverted bit special case I saw in mips_dis.c

and for shits n giggles:

Code: Select all

/*
	1   0    0   0
	0   cx   sx  0
	0   -sx  cx  0
	0   0    0   1
*/
void vfpu_rotateX&#40;float degrees&#41; &#123;
	register void *ptr __asm &#40;"a0"&#41; = vfpu_vars;
	vfpu_vars&#91;0&#93; = degrees / 90.0;
	__asm__ volatile &#40;
		cgen_asm&#40;lv_s&#40;0, 0, R_a0, 0&#41;&#41;
		cgen_asm&#40;vsin_s&#40;125, 0&#41;&#41;
		cgen_asm&#40;vcos_s&#40;126, 0&#41;&#41;
		cgen_asm&#40;vmidt_q&#40;Q_M000&#41;&#41;
		cgen_asm&#40;vmov_s&#40;S_S011, 126&#41;&#41;
		cgen_asm&#40;vmov_s&#40;S_S012, 125&#41;&#41;
		cgen_asm&#40;vneg_s&#40;S_S021, 125&#41;&#41;
		cgen_asm&#40;vmov_s&#40;S_S022, 126&#41;&#41;
	&#58; "=r"&#40;ptr&#41; &#58; "r"&#40;ptr&#41; &#58; "memory"&#41;;
&#125;

/*
	cy  0    -sy 0
	0   1    0   0
	sy  0    cy  0
	0   0    0   1
*/

void vfpu_rotateY&#40;float degrees&#41; &#123;
	register void *ptr __asm &#40;"a0"&#41; = vfpu_vars;
	vfpu_vars&#91;0&#93; = degrees / 90.0;
	__asm__ volatile &#40;
		cgen_asm&#40;lv_s&#40;4, 0*4, R_a0, 0&#41;&#41;
		cgen_asm&#40;vsin_s&#40;125, 4&#41;&#41;
		cgen_asm&#40;vcos_s&#40;126, 4&#41;&#41;
		cgen_asm&#40;vmidt_q&#40;Q_M100&#41;&#41;
		cgen_asm&#40;vmov_s&#40;S_S100, 126&#41;&#41;
		cgen_asm&#40;vneg_s&#40;S_S102, 125&#41;&#41;
		cgen_asm&#40;vmov_s&#40;S_S120, 125&#41;&#41;
		cgen_asm&#40;vmov_s&#40;S_S122, 126&#41;&#41;
	&#58; "=r"&#40;ptr&#41; &#58; "r"&#40;ptr&#41; &#58; "memory"&#41;;
&#125;

/*
	cz  sz   0   0
	-sz cz   0   0
	0   0    1   0
	0   0    0   1
*/

void vfpu_rotateZ&#40;float degrees&#41; &#123;
	register void *ptr __asm &#40;"a0"&#41; = vfpu_vars;
	vfpu_vars&#91;0&#93; = degrees / 90.0;
	__asm__ volatile &#40;
		cgen_asm&#40;lv_s&#40;8, 0*4, R_a0, 0&#41;&#41;
		cgen_asm&#40;vsin_s&#40;125, 8&#41;&#41;
		cgen_asm&#40;vcos_s&#40;126, 8&#41;&#41;
		cgen_asm&#40;vmidt_q&#40;Q_M200&#41;&#41;
		cgen_asm&#40;vmov_s&#40;S_S200, 126&#41;&#41;
		cgen_asm&#40;vmov_s&#40;S_S201, 125&#41;&#41;
		cgen_asm&#40;vneg_s&#40;S_S210, 125&#41;&#41;
		cgen_asm&#40;vmov_s&#40;S_S211, 126&#41;&#41;
	&#58; "=r"&#40;ptr&#41; &#58; "r"&#40;ptr&#41; &#58; "memory"&#41;;
&#125;

/*void matrix_projection&#40;float* matrix, float fovy, float aspect, float near, float far&#41;
&#123;
	matrix_identity&#40;matrix&#41;;

	float angle = &#40;fovy / 2.0f&#41; * &#40;M_PI/180.0f&#41;;
	float cotangent = cosf&#40;angle&#41; / sinf&#40;angle&#41;;

	matrix&#91;&#40;0<<2&#41;+0&#93; = cotangent / aspect;
	matrix&#91;&#40;1<<2&#41;+1&#93; = cotangent;
	matrix&#91;&#40;2<<2&#41;+2&#93; = &#40;far + near&#41; / &#40;near - far&#41;;
	matrix&#91;&#40;3<<2&#41;+2&#93; = 2.0f * &#40;far * near&#41; / &#40;near - far&#41;;
	matrix&#91;&#40;2<<2&#41;+3&#93; = -1;
	matrix&#91;&#40;3<<2&#41;+3&#93; = 0.0f;
&#125;*/

void vfpu_projection&#40;float fov, float aspect, float near, float far&#41; &#123;

	vfpu_vars&#91;0&#93; = &#40;fov / 2.0f&#41; / 90.0f;
	vfpu_vars&#91;1&#93; = &#40;far + near&#41; / &#40;near - far&#41;;
	vfpu_vars&#91;2&#93; = 2.0f * &#40;far * near&#41; / &#40;near - far&#41;;
	vfpu_vars&#91;3&#93; = aspect;
	register void *ptr __asm &#40;"a0"&#41; = vfpu_vars;
	__asm__ volatile &#40;
		cgen_asm&#40;vmidt_q&#40;Q_M300&#41;&#41;
		cgen_asm&#40;lv_q&#40;Q_R703, 0, R_a0, 0&#41;&#41;
		cgen_asm&#40;vsin_s&#40;S_S702, S_S703&#41;&#41;
		cgen_asm&#40;vcos_s&#40;S_S712, S_S703&#41;&#41;
		cgen_asm&#40;vdiv_s&#40;S_S311, S_S712, S_S702&#41;&#41;
		cgen_asm&#40;vdiv_s&#40;S_S300, S_S311, S_S733&#41;&#41;
		cgen_asm&#40;vmov_s&#40;S_S322, S_S713&#41;&#41;
		cgen_asm&#40;vmov_s&#40;S_S323, S_S723&#41;&#41;
		cgen_asm&#40;vone_s&#40;S_S332&#41;&#41;
		cgen_asm&#40;vneg_s&#40;S_S332, S_S332&#41;&#41;
		cgen_asm&#40;vzero_s&#40;S_S333&#41;&#41;
		&#58; "=r"&#40;ptr&#41; &#58; "r"&#40;ptr&#41; &#58; "memory"&#41;;
&#125;


void vfpu_concatXYZ&#40;void&#41; &#123;
	__asm__ volatile &#40;
		cgen_asm&#40;vmmul_q&#40;Q_M400, Q_M000, Q_M100&#41;&#41;
		cgen_asm&#40;vmmul_q&#40;Q_E500, Q_M400, Q_M200&#41;&#41;
	&#41;;
&#125;

void vfpu_transform&#40;float x, float y, float z&#41; &#123;
	vfpu_vars&#91;0&#93; = x;
	vfpu_vars&#91;1&#93; = y;
	vfpu_vars&#91;2&#93; = z;
	vfpu_vars&#91;3&#93; = 1.0;
	register void *ptr __asm &#40;"a0"&#41; = vfpu_vars;
	__asm__ volatile &#40;
		cgen_asm&#40;lv_q&#40;Q_R700, 0, R_a0, 0&#41;&#41;
		cgen_asm&#40;vtfm4_q&#40;Q_R701, Q_M500, Q_R700&#41;&#41;
		cgen_asm&#40;vtfm4_q&#40;Q_R702, Q_M300, Q_R701&#41;&#41;
		&#58; "=r"&#40;ptr&#41; &#58; "r"&#40;ptr&#41; &#58; "memory"&#41;;
&#125;
Thats a little matrix library I've been implementing as I was figuring out the vfpu opcodes. The transform code simply performs the rotation/projection transform, I'm not storing the result anywhere. Ill polish this up later after I get some sleep =)
User avatar
dot_blank
Posts: 498
Joined: Wed Sep 28, 2005 8:47 am
Location: Brasil

Post by dot_blank »

8) looks good to me
10011011 00101010 11010111 10001001 10111010
holger
Posts: 204
Joined: Thu Aug 18, 2005 10:57 am

Post by holger »

MrMr[iCE] wrote:And they just keep on coming:

Code: Select all

/*
+----------------------+--------------+----+--------------+---+--------------+
|31                 23 | 22        16 | 15 | 14         8 | 7 | 6         0  |
+----------------------+--------------+----+--------------+---+--------------+
| opcode 0xf08 &#40;p&#41;     | vfpu_rt&#91;6-0&#93; |  0 | vfpu_rs&#91;6-0&#93; | 0 | vfpu_rd&#91;6-0&#93; |
| opcode 0xf10 &#40;t&#41;     | vfpu_rt&#91;6-0&#93; |  0 | vfpu_rs&#91;6-0&#93; | 1 | vfpu_rd&#91;6-0&#93; |
| opcode 0xf18 &#40;q&#41;     | vfpu_rt&#91;6-0&#93; |  1 | vfpu_rs&#91;6-0&#93; | 0 | vfpu_rd&#91;6-0&#93; |
+----------------------+--------------+----+--------------+---+--------------+

  VectorHomogeneousTransform.Pair/Triple/Quad

    vhtfm2.p %vfpu_rd, %vfpu_rs, %vfpu_rt ; Homogeneous transform quad vector by quad matrix
    vhtfm3.t %vfpu_rd, %vfpu_rs, %vfpu_rt ; Homogeneous transform quad vector by quad matrix
    vhtfm4.q %vfpu_rd, %vfpu_rs, %vfpu_rt ; Homogeneous transform quad vector by quad matrix

        %vfpu_rt&#58;	VFPU Vector Source Register &#40;qreg 0..127&#41;
        %vfpu_rs&#58;	VFPU Matrix Source Register &#40;qmatrix 0..127&#41;
        %vfpu_rd&#58;	VFPU Vector Destination Register &#40;qreg 0..127&#41;

    vfpu_regs&#91;%vfpu_rd&#93; <- homeogenoustransform&#40;vfpu_matrix&#91;%vfpu_rs&#93;, vfpu_vector&#91;%vfpu_rt&#93;&#41;
*/

#define vhtfm2_p&#40;vfpu_rd,vfpu_rs,vfpu_rt&#41;  &#40;0xF0800000 | &#40;&#40;vfpu_rt&#41; << 16&#41; | &#40;&#40;vfpu_rs&#41; << 8&#41; | &#40;vfpu_rd&#41;&#41;
#define vhtfm3_t&#40;vfpu_rd,vfpu_rs,vfpu_rt&#41;  &#40;0xF1000080 | &#40;&#40;vfpu_rt&#41; << 16&#41; | &#40;&#40;vfpu_rs&#41; << 8&#41; | &#40;vfpu_rd&#41;&#41;
#define vhtfm4_q&#40;vfpu_rd,vfpu_rs,vfpu_rt&#41;  &#40;0xF1808000 | &#40;&#40;vfpu_rt&#41; << 16&#41; | &#40;&#40;vfpu_rs&#41; << 8&#41; | &#40;vfpu_rd&#41;&#41;
mmmh... I fear we need to add some explanations what a "homogenous transform" is and where exactly the difference to the "normal" transform is...

also my implementation of vmmul is wrong, it should be:

Code: Select all

#define vmmul_p&#40;vfpu_rd, vfpu_rs, vfpu_rt&#41; &#40;0xf0000080 | &#40;&#40;vfpu_rt&#41; << 16&#41; | &#40;&#40;&#40;vfpu_rs&#41; ^ 0x20&#41; << 8&#41; | &#40;vfpu_rd&#41;&#41;
#define vmmul_t&#40;vfpu_rd, vfpu_rs, vfpu_rt&#41; &#40;0xf0008000 | &#40;&#40;vfpu_rt&#41; << 16&#41; | &#40;&#40;&#40;vfpu_rs&#41; ^ 0x20&#41; << 8&#41; | &#40;vfpu_rd&#41;&#41;
#define vmmul_q&#40;vfpu_rd, vfpu_rs, vfpu_rt&#41; &#40;0xf0008080 | &#40;&#40;vfpu_rt&#41; << 16&#41; | &#40;&#40;&#40;vfpu_rs&#41; ^ 0x20&#41; << 8&#41; | &#40;vfpu_rd&#41;&#41;
this implements the inverted bit special case I saw in mips_dis.c
fixed+added, thanks!
MrMr[iCE]
Posts: 43
Joined: Mon Oct 03, 2005 4:55 pm

Post by MrMr[iCE] »

This could be done a lot faster if you were on irc man, there are some very smart cookies on #pspdev who have been a tremendous help with some of these ops, especially what the difference was between vdot/vhdp, etc. Trust me when I say, having someone to talk to in realtime, is WAY better than waiting for the next forum post =)
holger
Posts: 204
Joined: Thu Aug 18, 2005 10:57 am

Post by holger »

then it would be cool to write this down for the public in the spec - knowledge that's not distributed is lost knowledge - ;)
well, I installed an irc client now, but since I'm connecting over a call-by-call dialup I'm not online that much time...
MrMr[iCE]
Posts: 43
Joined: Mon Oct 03, 2005 4:55 pm

Post by MrMr[iCE] »

well dont be a stranger, join us on #pspdev, on irc.freenode.org

I almost have a full implementation of pspgum that works with the Gu commands. I have a vector demo working with vfpu handling the matrix math. But you gotta get on irc to see it =)
MrMr[iCE]
Posts: 43
Joined: Mon Oct 03, 2005 4:55 pm

Post by MrMr[iCE] »

I did some profiling, heres some results to give you an idea how well vfpu performs:

test case:
set up gu view matrix to identity
set up gu projection matrix
set up gu model matrix with x rotation, z rotation, and translate


with pspgum * 1000 runs, cpu at 222mhz
35155 us (micro seconds)

with pspvgum (my vfpu version of pspgum) * 1000 runs, cpu at 222mhz
3079 us

over 10x performace increase over pspgum. Any questions? =)
holger
Posts: 204
Joined: Thu Aug 18, 2005 10:57 am

Post by holger »

:) yes: when does this gets moved to SVN? - ;)

nice done!
mrbrown
Site Admin
Posts: 1537
Joined: Sat Jan 17, 2004 11:24 am

Post by mrbrown »

holger wrote::) yes: when does this gets moved to SVN? - ;)
I would hold before forcing the VFPU on everyone. It may be more work to make it optional, but it will pay off.
mrbrown
Site Admin
Posts: 1537
Joined: Sat Jan 17, 2004 11:24 am

Post by mrbrown »

MrMr[iCE] wrote:over 10x performace increase over pspgum. Any questions? =)
Have any real benchmarks? </devilsadvocate>
MrMr[iCE]
Posts: 43
Joined: Mon Oct 03, 2005 4:55 pm

Post by MrMr[iCE] »

well im still waiting for SVN access from oobles, but here's something you can play with in the meantime.

http://bradburn.net/mr.mr/files/libpspvgum.zip

source included, also contains a small demo to try out the vfpu routines.
Have any real benchmarks?
sorry man, I only did a simple gettimeofday difference. I dont know how to do a real benchmark..perhaps someone has code that does this?
mrbrown
Site Admin
Posts: 1537
Joined: Sat Jan 17, 2004 11:24 am

Post by mrbrown »

I meant something along the lines of profiling a real app with and without VFPU support, but is there anything out there actively using libpspgum or any homebrew that could immediately benefit from the VFPU?
MrMr[iCE]
Posts: 43
Joined: Mon Oct 03, 2005 4:55 pm

Post by MrMr[iCE] »

I guess we'll need someone to adapt the vfpu stuff into an existing app...I dont have anything to do a benchmark with besides my little 3d tests.
holger
Posts: 204
Joined: Thu Aug 18, 2005 10:57 am

Post by holger »

mrbrown wrote:
holger wrote::) yes: when does this gets moved to SVN? - ;)
I would hold before forcing the VFPU on everyone. It may be more work to make it optional, but it will pay off.
mmh... this would prevent e.g. a inline-libm, most functions are implementable in a single asm instruction (+load/store)...
How would you want to make this optional, do you want to provide two versions of every library to link against?
holger
Posts: 204
Joined: Thu Aug 18, 2005 10:57 am

Post by holger »

mrbrown wrote:I meant something along the lines of profiling a real app with and without VFPU support, but is there anything out there actively using libpspgum or any homebrew that could immediately benefit from the VFPU?
a VFPU-based libm would be of benefit for all.
holger
Posts: 204
Joined: Thu Aug 18, 2005 10:57 am

Post by holger »

MrMr[iCE] wrote:
Have any real benchmarks?
sorry man, I only did a simple gettimeofday difference. I dont know how to do a real benchmark..perhaps someone has code that does this?
See e.g. sdk/debug/profiler.c and ./sdk/samples/debug/profiler/main.c, this will also show minimized cache misses and CPU stalls of fine-tuned asm.

I don't know, though, whether the PSP has a cycle-exact counter register for exact timers. Anybody else?
TyRaNiD
Posts: 907
Joined: Sun Jan 18, 2004 12:23 am

Post by TyRaNiD »

The CPU cop0 has a cycle counter which you can access using the mfc0 $v0, $9 instruction. You must be in kernel mode though to use it.
holger
Posts: 204
Joined: Thu Aug 18, 2005 10:57 am

Post by holger »

our experimental VFPU-test code is running it's setup in kernel mode anyways, since we are installing an exception handler. Thanks for the hint!

Is there a list of mfc-registers somewhere, are you referring to the MIPS manuals?
jsgf
Posts: 254
Joined: Tue Jul 12, 2005 11:02 am
Contact:

Post by jsgf »

holger wrote:a VFPU-based libm would be of benefit for all.
Maybe. Any performance gain might be eaten by all the shuffling things around between FP and VFPU registers for simple scalar stuff. A vector version of libm would be a better match.

I would like to see a very simple, thin libvfpu which provites two things:
  1. a set of macros to make inline assembler access to the VFPU easy (like gcc/icc's xmmintrin.h for SSE)
  2. a simple lightweight context switching mechanism to allow multiple libraries to share the VFPU without stomping on each other
I envisage 2 as having calls something like:

Code: Select all

VFPUcontext *vfpuNewContext&#40;&#41;;
void vfpuSwitchContext&#40;VFPUcontext *ctxt, MatrixSet used&#41;;
void vfpuFreeContext&#40;VFPUcontext *&#41;;
where MatrixSet is a simple bitmask of which sets of matrix registers you want to use. If you're the only user of the VFPU, or you're using a disjoint set of registers from the other users, then vfpuSwitchContext would be a fairly cheap no-op; otherwise it would shuffle things around for you.

Obviously this could get expensive if you thrash contexts, but if you can be careful to work in relatively large batches, then you'll still get good performance. Certainly much better performance than letting unrelated VFPU users stomp on each other.
mrbrown
Site Admin
Posts: 1537
Joined: Sat Jan 17, 2004 11:24 am

Post by mrbrown »

jsgf wrote:Maybe. Any performance gain might be eaten by all the shuffling things around between FP and VFPU registers for simple scalar stuff. A vector version of libm would be a better match.
That was my point, that benchmarking a few matrix ops doesn't show whether or not everything should be replaced with VFPU code. I'm with jsgf in that there should be a set of VFPU compiler intrinsics, as well as a specialized VFPU library (it can overlap with libm if you want, but the decision to use the VFPU functions should be left up to the user).

Besides that, creating a VFPU-based thread incurs considerable overhead during a context switch. If you force the VFPU everywhere, then all threads would be required to maintain a VFPU context.
ector
Posts: 195
Joined: Thu May 12, 2005 10:22 pm

Post by ector »

mrbrown wrote: Besides that, creating a VFPU-based thread incurs considerable overhead during a context switch. If you force the VFPU everywhere, then all threads would be required to maintain a VFPU context.
Almost correct, unless I'm misunderstanding you and you're 100% right but not clear enough :)

Most platforms with heavy additional register sets (such as Gekko in the Nintendo Gamecube) perform "lazy" context switching of "extra" (such as VFPU) registers, usually managed through a per-thread enable flag (as on PSP?) or by disabling the additional register sets and enabling and context switching in the illegal instruction exception handler (such as on Gekko). (Emulating this in an efficient manner is a PAIN!!! ;)

The consequence is that having one single thread with VFPU (or whatever extra register set your processor has) is ABSOLUTELY free in terms of context switching, because the vfpu regs will just be left there, but as soon as you add another vfpu thread, you will risk incurring the heavy costs.

Thus, I agree with mrbrown that vfpu should not be forced on everyone, while attempting to clear up some things that were not completely clear in mrbrown's post :)
http://www.dtek.chalmers.se/~tronic/PSPTexTool.zip Free texture converter for PSP with source. More to come.
holger
Posts: 204
Joined: Thu Aug 18, 2005 10:57 am

Post by holger »

"heavy cost" is relative, save'n'restore the VFPU matrix registers involves 32 read/write cycles (which can get well-tuned to use write-through using the cache policy bits of the VFPU insns), that's is not too much compared to a single cache miss that's very likely to happen on a context switch anyways.
holger
Posts: 204
Joined: Thu Aug 18, 2005 10:57 am

Post by holger »

jsgf wrote:I would like to see a very simple, thin libvfpu which provites two things:
  1. a set of macros to make inline assembler access to the VFPU easy (like gcc/icc's xmmintrin.h for SSE)
yes, this would be nice, but requires some work in the toolchain, so that gcc knows how to schedule the VFPU registers.
jsgf wrote:[*] a simple lightweight context switching mechanism to allow multiple libraries to share the VFPU without stomping on each other
becomes obsolete with the above...
mrbrown
Site Admin
Posts: 1537
Joined: Sat Jan 17, 2004 11:24 am

Post by mrbrown »

holger wrote:"heavy cost" is relative, save'n'restore the VFPU matrix registers involves 32 read/write cycles (which can get well-tuned to use write-through using the cache policy bits of the VFPU insns), that's is not too much compared to a single cache miss that's very likely to happen on a context switch anyways.
There are more than 32 VFPU registers. You've missed the control registers.

The "heavy cost" is relative to what exactly? Do you know how painful a normal thread context switch is on the PSP, without saving and restoring the VFPU context?
holger
Posts: 204
Joined: Thu Aug 18, 2005 10:57 am

Post by holger »

mrbrown wrote:There are more than 32 VFPU registers. You've missed the control registers.
we still don't know much about them... are they accessible and modifyable from userspace?
mrbrown wrote:The "heavy cost" is relative to what exactly? Do you know how painful a normal thread context switch is on the PSP, without saving and restoring the VFPU context?
well, the usual rules for multithreaded OSes apply. you're accessing at least the interrupt vector code area, the old context's register save area, the new context register save area, the new code segment of the new thread, the data area of the new thread. At least 5 opportunities for a cache miss, more are not unlikely, depends whatever the new thread is doing...
mrbrown
Site Admin
Posts: 1537
Joined: Sat Jan 17, 2004 11:24 am

Post by mrbrown »

holger wrote:we still don't know much about them... are they accessible and modifyable from userspace?
Haven't you disasm'd a game that uses the VFPU (such as Wipeout)? The control registers are accessible with the current VFPU assembler (they don't need the wacky register syntax).
holger
Posts: 204
Joined: Thu Aug 18, 2005 10:57 am

Post by holger »

don't know whether it's running in user- or kernelspace, but that's easy to check.

btw, what's so wacky about the register syntax? The opcode bitfields look quite consistent, only prefix codes are somewhat unusual, but may get added later, on a first shot implementation one could use them as seperate instruction (or let a preprocessor generate matching vpfx instructions).
Post Reply