[WIP] A very special (and weird) template library for VFPU

hlide · Post by **hlide** » Mon Apr 07, 2008 4:58 am

I'm working on a special VFPU lib for C++, using template classes a lot.

The main idea is to let g++ to allocate vfpu registers and try to compute without the need to struggle with the assembly code. But the drawback is the way you write the algorithm using vfpu, because it would appear to most people very weird.

that code :

Code: Select all

#include "vfpu.h"

extern "C" void vector2d_sample_linear&#40;float *vector_result, float *vector_source, float *vector_target, float &alpha&#41;
&#123;
	// load scalar alpha in a new register
	typedef vfpu_scalar_load< vfpu_scalar_new_reg < > >
		load_alpha, step1;
	
	// load 2d vector source in a new register
	typedef vfpu_vector_2d_load< vfpu_vector_new_reg < step1 > >
		load_vector_source, step2;
	
	// load 2d vector target in a new register
	typedef vfpu_vector_2d_load< vfpu_vector_new_reg < step2 > >
		load_vector_target, step3;

	// saturated one's complement &#58; &#123; 1-alpha, alpha &#125; = &#123; &#40;1.0 - alpha&#41;&#91;0..1&#93;, alpha&#91;0..1&#93; &#125;
	typedef vfpu_vector_2d_socp_result< load_alpha, step3 >
		compute_alpha_and_one_minus_alpha, step4;

	typedef vfpu_vector_2d_scl_result< load_vector_source, vfpu_vector_2d_component< compute_alpha_and_one_minus_alpha, 0 >, step4 >
		scale_source_vector_with_one_minus_alpha, step5;

	typedef vfpu_vector_2d_scl_result< load_vector_target, vfpu_vector_2d_component< compute_alpha_and_one_minus_alpha, 1 >, step5 >
		scale_target_vector_with_alpha, step6;

	// Vs * &#40;1.0 - alpha&#41;&#91;0..1&#93; + Vd * alpha&#91;0..1&#93;
	typedef vfpu_vector_2d_add_result< scale_source_vector_with_one_minus_alpha, scale_target_vector_with_alpha, step6 >
		add_vectors_source_and_target, step7;


	// and store to vector result
	typedef vfpu_vector_2d_store< add_vectors_source_and_target, step7 >
		store_vector_result;

	// execute 
	load_alpha         _1&#40;alpha&#41;;
	load_vector_source _2&#40;*vector_source&#41;;
	load_vector_target _3&#40;*vector_target&#41;;
	
	compute_alpha_and_one_minus_alpha&#40;&#41;;
	scale_source_vector_with_one_minus_alpha&#40;&#41;;
	scale_target_vector_with_alpha&#40;&#41;;
	add_vectors_source_and_target&#40;&#41;;

	store_vector_result _&#40;*vector_result&#41;;
&#125;

gives me :

Code: Select all

00000018 <vector2d_sample_linear>&#58;

// load scalar alpha in a new register
  18&#58;   c8e00000        lv.s    S000.s,0&#40;a3&#41;

// load 2d vector source in a new register
  1c&#58;   c8a10000        lv.s    S010.s,0&#40;a1&#41;
  20&#58;   c8a10005        lv.s    S011.s,4&#40;a1&#41;

// load 2d vector target in a new register
  24&#58;   c8c20000        lv.s    S020.s,0&#40;a2&#41;
  28&#58;   c8c20005        lv.s    S021.s,4&#40;a2&#41;

// saturated one's complement &#58; &#123; 1-alpha, alpha &#125; = &#123; &#40;1.0 - alpha&#41;&#91;0..1&#93;, alpha&#91;0..1&#93; &#125;
  2c&#58;   d0450103        vsocp.s C030.p,S000.s

// scale vector source with 1-alpha
  30&#58;   65030184        vscl.p  C100.p,C010.p,S030.s

// scale vector target with alpha
  34&#58;   65230285        vscl.p  C110.p,C020.p,S031.s

// add vectors source and target to vector result
  38&#58;   60050486        vadd.p  C120.p,C100.p,C110.p

// store vector result
  3c&#58;   e8860000        sv.s    S120.s,0&#40;a0&#41;
  40&#58;   e8860005        sv.s    S121.s,4&#40;a0&#41;

// exit function
  44&#58;   03e00008        jr      ra
  48&#58;   00000000        nop

which is pretty good for what i was expected.

I dunno if somebody may be interested in this library.

Sure, there is still a lot of job to do.

NOTE :

any vfpu_vector_2d_XXX template class has an optional template parameter named "clobbered" (those ones named "stepN" in my example) which allows g++ to remember which VFPU registers were "allocated". This is the main reason why we need so many typedefs and this weird syntax.

hlide · Post by **hlide** » Thu Apr 10, 2008 8:21 am

in loop, it is not bad :

Code: Select all

extern "C" void vector_2d_sample_linear&#40;float *vector_result, float *vector_source, float *vector_target, float &alpha, int n&#41;
&#123;
	typedef vfpu_vector_new_register<                   > vector_source_reg;
	typedef vfpu_vector_new_register< vector_source_reg > vector_target_reg;
	typedef vfpu_scalar_new_register< vector_target_reg > alpha_reg;

	// saturated one's complement &#58; &#123; 1-alpha, alpha &#125; = &#123; &#40;1.0 - K&#41;&#91;0..1&#93;, K&#91;0..1&#93; &#125;
	typedef vfpu_vector_2d_socp_result< alpha_reg >
		alpha_and_one_minus_alpha;

	typedef vfpu_vector_2d_scl_result< vector_source_reg, vfpu_vector_2d_component< alpha_and_one_minus_alpha, 0 > >
		scaled_vector_source_with_one_minus_alpha;

	typedef vfpu_vector_2d_scl_result< vector_target_reg, vfpu_vector_2d_component< alpha_and_one_minus_alpha, 1 >, scaled_vector_source_with_one_minus_alpha >
		scaled_vector_target_with_alpha;

	// Vs * &#40;1.0 - K&#41;&#91;0..1&#93; + Vd * K&#91;0..1&#93;
	typedef vfpu_vector_2d_add_result< scaled_vector_source_with_one_minus_alpha, scaled_vector_target_with_alpha >
		result;

	// load alpha
	vfpu_scalar_load&#40;alpha_reg&#40;&#41;, alpha&#41;;

	// saturate alpha between 0 and 1 then compute one minus alpha 
	alpha_and_one_minus_alpha&#40;&#41;;

	// for n vectors
	for &#40;int i = 0; i < n; vector_result+=2&#41;
	&#123;
		// load source and target vectors
		vfpu_vector_2d_load&#40;vector_source_reg&#40;&#41;, vector_source&#91;0&#93;, vector_source&#91;1&#93;&#41;;
		vfpu_vector_2d_load&#40;vector_target_reg&#40;&#41;, vector_target&#91;0&#93;, vector_target&#91;1&#93;&#41;;

		vector_source+=2;

		// compute new source and target vectors 
		scaled_vector_source_with_one_minus_alpha&#40;&#41;;
		scaled_vector_target_with_alpha&#40;&#41;;
		
		vector_target+=2;
		
		++i;

		// and add them to get the result vector
		vfpu_vector_2d_store&#40;result&#40;&#41;, vector_result&#91;0&#93;, vector_result&#91;1&#93;&#41;;
	&#125;
&#125;

gives me :

Code: Select all

00000030 <vector_2d_sample_linear>&#58;
  30&#58;   c8e20000        lv.s    S020.s,0&#40;a3&#41;
  34&#58;   d0450203        vsocp.s C030.p,S020.s
  38&#58;   1900000f        blez    t0,78 <vector_2d_sample_linear+0x48>
  3c&#58;   00001021        move    v0,zero
  40&#58;   c8a00000        lv.s    S000.s,0&#40;a1&#41;
  44&#58;   c8a00005        lv.s    S001.s,4&#40;a1&#41;
  48&#58;   c8c10000        lv.s    S010.s,0&#40;a2&#41;
  4c&#58;   c8c10005        lv.s    S011.s,4&#40;a2&#41;
  50&#58;   24a50008        addiu   a1,a1,8
  54&#58;   65030084        vscl.p  C100.p,C000.p,S030.s
  58&#58;   65230185        vscl.p  C110.p,C010.p,S031.s
  5c&#58;   24c60008        addiu   a2,a2,8
  60&#58;   24420001        addiu   v0,v0,1
  64&#58;   60050486        vadd.p  C120.p,C100.p,C110.p
  68&#58;   e8860000        sv.s    S120.s,0&#40;a0&#41;
  6c&#58;   e8860005        sv.s    S121.s,4&#40;a0&#41;
  70&#58;   1502fff3        bne     t0,v0,40 <vector_2d_sample_linear+0x10>
  74&#58;   24840008        addiu   a0,a0,8
  78&#58;   03e00008        jr      ra
  7c&#58;   00000000        nop

some integer increments are inserted between vfpu instruction, that's very good for reducing lantency between vfpu instructions.

gauri · Post by **gauri** » Thu Apr 10, 2008 3:42 pm

so implementin intrinsics didn't prove itself really feasible?

can you share the whole library? i'm really interested in that.

hlide · Post by **hlide** » Fri Apr 11, 2008 5:56 am

gauri wrote:so implementin intrinsics didn't prove itself really feasible?

can you share the whole library? i'm really interested in that.

this is a very HUGE work to add intrinsics.

I plan to share this library as soon as possible. But it is too early to release it as I want to experiment with it and find different ways to implement this library in the most efficient way. When I started it that last weekend, I was just playing and I didn't really expect such a result without having to tweak psp-gcc for vfpu.

By the way, it seems psp-as is buggy for vfpu :

"vmov.p C010, C002" works fine but not "vmov.p $1, $64" whereas they are strictly the same thing. I have an error with a vfpu register $64 or above. It is such a pity because I really need to use the '$' register notation for my library. :/

hlide · Post by **hlide** » Mon Apr 21, 2008 11:09 am

gauri wrote:so implementin intrinsics didn't prove itself really feasible?

can you share the whole library? i'm really interested in that.

ok, still not complete. I was mainly trying to make a 2d lib using vfpu but there is no reason not to extend it to a 3d or quaternion lib. I'm not totally satisfied with it, but i can see it as a good start.

[url]svn://svn2.assembla.com/svn/pspvftl/trunk/pspvftl/[/url]