//The first method we will attempt is: (from gtest.Optimiser)

/*
Method int stupidMethod(int, int)
   0 iload_1
   1 iload_2
   2 iadd
   3 ireturn   
*/



0000006c <do_asm_stupidMethod>:
  6c:   55                      pushl  %ebp
    6d:   89 e5                   movl   %esp,%ebp
      6f:   83 ec 04                subl   $0x4,%esp
        72:   8b 45 08                movl   0x8(%ebp),%eax
	  75:   8b 10                   movl   (%eax),%edx
	    77:   89 55 fc                movl   %edx,0xfffffffc(%ebp)
	      7a:   c9                      leave
	        7b:   c3                      ret    

This is  int a = pi32Vars[0];
    a++;
        pi32Vars[0] = a;

0000006c <do_asm_stupidMethod>:
  6c:   55                      pushl  %ebp
    6d:   89 e5                   movl   %esp,%ebp
      6f:   83 ec 04                subl   $0x4,%esp
        72:   8b 45 08                movl   0x8(%ebp),%eax
	  75:   8b 10                   movl   (%eax),%edx
	    77:   89 55 fc                movl   %edx,0xfffffffc(%ebp)
	      7a:   ff 45 fc                incl   0xfffffffc(%ebp)
	        7d:   8b 45 08                movl   0x8(%ebp),%eax
		  80:   8b 55 fc                movl   0xfffffffc(%ebp),%edx
		    83:   89 10                   movl   %edx,(%eax)
		      85:   c9                      leave
		        86:   c3                      ret     



So I guess a = pi32Vars[0] is the same as

72:   8b 45 08                movl   0x8(%ebp),%eax
75:   8b 10                   movl   (%eax),%edx
77:   89 55 fc                movl   %edx,0xfffffffc(%ebp)
        
which is moving basepointer + 8 to eax
Then let edx = *eax
Then let ebp[-1] = edx

So prologue: First push old base_pointer onto stack
	     Move the stack pointer to base_pointer
	     Subtract the stack by 4 

Now we see that pi32Vars = NULL is:

  6f:   c7 45 08 00 00 00 00    movl   $0x0,0x8(%ebp) 

Note the sp isn't moved down by 4 since there are no local vars, lets see if we have two local vars, so waar:

6f:   83 ec 08                subl   $0x8,%esp 

Now we do:

    int c = piOptop[1] + piOptop[2]; //Remember to skip the aref!
    *(piOptop) = c;

//This is code generated from the two C lines
//Prologue
 6c:   55                      pushl  %ebp
 6d:   89 e5                   movl   %esp,%ebp
//Code     
 6f:   83 ec 04                subl   $0x4,%esp		 //Create space for c
 72:   8b 45 08                movl   0x8(%ebp),%eax	 //Copy args  to eax
 75:   83 c0 04                addl   $0x4,%eax		 //Move eax to optop + 1
 78:   8b 55 08                movl   0x8(%ebp),%edx	 //Copy args to edx
 7b:   83 c2 08                addl   $0x8,%edx		 //Move edx to optop + 2
 7e:   8b 00                   movl   (%eax),%eax	 //Let eax = *eax
 80:   8b 12                   movl   (%edx),%edx	 //Let edx = *edx
 82:   8d 0c 02                leal   (%edx,%eax,1),%ecx //Sum edx and eax and put result in ecx
 85:   89 4d fc                movl   %ecx,0xfffffffc(%ebp) //Copy ecx to someplace (stack?)
 88:   8b 45 08                movl   0x8(%ebp),%eax	    //Copy args (optop) to eax
 8b:   8b 55 fc                movl   0xfffffffc(%ebp),%edx //Copy someplace to edx
 8e:   89 10                   movl   %edx,(%eax)	    //Let *optop = edx
//Epilogue
 90:   c9                      leave
 91:   c3                      ret    




Optimised version (-O2), easily half the number of instructions above:

00000044 <do_asm_stupidMethod>:
  44:   55                      pushl  %ebp
  45:   89 e5                   movl   %esp,%ebp
  47:   8b 45 08                movl   0x8(%ebp),%eax    
  4a:   8b 50 08                movl   0x8(%eax),%edx
  4d:   03 50 04                addl   0x4(%eax),%edx 
  50:   89 10                   movl   %edx,(%eax)
  52:   c9                      leave
  53:   c3                      ret 

Timing: With the optimised call (300000 calls), we have 0.74 0.73 0.73 user time. 
	Without we have .91 .93 .92 .91

Hmm, and remember the unoptimised version doesn't have to run the check to see if the method can be optimised. So this is impressive. 

Now what if we do a QUICK style replacement? This should remove the overhead of that check.

Well the results of that show that we have 0.57 0.57 0.59. So we clearly get an almost 2x increase in speed. But remember this doesn't include the time to generate the native code for the method. 

Ok next step is to code the asm code directly instead of a c version:

Ok we did this, seems to work and now our times are:

.55 .60 .58 .59 .61 .62

Much the same, but now there is real overhead in invokevirtual_quick because it has to fetch the native pointer from the pstMethodTemp->pstCode structure. When a native method calls a native method do we still need to look up the method like this? Ie is it dependent on the reference's type ?

The next step is to generate asm code for each bytecode, see gen_bytecode_intruction gen_c_prologue, epilogue etc:

Here is reference ASM code:

//This is code generated from my own NASM source (stupid.S)
00000000 <_runme>:
   0:   55                      pushl  %ebp
   1:   89 e5                   movl   %esp,%ebp
   3:   8b 45 08                movl   0x8(%ebp),%eax ; iload_1
   6:   05 04 00 00 00          addl   $0x4,%eax
   b:   ff 30                   pushl  (%eax)
   d:   8b 45 08                movl   0x8(%ebp),%eax ; iload_2
   10:   05 08 00 00 00          addl   $0x8,%eax
   15:   ff 30                   pushl  (%eax)
   17:   59                      popl   %ecx ; iadd
   18:   5a                      popl   %edx
   19:   01 ca                   addl   %ecx,%edx
   1b:   8b 45 08                movl   0x8(%ebp),%eax
   1e:   89 10                   movl   %edx,(%eax)
   20:   c9                      leave
   21:   c3                      ret  

//When we generate the assembly for each of these bytecodes, we get times like:

0.60 0.61 .60 .63 .61

//Ok next step is to see if it works for a different method, so we have created:

Method int stupiderMethod(int, int)
   0 iload_1
   1 iload_2
   2 iadd
   3 iload_2
   4 iadd
   5 ireturn 

It doesn't work!!! This is because when we do iadd, we don't do a pushl onto the C stack. Let's do that in stupid.S

This gives us one more line:

1b:   52                      pushl  %edx
 
let's add that to analyse.c

And -- damn it works!!!!! Yeehah.

Running the latest gtest.Optimiser we get times of
1.79 1.82 1.82 1.83

Without the ops:

1.50 1.56 1.51 1.48 1.54

So it seems that we should remove the overhead of checking whether a function can be optimised, hmm, no this only happens 4 times. 

Ok, checking with all 3 methods: (no ops)
2.30 2.30 2.25 2.25 2.27 

(ops)

2.92 2.87 2.94 2.92 2.92 2.90

So why is the optimised code now slower than interpreted?

Just the addsub (opt)
1.68 1.65 1.67 1.72

(no opts)
1.59 1.55 1.51 1.57 1.59 

Damn!

Let's check if we've simple added too much overhead

So with not_a_stupid , 1 call no opts:

0.13 .14 .13 .15 .20

Opts:

0.14 .13 .18 .13 .13

With opts, 300000 times:

1.55 1.55 1.56 1.54

without
1.54 1.58 1.55 1.55 


NO opts, just stupidMethod 300000 times

1.12 1.07 1.11 1.09 1.07 1.10

With opts:

0.61 0.58 0.59 0.60 0.60

Damn, this so much faster, why is it slower with the others? I suppose this example shows how we've removed much function call overhead (this is a short method). Lets create a long method and see how that goes.

With opts 300000 long metohds:

Ok I found out i forgot to put isub in analyse.c

DOH!!!!

With opts:

0.53 0.57 0.53 

Without:

1.72 1.70 1.72 1.73 1.70

So we're seeing the removal of method overhead and instruction overhead




Performance of Huffman with getfield_quick

With opts
1.08
1.08
1.08
1.09
1.06

Again

1.95
1.88
Without:

1.74
1.76
1.97
2.02
2.05
1.37
0.94
0.87
0.87
0.97
0.90
0.87

Again

2.06
2.06
1.88
2.05
2.04
2.06
1.90
2.02
2.02
1.95
1.91
1.95
2.01
2.09
2.01

Weird! Maybe I should wait till the machine isn't doing anything


With opts 

1.25
1.24
1.25

With opts but no more redundant checking whether a method can be optimise (ACC_CANNOT_OPTIMISE)

0.88
0.88
0.84
0.94
0.84
0.89
0.84
0.89

with invokenonvirtual_quick 

0.80
0.79
0.75
0.77
0.81
0.84
0.78
0.83

with getstatic_quick 
(ok, so no getstatic_quick!)
enabled getfield_quick
enabled putfield_quick
enable invoke_virtual_quick (and created invokevirtual_quick_optimised)
enabled invokenonvirtual_quick (or invokestatic_quick)

New times:

55
55
48
55
48
55
51
54
53
54
47
49

With IBM 1.18

1.42
1.39
1.40
1.37
1.38
1.40
1.42
1.45
1.38

with no jit
0.41
0.38
0.37
0.39
0.43
0.45
0.43
0.40
0.42

New times

nojit

.28
.30
.31
.29
.26
.34
.32
.29
.30

withjit

1.02
.93
.97
.94
1.00
.97

Kissme, opts

.46
.50
.49
.49
.48
.45
.49
.49
.52

kiss, no opts but with _quick

.53
.55
.48
.53
.49
.50
.53
.52
.55
.49
.55
.49

//I've added a counter to see what percentage of methods we can optimise so far

It seems we don't optimise any in Huffman, and only two gtest.Optimiser.

I get following errosr from kjc:

INTERP failed to load class kjc
** Current frame 80a7848 java/lang/DefaultClassLoader.loadClass 0
** Current frame 80a7818 java/lang/ClassLoader.loadClass 1
** Current frame 80a77e8 java/util/ResourceBundle.tryBundle 2
Created optimised method for at/dms/kjc/CType.isArrayType ()Z
Generating code for method isArrayType
Returning method at 81bf1c0 2, used 16 asm bytes
Throwing ClassCastException in checkcast. to at/dms/kjc/CNumericType from at/dms/kjc/CArrayType
** Current frame 80a77b8 at/dms/kjc/CArrayType.<init> 0
** Current frame 80a7788 at/dms/kjc/CStdType.init 1
** Current frame 80a7758 at/dms/kjc/Main.initialize 2
** Current frame 80a7728 at/dms/kjc/Main.run 3
** Current frame 80a76f8 at/dms/kjc/Main.compile 4
** Current frame 80a76c8 at/dms/kjc/Main.main 5
---------------Top level exception handler--------------------       

I'm going disable SVETLANA and see it if runs

This is the method:

Method boolean isArrayType()
   0 iconst_0
   1 ireturn        

Of course the problem was that we don't put the value back onto the java stack with this kind of method.
I changed ireturn / areturn to put what's on the C stack on *optop.

And it presents another bug in the iconst_x instructions.

Method int getTag()
   0 iconst_1
  1 ireturn 


Method boolean isClassType()
   0 iconst_1
   1 ireturn 

CType:
Method boolean isClassType()
   0 iconst_0
   1 ireturn 

but these type of methods work in Optimiser .... hmmm

Method int getColumn()
   0 iconst_0
      1 ireturn

      Method int getLine()
         0 iconst_0
	    1 ireturn  

I hacked it to not optimise any methods, lets see...

Sure enough it runs. Now to optimise all methods except iconst_0 and iconst_1

Method int getTag()
   0 iconst_3
   1 ireturn      
Method int getSize()
   0 iconst_3
  1 ireturn   

So we remove 3 too ...

And it runs. 

SO it seems we are losing our VTIndex and Argsize somewhere in invoke_virtual_quick_optimised
I fixed that. It's because I didn't understand virtual dispatch!

Timing for kjc with simple.java

4.95
5.15
5.05
5.01
4.92
5.01
5.09
5.02

JDK nojit

1.26
1.16
1.23
1.20
1.20
1.21
1.24

withjit

7.02
7.30
7.15
7.07
7.19
7.26

Interesting hey?


