Equation Solution High Performance by Design 




Parallel Performance of 8Byte Matrix Multiplication
[Posted by JennChing Luo on June 14, 2016 ]
There are two related posts. One was a performance of multiplication of two 16byte (quad precision) matrices, and the other showed the performance of 10byte (extended precision) matrix product. The efficiency was inconsistent. Sixtyfour cores speeded the product of two
16byte matrices
up to 60 times faster; While multiplication of
two 10byte matrices
did not improve the computing speed when using more than 20 cores.
This post shows a parallel performance of an 8byte matrix product. Fortyeight cores speeded 8byte matrix product up to 40.5x. The performance was also inconsistent with a 10byte matrix product. It is explainable why a different variable type leads to various efficiency. This post is not prepared to address technical issues but showing a parallel performance of an 8byte (double precision) matrix product. The performance of the 8byte matrix product shows 48 cores speeded the computing up to 40.5x. The parallel performance is as follows. TESTING EXAMPLE
Perform [C]=[A][B], where matrices [A], [B] and [C] are 8byte real matrix. Matrix [A] is of order (15000by11000), and matrix [B] is of order (11000by12000), and matrix [C] is of order (15000by12000).
COMPUTING ENVIRONMENT
Computer: a Dell PowerEdge R815 with quad Opteron 6168, a total of 48 cores.
Operating System: Windows Server 2008 R2
Compiler: gfortran with optimization O3; The application links against neuloop4 for parallel processing.
Subroutine: laipe$matmul_8 which performs matrix multiplication in parallel
COMPARISON WITH GFORTRAN INTRINSIC FUNCTION MATMUL GFORTRAN has the intrinsic function, matmul, for matrix multiplication. The intrinsic function matmul is a sequential subroutine, and cannot take advantage of multicore. Before showing the parallel performance of laipe$matmul_8, we are going to have a comparison of laipe$matmul_8, on one core, with the intrinsic function, matmul. First, let us see the performance of the intrinsic function, matmul. The timing result is as follows: Elapsed Time (Seconds): 7265.82 CPU Time in User Mode (Seconds): 7264.33 CPU Time in Kernel Mode (Seconds): 1.50 Total CPU Time (Seconds): 7265.82 The intrinsic function, matmul, took 7265.82 seconds to compute the matrix multiplication. Next, let us see the performance of the parallel subroutine laipe$matmul_8 on one core. We have the following timing result: Elapsed Time (Seconds): 7086.91 CPU Time in User Mode (Seconds): 7084.63 CPU Time in Kernel Mode (Seconds): 1.95 Total CPU Time (Seconds): 7086.58 When one core enables, the subroutine laipe$matmul_8 ran faster than the intrinsic function matmul. laipe$matmul_8 is a parallel subroutine, which has extra code for parallel processing. Supposedly, laipe$matmul_8, with extra burden, should be slower than matmul. However, laipe$matmul_8 ran faster than matmul. When only one core enabling, the parallel subroutine laipe$matmul_8 took only 7086.91 seconds to perform the matrix multiplication. TIMING RESULT Timing results include "elapsed time," "CPU time in user mode," "CPU time in kernel mode," and "total CPU time." The timing result on one core to 48 cores is as follows: number of cores: 1 Elapsed Time (Seconds): 7086.91 CPU Time in User Mode (Seconds): 7084.63 CPU Time in Kernel Mode (Seconds): 1.95 Total CPU Time (Seconds): 7086.58 number of cores: 2 Elapsed Time (Seconds): 3538.93 CPU Time in User Mode (Seconds): 7076.46 CPU Time in Kernel Mode (Seconds): 0.67 Total CPU Time (Seconds): 7077.13 number of cores: 3 Elapsed Time (Seconds): 2360.84 CPU Time in User Mode (Seconds): 7080.57 CPU Time in Kernel Mode (Seconds): 0.89 Total CPU Time (Seconds): 7081.46 number of cores: 4 Elapsed Time (Seconds): 1776.20 CPU Time in User Mode (Seconds): 7085.67 CPU Time in Kernel Mode (Seconds): 0.80 Total CPU Time (Seconds): 7086.47 number of cores: 5 Elapsed Time (Seconds): 1419.08 CPU Time in User Mode (Seconds): 7092.35 CPU Time in Kernel Mode (Seconds): 0.70 Total CPU Time (Seconds): 7093.05 number of cores: 6 Elapsed Time (Seconds): 1183.78 CPU Time in User Mode (Seconds): 7099.17 CPU Time in Kernel Mode (Seconds): 0.83 Total CPU Time (Seconds): 7100.00 number of cores: 7 Elapsed Time (Seconds): 1021.89 CPU Time in User Mode (Seconds): 7103.77 CPU Time in Kernel Mode (Seconds): 0.72 Total CPU Time (Seconds): 7104.49 number of cores: 8 Elapsed Time (Seconds): 890.70 CPU Time in User Mode (Seconds): 7105.56 CPU Time in Kernel Mode (Seconds): 0.87 Total CPU Time (Seconds): 7106.44 number of cores: 9 Elapsed Time (Seconds): 796.12 CPU Time in User Mode (Seconds): 7111.38 CPU Time in Kernel Mode (Seconds): 0.98 Total CPU Time (Seconds): 7112.37 number of cores: 10 Elapsed Time (Seconds): 716.06 CPU Time in User Mode (Seconds): 7123.57 CPU Time in Kernel Mode (Seconds): 0.86 Total CPU Time (Seconds): 7124.43 number of cores: 11 Elapsed Time (Seconds): 654.52 CPU Time in User Mode (Seconds): 7125.08 CPU Time in Kernel Mode (Seconds): 0.92 Total CPU Time (Seconds): 7126.00 number of cores: 12 Elapsed Time (Seconds): 597.61 CPU Time in User Mode (Seconds): 7123.61 CPU Time in Kernel Mode (Seconds): 0.81 Total CPU Time (Seconds): 7124.43 number of cores: 13 Elapsed Time (Seconds): 559.47 CPU Time in User Mode (Seconds): 7201.68 CPU Time in Kernel Mode (Seconds): 0.97 Total CPU Time (Seconds): 7202.64 number of cores: 14 Elapsed Time (Seconds): 522.09 CPU Time in User Mode (Seconds): 7224.69 CPU Time in Kernel Mode (Seconds): 0.80 Total CPU Time (Seconds): 7225.48 number of cores: 15 Elapsed Time (Seconds): 486.16 CPU Time in User Mode (Seconds): 7213.00 CPU Time in Kernel Mode (Seconds): 0.66 Total CPU Time (Seconds): 7213.66 number of cores: 16 Elapsed Time (Seconds): 458.07 CPU Time in User Mode (Seconds): 7242.91 CPU Time in Kernel Mode (Seconds): 0.92 Total CPU Time (Seconds): 7243.83 number of cores: 17 Elapsed Time (Seconds): 436.55 CPU Time in User Mode (Seconds): 7353.48 CPU Time in Kernel Mode (Seconds): 0.94 Total CPU Time (Seconds): 7354.42 number of cores: 18 Elapsed Time (Seconds): 414.09 CPU Time in User Mode (Seconds): 7353.29 CPU Time in Kernel Mode (Seconds): 0.81 Total CPU Time (Seconds): 7354.11 number of cores: 19 Elapsed Time (Seconds): 394.50 CPU Time in User Mode (Seconds): 7373.03 CPU Time in Kernel Mode (Seconds): 0.75 Total CPU Time (Seconds): 7373.78 number of cores: 20 Elapsed Time (Seconds): 368.94 CPU Time in User Mode (Seconds): 7259.41 CPU Time in Kernel Mode (Seconds): 0.95 Total CPU Time (Seconds): 7260.36 number of cores: 21 Elapsed Time (Seconds): 350.57 CPU Time in User Mode (Seconds): 7265.81 CPU Time in Kernel Mode (Seconds): 0.76 Total CPU Time (Seconds): 7266.57 number of cores: 22 Elapsed Time (Seconds): 336.23 CPU Time in User Mode (Seconds): 7298.26 CPU Time in Kernel Mode (Seconds): 1.05 Total CPU Time (Seconds): 7299.30 number of cores: 23 Elapsed Time (Seconds): 320.69 CPU Time in User Mode (Seconds): 7262.85 CPU Time in Kernel Mode (Seconds): 1.05 Total CPU Time (Seconds): 7263.89 number of cores: 24 Elapsed Time (Seconds): 308.52 CPU Time in User Mode (Seconds): 7318.74 CPU Time in Kernel Mode (Seconds): 1.00 Total CPU Time (Seconds): 7319.74 number of cores: 25 Elapsed Time (Seconds): 297.68 CPU Time in User Mode (Seconds): 7321.63 CPU Time in Kernel Mode (Seconds): 0.81 Total CPU Time (Seconds): 7322.44 number of cores: 26 Elapsed Time (Seconds): 286.96 CPU Time in User Mode (Seconds): 7322.41 CPU Time in Kernel Mode (Seconds): 1.00 Total CPU Time (Seconds): 7323.40 number of cores: 27 Elapsed Time (Seconds): 276.98 CPU Time in User Mode (Seconds): 7329.86 CPU Time in Kernel Mode (Seconds): 0.69 Total CPU Time (Seconds): 7330.55 number of cores: 28 Elapsed Time (Seconds): 267.32 CPU Time in User Mode (Seconds): 7333.12 CPU Time in Kernel Mode (Seconds): 0.80 Total CPU Time (Seconds): 7333.92 number of cores: 29 Elapsed Time (Seconds): 258.70 CPU Time in User Mode (Seconds): 7339.24 CPU Time in Kernel Mode (Seconds): 0.94 Total CPU Time (Seconds): 7340.17 number of cores: 30 Elapsed Time (Seconds): 249.40 CPU Time in User Mode (Seconds): 7342.81 CPU Time in Kernel Mode (Seconds): 0.78 Total CPU Time (Seconds): 7343.59 number of cores: 31 Elapsed Time (Seconds): 240.38 CPU Time in User Mode (Seconds): 7350.33 CPU Time in Kernel Mode (Seconds): 1.06 Total CPU Time (Seconds): 7351.39 number of cores: 32 Elapsed Time (Seconds): 233.81 CPU Time in User Mode (Seconds): 7365.81 CPU Time in Kernel Mode (Seconds): 0.67 Total CPU Time (Seconds): 7366.48 number of cores: 33 Elapsed Time (Seconds): 229.56 CPU Time in User Mode (Seconds): 7393.50 CPU Time in Kernel Mode (Seconds): 0.56 Total CPU Time (Seconds): 7394.06 number of cores: 34 Elapsed Time (Seconds): 221.99 CPU Time in User Mode (Seconds): 7401.73 CPU Time in Kernel Mode (Seconds): 0.83 Total CPU Time (Seconds): 7402.56 number of cores: 35 Elapsed Time (Seconds): 216.19 CPU Time in User Mode (Seconds): 7422.51 CPU Time in Kernel Mode (Seconds): 1.05 Total CPU Time (Seconds): 7423.56 number of cores: 36 Elapsed Time (Seconds): 211.94 CPU Time in User Mode (Seconds): 7456.69 CPU Time in Kernel Mode (Seconds): 0.80 Total CPU Time (Seconds): 7457.49 number of cores: 37 Elapsed Time (Seconds): 206.97 CPU Time in User Mode (Seconds): 7478.70 CPU Time in Kernel Mode (Seconds): 0.90 Total CPU Time (Seconds): 7479.61 number of cores: 38 Elapsed Time (Seconds): 203.18 CPU Time in User Mode (Seconds): 7513.54 CPU Time in Kernel Mode (Seconds): 0.75 Total CPU Time (Seconds): 7514.29 number of cores: 39 Elapsed Time (Seconds): 198.26 CPU Time in User Mode (Seconds): 7555.64 CPU Time in Kernel Mode (Seconds): 0.87 Total CPU Time (Seconds): 7556.52 number of cores: 40 Elapsed Time (Seconds): 195.20 CPU Time in User Mode (Seconds): 7592.27 CPU Time in Kernel Mode (Seconds): 0.84 Total CPU Time (Seconds): 7593.11 number of cores: 41 Elapsed Time (Seconds): 190.12 CPU Time in User Mode (Seconds): 7647.62 CPU Time in Kernel Mode (Seconds): 0.98 Total CPU Time (Seconds): 7648.60 number of cores: 42 Elapsed Time (Seconds): 187.48 CPU Time in User Mode (Seconds): 7686.03 CPU Time in Kernel Mode (Seconds): 1.03 Total CPU Time (Seconds): 7687.06 number of cores: 43 Elapsed Time (Seconds): 185.58 CPU Time in User Mode (Seconds): 7776.79 CPU Time in Kernel Mode (Seconds): 0.95 Total CPU Time (Seconds): 7777.74 number of cores: 44 Elapsed Time (Seconds): 183.57 CPU Time in User Mode (Seconds): 7835.40 CPU Time in Kernel Mode (Seconds): 0.89 Total CPU Time (Seconds): 7836.29 number of cores: 45 Elapsed Time (Seconds): 178.98 CPU Time in User Mode (Seconds): 7895.26 CPU Time in Kernel Mode (Seconds): 0.95 Total CPU Time (Seconds): 7896.21 number of cores: 46 Elapsed Time (Seconds): 178.65 CPU Time in User Mode (Seconds): 8006.25 CPU Time in Kernel Mode (Seconds): 1.25 Total CPU Time (Seconds): 8007.50 number of cores: 47 Elapsed Time (Seconds): 178.11 CPU Time in User Mode (Seconds): 8093.89 CPU Time in Kernel Mode (Seconds): 0.86 Total CPU Time (Seconds): 8094.75 number of cores: 48 Elapsed Time (Seconds): 174.81 CPU Time in User Mode (Seconds): 8228.79 CPU Time in Kernel Mode (Seconds): 1.05 Total CPU Time (Seconds): 8229.83 Fortyeight cores completed the computation in 174 seconds; One core took 7086.91 seconds, e.g., 1 hour and 58 minutes. Fortyeight cores could complete a 2hour job in 3 minutes. In the following, we are going to see parallel speedup and efficiency. SPEEDUP AND EFFICIENCY The following table summarizes speedup and efficiency. The number of cores is in the first column; Elapsed time is in the second column; The third column is parallel speedup. In the following table, we can see it yielded an almost linear speedup. Fortyeight cores improved the speed to 40.5x. The fourth column is parallel efficiency. It also shows 42 cores could achieve a 90% efficiency. The following table has the timing results.



