Loading...

Summer Research Fellowship Programme of India's Science Academies

Detecting floating point exceptions in RISC-V using GCC

Faisal Riyaz

ZHCET, Aligarh Muslim University, Aligarh 202002

V. Kamakoti

Department of Computer Science and Engineering, Indian Institute of Technology, Madras, Chennai 600036

Abstract

Floating point operations are an important part of many scientific and everyday computations. Almost every programming language provides the ability to use floating point numbers for calculation. Errors during floating point exceptions are difficult to detect. They often are critical and cannot be ignored. Most instruction set architectures generate traps to help detect floating point exceptions. RISC-V is an open ISA getting widely adopted by industry and academia. Unlike many other Instruction Set Architectures, RISC-V does not support traps on floating point exceptions. It instead requires explicit checks by software. Detecting exceptions in software can be computationally expensive and often not feasible. We present a modified GCC backend for RISC-V capable to detect floating point exceptions that occur when performing elementary floating point operations: addition, multiplication, addition, subtraction, fused multiply add and fused multiply subtract. Machine description files of RISC-V GCC are modified to add instructions to check floating point status flags. Detection can be enabled or disabled using a command-line option. We compared the performance of modified RISC-V GCC with traps enabled to RISC-V GCC with traps disabled and found 20% decrease in performance for LINPACK benchmark.

Keywords: compilers, reduced Instrcution Set Architecture, benchmarks

Abbreviations

Abbreviations
 RISCReduced Instruction Set Computer 
 ISAInstruction Set Architecture 
 GCCGNU C Compiler 
 FCSRFloating point Control and Status Register 
MWIPS Million Whetstone Instructions Per Second 
 MFLOPSMillion Flaoting Point Operarions Per Second 

INTRODUCTION

RISC-V is a free and open ISA, enabling a new era of processor innovation. It was developed by the University of California to support computer architecture research and education (​Waterman, et al ,2014​). It is calling attention worldwide by its fast growth and adoption in industry. RISC-V has a modular design. It has 32-, 64- and 128-bit integer base with added optional extensions.

Processors with RISC-V ISA do not support traps for floating point exception (​Waterman, et al ,2014​). It instead requires explicit checks by software. When any exception occurs, it gets noted in the floating point control and status register. The program continues as if nothing happened. The operation produces a default value, specific to the exception.

To detect floating point exceptions, we modified GCC to insert a check after every floating point operation using GCC. A command line option fp-trap was provided to enable or disable traps for floating point exceptions.

To observe the difference in execution speed between gcc and modified riscv-gcc with traps enabled, we used Whetsone and LINPACK benchmark with different optimization levels. We observed 20% decrease in MFLOPS for LINPACK when no optimization was used. There was 8.8% decrease in MIPS for Whetstone when no optimization was used.

We also compared the size of executable generated from LINPACK benchmark. There was almost no difference between riscv-gcc and modified riscv-gcc with traps enabled.

Thus, Contribution of this paper are the following:

  • We modified the GCC backend to generate traps when any floating point exception occurs during elementary floating point operations.
  • We compared the performance of our modified gcc with traps enabled for common exceptions with riscv-gcc using Whetstone and LINPACK benchmark on different optimization levels.
  • We compared the size of executable generated from the LINPACK benchmark when modified gcc, riscv-gcc with trap enabled for common exception and riscv gcc with trap enabled for all floating point exceptions.

We modified machine description files of GCC responsible for generating RISC-V assembly from RTL. We added a check to fcsr after every common floating point operation. If any exception occurs, a trap gets generated and program gets terminated.

Floating Point Exceptions

Floating point operations are an important part of many scientific and everyday computations. Floating point instructions are part of almost all popular instructions sets like ARM, MIPS, etc. RISC-V has extensions for single-precision and double-precision floating point.

Exception is an unusual condition associated with an instruction. Exceptions are synchronous i.e. they occur at runtime.

To quote W. Kahan, "An arithmetic exception arises when an attempted atomic arithmetic operation has no result that would be acceptable universally. The meanings of atomic and acceptable vary with time and place".

IEEE 754 defines five types of floating point exceptions that must be signaled when detected. They are:

  • Invalid operation
  • Division by zero
  • Overflow
  • Underflow
  • Inexact calculation

GCC

GCC is the most popular compiler for RISC-V.

GCC-Architecture.png
    GCC Architecture (​Wicht, 2013)​

    The following section is based on

    GCC includes three components: a front end, a middle end and a backend. A source file goes through each component one after another. There is one front end for each programming language supported by GCC. Front end reads the source file, parse it and convert it into the standard abstract syntax tree. AST is converted into a unified form called generic. Then, it is converted into GIMPLE. After GIMPLE, the source code is converted into SSA representation. Many optimizations are performed on this stage. After the SSA optimization pass, the tree is converted back to the GIMPLE form which is then used to generate RTL form of tree.

    RTL is a hardware-based representation that corresponds to an abstract target architecture with an infinite number of registers. In this language, the instructions to be output are described, pretty much one by one, in an algebraic form that describes what the instruction does. Assembly instructions are generated from RTL using machine description files. Each ISA has separate machine description files in gcc/config/machine.

    The phase of importance for our purpose is when RTL is translated to assembly instructions. RTL is used for specifying machine description.

    The.md file for a target machine contains a pattern for each instruction that the target machine supports.

    We modified the assembly instructions generated for following standard pattern names:

    • fadd
    • fmul
    • fsub
    • fdiv
    • fsqrt
    • fma
    • fms
    • fnma
    • fnms

    SCOPE

    Many operations can raise floating point exceptions. We modified GCC to detect floating point operations only when performing following basic operations:

    • floating point addition
    • floating point substraction
    • floating point multilication
    • floating point division
    • floating point fused multiply add
    • floating point fused multiply substract

    Modified GCC can detect floating point exceptions for both single and double precision when performing any of the above operations. Modified GCC was not tested on any actual implementation of RISC-V. We used Spike, the RISC-V ISA simulator to test our changes and run the benchmarks. Only C front end of GCC, gcc was used to test and compile the benchmarks. Other front ends were not tested.

    The version of GCC used as a base for our modified GCC was riscv64-unknown-elf-gcc (GCC) 8.2.0.

    To test the performance of our modified GCC, we used the whetstone and LINPACK benchmark. We have reported only double precision floating point results.

    METHODOLOGY

    We used GCC over LLVM because the support for LLVM is not mature.

    To assure the correctness of our compiler, many sample program were run on Spike, a RISC-V Simulator. We did not test our modified GCC on any actual RISC-V processor.

    We compared the performance of our modified riscv gcc when using fp-trap=common with modified gcc. It was not possible to compare the performace of our modified riscv gcc with fp-trap=all because both benchmarks perform divison which in most cases raises inexact floating point excetpion.

    It is interesting to note that the performance of our modified riscv gcc with traps enabled will be same as original riscv-gcc. This is because both compilers generate the same assembly thus will have same execution time. Although, It is expected that generating executable with modified riscv gcc will be slower because after every floating point operation, check will be performed to determine whether trap should be generated or not.

    To examine how our modification has affected GCC performance, we use two benchmarks:

    • Whetstone
    • LINPACK

    We used two becnhmarks because both of these benchmarks has limatiations in different domains. Both benchmarks measure performance based on number of floating point operations per second.

    According to ​Null and Lobur 2015​, "Whetsone is floating point intensitive, with many calls to library routines for computation of trignometric and exponential functions". LINPACK is a collection of subroutines which solves systems of linear equation.

    We used a simple version of LINPACK. LINPACK has gained wide acceptance as a general method of evaluating computer and supercomputer performance.

    Programs for both benchmarks were in C language. Only double floating point exceptions were tested. For source code of both benchmarks, see Appendix.

    Both benchmarks were executed on Spike. We executed both benchmarks with no optisation and different optimization levels.

    For whetstone benchmark, we set the loop count to 100000. For LINPACK benchmark, we used an array of size 1000 x 1000.​

    RESULTS AND DISCUSSION

    Modified RISC-V GCC

    By default, traps for all floating point exceptions are disabled.
    Trap is generated using `ebreak` instruction.

    We can enable traps using fp-trap command line arguments. Valid arguments to fp-trap are:

    • all
    • common

    When all is passed to fp-trap, all five IEEE floating point exceptions generate trap.

    When common is passed to fp-trap, only Invalid, Overflow, and Division by Zero exceptions generate a trap. We provided this option because in most cases, a programmer is not concerned with Underflow and Inexact exceptionas they are very common.

    The floating point control and status register, fcsr, is a RISC-V control and status register. It is a read/write register that selects the dynamic rounding mode for floating point arithmetic operations and holds the accrued exception flags. When any floating point exception occurs, the corresponding flag in fcsr gets set.

    frflags instruction reads the accured exception flags field and copies it into the least-significant five bits of integer register rd, with zero in all other bits. ebreak instructionis used to generete trap.

    After operations which possibly can cause floating point exceptions, frflags instruction is used.

    • When all is passed to fp-trap option:

    The content of this register is compared with zero. If value inside register is zero, a trap gets generated. Otherwise ebreak instruction is skipped.

    • When common is passed to fp-trap option:

    The content of the register is compared with three. To compare with three, compiler selects another register and stores three in it. If the value inside register is greater than three, a trap gets generated. Otherwise ebreak instruction is skipped.

    Consider this C program (Code 1).

    1. #include<stdio.h>
    2. int main()
    3. {
    4. float a, b, c;
    5. a = 4.5;
    6. b = 0;
    7. c = a / b;
    8. printf("%f\n", c);
    9. return 0;
    10. }
    Example program to demonstrate division by zero

    When we compile the program with fp-trap=all, some assembly generated is in Code 3.

    1. main:
    2. addi sp,sp,-32
    3. sd ra,24(sp)
    4. sd s0,16(sp)
    5. addi s0,sp,32
    6. lui a5,%hi(.LC0)
    7. flw fa5,%lo(.LC0)(a5)
    8. fsw fa5,-20(s0)
    9. sw zero,-24(s0)
    10. flw fa4,-20(s0)
    11. flw fa5,-24(s0)
    12. fdiv.s fa5,fa4,fa5
    13. frflags a5
    14. beqz a5, 1f
    15. ebreak
    16. 1:
    17. fsw fa5,-28(s0)
    18. flw fa5,-28(s0)
    19. fcvt.d.s fa5,fa5
    20. fmv.x.d a1,fa5
    21. lui a5,%hi(.LC1)
    22. addi a0,a5,%lo(.LC1)
    23. call printf
    24. li a5,0
    25. mv a0,a5
    26. ld ra,24(sp)
    27. ld s0,16(sp)
    28. addi sp,sp,32
    29. jr ra
    Some Assembly instructions when Code 1 is compiled using fp-trap=all

    When we compile the program with fp-trap=common, some assembly generated is in Code 4.

    1. main:
    2. addi sp,sp,-32
    3. sd ra,24(sp)
    4. sd s0,16(sp)
    5. addi s0,sp,32
    6. lui a5,%hi(.LC0)
    7. flw fa5,%lo(.LC0)(a5)
    8. fsw fa5,-20(s0)
    9. sw zero,-24(s0)
    10. flw fa4,-20(s0)
    11. flw fa5,-24(s0)
    12. fdiv.s fa5,fa4,fa5
    13. frflags a5
    14. li a4, 3
    15. ble a5, a4, 1f
    16. ebreak
    17. 1:
    18. fsw fa5,-28(s0)
    19. flw fa5,-28(s0)
    20. fcvt.d.s fa5,fa5
    21. fmv.x.d a1,fa5
    22. lui a5,%hi(.LC1)
    23. addi a0,a5,%lo(.LC1)
    24. call printf
    25. li a5,0
    26. mv a0,a5
    27. ld ra,24(sp)
    28. ld s0,16(sp)
    29. addi sp,sp,32
    30. jr ra
    Some Assembly instructions when Code 1 is compiled using fp-trap=common

    When we run the program on Spike, ebreak instruction gets executed and Code 5 is displayed in terminal window.

    1. bbl loader
    2. z 0000000000000000 ra 0000000000010108 sp 000000007f7e9b30 gp 000000000001e700
    3. tp 0000000000000000 t0 0000000000000000 t1 000000000000000f t2 0000000000000000
    4. s0 000000007f7e9b50 s1 0000000000000000 a0 0000000000000001 a1 000000007f7e9b58
    5. a2 0000000000000000 a3 0000000000000010 a4 0000000000000001 a5 0000000000000008
    6. a6 000000000000001f a7 0000000000000000 s2 0000000000000000 s3 0000000000000000
    7. s4 0000000000000000 s5 0000000000000000 s6 0000000000000000 s7 0000000000000000
    8. s8 0000000000000000 s9 0000000000000000 sA 0000000000000000 sB 0000000000000000
    9. t3 0000000000000000 t4 0000000000000000 t5 0000000000000000 t6 0000000000000000
    10. pc 00000000000101de va 00000000000101de insn ffffffff sr 8000000200046020
    11. Breakpoint!
    12. z 0000000000000000 ra 0000000000010108 sp 000000007f7e9b30 gp 000000000001e700
    13. tp 0000000000000000 t0 0000000000000000 t1 000000000000000f t2 0000000000000000
    14. s0 000000007f7e9b50 s1 0000000000000000 a0 0000000000000001 a1 000000007f7e9b58
    15. a2 0000000000000000 a3 0000000000000010 a4 0000000000000001 a5 0000000000000008
    16. a6 000000000000001f a7 0000000000000000 s2 0000000000000000 s3 0000000000000000
    17. s4 0000000000000000 s5 0000000000000000 s6 0000000000000000 s7 0000000000000000
    18. s8 0000000000000000 s9 0000000000000000 sA 0000000000000000 sB 0000000000000000
    19. t3 0000000000000000 t4 0000000000000000 t5 0000000000000000 t6 0000000000000000
    20. pc 00000000000101e2 va 0000000000000108 insn ffffffff sr 8000000200046020
    21. User store segfault @ 0x0000000000000108
    Error message when ebreak instruction is encountered

    Whetstone Benchmark

    We compared the performance of our modified riscv gcc with modified riscv gcc when fp-trap=common option is used.

    We used the C version of the program. See Appendix A for the code of whetstone benchmark used. The program was executed on Spike simulator.

    The value of n, which denotes number of loops used was set as 100000. The results are expressed in Million Whetstone Instruction Per Second.

    We observed that the difference of performance between riscv-gcc with fp-trap=common and riscv-gcc with traps disabled increased with optimization level. However, for O3 optimzation level, the performance of riscv-gcc with fp-trap=common was equal to ricv-gcc with trap disabled.

    No optimization

    There was 8.8% decrease in MWIPS when fp-trap=common was used.

    mwno.jpeg
      No Optimization

      O1 optimization

      There was 33% decrease in MWIPS when fp-trap=commom was used.

      mwo1.jpeg
        O1 Optimization

        O2 optimization

        There was 50% decrease in MWIPS when fp-trap=common was used. This is a significant decrease in performance.

        mwo2.jpeg
          O2 Optimization

          O3 optimization

          Interestingly, the performance does not depend upon whether fp-trap option is used.

          mwo3.jpeg
            O3 Optimization

            Os optimization

            When Os optimization is used, GCC tries to optimize for size.

            There was 13.2% decrease in MWIPS when fp-trap=commom was used.

            LINPACK Benchmark

            We used a simple version of LINPACK benchmark. The program was written in C. See Appendix B.

            The size of matrix used was 1000 x 1000. The results are expressed in Million Floating Point Operations Per Second.

            We observed that the difference of performace between riscv-gcc with fp-trap=common and riscv-gcc with traps disabled in most cases increased with optimization level.

            No optimization

            There was 20% decrease in MFLOPS when fp-trap=common was used.

            linpack-no-optimization.jpeg
              No Optimization

              O1 optimization

              There was 45.7% in MFLOPS when fp-trap=common was used.

              linpack-O1-optimization.jpeg
                O1 Optimization

                O2 optimization

                There was 31.5% in MFLOPS when fp-trap=common was used.

                linpack-O2-optimization-.jpeg
                  O2 Optimization

                  O3 optimization

                  There was 33% in MFLOPS when fp-trap=common was used.

                  linpack-O3-optimization.jpeg
                    O3 Optimization

                    Os optimization

                    There was 26% in MFLOPS when fp-trap=common was used.

                    linpack-Os-optimization.jpeg
                      Os Optimization

                      Executable Size Comparision

                      We compared the size of executable generated from compiling Whetstone benchmark program. Comparision was performed between riscv-gcc with trap disabled, riscv-gcc with fp-trap=common and riscv-gcc with fp-trap=all.

                      The executable size was almost same for all three cases.

                      Executable obtained when using fp-trap=common was 1.6% larger than executable obtained when traps were disabled. Executable obtained when using fp-trap=all was 1.1% larger than executable obtained when traps were disabled.

                      executable-size-comparision.jpeg
                        Executable size comparision

                        CONCLUSION

                        RISC-V is the ISA of the future. RISC does not support traps on floating point exceptions. In this work, we demonstrated that GCC backend can be modified to detect floating point exceptions after certain operations. Machine description files were modified. We compared how enabling traps affect the execution speed and executable size using whetstone and LINPACK floating point benchmarks with different optimizations. We observed an 8.8% decrease in MWIPS when no optimization was used. 20% in MFLOPS was observed using LINPACK benchmark with no optimization. The executable size was not affected much by enabling traps.

                        ACKNOWLEDGEMENTS

                        I am grateful to Almighty God for every blessing in my life.

                        I express my sincere thanks and gratitude to Professor V. Kamakoti. It was a privilege to have him as a guide. 

                        I am grateful to my mentor, Sathya Narayan. He was supportive during the whole period. Working under him has been a rewarding experience.

                        Special thanks to Shankar Raman sir for his comments and Ideas. I also extend my gratitude to all members of RISE lab.

                        I would like to thank the Indian Academy of Sciences for providing me the Summer Research Fellowship, which was a great opportunity for me to get involved in research and learn many new things.

                         I am grateful to my family for helping me through all.

                        APPENDICES

                        Appendex A

                        C program for Whetstone benchmark

                        1. /*
                        2. * C Converted Whetstone Double Precision Benchmark
                        3. * Version 1.2 22 March 1998
                        4. *
                        5. * (c) Copyright 1998 Painter Engineering, Inc.
                        6. * All Rights Reserved.
                        7. *
                        8. * Permission is granted to use, duplicate, and
                        9. * publish this text and program as long as it
                        10. * includes this entire comment block and limited
                        11. * rights reference.
                        12. *
                        13. * Converted by Rich Painter, Painter Engineering, Inc. based on the
                        14. * www.netlib.org benchmark/whetstoned version obtained 16 March 1998.
                        15. *
                        16. * A novel approach was used here to keep the look and feel of the
                        17. * FORTRAN version. Altering the FORTRAN-based array indices,
                        18. * starting at element 1, to start at element 0 for C, would require
                        19. * numerous changes, including decrementing the variable indices by 1.
                        20. * Instead, the array E1[] was declared 1 element larger in C. This
                        21. * allows the FORTRAN index range to function without any literal or
                        22. * variable indices changes. The array element E1[0] is simply never
                        23. * used and does not alter the benchmark results.
                        24. *
                        25. * The major FORTRAN comment blocks were retained to minimize
                        26. * differences between versions. Modules N5 and N12, like in the
                        27. * FORTRAN version, have been eliminated here.
                        28. *
                        29. * An optional command-line argument has been provided [-c] to
                        30. * offer continuous repetition of the entire benchmark.
                        31. * An optional argument for setting an alternate LOOP count is also
                        32. * provided. Define PRINTOUT to cause the POUT() function to print
                        33. * outputs at various stages. Final timing measurements should be
                        34. * made with the PRINTOUT undefined.
                        35. *
                        36. * Questions and comments may be directed to the author at
                        37. * r.painter@ieee.org
                        38. */
                        39. /*
                        40. C**********************************************************************
                        41. C Benchmark #2 -- Double Precision Whetstone (A001)
                        42. C
                        43. C o This is a REAL*8 version of
                        44. C the Whetstone benchmark program.
                        45. C
                        46. C o DO-loop semantics are ANSI-66 compatible.
                        47. C
                        48. C o Final measurements are to be made with all
                        49. C WRITE statements and FORMAT sttements removed.
                        50. C
                        51. C**********************************************************************
                        52. */
                        53. /* standard C library headers required */
                        54. #include <stdlib.h>
                        55. #include <stdio.h>
                        56. #include <string.h>
                        57. #include <math.h>
                        58. /* the following is optional depending on the timing function used */
                        59. #include <time.h>
                        60. /* map the FORTRAN math functions, etc. to the C versions */
                        61. #define DSIN sin
                        62. #define DCOS cos
                        63. #define DATAN atan
                        64. #define DLOG log
                        65. #define DEXP exp
                        66. #define DSQRT sqrt
                        67. #define IF if
                        68. /* function prototypes */
                        69. void POUT(long N, long J, long K, double X1, double X2, double X3, double X4);
                        70. void PA(double E[]);
                        71. void P0(void);
                        72. void P3(double X, double Y, double *Z);
                        73. #define USAGE "usage: whetdc [-c] [loops]\n"
                        74. /*
                        75. COMMON T,T1,T2,E1(4),J,K,L
                        76. */
                        77. double T,T1,T2,E1[5];
                        78. int J,K,L;
                        79. int
                        80. main(int argc, char *argv[])
                        81. {
                        82. /* used in the FORTRAN version */
                        83. long I;
                        84. long N1, N2, N3, N4, N6, N7, N8, N9, N10, N11;
                        85. double X1,X2,X3,X4,X,Y,Z;
                        86. long LOOP;
                        87. int II, JJ;
                        88. /* added for this version */
                        89. long loopstart;
                        90. long startsec, finisec;
                        91. float KIPS;
                        92. int continuous;
                        93. loopstart = 1000; /* see the note about LOOP below */
                        94. continuous = 0;
                        95. II = 1; /* start at the first arg (temp use of II here) */
                        96. while (II < argc) {
                        97. if (strncmp(argv[II], "-c", 2) == 0 || argv[II][0] == 'c') {
                        98. continuous = 1;
                        99. } else if (atol(argv[II]) > 0) {
                        100. loopstart = atol(argv[II]);
                        101. } else {
                        102. fprintf(stderr, USAGE);
                        103. return(1);
                        104. }
                        105. II++;
                        106. }
                        107. LCONT:
                        108. /*
                        109. C
                        110. C Start benchmark timing at this point.
                        111. C
                        112. */
                        113. startsec = time(0);
                        114. /*
                        115. C
                        116. C The actual benchmark starts here.
                        117. C
                        118. */
                        119. T = .499975;
                        120. T1 = 0.50025;
                        121. T2 = 2.0;
                        122. /*
                        123. C
                        124. C With loopcount LOOP=10, one million Whetstone instructions
                        125. C will be executed in EACH MAJOR LOOP..A MAJOR LOOP IS EXECUTED
                        126. C 'II' TIMES TO INCREASE WALL-CLOCK TIMING ACCURACY.
                        127. C
                        128. LOOP = 1000;
                        129. */
                        130. LOOP = loopstart;
                        131. II = 1;
                        132. JJ = 1;
                        133. IILOOP:
                        134. N1 = 0;
                        135. N2 = 12 * LOOP;
                        136. N3 = 14 * LOOP;
                        137. N4 = 345 * LOOP;
                        138. N6 = 210 * LOOP;
                        139. N7 = 32 * LOOP;
                        140. N8 = 899 * LOOP;
                        141. N9 = 616 * LOOP;
                        142. N10 = 0;
                        143. N11 = 93 * LOOP;
                        144. /*
                        145. C
                        146. C Module 1: Simple identifiers
                        147. C
                        148. */
                        149. X1 = 1.0;
                        150. X2 = -1.0;
                        151. X3 = -1.0;
                        152. X4 = -1.0;
                        153. for (I = 1; I <= N1; I++) {
                        154. X1 = (X1 + X2 + X3 - X4) * T;
                        155. X2 = (X1 + X2 - X3 + X4) * T;
                        156. X3 = (X1 - X2 + X3 + X4) * T;
                        157. X4 = (-X1+ X2 + X3 + X4) * T;
                        158. }
                        159. #ifdef PRINTOUT
                        160. IF (JJ==II)POUT(N1,N1,N1,X1,X2,X3,X4);
                        161. #endif
                        162. /*
                        163. C
                        164. C Module 2: Array elements
                        165. C
                        166. */
                        167. E1[1] = 1.0;
                        168. E1[2] = -1.0;
                        169. E1[3] = -1.0;
                        170. E1[4] = -1.0;
                        171. for (I = 1; I <= N2; I++) {
                        172. E1[1] = ( E1[1] + E1[2] + E1[3] - E1[4]) * T;
                        173. E1[2] = ( E1[1] + E1[2] - E1[3] + E1[4]) * T;
                        174. E1[3] = ( E1[1] - E1[2] + E1[3] + E1[4]) * T;
                        175. E1[4] = (-E1[1] + E1[2] + E1[3] + E1[4]) * T;
                        176. }
                        177. #ifdef PRINTOUT
                        178. IF (JJ==II)POUT(N2,N3,N2,E1[1],E1[2],E1[3],E1[4]);
                        179. #endif
                        180. /*
                        181. C
                        182. C Module 3: Array as parameter
                        183. C
                        184. */
                        185. for (I = 1; I <= N3; I++)
                        186. PA(E1);
                        187. #ifdef PRINTOUT
                        188. IF (JJ==II)POUT(N3,N2,N2,E1[1],E1[2],E1[3],E1[4]);
                        189. #endif
                        190. /*
                        191. C
                        192. C Module 4: Conditional jumps
                        193. C
                        194. */
                        195. J = 1;
                        196. for (I = 1; I <= N4; I++) {
                        197. if (J == 1)
                        198. J = 2;
                        199. else
                        200. J = 3;
                        201. if (J > 2)
                        202. J = 0;
                        203. else
                        204. J = 1;
                        205. if (J < 1)
                        206. J = 1;
                        207. else
                        208. J = 0;
                        209. }
                        210. #ifdef PRINTOUT
                        211. IF (JJ==II)POUT(N4,J,J,X1,X2,X3,X4);
                        212. #endif
                        213. /*
                        214. C
                        215. C Module 5: Omitted
                        216. C Module 6: Integer arithmetic
                        217. C
                        218. */
                        219. J = 1;
                        220. K = 2;
                        221. L = 3;
                        222. for (I = 1; I <= N6; I++) {
                        223. J = J * (K-J) * (L-K);
                        224. K = L * K - (L-J) * K;
                        225. L = (L-K) * (K+J);
                        226. E1[L-1] = J + K + L;
                        227. E1[K-1] = J * K * L;
                        228. }
                        229. #ifdef PRINTOUT
                        230. IF (JJ==II)POUT(N6,J,K,E1[1],E1[2],E1[3],E1[4]);
                        231. #endif
                        232. /*
                        233. C
                        234. C Module 7: Trigonometric functions
                        235. C
                        236. */
                        237. X = 0.5;
                        238. Y = 0.5;
                        239. for (I = 1; I <= N7; I++) {
                        240. X = T * DATAN(T2*DSIN(X)*DCOS(X)/(DCOS(X+Y)+DCOS(X-Y)-1.0));
                        241. Y = T * DATAN(T2*DSIN(Y)*DCOS(Y)/(DCOS(X+Y)+DCOS(X-Y)-1.0));
                        242. }
                        243. #ifdef PRINTOUT
                        244. IF (JJ==II)POUT(N7,J,K,X,X,Y,Y);
                        245. #endif
                        246. /*
                        247. C
                        248. C Module 8: Procedure calls
                        249. C
                        250. */
                        251. X = 1.0;
                        252. Y = 1.0;
                        253. Z = 1.0;
                        254. for (I = 1; I <= N8; I++)
                        255. P3(X,Y,&Z);
                        256. #ifdef PRINTOUT
                        257. IF (JJ==II)POUT(N8,J,K,X,Y,Z,Z);
                        258. #endif
                        259. /*
                        260. C
                        261. C Module 9: Array references
                        262. C
                        263. */
                        264. J = 1;
                        265. K = 2;
                        266. L = 3;
                        267. E1[1] = 1.0;
                        268. E1[2] = 2.0;
                        269. E1[3] = 3.0;
                        270. for (I = 1; I <= N9; I++)
                        271. P0();
                        272. #ifdef PRINTOUT
                        273. IF (JJ==II)POUT(N9,J,K,E1[1],E1[2],E1[3],E1[4]);
                        274. #endif
                        275. /*
                        276. C
                        277. C Module 10: Integer arithmetic
                        278. C
                        279. */
                        280. J = 2;
                        281. K = 3;
                        282. for (I = 1; I <= N10; I++) {
                        283. J = J + K;
                        284. K = J + K;
                        285. J = K - J;
                        286. K = K - J - J;
                        287. }
                        288. #ifdef PRINTOUT
                        289. IF (JJ==II)POUT(N10,J,K,X1,X2,X3,X4);
                        290. #endif
                        291. /*
                        292. C
                        293. C Module 11: Standard functions
                        294. C
                        295. */
                        296. X = 0.75;
                        297. for (I = 1; I <= N11; I++)
                        298. X = DSQRT(DEXP(DLOG(X)/T1));
                        299. #ifdef PRINTOUT
                        300. IF (JJ==II)POUT(N11,J,K,X,X,X,X);
                        301. #endif
                        302. /*
                        303. C
                        304. C THIS IS THE END OF THE MAJOR LOOP.
                        305. C
                        306. */
                        307. if (++JJ <= II)
                        308. goto IILOOP;
                        309. /*
                        310. C
                        311. C Stop benchmark timing at this point.
                        312. C
                        313. */
                        314. finisec = time(0);
                        315. /*
                        316. C----------------------------------------------------------------
                        317. C Performance in Whetstone KIP's per second is given by
                        318. C
                        319. C (100*LOOP*II)/TIME
                        320. C
                        321. C where TIME is in seconds.
                        322. C--------------------------------------------------------------------
                        323. */
                        324. printf("\n");
                        325. if (finisec-startsec <= 0) {
                        326. printf("Insufficient duration- Increase the LOOP count\n");
                        327. return(1);
                        328. }
                        329. printf("Loops: %ld, Iterations: %d, Duration: %ld sec.\n",
                        330. LOOP, II, finisec-startsec);
                        331. KIPS = (100.0*LOOP*II)/(float)(finisec-startsec);
                        332. if (KIPS >= 1000.0)
                        333. printf("C Converted Double Precision Whetstones: %.1f MIPS\n", KIPS/1000.0);
                        334. else
                        335. printf("C Converted Double Precision Whetstones: %.1f KIPS\n", KIPS);
                        336. if (continuous)
                        337. goto LCONT;
                        338. return(0);
                        339. }
                        340. void
                        341. PA(double E[])
                        342. {
                        343. J = 0;
                        344. L10:
                        345. E[1] = ( E[1] + E[2] + E[3] - E[4]) * T;
                        346. E[2] = ( E[1] + E[2] - E[3] + E[4]) * T;
                        347. E[3] = ( E[1] - E[2] + E[3] + E[4]) * T;
                        348. E[4] = (-E[1] + E[2] + E[3] + E[4]) / T2;
                        349. J += 1;
                        350. if (J < 6)
                        351. goto L10;
                        352. }
                        353. void
                        354. P0(void)
                        355. {
                        356. E1[J] = E1[K];
                        357. E1[K] = E1[L];
                        358. E1[L] = E1[J];
                        359. }
                        360. void
                        361. P3(double X, double Y, double *Z)
                        362. {
                        363. double X1, Y1;
                        364. X1 = X;
                        365. Y1 = Y;
                        366. X1 = T * (X1 + Y1);
                        367. Y1 = T * (X1 + Y1);
                        368. *Z = (X1 + Y1) / T2;
                        369. }
                        370. #ifdef PRINTOUT
                        371. void
                        372. POUT(long N, long J, long K, double X1, double X2, double X3, double X4)
                        373. {
                        374. printf("%7ld %7ld %7ld %12.4e %12.4e %12.4e %12.4e\n",
                        375. N, J, K, X1, X2, X3, X4);
                        376. }
                        377. #endif
                        whetstone.c

                        Appendex B

                        C program for LINPACK Benchmark

                        1. /*
                        2. **
                        3. ** LINPACK.C Linpack benchmark, calculates FLOPS.
                        4. ** (FLoating Point Operations Per Second)
                        5. **
                        6. ** Translated to C by Bonnie Toy 5/88
                        7. **
                        8. ** Modified by Will Menninger, 10/93, with these features:
                        9. ** (modified on 2/25/94 to fix a problem with daxpy for
                        10. ** unequal increments or equal increments not equal to 1.
                        11. ** Jack Dongarra)
                        12. **
                        13. ** - Defaults to double precision.
                        14. ** - Averages ROLLed and UNROLLed performance.
                        15. ** - User selectable array sizes.
                        16. ** - Automatically does enough repetitions to take at least 10 CPU seconds.
                        17. ** - Prints machine precision.
                        18. ** - ANSI prototyping.
                        19. **
                        20. ** To compile: cc -O -o linpack linpack.c -lm
                        21. **
                        22. **
                        23. */
                        24. #include <stdio.h>
                        25. #include <stdlib.h>
                        26. #include <math.h>
                        27. #include <time.h>
                        28. #include <float.h>
                        29. #define DP
                        30. #ifdef SP
                        31. #define ZERO 0.0
                        32. #define ONE 1.0
                        33. #define PREC "Single"
                        34. #define BASE10DIG FLT_DIG
                        35. typedef float REAL;
                        36. #endif
                        37. #ifdef DP
                        38. #define ZERO 0.0e0
                        39. #define ONE 1.0e0
                        40. #define PREC "Double"
                        41. #define BASE10DIG DBL_DIG
                        42. typedef double REAL;
                        43. #endif
                        44. static REAL linpack (long nreps,int arsize);
                        45. static void matgen (REAL *a,int lda,int n,REAL *b,REAL *norma);
                        46. static void dgefa (REAL *a,int lda,int n,int *ipvt,int *info,int roll);
                        47. static void dgesl (REAL *a,int lda,int n,int *ipvt,REAL *b,int job,int roll);
                        48. static void daxpy_r (int n,REAL da,REAL *dx,int incx,REAL *dy,int incy);
                        49. static REAL ddot_r (int n,REAL *dx,int incx,REAL *dy,int incy);
                        50. static void dscal_r (int n,REAL da,REAL *dx,int incx);
                        51. static void daxpy_ur (int n,REAL da,REAL *dx,int incx,REAL *dy,int incy);
                        52. static REAL ddot_ur (int n,REAL *dx,int incx,REAL *dy,int incy);
                        53. static void dscal_ur (int n,REAL da,REAL *dx,int incx);
                        54. static int idamax (int n,REAL *dx,int incx);
                        55. static REAL second (void);
                        56. static void *mempool;
                        57. void main(void)
                        58. {
                        59. char buf[80];
                        60. int arsize;
                        61. long arsize2d,memreq,nreps;
                        62. size_t malloc_arg;
                        63. while (1)
                        64. {
                        65. printf("Enter array size (q to quit) [200]: ");
                        66. fgets(buf,79,stdin);
                        67. if (buf[0]=='q' || buf[0]=='Q')
                        68. break;
                        69. if (buf[0]=='\0' || buf[0]=='\n')
                        70. arsize=200;
                        71. else
                        72. arsize=atoi(buf);
                        73. arsize/=2;
                        74. arsize*=2;
                        75. if (arsize<10)
                        76. {
                        77. printf("Too small.\n");
                        78. continue;
                        79. }
                        80. arsize2d = (long)arsize*(long)arsize;
                        81. memreq=arsize2d*sizeof(REAL)+(long)arsize*sizeof(REAL)+(long)arsize*sizeof(int);
                        82. printf("Memory required: %ldK.\n",(memreq+512L)>>10);
                        83. malloc_arg=(size_t)memreq;
                        84. if (malloc_arg!=memreq || (mempool=malloc(malloc_arg))==NULL)
                        85. {
                        86. printf("Not enough memory available for given array size.\n\n");
                        87. continue;
                        88. }
                        89. printf("\n\nLINPACK benchmark, %s precision.\n",PREC);
                        90. printf("Machine precision: %d digits.\n",BASE10DIG);
                        91. printf("Array size %d X %d.\n",arsize,arsize);
                        92. printf("Average rolled and unrolled performance:\n\n");
                        93. printf(" Reps Time(s) DGEFA DGESL OVERHEAD KFLOPS\n");
                        94. printf("----------------------------------------------------\n");
                        95. nreps=1;
                        96. while (linpack(nreps,arsize)<10.)
                        97. nreps*=2;
                        98. free(mempool);
                        99. printf("\n");
                        100. }
                        101. }
                        102. static REAL linpack(long nreps,int arsize)
                        103. {
                        104. REAL *a,*b;
                        105. REAL norma,t1,kflops,tdgesl,tdgefa,totalt,toverhead,ops;
                        106. int *ipvt,n,info,lda;
                        107. long i,arsize2d;
                        108. lda = arsize;
                        109. n = arsize/2;
                        110. arsize2d = (long)arsize*(long)arsize;
                        111. ops=((2.0*n*n*n)/3.0+2.0*n*n);
                        112. a=(REAL *)mempool;
                        113. b=a+arsize2d;
                        114. ipvt=(int *)&b[arsize];
                        115. tdgesl=0;
                        116. tdgefa=0;
                        117. totalt=second();
                        118. for (i=0;i<nreps;i++)
                        119. {
                        120. matgen(a,lda,n,b,&norma);
                        121. t1 = second();
                        122. dgefa(a,lda,n,ipvt,&info,1);
                        123. tdgefa += second()-t1;
                        124. t1 = second();
                        125. dgesl(a,lda,n,ipvt,b,0,1);
                        126. tdgesl += second()-t1;
                        127. }
                        128. for (i=0;i<nreps;i++)
                        129. {
                        130. matgen(a,lda,n,b,&norma);
                        131. t1 = second();
                        132. dgefa(a,lda,n,ipvt,&info,0);
                        133. tdgefa += second()-t1;
                        134. t1 = second();
                        135. dgesl(a,lda,n,ipvt,b,0,0);
                        136. tdgesl += second()-t1;
                        137. }
                        138. totalt=second()-totalt;
                        139. if (totalt<0.5 || tdgefa+tdgesl<0.2)
                        140. return(0.);
                        141. kflops=2.*nreps*ops/(1000.*(tdgefa+tdgesl));
                        142. toverhead=totalt-tdgefa-tdgesl;
                        143. if (tdgefa<0.)
                        144. tdgefa=0.;
                        145. if (tdgesl<0.)
                        146. tdgesl=0.;
                        147. if (toverhead<0.)
                        148. toverhead=0.;
                        149. printf("%8ld %6.2f %6.2f%% %6.2f%% %6.2f%% %9.3f\n",
                        150. nreps,totalt,100.*tdgefa/totalt,
                        151. 100.*tdgesl/totalt,100.*toverhead/totalt,
                        152. kflops);
                        153. return(totalt);
                        154. }
                        155. /*
                        156. ** For matgen,
                        157. ** We would like to declare a[][lda], but c does not allow it. In this
                        158. ** function, references to a[i][j] are written a[lda*i+j].
                        159. */
                        160. static void matgen(REAL *a,int lda,int n,REAL *b,REAL *norma)
                        161. {
                        162. int init,i,j;
                        163. init = 1325;
                        164. *norma = 0.0;
                        165. for (j = 0; j < n; j++)
                        166. for (i = 0; i < n; i++)
                        167. {
                        168. init = (int)((long)3125*(long)init % 65536L);
                        169. a[lda*j+i] = (init - 32768.0)/16384.0;
                        170. *norma = (a[lda*j+i] > *norma) ? a[lda*j+i] : *norma;
                        171. }
                        172. for (i = 0; i < n; i++)
                        173. b[i] = 0.0;
                        174. for (j = 0; j < n; j++)
                        175. for (i = 0; i < n; i++)
                        176. b[i] = b[i] + a[lda*j+i];
                        177. }
                        178. /*
                        179. **
                        180. ** DGEFA benchmark
                        181. **
                        182. ** We would like to declare a[][lda], but c does not allow it. In this
                        183. ** function, references to a[i][j] are written a[lda*i+j].
                        184. **
                        185. ** dgefa factors a double precision matrix by gaussian elimination.
                        186. **
                        187. ** dgefa is usually called by dgeco, but it can be called
                        188. ** directly with a saving in time if rcond is not needed.
                        189. ** (time for dgeco) = (1 + 9/n)*(time for dgefa) .
                        190. **
                        191. ** on entry
                        192. **
                        193. ** a REAL precision[n][lda]
                        194. ** the matrix to be factored.
                        195. **
                        196. ** lda integer
                        197. ** the leading dimension of the array a .
                        198. **
                        199. ** n integer
                        200. ** the order of the matrix a .
                        201. **
                        202. ** on return
                        203. **
                        204. ** a an upper triangular matrix and the multipliers
                        205. ** which were used to obtain it.
                        206. ** the factorization can be written a = l*u where
                        207. ** l is a product of permutation and unit lower
                        208. ** triangular matrices and u is upper triangular.
                        209. **
                        210. ** ipvt integer[n]
                        211. ** an integer vector of pivot indices.
                        212. **
                        213. ** info integer
                        214. ** = 0 normal value.
                        215. ** = k if u[k][k] .eq. 0.0 . this is not an error
                        216. ** condition for this subroutine, but it does
                        217. ** indicate that dgesl or dgedi will divide by zero
                        218. ** if called. use rcond in dgeco for a reliable
                        219. ** indication of singularity.
                        220. **
                        221. ** linpack. this version dated 08/14/78 .
                        222. ** cleve moler, university of New Mexico, argonne national lab.
                        223. **
                        224. ** functions
                        225. **
                        226. ** blas daxpy,dscal,idamax
                        227. **
                        228. */
                        229. static void dgefa(REAL *a,int lda,int n,int *ipvt,int *info,int roll)
                        230. {
                        231. REAL t;
                        232. int idamax(),j,k,kp1,l,nm1;
                        233. /* gaussian elimination with partial pivoting */
                        234. if (roll)
                        235. {
                        236. *info = 0;
                        237. nm1 = n - 1;
                        238. if (nm1 >= 0)
                        239. for (k = 0; k < nm1; k++)
                        240. {
                        241. kp1 = k + 1;
                        242. /* find l = pivot index */
                        243. l = idamax(n-k,&a[lda*k+k],1) + k;
                        244. ipvt[k] = l;
                        245. /* zero pivot implies this column already
                        246. triangularized */
                        247. if (a[lda*k+l] != ZERO)
                        248. {
                        249. /* interchange if necessary */
                        250. if (l != k)
                        251. {
                        252. t = a[lda*k+l];
                        253. a[lda*k+l] = a[lda*k+k];
                        254. a[lda*k+k] = t;
                        255. }
                        256. /* compute multipliers */
                        257. t = -ONE/a[lda*k+k];
                        258. dscal_r(n-(k+1),t,&a[lda*k+k+1],1);
                        259. /* row elimination with column indexing */
                        260. for (j = kp1; j < n; j++)
                        261. {
                        262. t = a[lda*j+l];
                        263. if (l != k)
                        264. {
                        265. a[lda*j+l] = a[lda*j+k];
                        266. a[lda*j+k] = t;
                        267. }
                        268. daxpy_r(n-(k+1),t,&a[lda*k+k+1],1,&a[lda*j+k+1],1);
                        269. }
                        270. }
                        271. else
                        272. (*info) = k;
                        273. }
                        274. ipvt[n-1] = n-1;
                        275. if (a[lda*(n-1)+(n-1)] == ZERO)
                        276. (*info) = n-1;
                        277. }
                        278. else
                        279. {
                        280. *info = 0;
                        281. nm1 = n - 1;
                        282. if (nm1 >= 0)
                        283. for (k = 0; k < nm1; k++)
                        284. {
                        285. kp1 = k + 1;
                        286. /* find l = pivot index */
                        287. l = idamax(n-k,&a[lda*k+k],1) + k;
                        288. ipvt[k] = l;
                        289. /* zero pivot implies this column already
                        290. triangularized */
                        291. if (a[lda*k+l] != ZERO)
                        292. {
                        293. /* interchange if necessary */
                        294. if (l != k)
                        295. {
                        296. t = a[lda*k+l];
                        297. a[lda*k+l] = a[lda*k+k];
                        298. a[lda*k+k] = t;
                        299. }
                        300. /* compute multipliers */
                        301. t = -ONE/a[lda*k+k];
                        302. dscal_ur(n-(k+1),t,&a[lda*k+k+1],1);
                        303. /* row elimination with column indexing */
                        304. for (j = kp1; j < n; j++)
                        305. {
                        306. t = a[lda*j+l];
                        307. if (l != k)
                        308. {
                        309. a[lda*j+l] = a[lda*j+k];
                        310. a[lda*j+k] = t;
                        311. }
                        312. daxpy_ur(n-(k+1),t,&a[lda*k+k+1],1,&a[lda*j+k+1],1);
                        313. }
                        314. }
                        315. else
                        316. (*info) = k;
                        317. }
                        318. ipvt[n-1] = n-1;
                        319. if (a[lda*(n-1)+(n-1)] == ZERO)
                        320. (*info) = n-1;
                        321. }
                        322. }
                        323. /*
                        324. **
                        325. ** DGESL benchmark
                        326. **
                        327. ** We would like to declare a[][lda], but c does not allow it. In this
                        328. ** function, references to a[i][j] are written a[lda*i+j].
                        329. **
                        330. ** dgesl solves the double precision system
                        331. ** a * x = b or trans(a) * x = b
                        332. ** using the factors computed by dgeco or dgefa.
                        333. **
                        334. ** on entry
                        335. **
                        336. ** a double precision[n][lda]
                        337. ** the output from dgeco or dgefa.
                        338. **
                        339. ** lda integer
                        340. ** the leading dimension of the array a .
                        341. **
                        342. ** n integer
                        343. ** the order of the matrix a .
                        344. **
                        345. ** ipvt integer[n]
                        346. ** the pivot vector from dgeco or dgefa.
                        347. **
                        348. ** b double precision[n]
                        349. ** the right hand side vector.
                        350. **
                        351. ** job integer
                        352. ** = 0 to solve a*x = b ,
                        353. ** = nonzero to solve trans(a)*x = b where
                        354. ** trans(a) is the transpose.
                        355. **
                        356. ** on return
                        357. **
                        358. ** b the solution vector x .
                        359. **
                        360. ** error condition
                        361. **
                        362. ** a division by zero will occur if the input factor contains a
                        363. ** zero on the diagonal. technically this indicates singularity
                        364. ** but it is often caused by improper arguments or improper
                        365. ** setting of lda . it will not occur if the subroutines are
                        366. ** called correctly and if dgeco has set rcond .gt. 0.0
                        367. ** or dgefa has set info .eq. 0 .
                        368. **
                        369. ** to compute inverse(a) * c where c is a matrix
                        370. ** with p columns
                        371. ** dgeco(a,lda,n,ipvt,rcond,z)
                        372. ** if (!rcond is too small){
                        373. ** for (j=0,j<p,j++)
                        374. ** dgesl(a,lda,n,ipvt,c[j][0],0);
                        375. ** }
                        376. **
                        377. ** linpack. this version dated 08/14/78 .
                        378. ** cleve moler, university of new mexico, argonne national lab.
                        379. **
                        380. ** functions
                        381. **
                        382. ** blas daxpy,ddot
                        383. */
                        384. static void dgesl(REAL *a,int lda,int n,int *ipvt,REAL *b,int job,int roll)
                        385. {
                        386. REAL t;
                        387. int k,kb,l,nm1;
                        388. if (roll)
                        389. {
                        390. nm1 = n - 1;
                        391. if (job == 0)
                        392. {
                        393. /* job = 0 , solve a * x = b */
                        394. /* first solve l*y = b */
                        395. if (nm1 >= 1)
                        396. for (k = 0; k < nm1; k++)
                        397. {
                        398. l = ipvt[k];
                        399. t = b[l];
                        400. if (l != k)
                        401. {
                        402. b[l] = b[k];
                        403. b[k] = t;
                        404. }
                        405. daxpy_r(n-(k+1),t,&a[lda*k+k+1],1,&b[k+1],1);
                        406. }
                        407. /* now solve u*x = y */
                        408. for (kb = 0; kb < n; kb++)
                        409. {
                        410. k = n - (kb + 1);
                        411. b[k] = b[k]/a[lda*k+k];
                        412. t = -b[k];
                        413. daxpy_r(k,t,&a[lda*k+0],1,&b[0],1);
                        414. }
                        415. }
                        416. else
                        417. {
                        418. /* job = nonzero, solve trans(a) * x = b */
                        419. /* first solve trans(u)*y = b */
                        420. for (k = 0; k < n; k++)
                        421. {
                        422. t = ddot_r(k,&a[lda*k+0],1,&b[0],1);
                        423. b[k] = (b[k] - t)/a[lda*k+k];
                        424. }
                        425. /* now solve trans(l)*x = y */
                        426. if (nm1 >= 1)
                        427. for (kb = 1; kb < nm1; kb++)
                        428. {
                        429. k = n - (kb+1);
                        430. b[k] = b[k] + ddot_r(n-(k+1),&a[lda*k+k+1],1,&b[k+1],1);
                        431. l = ipvt[k];
                        432. if (l != k)
                        433. {
                        434. t = b[l];
                        435. b[l] = b[k];
                        436. b[k] = t;
                        437. }
                        438. }
                        439. }
                        440. }
                        441. else
                        442. {
                        443. nm1 = n - 1;
                        444. if (job == 0)
                        445. {
                        446. /* job = 0 , solve a * x = b */
                        447. /* first solve l*y = b */
                        448. if (nm1 >= 1)
                        449. for (k = 0; k < nm1; k++)
                        450. {
                        451. l = ipvt[k];
                        452. t = b[l];
                        453. if (l != k)
                        454. {
                        455. b[l] = b[k];
                        456. b[k] = t;
                        457. }
                        458. daxpy_ur(n-(k+1),t,&a[lda*k+k+1],1,&b[k+1],1);
                        459. }
                        460. /* now solve u*x = y */
                        461. for (kb = 0; kb < n; kb++)
                        462. {
                        463. k = n - (kb + 1);
                        464. b[k] = b[k]/a[lda*k+k];
                        465. t = -b[k];
                        466. daxpy_ur(k,t,&a[lda*k+0],1,&b[0],1);
                        467. }
                        468. }
                        469. else
                        470. {
                        471. /* job = nonzero, solve trans(a) * x = b */
                        472. /* first solve trans(u)*y = b */
                        473. for (k = 0; k < n; k++)
                        474. {
                        475. t = ddot_ur(k,&a[lda*k+0],1,&b[0],1);
                        476. b[k] = (b[k] - t)/a[lda*k+k];
                        477. }
                        478. /* now solve trans(l)*x = y */
                        479. if (nm1 >= 1)
                        480. for (kb = 1; kb < nm1; kb++)
                        481. {
                        482. k = n - (kb+1);
                        483. b[k] = b[k] + ddot_ur(n-(k+1),&a[lda*k+k+1],1,&b[k+1],1);
                        484. l = ipvt[k];
                        485. if (l != k)
                        486. {
                        487. t = b[l];
                        488. b[l] = b[k];
                        489. b[k] = t;
                        490. }
                        491. }
                        492. }
                        493. }
                        494. }
                        495. /*
                        496. ** Constant times a vector plus a vector.
                        497. ** Jack Dongarra, linpack, 3/11/78.
                        498. ** ROLLED version
                        499. */
                        500. static void daxpy_r(int n,REAL da,REAL *dx,int incx,REAL *dy,int incy)
                        501. {
                        502. int i,ix,iy;
                        503. if (n <= 0)
                        504. return;
                        505. if (da == ZERO)
                        506. return;
                        507. if (incx != 1 || incy != 1)
                        508. {
                        509. /* code for unequal increments or equal increments != 1 */
                        510. ix = 1;
                        511. iy = 1;
                        512. if(incx < 0) ix = (-n+1)*incx + 1;
                        513. if(incy < 0)iy = (-n+1)*incy + 1;
                        514. for (i = 0;i < n; i++)
                        515. {
                        516. dy[iy] = dy[iy] + da*dx[ix];
                        517. ix = ix + incx;
                        518. iy = iy + incy;
                        519. }
                        520. return;
                        521. }
                        522. /* code for both increments equal to 1 */
                        523. for (i = 0;i < n; i++)
                        524. dy[i] = dy[i] + da*dx[i];
                        525. }
                        526. /*
                        527. ** Forms the dot product of two vectors.
                        528. ** Jack Dongarra, linpack, 3/11/78.
                        529. ** ROLLED version
                        530. */
                        531. static REAL ddot_r(int n,REAL *dx,int incx,REAL *dy,int incy)
                        532. {
                        533. REAL dtemp;
                        534. int i,ix,iy;
                        535. dtemp = ZERO;
                        536. if (n <= 0)
                        537. return(ZERO);
                        538. if (incx != 1 || incy != 1)
                        539. {
                        540. /* code for unequal increments or equal increments != 1 */
                        541. ix = 0;
                        542. iy = 0;
                        543. if (incx < 0) ix = (-n+1)*incx;
                        544. if (incy < 0) iy = (-n+1)*incy;
                        545. for (i = 0;i < n; i++)
                        546. {
                        547. dtemp = dtemp + dx[ix]*dy[iy];
                        548. ix = ix + incx;
                        549. iy = iy + incy;
                        550. }
                        551. return(dtemp);
                        552. }
                        553. /* code for both increments equal to 1 */
                        554. for (i=0;i < n; i++)
                        555. dtemp = dtemp + dx[i]*dy[i];
                        556. return(dtemp);
                        557. }
                        558. /*
                        559. ** Scales a vector by a constant.
                        560. ** Jack Dongarra, linpack, 3/11/78.
                        561. ** ROLLED version
                        562. */
                        563. static void dscal_r(int n,REAL da,REAL *dx,int incx)
                        564. {
                        565. int i,nincx;
                        566. if (n <= 0)
                        567. return;
                        568. if (incx != 1)
                        569. {
                        570. /* code for increment not equal to 1 */
                        571. nincx = n*incx;
                        572. for (i = 0; i < nincx; i = i + incx)
                        573. dx[i] = da*dx[i];
                        574. return;
                        575. }
                        576. /* code for increment equal to 1 */
                        577. for (i = 0; i < n; i++)
                        578. dx[i] = da*dx[i];
                        579. }
                        580. /*
                        581. ** constant times a vector plus a vector.
                        582. ** Jack Dongarra, linpack, 3/11/78.
                        583. ** UNROLLED version
                        584. */
                        585. static void daxpy_ur(int n,REAL da,REAL *dx,int incx,REAL *dy,int incy)
                        586. {
                        587. int i,ix,iy,m;
                        588. if (n <= 0)
                        589. return;
                        590. if (da == ZERO)
                        591. return;
                        592. if (incx != 1 || incy != 1)
                        593. {
                        594. /* code for unequal increments or equal increments != 1 */
                        595. ix = 1;
                        596. iy = 1;
                        597. if(incx < 0) ix = (-n+1)*incx + 1;
                        598. if(incy < 0)iy = (-n+1)*incy + 1;
                        599. for (i = 0;i < n; i++)
                        600. {
                        601. dy[iy] = dy[iy] + da*dx[ix];
                        602. ix = ix + incx;
                        603. iy = iy + incy;
                        604. }
                        605. return;
                        606. }
                        607. /* code for both increments equal to 1 */
                        608. m = n % 4;
                        609. if ( m != 0)
                        610. {
                        611. for (i = 0; i < m; i++)
                        612. dy[i] = dy[i] + da*dx[i];
                        613. if (n < 4)
                        614. return;
                        615. }
                        616. for (i = m; i < n; i = i + 4)
                        617. {
                        618. dy[i] = dy[i] + da*dx[i];
                        619. dy[i+1] = dy[i+1] + da*dx[i+1];
                        620. dy[i+2] = dy[i+2] + da*dx[i+2];
                        621. dy[i+3] = dy[i+3] + da*dx[i+3];
                        622. }
                        623. }
                        624. /*
                        625. ** Forms the dot product of two vectors.
                        626. ** Jack Dongarra, linpack, 3/11/78.
                        627. ** UNROLLED version
                        628. */
                        629. static REAL ddot_ur(int n,REAL *dx,int incx,REAL *dy,int incy)
                        630. {
                        631. REAL dtemp;
                        632. int i,ix,iy,m;
                        633. dtemp = ZERO;
                        634. if (n <= 0)
                        635. return(ZERO);
                        636. if (incx != 1 || incy != 1)
                        637. {
                        638. /* code for unequal increments or equal increments != 1 */
                        639. ix = 0;
                        640. iy = 0;
                        641. if (incx < 0) ix = (-n+1)*incx;
                        642. if (incy < 0) iy = (-n+1)*incy;
                        643. for (i = 0;i < n; i++)
                        644. {
                        645. dtemp = dtemp + dx[ix]*dy[iy];
                        646. ix = ix + incx;
                        647. iy = iy + incy;
                        648. }
                        649. return(dtemp);
                        650. }
                        651. /* code for both increments equal to 1 */
                        652. m = n % 5;
                        653. if (m != 0)
                        654. {
                        655. for (i = 0; i < m; i++)
                        656. dtemp = dtemp + dx[i]*dy[i];
                        657. if (n < 5)
                        658. return(dtemp);
                        659. }
                        660. for (i = m; i < n; i = i + 5)
                        661. {
                        662. dtemp = dtemp + dx[i]*dy[i] +
                        663. dx[i+1]*dy[i+1] + dx[i+2]*dy[i+2] +
                        664. dx[i+3]*dy[i+3] + dx[i+4]*dy[i+4];
                        665. }
                        666. return(dtemp);
                        667. }
                        668. /*
                        669. ** Scales a vector by a constant.
                        670. ** Jack Dongarra, linpack, 3/11/78.
                        671. ** UNROLLED version
                        672. */
                        673. static void dscal_ur(int n,REAL da,REAL *dx,int incx)
                        674. {
                        675. int i,m,nincx;
                        676. if (n <= 0)
                        677. return;
                        678. if (incx != 1)
                        679. {
                        680. /* code for increment not equal to 1 */
                        681. nincx = n*incx;
                        682. for (i = 0; i < nincx; i = i + incx)
                        683. dx[i] = da*dx[i];
                        684. return;
                        685. }
                        686. /* code for increment equal to 1 */
                        687. m = n % 5;
                        688. if (m != 0)
                        689. {
                        690. for (i = 0; i < m; i++)
                        691. dx[i] = da*dx[i];
                        692. if (n < 5)
                        693. return;
                        694. }
                        695. for (i = m; i < n; i = i + 5)
                        696. {
                        697. dx[i] = da*dx[i];
                        698. dx[i+1] = da*dx[i+1];
                        699. dx[i+2] = da*dx[i+2];
                        700. dx[i+3] = da*dx[i+3];
                        701. dx[i+4] = da*dx[i+4];
                        702. }
                        703. }
                        704. /*
                        705. ** Finds the index of element having max. absolute value.
                        706. ** Jack Dongarra, linpack, 3/11/78.
                        707. */
                        708. static int idamax(int n,REAL *dx,int incx)
                        709. {
                        710. REAL dmax;
                        711. int i, ix, itemp;
                        712. if (n < 1)
                        713. return(-1);
                        714. if (n ==1 )
                        715. return(0);
                        716. if(incx != 1)
                        717. {
                        718. /* code for increment not equal to 1 */
                        719. ix = 1;
                        720. dmax = fabs((double)dx[0]);
                        721. ix = ix + incx;
                        722. for (i = 1; i < n; i++)
                        723. {
                        724. if(fabs((double)dx[ix]) > dmax)
                        725. {
                        726. itemp = i;
                        727. dmax = fabs((double)dx[ix]);
                        728. }
                        729. ix = ix + incx;
                        730. }
                        731. }
                        732. else
                        733. {
                        734. /* code for increment equal to 1 */
                        735. itemp = 0;
                        736. dmax = fabs((double)dx[0]);
                        737. for (i = 1; i < n; i++)
                        738. if(fabs((double)dx[i]) > dmax)
                        739. {
                        740. itemp = i;
                        741. dmax = fabs((double)dx[i]);
                        742. }
                        743. }
                        744. return (itemp);
                        745. }
                        746. static REAL second(void)
                        747. {
                        748. return ((REAL)((REAL)clock()/(REAL)CLOCKS_PER_SEC));
                        749. }
                        linpack.c

                        BIBLIOGRAPHY

                        • Walfridsson, Krister (2017, August 13). Getting started with a GCC back end [Blog post]. Retrieved from https://kristerw.blogspot.com/2017/08/getting-started-with-gcc-back-end.html

                        References

                        • Waterman, A., Lee, Y., Patterson, D. A., & Asanovic, K. (2014). The RISC-V instruction set manual, volume I: User-level ISA. CS Division, EECE Department, University of California, Berkeley

                        • Wicht, Baptiste. (2013). Cache-Friendly Profile Guided Optimization. 10.13140/RG.2.2.23693.74723.

                        • Null, L. and Lobur, J. (2015). The essentials of computer organization and architecture, fourth edition. Burlington, MA: Jones & Bartlett Learning, p.554.

                        More
                        Written, reviewed, revised, proofed and published with