# Detecting floating point exceptions in RISC-V using GCC

Faisal Riyaz

ZHCET, Aligarh Muslim University, Aligarh 202002

V. Kamakoti

Department of Computer Science and Engineering, Indian Institute of Technology, Madras, Chennai 600036

## Abstract

Floating point operations are an important part of many scientific and everyday computations. Almost every programming language provides the ability to use floating point numbers for calculation. Errors during floating point exceptions are difficult to detect. They often are critical and cannot be ignored. Most instruction set architectures generate traps to help detect floating point exceptions. RISC-V is an open ISA getting widely adopted by industry and academia. Unlike many other Instruction Set Architectures, RISC-V does not support traps on floating point exceptions. It instead requires explicit checks by software. Detecting exceptions in software can be computationally expensive and often not feasible. We present a modified GCC backend for RISC-V capable to detect floating point exceptions that occur when performing elementary floating point operations: addition, multiplication, addition, subtraction, fused multiply add and fused multiply subtract. Machine description files of RISC-V GCC are modified to add instructions to check floating point status flags. Detection can be enabled or disabled using a command-line option. We compared the performance of modified RISC-V GCC with traps enabled to RISC-V GCC with traps disabled and found 20% decrease in performance for LINPACK benchmark.

Keywords: compilers, reduced Instrcution Set Architecture, benchmarks

## Abbreviations

Abbreviations
 RISC Reduced Instruction Set Computer ISA Instruction Set Architecture GCC GNU C Compiler FCSR Floating point Control and Status Register MWIPS Million Whetstone Instructions Per Second MFLOPS Million Flaoting Point Operarions Per Second

## INTRODUCTION

RISC-V is a free and open ISA, enabling a new era of processor innovation. It was developed by the University of California to support computer architecture research and education (​Waterman, et al ,2014​). It is calling attention worldwide by its fast growth and adoption in industry. RISC-V has a modular design. It has 32-, 64- and 128-bit integer base with added optional extensions.

Processors with RISC-V ISA do not support traps for floating point exception (​Waterman, et al ,2014​). It instead requires explicit checks by software. When any exception occurs, it gets noted in the floating point control and status register. The program continues as if nothing happened. The operation produces a default value, specific to the exception.

To detect floating point exceptions, we modified GCC to insert a check after every floating point operation using GCC. A command line option fp-trap was provided to enable or disable traps for floating point exceptions.

To observe the difference in execution speed between gcc and modified riscv-gcc with traps enabled, we used Whetsone and LINPACK benchmark with different optimization levels. We observed 20% decrease in MFLOPS for LINPACK when no optimization was used. There was 8.8% decrease in MIPS for Whetstone when no optimization was used.

We also compared the size of executable generated from LINPACK benchmark. There was almost no difference between riscv-gcc and modified riscv-gcc with traps enabled.

Thus, Contribution of this paper are the following:

• We modified the GCC backend to generate traps when any floating point exception occurs during elementary floating point operations.
• We compared the performance of our modified gcc with traps enabled for common exceptions with riscv-gcc using Whetstone and LINPACK benchmark on different optimization levels.
• We compared the size of executable generated from the LINPACK benchmark when modified gcc, riscv-gcc with trap enabled for common exception and riscv gcc with trap enabled for all floating point exceptions.

We modified machine description files of GCC responsible for generating RISC-V assembly from RTL. We added a check to fcsr after every common floating point operation. If any exception occurs, a trap gets generated and program gets terminated.

## Floating Point Exceptions

Floating point operations are an important part of many scientific and everyday computations. Floating point instructions are part of almost all popular instructions sets like ARM, MIPS, etc. RISC-V has extensions for single-precision and double-precision floating point.

Exception is an unusual condition associated with an instruction. Exceptions are synchronous i.e. they occur at runtime.

To quote W. Kahan, "An arithmetic exception arises when an attempted atomic arithmetic operation has no result that would be acceptable universally. The meanings of atomic and acceptable vary with time and place".

IEEE 754 defines five types of floating point exceptions that must be signaled when detected. They are:

• Invalid operation
• Division by zero
• Overflow
• Underflow
• Inexact calculation

## GCC

GCC is the most popular compiler for RISC-V.

GCC Architecture (​Wicht, 2013)​

The following section is based on

GCC includes three components: a front end, a middle end and a backend. A source file goes through each component one after another. There is one front end for each programming language supported by GCC. Front end reads the source file, parse it and convert it into the standard abstract syntax tree. AST is converted into a unified form called generic. Then, it is converted into GIMPLE. After GIMPLE, the source code is converted into SSA representation. Many optimizations are performed on this stage. After the SSA optimization pass, the tree is converted back to the GIMPLE form which is then used to generate RTL form of tree.

RTL is a hardware-based representation that corresponds to an abstract target architecture with an infinite number of registers. In this language, the instructions to be output are described, pretty much one by one, in an algebraic form that describes what the instruction does. Assembly instructions are generated from RTL using machine description files. Each ISA has separate machine description files in gcc/config/machine.

The phase of importance for our purpose is when RTL is translated to assembly instructions. RTL is used for specifying machine description.

The.md file for a target machine contains a pattern for each instruction that the target machine supports.

We modified the assembly instructions generated for following standard pattern names:

• fmul
• fsub
• fdiv
• fsqrt
• fma
• fms
• fnma
• fnms

## SCOPE

Many operations can raise floating point exceptions. We modified GCC to detect floating point operations only when performing following basic operations:

• floating point substraction
• floating point multilication
• floating point division
• floating point fused multiply add
• floating point fused multiply substract

Modified GCC can detect floating point exceptions for both single and double precision when performing any of the above operations. Modified GCC was not tested on any actual implementation of RISC-V. We used Spike, the RISC-V ISA simulator to test our changes and run the benchmarks. Only C front end of GCC, gcc was used to test and compile the benchmarks. Other front ends were not tested.

The version of GCC used as a base for our modified GCC was riscv64-unknown-elf-gcc (GCC) 8.2.0.

To test the performance of our modified GCC, we used the whetstone and LINPACK benchmark. We have reported only double precision floating point results.

## METHODOLOGY

We used GCC over LLVM because the support for LLVM is not mature.

To assure the correctness of our compiler, many sample program were run on Spike, a RISC-V Simulator. We did not test our modified GCC on any actual RISC-V processor.

We compared the performance of our modified riscv gcc when using fp-trap=common with modified gcc. It was not possible to compare the performace of our modified riscv gcc with fp-trap=all because both benchmarks perform divison which in most cases raises inexact floating point excetpion.

It is interesting to note that the performance of our modified riscv gcc with traps enabled will be same as original riscv-gcc. This is because both compilers generate the same assembly thus will have same execution time. Although, It is expected that generating executable with modified riscv gcc will be slower because after every floating point operation, check will be performed to determine whether trap should be generated or not.

To examine how our modification has affected GCC performance, we use two benchmarks:

• Whetstone
• LINPACK

We used two becnhmarks because both of these benchmarks has limatiations in different domains. Both benchmarks measure performance based on number of floating point operations per second.

According to ​Null and Lobur 2015​, "Whetsone is floating point intensitive, with many calls to library routines for computation of trignometric and exponential functions". LINPACK is a collection of subroutines which solves systems of linear equation.

We used a simple version of LINPACK. LINPACK has gained wide acceptance as a general method of evaluating computer and supercomputer performance.

Programs for both benchmarks were in C language. Only double floating point exceptions were tested. For source code of both benchmarks, see Appendix.

Both benchmarks were executed on Spike. We executed both benchmarks with no optisation and different optimization levels.

For whetstone benchmark, we set the loop count to 100000. For LINPACK benchmark, we used an array of size 1000 x 1000.​

## RESULTS AND DISCUSSION

Modified RISC-V GCC

By default, traps for all floating point exceptions are disabled.
Trap is generated using ebreak instruction.

We can enable traps using fp-trap command line arguments. Valid arguments to fp-trap are:

• all
• common

When all is passed to fp-trap, all five IEEE floating point exceptions generate trap.

When common is passed to fp-trap, only Invalid, Overflow, and Division by Zero exceptions generate a trap. We provided this option because in most cases, a programmer is not concerned with Underflow and Inexact exceptionas they are very common.

The floating point control and status register, fcsr, is a RISC-V control and status register. It is a read/write register that selects the dynamic rounding mode for floating point arithmetic operations and holds the accrued exception flags. When any floating point exception occurs, the corresponding flag in fcsr gets set.

frflags instruction reads the accured exception flags field and copies it into the least-significant five bits of integer register rd, with zero in all other bits. ebreak instructionis used to generete trap.

After operations which possibly can cause floating point exceptions, frflags instruction is used.

• When all is passed to fp-trap option:

The content of this register is compared with zero. If value inside register is zero, a trap gets generated. Otherwise ebreak instruction is skipped.

• When common is passed to fp-trap option:

The content of the register is compared with three. To compare with three, compiler selects another register and stores three in it. If the value inside register is greater than three, a trap gets generated. Otherwise ebreak instruction is skipped.

Consider this C program (Code 1).

#include<stdio.h>int main(){	float a, b, c;	a = 4.5;	b = 0;	c = a / b;	printf("%f\n", c);	return 0;}
Example program to demonstrate division by zero

When we compile the program with fp-trap=all, some assembly generated is in Code 3.

main:	addi	sp,sp,-32	sd	ra,24(sp)	sd	s0,16(sp)	addi	s0,sp,32	lui	a5,%hi(.LC0)	flw	fa5,%lo(.LC0)(a5)	fsw	fa5,-20(s0)	sw	zero,-24(s0)	flw	fa4,-20(s0)	flw	fa5,-24(s0)	fdiv.s	fa5,fa4,fa5	frflags	a5	beqz	a5, 1f	ebreak1:	fsw	fa5,-28(s0)	flw	fa5,-28(s0)	fcvt.d.s	fa5,fa5	fmv.x.d	a1,fa5	lui	a5,%hi(.LC1)	addi	a0,a5,%lo(.LC1)	call	printf	li	a5,0	mv	a0,a5	ld	ra,24(sp)	ld	s0,16(sp)	addi	sp,sp,32	jr	ra
Some Assembly instructions when Code 1 is compiled using fp-trap=all

When we compile the program with fp-trap=common, some assembly generated is in Code 4.

main:	addi	sp,sp,-32	sd	ra,24(sp)	sd	s0,16(sp)	addi	s0,sp,32	lui	a5,%hi(.LC0)	flw	fa5,%lo(.LC0)(a5)	fsw	fa5,-20(s0)	sw	zero,-24(s0)	flw	fa4,-20(s0)	flw	fa5,-24(s0)	fdiv.s	fa5,fa4,fa5	frflags	a5	li	a4, 3	ble	a5, a4, 1f	ebreak1:	fsw	fa5,-28(s0)	flw	fa5,-28(s0)	fcvt.d.s	fa5,fa5	fmv.x.d	a1,fa5	lui	a5,%hi(.LC1)	addi	a0,a5,%lo(.LC1)	call	printf	li	a5,0	mv	a0,a5	ld	ra,24(sp)	ld	s0,16(sp)	addi	sp,sp,32	jr	ra
Some Assembly instructions when Code 1 is compiled using fp-trap=common

When we run the program on Spike, ebreak instruction gets executed and Code 5 is displayed in terminal window.

bbl loaderz  0000000000000000 ra 0000000000010108 sp 000000007f7e9b30 gp 000000000001e700tp 0000000000000000 t0 0000000000000000 t1 000000000000000f t2 0000000000000000s0 000000007f7e9b50 s1 0000000000000000 a0 0000000000000001 a1 000000007f7e9b58a2 0000000000000000 a3 0000000000000010 a4 0000000000000001 a5 0000000000000008a6 000000000000001f a7 0000000000000000 s2 0000000000000000 s3 0000000000000000s4 0000000000000000 s5 0000000000000000 s6 0000000000000000 s7 0000000000000000s8 0000000000000000 s9 0000000000000000 sA 0000000000000000 sB 0000000000000000t3 0000000000000000 t4 0000000000000000 t5 0000000000000000 t6 0000000000000000pc 00000000000101de va 00000000000101de insn       ffffffff sr 8000000200046020Breakpoint!z  0000000000000000 ra 0000000000010108 sp 000000007f7e9b30 gp 000000000001e700tp 0000000000000000 t0 0000000000000000 t1 000000000000000f t2 0000000000000000s0 000000007f7e9b50 s1 0000000000000000 a0 0000000000000001 a1 000000007f7e9b58a2 0000000000000000 a3 0000000000000010 a4 0000000000000001 a5 0000000000000008a6 000000000000001f a7 0000000000000000 s2 0000000000000000 s3 0000000000000000s4 0000000000000000 s5 0000000000000000 s6 0000000000000000 s7 0000000000000000s8 0000000000000000 s9 0000000000000000 sA 0000000000000000 sB 0000000000000000t3 0000000000000000 t4 0000000000000000 t5 0000000000000000 t6 0000000000000000pc 00000000000101e2 va 0000000000000108 insn       ffffffff sr 8000000200046020User store segfault @ 0x0000000000000108
Error message when ebreak instruction is encountered

## Whetstone Benchmark

We compared the performance of our modified riscv gcc with modified riscv gcc when fp-trap=common option is used.

We used the C version of the program. See Appendix A for the code of whetstone benchmark used. The program was executed on Spike simulator.

The value of n, which denotes number of loops used was set as 100000. The results are expressed in Million Whetstone Instruction Per Second.

We observed that the difference of performance between riscv-gcc with fp-trap=common and riscv-gcc with traps disabled increased with optimization level. However, for O3 optimzation level, the performance of riscv-gcc with fp-trap=common was equal to ricv-gcc with trap disabled.

## No optimization

There was 8.8% decrease in MWIPS when fp-trap=common was used.

No Optimization

## O1 optimization

There was 33% decrease in MWIPS when fp-trap=commom was used.

O1 Optimization

## O2 optimization

There was 50% decrease in MWIPS when fp-trap=common was used. This is a significant decrease in performance.

O2 Optimization

## O3 optimization

Interestingly, the performance does not depend upon whether fp-trap option is used.

O3 Optimization

## Os optimization

When Os optimization is used, GCC tries to optimize for size.

There was 13.2% decrease in MWIPS when fp-trap=commom was used.

## LINPACK Benchmark

We used a simple version of LINPACK benchmark. The program was written in C. See Appendix B.

The size of matrix used was 1000 x 1000. The results are expressed in Million Floating Point Operations Per Second.

We observed that the difference of performace between riscv-gcc with fp-trap=common and riscv-gcc with traps disabled in most cases increased with optimization level.

## No optimization

There was 20% decrease in MFLOPS when fp-trap=common was used.

No Optimization

## O1 optimization

There was 45.7% in MFLOPS when fp-trap=common was used.

O1 Optimization

## O2 optimization

There was 31.5% in MFLOPS when fp-trap=common was used.

O2 Optimization

## O3 optimization

There was 33% in MFLOPS when fp-trap=common was used.

O3 Optimization

## Os optimization

There was 26% in MFLOPS when fp-trap=common was used.

Os Optimization

## Executable Size Comparision

We compared the size of executable generated from compiling Whetstone benchmark program. Comparision was performed between riscv-gcc with trap disabled, riscv-gcc with fp-trap=common and riscv-gcc with fp-trap=all.

The executable size was almost same for all three cases.

Executable obtained when using fp-trap=common was 1.6% larger than executable obtained when traps were disabled. Executable obtained when using fp-trap=all was 1.1% larger than executable obtained when traps were disabled.

Executable size comparision

## CONCLUSION

RISC-V is the ISA of the future. RISC does not support traps on floating point exceptions. In this work, we demonstrated that GCC backend can be modified to detect floating point exceptions after certain operations. Machine description files were modified. We compared how enabling traps affect the execution speed and executable size using whetstone and LINPACK floating point benchmarks with different optimizations. We observed an 8.8% decrease in MWIPS when no optimization was used. 20% in MFLOPS was observed using LINPACK benchmark with no optimization. The executable size was not affected much by enabling traps.

## ACKNOWLEDGEMENTS

I am grateful to Almighty God for every blessing in my life.

I express my sincere thanks and gratitude to Professor V. Kamakoti. It was a privilege to have him as a guide.

I am grateful to my mentor, Sathya Narayan. He was supportive during the whole period. Working under him has been a rewarding experience.

Special thanks to Shankar Raman sir for his comments and Ideas. I also extend my gratitude to all members of RISE lab.

I would like to thank the Indian Academy of Sciences for providing me the Summer Research Fellowship, which was a great opportunity for me to get involved in research and learn many new things.

I am grateful to my family for helping me through all.

## APPENDICES

Appendex A

C program for Whetstone benchmark

/* * C Converted Whetstone Double Precision Benchmark *		Version 1.2	22 March 1998 * *	(c) Copyright 1998 Painter Engineering, Inc. *		All Rights Reserved. * *		Permission is granted to use, duplicate, and *		publish this text and program as long as it *		includes this entire comment block and limited *		rights reference. * * Converted by Rich Painter, Painter Engineering, Inc. based on the * www.netlib.org benchmark/whetstoned version obtained 16 March 1998. * * A novel approach was used here to keep the look and feel of the * FORTRAN version.  Altering the FORTRAN-based array indices, * starting at element 1, to start at element 0 for C, would require * numerous changes, including decrementing the variable indices by 1. * Instead, the array E1[] was declared 1 element larger in C.  This * allows the FORTRAN index range to function without any literal or * variable indices changes.  The array element E1[0] is simply never * used and does not alter the benchmark results. * * The major FORTRAN comment blocks were retained to minimize * differences between versions.  Modules N5 and N12, like in the * FORTRAN version, have been eliminated here. * * An optional command-line argument has been provided [-c] to * offer continuous repetition of the entire benchmark. * An optional argument for setting an alternate LOOP count is also * provided.  Define PRINTOUT to cause the POUT() function to print * outputs at various stages.  Final timing measurements should be * made with the PRINTOUT undefined. * * Questions and comments may be directed to the author at *			r.painter@ieee.org *//*C**********************************************************************C     Benchmark #2 -- Double  Precision Whetstone (A001)CC     o	This is a REAL*8 version ofC	the Whetstone benchmark program.CC     o	DO-loop semantics are ANSI-66 compatible.CC     o	Final measurements are to be made with allC	WRITE statements and FORMAT sttements removed.CC**********************************************************************   *//* standard C library headers required */#include <stdlib.h>#include <stdio.h>#include <string.h>#include <math.h>/* the following is optional depending on the timing function used */#include <time.h>/* map the FORTRAN math functions, etc. to the C versions */#define DSIN	sin#define DCOS	cos#define DATAN	atan#define DLOG	log#define DEXP	exp#define DSQRT	sqrt#define IF		if/* function prototypes */void POUT(long N, long J, long K, double X1, double X2, double X3, double X4);void PA(double E[]);void P0(void);void P3(double X, double Y, double *Z);#define USAGE	"usage: whetdc [-c] [loops]\n"/*	COMMON T,T1,T2,E1(4),J,K,L*/double T,T1,T2,E1[5];int J,K,L;intmain(int argc, char *argv[]){	/* used in the FORTRAN version */	long I;	long N1, N2, N3, N4, N6, N7, N8, N9, N10, N11;	double X1,X2,X3,X4,X,Y,Z;	long LOOP;	int II, JJ;	/* added for this version */	long loopstart;	long startsec, finisec;	float KIPS;	int continuous;	loopstart = 1000;		/* see the note about LOOP below */	continuous = 0;	II = 1;		/* start at the first arg (temp use of II here) */	while (II < argc) {		if (strncmp(argv[II], "-c", 2) == 0 || argv[II][0] == 'c') {			continuous = 1;		} else if (atol(argv[II]) > 0) {			loopstart = atol(argv[II]);		} else {			fprintf(stderr, USAGE);			return(1);		}		II++;	}LCONT:/*CC	Start benchmark timing at this point.C*/	startsec = time(0);/*CC	The actual benchmark starts here.C*/	T  = .499975;	T1 = 0.50025;	T2 = 2.0;/*CC	With loopcount LOOP=10, one million Whetstone instructionsC	will be executed in EACH MAJOR LOOP..A MAJOR LOOP IS EXECUTEDC	'II' TIMES TO INCREASE WALL-CLOCK TIMING ACCURACY.C	LOOP = 1000;*/	LOOP = loopstart;	II   = 1;	JJ = 1;IILOOP:	N1  = 0;	N2  = 12 * LOOP;	N3  = 14 * LOOP;	N4  = 345 * LOOP;	N6  = 210 * LOOP;	N7  = 32 * LOOP;	N8  = 899 * LOOP;	N9  = 616 * LOOP;	N10 = 0;	N11 = 93 * LOOP;/*CC	Module 1: Simple identifiersC*/	X1  =  1.0;	X2  = -1.0;	X3  = -1.0;	X4  = -1.0;	for (I = 1; I <= N1; I++) {	    X1 = (X1 + X2 + X3 - X4) * T;	    X2 = (X1 + X2 - X3 + X4) * T;	    X3 = (X1 - X2 + X3 + X4) * T;	    X4 = (-X1+ X2 + X3 + X4) * T;	}#ifdef PRINTOUT	IF (JJ==II)POUT(N1,N1,N1,X1,X2,X3,X4);#endif/*CC	Module 2: Array elementsC*/	E1[1] =  1.0;	E1[2] = -1.0;	E1[3] = -1.0;	E1[4] = -1.0;	for (I = 1; I <= N2; I++) {	    E1[1] = ( E1[1] + E1[2] + E1[3] - E1[4]) * T;	    E1[2] = ( E1[1] + E1[2] - E1[3] + E1[4]) * T;	    E1[3] = ( E1[1] - E1[2] + E1[3] + E1[4]) * T;	    E1[4] = (-E1[1] + E1[2] + E1[3] + E1[4]) * T;	}#ifdef PRINTOUT	IF (JJ==II)POUT(N2,N3,N2,E1[1],E1[2],E1[3],E1[4]);#endif/*CC	Module 3: Array as parameterC*/	for (I = 1; I <= N3; I++)		PA(E1);#ifdef PRINTOUT	IF (JJ==II)POUT(N3,N2,N2,E1[1],E1[2],E1[3],E1[4]);#endif/*CC	Module 4: Conditional jumpsC*/	J = 1;	for (I = 1; I <= N4; I++) {		if (J == 1)			J = 2;		else			J = 3;		if (J > 2)			J = 0;		else			J = 1;		if (J < 1)			J = 1;		else			J = 0;	}#ifdef PRINTOUT	IF (JJ==II)POUT(N4,J,J,X1,X2,X3,X4);#endif/*CC	Module 5: OmittedC 	Module 6: Integer arithmeticC*/	J = 1;	K = 2;	L = 3;	for (I = 1; I <= N6; I++) {	    J = J * (K-J) * (L-K);	    K = L * K - (L-J) * K;	    L = (L-K) * (K+J);	    E1[L-1] = J + K + L;	    E1[K-1] = J * K * L;	}#ifdef PRINTOUT	IF (JJ==II)POUT(N6,J,K,E1[1],E1[2],E1[3],E1[4]);#endif/*CC	Module 7: Trigonometric functionsC*/	X = 0.5;	Y = 0.5;	for (I = 1; I <= N7; I++) {		X = T * DATAN(T2*DSIN(X)*DCOS(X)/(DCOS(X+Y)+DCOS(X-Y)-1.0));		Y = T * DATAN(T2*DSIN(Y)*DCOS(Y)/(DCOS(X+Y)+DCOS(X-Y)-1.0));	}#ifdef PRINTOUT	IF (JJ==II)POUT(N7,J,K,X,X,Y,Y);#endif/*CC	Module 8: Procedure callsC*/	X = 1.0;	Y = 1.0;	Z = 1.0;	for (I = 1; I <= N8; I++)		P3(X,Y,&Z);#ifdef PRINTOUT	IF (JJ==II)POUT(N8,J,K,X,Y,Z,Z);#endif/*CC	Module 9: Array referencesC*/	J = 1;	K = 2;	L = 3;	E1[1] = 1.0;	E1[2] = 2.0;	E1[3] = 3.0;	for (I = 1; I <= N9; I++)		P0();#ifdef PRINTOUT	IF (JJ==II)POUT(N9,J,K,E1[1],E1[2],E1[3],E1[4]);#endif/*CC	Module 10: Integer arithmeticC*/	J = 2;	K = 3;	for (I = 1; I <= N10; I++) {	    J = J + K;	    K = J + K;	    J = K - J;	    K = K - J - J;	}#ifdef PRINTOUT	IF (JJ==II)POUT(N10,J,K,X1,X2,X3,X4);#endif/*CC	Module 11: Standard functionsC*/	X = 0.75;	for (I = 1; I <= N11; I++)		X = DSQRT(DEXP(DLOG(X)/T1));#ifdef PRINTOUT	IF (JJ==II)POUT(N11,J,K,X,X,X,X);#endif/*CC      THIS IS THE END OF THE MAJOR LOOP.C*/	if (++JJ <= II)		goto IILOOP;/*CC      Stop benchmark timing at this point.C*/	finisec = time(0);/*C----------------------------------------------------------------C      Performance in Whetstone KIP's per second is given byCC	(100*LOOP*II)/TIMECC      where TIME is in seconds.C--------------------------------------------------------------------*/	printf("\n");	if (finisec-startsec <= 0) {		printf("Insufficient duration- Increase the LOOP count\n");		return(1);	}	printf("Loops: %ld, Iterations: %d, Duration: %ld sec.\n",			LOOP, II, finisec-startsec);	KIPS = (100.0*LOOP*II)/(float)(finisec-startsec);	if (KIPS >= 1000.0)		printf("C Converted Double Precision Whetstones: %.1f MIPS\n", KIPS/1000.0);	else		printf("C Converted Double Precision Whetstones: %.1f KIPS\n", KIPS);	if (continuous)		goto LCONT;	return(0);}voidPA(double E[]){	J = 0;L10:	E[1] = ( E[1] + E[2] + E[3] - E[4]) * T;	E[2] = ( E[1] + E[2] - E[3] + E[4]) * T;	E[3] = ( E[1] - E[2] + E[3] + E[4]) * T;	E[4] = (-E[1] + E[2] + E[3] + E[4]) / T2;	J += 1;	if (J < 6)		goto L10;}voidP0(void){	E1[J] = E1[K];	E1[K] = E1[L];	E1[L] = E1[J];}voidP3(double X, double Y, double *Z){	double X1, Y1;	X1 = X;	Y1 = Y;	X1 = T * (X1 + Y1);	Y1 = T * (X1 + Y1);	*Z  = (X1 + Y1) / T2;}#ifdef PRINTOUTvoidPOUT(long N, long J, long K, double X1, double X2, double X3, double X4){	printf("%7ld %7ld %7ld %12.4e %12.4e %12.4e %12.4e\n",						N, J, K, X1, X2, X3, X4);}#endif
whetstone.c

Appendex B

C program for LINPACK Benchmark

/***** LINPACK.C        Linpack benchmark, calculates FLOPS.**                  (FLoating Point Operations Per Second)**** Translated to C by Bonnie Toy 5/88**** Modified by Will Menninger, 10/93, with these features:**  (modified on 2/25/94  to fix a problem with daxpy  for**   unequal increments or equal increments not equal to 1.**     Jack Dongarra)**** - Defaults to double precision.** - Averages ROLLed and UNROLLed performance.** - User selectable array sizes.** - Automatically does enough repetitions to take at least 10 CPU seconds.** - Prints machine precision.** - ANSI prototyping.**** To compile:  cc -O -o linpack linpack.c -lm*****/#include <stdio.h>#include <stdlib.h>#include <math.h>#include <time.h>#include <float.h>#define DP#ifdef SP#define ZERO        0.0#define ONE         1.0#define PREC        "Single"#define BASE10DIG   FLT_DIGtypedef float   REAL;#endif#ifdef DP#define ZERO        0.0e0#define ONE         1.0e0#define PREC        "Double"#define BASE10DIG   DBL_DIGtypedef double  REAL;#endifstatic REAL linpack  (long nreps,int arsize);static void matgen   (REAL *a,int lda,int n,REAL *b,REAL *norma);static void dgefa    (REAL *a,int lda,int n,int *ipvt,int *info,int roll);static void dgesl    (REAL *a,int lda,int n,int *ipvt,REAL *b,int job,int roll);static void daxpy_r  (int n,REAL da,REAL *dx,int incx,REAL *dy,int incy);static REAL ddot_r   (int n,REAL *dx,int incx,REAL *dy,int incy);static void dscal_r  (int n,REAL da,REAL *dx,int incx);static void daxpy_ur (int n,REAL da,REAL *dx,int incx,REAL *dy,int incy);static REAL ddot_ur  (int n,REAL *dx,int incx,REAL *dy,int incy);static void dscal_ur (int n,REAL da,REAL *dx,int incx);static int  idamax   (int n,REAL *dx,int incx);static REAL second   (void);static void *mempool;void main(void)    {    char    buf[80];    int     arsize;    long    arsize2d,memreq,nreps;    size_t  malloc_arg;    while (1)        {        printf("Enter array size (q to quit) [200]:  ");        fgets(buf,79,stdin);        if (buf[0]=='q' || buf[0]=='Q')            break;        if (buf[0]=='\0' || buf[0]=='\n')            arsize=200;        else            arsize=atoi(buf);        arsize/=2;        arsize*=2;        if (arsize<10)            {            printf("Too small.\n");            continue;            }        arsize2d = (long)arsize*(long)arsize;        memreq=arsize2d*sizeof(REAL)+(long)arsize*sizeof(REAL)+(long)arsize*sizeof(int);        printf("Memory required:  %ldK.\n",(memreq+512L)>>10);        malloc_arg=(size_t)memreq;        if (malloc_arg!=memreq || (mempool=malloc(malloc_arg))==NULL)            {            printf("Not enough memory available for given array size.\n\n");            continue;            }        printf("\n\nLINPACK benchmark, %s precision.\n",PREC);        printf("Machine precision:  %d digits.\n",BASE10DIG);        printf("Array size %d X %d.\n",arsize,arsize);        printf("Average rolled and unrolled performance:\n\n");        printf("    Reps Time(s) DGEFA   DGESL  OVERHEAD    KFLOPS\n");        printf("----------------------------------------------------\n");        nreps=1;        while (linpack(nreps,arsize)<10.)            nreps*=2;        free(mempool);        printf("\n");        }    }static REAL linpack(long nreps,int arsize)    {    REAL  *a,*b;    REAL   norma,t1,kflops,tdgesl,tdgefa,totalt,toverhead,ops;    int   *ipvt,n,info,lda;    long   i,arsize2d;    lda = arsize;    n = arsize/2;    arsize2d = (long)arsize*(long)arsize;    ops=((2.0*n*n*n)/3.0+2.0*n*n);    a=(REAL *)mempool;    b=a+arsize2d;    ipvt=(int *)&b[arsize];    tdgesl=0;    tdgefa=0;    totalt=second();    for (i=0;i<nreps;i++)        {        matgen(a,lda,n,b,&norma);        t1 = second();        dgefa(a,lda,n,ipvt,&info,1);        tdgefa += second()-t1;        t1 = second();        dgesl(a,lda,n,ipvt,b,0,1);        tdgesl += second()-t1;        }    for (i=0;i<nreps;i++)        {        matgen(a,lda,n,b,&norma);        t1 = second();        dgefa(a,lda,n,ipvt,&info,0);        tdgefa += second()-t1;        t1 = second();        dgesl(a,lda,n,ipvt,b,0,0);        tdgesl += second()-t1;        }    totalt=second()-totalt;    if (totalt<0.5 || tdgefa+tdgesl<0.2)        return(0.);    kflops=2.*nreps*ops/(1000.*(tdgefa+tdgesl));    toverhead=totalt-tdgefa-tdgesl;    if (tdgefa<0.)        tdgefa=0.;    if (tdgesl<0.)        tdgesl=0.;    if (toverhead<0.)        toverhead=0.;    printf("%8ld %6.2f %6.2f%% %6.2f%% %6.2f%%  %9.3f\n",            nreps,totalt,100.*tdgefa/totalt,            100.*tdgesl/totalt,100.*toverhead/totalt,            kflops);    return(totalt);    }/*** For matgen,** We would like to declare a[][lda], but c does not allow it.  In this** function, references to a[i][j] are written a[lda*i+j].*/static void matgen(REAL *a,int lda,int n,REAL *b,REAL *norma)    {    int init,i,j;    init = 1325;    *norma = 0.0;    for (j = 0; j < n; j++)        for (i = 0; i < n; i++)            {            init = (int)((long)3125*(long)init % 65536L);            a[lda*j+i] = (init - 32768.0)/16384.0;            *norma = (a[lda*j+i] > *norma) ? a[lda*j+i] : *norma;            }    for (i = 0; i < n; i++)        b[i] = 0.0;    for (j = 0; j < n; j++)        for (i = 0; i < n; i++)            b[i] = b[i] + a[lda*j+i];    }/***** DGEFA benchmark**** We would like to declare a[][lda], but c does not allow it.  In this** function, references to a[i][j] are written a[lda*i+j].****   dgefa factors a double precision matrix by gaussian elimination.****   dgefa is usually called by dgeco, but it can be called**   directly with a saving in time if  rcond  is not needed.**   (time for dgeco) = (1 + 9/n)*(time for dgefa) .****   on entry****      a       REAL precision[n][lda]**              the matrix to be factored.****      lda     integer**              the leading dimension of the array  a .****      n       integer**              the order of the matrix  a .****   on return****      a       an upper triangular matrix and the multipliers**              which were used to obtain it.**              the factorization can be written  a = l*u  where**              l  is a product of permutation and unit lower**              triangular matrices and  u  is upper triangular.****      ipvt    integer[n]**              an integer vector of pivot indices.****      info    integer**              = 0  normal value.**              = k  if  u[k][k] .eq. 0.0 .  this is not an error**                   condition for this subroutine, but it does**                   indicate that dgesl or dgedi will divide by zero**                   if called.  use  rcond  in dgeco for a reliable**                   indication of singularity.****   linpack. this version dated 08/14/78 .**   cleve moler, university of New Mexico, argonne national lab.****   functions****   blas daxpy,dscal,idamax***/static void dgefa(REAL *a,int lda,int n,int *ipvt,int *info,int roll)    {    REAL t;    int idamax(),j,k,kp1,l,nm1;    /* gaussian elimination with partial pivoting */    if (roll)        {        *info = 0;        nm1 = n - 1;        if (nm1 >=  0)            for (k = 0; k < nm1; k++)                {                kp1 = k + 1;                /* find l = pivot index */                l = idamax(n-k,&a[lda*k+k],1) + k;                ipvt[k] = l;                /* zero pivot implies this column already                   triangularized */                if (a[lda*k+l] != ZERO)                    {                    /* interchange if necessary */                    if (l != k)                        {                        t = a[lda*k+l];                        a[lda*k+l] = a[lda*k+k];                        a[lda*k+k] = t;                        }                    /* compute multipliers */                    t = -ONE/a[lda*k+k];                    dscal_r(n-(k+1),t,&a[lda*k+k+1],1);                    /* row elimination with column indexing */                    for (j = kp1; j < n; j++)                        {                        t = a[lda*j+l];                        if (l != k)                            {                            a[lda*j+l] = a[lda*j+k];                            a[lda*j+k] = t;                            }                        daxpy_r(n-(k+1),t,&a[lda*k+k+1],1,&a[lda*j+k+1],1);                        }                    }                else                    (*info) = k;                }        ipvt[n-1] = n-1;        if (a[lda*(n-1)+(n-1)] == ZERO)            (*info) = n-1;        }    else        {        *info = 0;        nm1 = n - 1;        if (nm1 >=  0)            for (k = 0; k < nm1; k++)                {                kp1 = k + 1;                /* find l = pivot index */                l = idamax(n-k,&a[lda*k+k],1) + k;                ipvt[k] = l;                /* zero pivot implies this column already                   triangularized */                if (a[lda*k+l] != ZERO)                    {                    /* interchange if necessary */                    if (l != k)                        {                        t = a[lda*k+l];                        a[lda*k+l] = a[lda*k+k];                        a[lda*k+k] = t;                        }                    /* compute multipliers */                    t = -ONE/a[lda*k+k];                    dscal_ur(n-(k+1),t,&a[lda*k+k+1],1);                    /* row elimination with column indexing */                    for (j = kp1; j < n; j++)                        {                        t = a[lda*j+l];                        if (l != k)                            {                            a[lda*j+l] = a[lda*j+k];                            a[lda*j+k] = t;                            }                        daxpy_ur(n-(k+1),t,&a[lda*k+k+1],1,&a[lda*j+k+1],1);                        }                    }                else                    (*info) = k;                }        ipvt[n-1] = n-1;        if (a[lda*(n-1)+(n-1)] == ZERO)            (*info) = n-1;        }    }/***** DGESL benchmark**** We would like to declare a[][lda], but c does not allow it.  In this** function, references to a[i][j] are written a[lda*i+j].****   dgesl solves the double precision system**   a * x = b  or  trans(a) * x = b**   using the factors computed by dgeco or dgefa.****   on entry****      a       double precision[n][lda]**              the output from dgeco or dgefa.****      lda     integer**              the leading dimension of the array  a .****      n       integer**              the order of the matrix  a .****      ipvt    integer[n]**              the pivot vector from dgeco or dgefa.****      b       double precision[n]**              the right hand side vector.****      job     integer**              = 0         to solve  a*x = b ,**              = nonzero   to solve  trans(a)*x = b  where**                          trans(a)  is the transpose.****  on return****      b       the solution vector  x .****   error condition****      a division by zero will occur if the input factor contains a**      zero on the diagonal.  technically this indicates singularity**      but it is often caused by improper arguments or improper**      setting of lda .  it will not occur if the subroutines are**      called correctly and if dgeco has set rcond .gt. 0.0**      or dgefa has set info .eq. 0 .****   to compute  inverse(a) * c  where  c  is a matrix**   with  p  columns**         dgeco(a,lda,n,ipvt,rcond,z)**         if (!rcond is too small){**              for (j=0,j<p,j++)**                      dgesl(a,lda,n,ipvt,c[j][0],0);**         }****   linpack. this version dated 08/14/78 .**   cleve moler, university of new mexico, argonne national lab.****   functions****   blas daxpy,ddot*/static void dgesl(REAL *a,int lda,int n,int *ipvt,REAL *b,int job,int roll)    {    REAL    t;    int     k,kb,l,nm1;    if (roll)        {        nm1 = n - 1;        if (job == 0)            {            /* job = 0 , solve  a * x = b   */            /* first solve  l*y = b         */            if (nm1 >= 1)                for (k = 0; k < nm1; k++)                    {                    l = ipvt[k];                    t = b[l];                    if (l != k)                        {                        b[l] = b[k];                        b[k] = t;                        }                    daxpy_r(n-(k+1),t,&a[lda*k+k+1],1,&b[k+1],1);                    }            /* now solve  u*x = y */            for (kb = 0; kb < n; kb++)                {                k = n - (kb + 1);                b[k] = b[k]/a[lda*k+k];                t = -b[k];                daxpy_r(k,t,&a[lda*k+0],1,&b[0],1);                }            }        else            {            /* job = nonzero, solve  trans(a) * x = b  */            /* first solve  trans(u)*y = b             */            for (k = 0; k < n; k++)                {                t = ddot_r(k,&a[lda*k+0],1,&b[0],1);                b[k] = (b[k] - t)/a[lda*k+k];                }            /* now solve trans(l)*x = y     */            if (nm1 >= 1)                for (kb = 1; kb < nm1; kb++)                    {                    k = n - (kb+1);                    b[k] = b[k] + ddot_r(n-(k+1),&a[lda*k+k+1],1,&b[k+1],1);                    l = ipvt[k];                    if (l != k)                        {                        t = b[l];                        b[l] = b[k];                        b[k] = t;                        }                    }            }        }    else        {        nm1 = n - 1;        if (job == 0)            {            /* job = 0 , solve  a * x = b   */            /* first solve  l*y = b         */            if (nm1 >= 1)                for (k = 0; k < nm1; k++)                    {                    l = ipvt[k];                    t = b[l];                    if (l != k)                        {                        b[l] = b[k];                        b[k] = t;                        }                    daxpy_ur(n-(k+1),t,&a[lda*k+k+1],1,&b[k+1],1);                    }            /* now solve  u*x = y */            for (kb = 0; kb < n; kb++)                {                k = n - (kb + 1);                b[k] = b[k]/a[lda*k+k];                t = -b[k];                daxpy_ur(k,t,&a[lda*k+0],1,&b[0],1);                }            }        else            {            /* job = nonzero, solve  trans(a) * x = b  */            /* first solve  trans(u)*y = b             */            for (k = 0; k < n; k++)                {                t = ddot_ur(k,&a[lda*k+0],1,&b[0],1);                b[k] = (b[k] - t)/a[lda*k+k];                }            /* now solve trans(l)*x = y     */            if (nm1 >= 1)                for (kb = 1; kb < nm1; kb++)                    {                    k = n - (kb+1);                    b[k] = b[k] + ddot_ur(n-(k+1),&a[lda*k+k+1],1,&b[k+1],1);                    l = ipvt[k];                    if (l != k)                        {                        t = b[l];                        b[l] = b[k];                        b[k] = t;                        }                    }            }        }    }/*** Constant times a vector plus a vector.** Jack Dongarra, linpack, 3/11/78.** ROLLED version*/static void daxpy_r(int n,REAL da,REAL *dx,int incx,REAL *dy,int incy)    {    int i,ix,iy;    if (n <= 0)        return;    if (da == ZERO)        return;    if (incx != 1 || incy != 1)        {        /* code for unequal increments or equal increments != 1 */        ix = 1;        iy = 1;        if(incx < 0) ix = (-n+1)*incx + 1;        if(incy < 0)iy = (-n+1)*incy + 1;        for (i = 0;i < n; i++)            {            dy[iy] = dy[iy] + da*dx[ix];            ix = ix + incx;            iy = iy + incy;            }        return;        }    /* code for both increments equal to 1 */    for (i = 0;i < n; i++)        dy[i] = dy[i] + da*dx[i];    }/*** Forms the dot product of two vectors.** Jack Dongarra, linpack, 3/11/78.** ROLLED version*/static REAL ddot_r(int n,REAL *dx,int incx,REAL *dy,int incy)    {    REAL dtemp;    int i,ix,iy;    dtemp = ZERO;    if (n <= 0)        return(ZERO);    if (incx != 1 || incy != 1)        {        /* code for unequal increments or equal increments != 1 */        ix = 0;        iy = 0;        if (incx < 0) ix = (-n+1)*incx;        if (incy < 0) iy = (-n+1)*incy;        for (i = 0;i < n; i++)            {            dtemp = dtemp + dx[ix]*dy[iy];            ix = ix + incx;            iy = iy + incy;            }        return(dtemp);        }    /* code for both increments equal to 1 */    for (i=0;i < n; i++)        dtemp = dtemp + dx[i]*dy[i];    return(dtemp);    }/*** Scales a vector by a constant.** Jack Dongarra, linpack, 3/11/78.** ROLLED version*/static void dscal_r(int n,REAL da,REAL *dx,int incx)    {    int i,nincx;    if (n <= 0)        return;    if (incx != 1)        {        /* code for increment not equal to 1 */        nincx = n*incx;        for (i = 0; i < nincx; i = i + incx)            dx[i] = da*dx[i];        return;        }    /* code for increment equal to 1 */    for (i = 0; i < n; i++)        dx[i] = da*dx[i];    }/*** constant times a vector plus a vector.** Jack Dongarra, linpack, 3/11/78.** UNROLLED version*/static void daxpy_ur(int n,REAL da,REAL *dx,int incx,REAL *dy,int incy)    {    int i,ix,iy,m;    if (n <= 0)        return;    if (da == ZERO)        return;    if (incx != 1 || incy != 1)        {        /* code for unequal increments or equal increments != 1 */        ix = 1;        iy = 1;        if(incx < 0) ix = (-n+1)*incx + 1;        if(incy < 0)iy = (-n+1)*incy + 1;        for (i = 0;i < n; i++)            {            dy[iy] = dy[iy] + da*dx[ix];            ix = ix + incx;            iy = iy + incy;            }        return;        }    /* code for both increments equal to 1 */    m = n % 4;    if ( m != 0)        {        for (i = 0; i < m; i++)            dy[i] = dy[i] + da*dx[i];        if (n < 4)            return;        }    for (i = m; i < n; i = i + 4)        {        dy[i] = dy[i] + da*dx[i];        dy[i+1] = dy[i+1] + da*dx[i+1];        dy[i+2] = dy[i+2] + da*dx[i+2];        dy[i+3] = dy[i+3] + da*dx[i+3];        }    }/*** Forms the dot product of two vectors.** Jack Dongarra, linpack, 3/11/78.** UNROLLED version*/static REAL ddot_ur(int n,REAL *dx,int incx,REAL *dy,int incy)    {    REAL dtemp;    int i,ix,iy,m;    dtemp = ZERO;    if (n <= 0)        return(ZERO);    if (incx != 1 || incy != 1)        {        /* code for unequal increments or equal increments != 1 */        ix = 0;        iy = 0;        if (incx < 0) ix = (-n+1)*incx;        if (incy < 0) iy = (-n+1)*incy;        for (i = 0;i < n; i++)            {            dtemp = dtemp + dx[ix]*dy[iy];            ix = ix + incx;            iy = iy + incy;            }        return(dtemp);        }    /* code for both increments equal to 1 */    m = n % 5;    if (m != 0)        {        for (i = 0; i < m; i++)            dtemp = dtemp + dx[i]*dy[i];        if (n < 5)            return(dtemp);        }    for (i = m; i < n; i = i + 5)        {        dtemp = dtemp + dx[i]*dy[i] +        dx[i+1]*dy[i+1] + dx[i+2]*dy[i+2] +        dx[i+3]*dy[i+3] + dx[i+4]*dy[i+4];        }    return(dtemp);    }/*** Scales a vector by a constant.** Jack Dongarra, linpack, 3/11/78.** UNROLLED version*/static void dscal_ur(int n,REAL da,REAL *dx,int incx)    {    int i,m,nincx;    if (n <= 0)        return;    if (incx != 1)        {        /* code for increment not equal to 1 */        nincx = n*incx;        for (i = 0; i < nincx; i = i + incx)            dx[i] = da*dx[i];        return;        }    /* code for increment equal to 1 */    m = n % 5;    if (m != 0)        {        for (i = 0; i < m; i++)            dx[i] = da*dx[i];        if (n < 5)            return;        }    for (i = m; i < n; i = i + 5)        {        dx[i] = da*dx[i];        dx[i+1] = da*dx[i+1];        dx[i+2] = da*dx[i+2];        dx[i+3] = da*dx[i+3];        dx[i+4] = da*dx[i+4];        }    }/*** Finds the index of element having max. absolute value.** Jack Dongarra, linpack, 3/11/78.*/static int idamax(int n,REAL *dx,int incx)    {    REAL dmax;    int i, ix, itemp;    if (n < 1)        return(-1);    if (n ==1 )        return(0);    if(incx != 1)        {        /* code for increment not equal to 1 */        ix = 1;        dmax = fabs((double)dx[0]);        ix = ix + incx;        for (i = 1; i < n; i++)            {            if(fabs((double)dx[ix]) > dmax)                {                itemp = i;                dmax = fabs((double)dx[ix]);                }            ix = ix + incx;            }        }    else        {        /* code for increment equal to 1 */        itemp = 0;        dmax = fabs((double)dx[0]);        for (i = 1; i < n; i++)            if(fabs((double)dx[i]) > dmax)                {                itemp = i;                dmax = fabs((double)dx[i]);                }        }    return (itemp);    }static REAL second(void)    {    return ((REAL)((REAL)clock()/(REAL)CLOCKS_PER_SEC));    }
linpack.c

## BIBLIOGRAPHY

• Walfridsson, Krister (2017, August 13). Getting started with a GCC back end [Blog post]. Retrieved from https://kristerw.blogspot.com/2017/08/getting-started-with-gcc-back-end.html

#### References

• Waterman, A., Lee, Y., Patterson, D. A., & Asanovic, K. (2014). The RISC-V instruction set manual, volume I: User-level ISA. CS Division, EECE Department, University of California, Berkeley

• Wicht, Baptiste. (2013). Cache-Friendly Profile Guided Optimization. 10.13140/RG.2.2.23693.74723.

• Null, L. and Lobur, J. (2015). The essentials of computer organization and architecture, fourth edition. Burlington, MA: Jones & Bartlett Learning, p.554.

More
Written, reviewed, revised, proofed and published with