Crusoe Update: Linux Benchmarks
15 Sep 2021 - faintshadows
The final entry in 2021’s Crusoe saga.
Debian 8 is the last Debian release that works with i586 CPUs, and thankfully, it works on the Crusoe, with limited stability issues. Namely if you have X running, and try to run apt, it freezes. Don’t know, but I’ll just install things before running X, no big deal.
I was about to put the Crusoe back away for hibernation but I wanted to check some benchmarks under Linux, since they likely won’t need SSE, or could be compiled to not use it at least.
I grabbed all benchmarks from https://linux-sunxi.org/Benchmarks.
LINPACK
foxpro@crusoe:~$ ./linpack
Enter array size (q to quit) [200]:
Memory required: 315K.
LINPACK benchmark, Double precision.
Machine precision: 15 digits.
Array size 200 X 200.
Average rolled and unrolled performance:
Reps Time(s) DGEFA DGESL OVERHEAD KFLOPS
----------------------------------------------------
64 0.66 83.32% 2.78% 13.90% 155170.141
128 1.32 83.33% 2.77% 13.90% 155020.165
256 2.63 83.32% 2.77% 13.90% 155095.938
512 5.27 83.34% 2.77% 13.88% 155007.179
1024 10.53 83.32% 2.77% 13.91% 155128.325
I compiled it with cc -Ofast -o linpack linpack.c -lm -march=i586 -fomit-frame-pointer -mpreferred-stack-boundary=2 -falign-functions=0 -falign-jumps=0 -falign-loops=0
because of course I would. This is what was recommended for the Crusoe from
Gentoo users back in the day, and there was actually a ~3% decrease in
performance without all those flags added, so what the hell, sure.
“But faint those flags are kinda pointless” ok but when you look at how the
Crusoe works under the hood, those last 3 flags actually help because the
Crusoe does its own re-aligning of code, so why have the compiler do it?
It’s not like I’m doing -funroll-all-loops
Fine I’ll run LINPACK without the flags and you can see the difference
foxpro@crusoe:~/bench$ cc -Ofast -o linpack linpack.c -lm
foxpro@crusoe:~/bench$ ./linpack
Enter array size (q to quit) [200]:
Memory required: 315K.
LINPACK benchmark, Double precision.
Machine precision: 15 digits.
Array size 200 X 200.
Average rolled and unrolled performance:
Reps Time(s) DGEFA DGESL OVERHEAD KFLOPS
----------------------------------------------------
64 0.70 79.80% 2.70% 17.50% 152346.352
128 1.41 79.69% 2.77% 17.54% 151261.439
256 2.78 79.90% 2.58% 17.53% 153236.933
512 5.56 79.92% 2.58% 17.50% 153195.100
1024 11.14 79.73% 2.58% 17.69% 153330.947
See!! It’s slower!
Anyways, all benchmarks will be compiled with those flags to keep it even. If people don’t like that I guess I can re-run them later without the fun flags.
DHRYSTONE
Compiled with gcc dhry1.c cpuidc.o cpuida.o -lrt -lc -lm -march=i586 -fomit-frame-pointer -mpreferred-stack-boundary=2 -falign-functions=0 -falign-jumps=0 -falign-loops=0 -o dhry1
Ok I said I’d use all the same flags but the instructions for this said no
optimizations, and -O2
then -O3
, so I did that, but the rest are the same.
Outputs Truncated after the first run for brevity.
1.1 -O0
####################################################
getDetails and MHz
Assembler CPUID and RDTSC
CPU GenuineTMx86, Features Code 0080893F, Model Code 00000543
Transmeta(tm) Crusoe(tm) Processor TM5800
Measured - Minimum 793 MHz, Maximum 793 MHz
Linux Functions
get_nprocs() - CPUs 1, Configured CPUs 1
get_phys_pages() and size - RAM Size 0.23 GB, Page Size 4096 Bytes
uname() - Linux, crusoe, 3.16.0-6-586
#1 Debian 3.16.56-1+deb8u1 (2018-05-08), i586
##########################################
Dhrystone Benchmark, Version 1.1 (Language: C or C++)
Optimisation No Opt
10000 runs 0.02 seconds
100000 runs 0.10 seconds
200000 runs 0.20 seconds
400000 runs 0.40 seconds
800000 runs 0.80 seconds
1600000 runs 1.61 seconds
3200000 runs 3.24 seconds
Array2Glob8/7: O.K. 3200010
Microseconds for one run through Dhrystone: 1.01
Dhrystones per Second: 986560
VAX MIPS rating = 561.50
1.1 -O2
Microseconds for one run through Dhrystone: 0.55
Dhrystones per Second: 1815156
VAX MIPS rating = 1033.10
1.1 -O3
Microseconds for one run through Dhrystone: 0.44
Dhrystones per Second: 2279718
VAX MIPS rating = 1297.51
2.1 -O0
There were two versions of the Dhrystone benchmark, here’s version 2.1
Dhrystone Benchmark, Version 2.1 (Language: C or C++)
Optimisation No Opt
Register option not selected
40000 runs 0.05 seconds
400000 runs 0.41 seconds
800000 runs 0.82 seconds
1600000 runs 1.63 seconds
3200000 runs 3.27 seconds
Final values (* implementation-dependent):
Int_Glob: O.K. 5 Bool_Glob: O.K. 1
Ch_1_Glob: O.K. A Ch_2_Glob: O.K. B
Arr_1_Glob[8]: O.K. 7 Arr_2_Glob8/7: O.K. 3200010
Ptr_Glob-> Ptr_Comp: * 134545776
Discr: O.K. 0 Enum_Comp: O.K. 2
Int_Comp: O.K. 17 Str_Comp: O.K. DHRYSTONE PROGRAM, SOME STRING
Next_Ptr_Glob-> Ptr_Comp: * 134545776 same as above
Discr: O.K. 0 Enum_Comp: O.K. 1
Int_Comp: O.K. 18 Str_Comp: O.K. DHRYSTONE PROGRAM, SOME STRING
Int_1_Loc: O.K. 5 Int_2_Loc: O.K. 13
Int_3_Loc: O.K. 7 Enum_Loc: O.K. 1
Str_1_Loc: O.K. DHRYSTONE PROGRAM, 1'ST STRING
Str_2_Loc: O.K. DHRYSTONE PROGRAM, 2'ND STRING
Microseconds for one run through Dhrystone: 1.02
Dhrystones per Second: 979200
VAX MIPS rating = 557.31
2.1 -O2
Microseconds for one run through Dhrystone: 0.74
Dhrystones per Second: 1358044
VAX MIPS rating = 772.93
2.1 -O3
Microseconds for one run through Dhrystone: 0.71
Dhrystones per Second: 1409295
VAX MIPS rating = 802.10
WHETSTONE
Same flags as the Dhrystone benchmarks.
-O0
Single Precision C/C++ Whetstone Benchmark
Loop content Result MFLOPS MOPS Seconds
N1 floating point -1.12475025653839111 132.613 0.031
N2 floating point -1.12274754047393799 81.167 0.353
N3 if then else 1.00000000000000000 142.886 0.154
N4 fixed point 12.00000000000000000 121.395 0.553
N5 sin,cos etc. 0.49904659390449524 6.884 2.574
N6 floating point 0.99999988079071045 36.821 3.120
N7 assignments 3.00000000000000000 67.113 0.587
N8 exp,sqrt etc. 0.75110864639282227 3.031 2.614
MWIPS 213.303 9.986
-O2
Single Precision C/C++ Whetstone Benchmark
Loop content Result MFLOPS MOPS Seconds
N1 floating point -1.12441420555114746 199.450 0.033
N2 floating point -1.12241148948669434 118.126 0.389
N3 if then else 1.00000000000000000 470.852 0.075
N4 fixed point 12.00000000000000000 687.611 0.157
N5 sin,cos etc. 0.49904659390449524 6.721 4.233
N6 floating point 0.99999988079071045 188.546 0.978
N7 assignments 3.00000000000000000 671.471 0.094
N8 exp,sqrt etc. 0.75110864639282227 3.276 3.884
MWIPS 347.433 9.844
-O3
Single Precision C/C++ Whetstone Benchmark
Loop content Result MFLOPS MOPS Seconds
N1 floating point -1.12441420555114746 199.070 0.034
N2 floating point -1.12239956855773926 234.379 0.201
N3 if then else 1.00000000000000000 590.353 0.061
N4 fixed point 12.00000000000000000 688.087 0.160
N5 sin,cos etc. 0.49904659390449524 6.682 4.358
N6 floating point 0.99999988079071045 188.449 1.002
N7 assignments 3.00000000000000000 717.757 0.090
N8 exp,sqrt etc. 0.75110864639282227 3.277 3.973
MWIPS 354.306 9.878
NBENCH
Used some extra CFLAGS, as per request of the Makefile.
-s -static -Wall -O3 -march=i586 -fomit-frame-pointer \
-mpreferred-stack-boundary=2 -falign-functions=0 -falign-jumps=0 \
-falign-loops=0 -funroll-loops
BYTEmark* Native Mode Benchmark ver. 2 (10/95)
Index-split by Andrew D. Balsa (11/97)
Linux/Unix* port by Uwe F. Mayer (12/96,11/97)
TEST : Iterations/sec. : Old Index : New Index
: : Pentium 90* : AMD K6/233*
--------------------:------------------:-------------:------------
NUMERIC SORT : 353.92 : 9.08 : 2.98
STRING SORT : 22.378 : 10.00 : 1.55
BITFIELD : 1.7021e+08 : 29.20 : 6.10
FP EMULATION : 68.53 : 32.88 : 7.59
FOURIER : 2952.1 : 3.36 : 1.89
ASSIGNMENT : 10.309 : 39.23 : 10.18
IDEA : 924.23 : 14.14 : 4.20
HUFFMAN : 613.45 : 17.01 : 5.43
NEURAL NET : 4.998 : 8.03 : 3.38
LU DECOMPOSITION : 246.81 : 12.79 : 9.23
==========================ORIGINAL BYTEMARK RESULTS==========================
INTEGER INDEX : 18.774
FLOATING-POINT INDEX: 7.011
Baseline (MSDOS*) : Pentium* 90, 256 KB L2-cache, Watcom* compiler 10.0
==============================LINUX DATA BELOW===============================
CPU : GenuineTMx86 Transmeta(tm) Crusoe(tm) Processor TM5800 800MHz
L2 Cache : 512 KB
OS : Linux 3.16.0-6-586
C compiler : gcc version 4.9.2 (Debian 4.9.2-10+deb8u1)
libc : libc-2.19.so
MEMORY INDEX : 4.580
INTEGER INDEX : 4.765
FLOATING-POINT INDEX: 3.889
Baseline (LINUX) : AMD K6/233*, 512 KB L2-cache, gcc 2.7.2.3, libc-5.4.38
* Trademarks are property of their respective holder.
Well, that’s it for my benchmarks under Linux. I have no frame of reference, so I’ll leave it up to you, the reader, to compare amongst your own hardware of this vintage.