SPEC HPG Benchmarks for HPC Systems

advertisement
1
SPEC HPG Benchmarks for HPC Systems
Matthijs van
Kalyanasundaram
Waveren , Kumaran
, Greg Gaertner, Wesley Jones , Rudolf Eigenmann , Ron
Whitney
and Hideki Saito SPEC HPG Secretary Fujitsu
Systems
Lieberman , Matthias S. Müller , Brian
Europe Ltd waveren@fujitsu.fr SPEC HPG Chair, Silicon Graphics, Inc.kumaran@sgi.com National
Renewable Energy Lab. Wesley Jones@nrel.gov Purdue University. eigenman@ecn.purdue.edu SPEC HPG
Vice Chair, Hewlett-Packard Corporation. lieb@rsn.hp.com TU Dresden matthias.mueller@tu-dresden.de Sun
Microsystems. Brian.Whitney@sun.com Intel Corporation hideki.saito@intel.com
Abstract— In this paper, we discuss results and characteristics
of the benchmark suites maintained by the Standard Performance Evaluation Corporation’s (SPEC) High-Performance
Group (HPG). Currently, SPEC HPG has two lines of benchmark
suites for measuring performance of large-scale systems: SPEC
OMP and SPEC HPC2002. SPEC OMP uses the OpenMP
API and includes benchmark suites intended for measuring
performance of modern shared memory parallel systems. SPEC
HPC2002 uses both OpenMP and MPI, and thus it is suitable
for distributed memory systems, shared memory systems, and
hybrid systems. SPEC HPC2002 contains benchmarks from three
popular application areas, Chemistry, Seismic, and Weather
Forecasting. Each of the three benchmarks in HPC2002 has a
small and a medium data set, in order to satisfy the need for
benchmarking a wide range of high-performance systems. We
analyze published results of these benchmark suites regarding
scalability. We also discuss current efforts of SPEC HPG to develop a SPEC MPI benchmark suite based on MPI applications.
I. I NTRODUCTION
SPEC (The Standard Performance Evaluation Corporation)
is an organization for creating industry-standard benchmarks
to measure various aspects of modern computer system performance. SPEC’s High-Performance Group (SPEC HPG) is
a workgroup aimed at benchmarking high-performance computer systems. In June of 2001, SPEC HPG released the first of
the SPEC OMP benchmark suites, SPEC OMPM2001. This
suite consists of a set of OpenMP-based[1], [2] application
programs. The data sets of the SPEC OMPM2001 suite (also
referred to as the medium suite) are derived from state-ofthe-art computation on modern medium-scale (4- to 16-way)
shared memory parallel systems. Aslot et al.[3] have presented
the benchmark suite. Aslot et al.[4] and Iwashita et al.[5] have
described performance characteristics of the benchmark suite.
The second, large suite, SPEC OMPL2001, focusing on 32way and larger systems, was released in May 2002. SPEC
OMPL2001 shares most of the application code base with
SPEC OMPM2001, but the code and the data sets have been
improved and made larger to achieve better scaling and also to
reflect the class of computation regularly performed on such
large systems[6]. So far, SPEC OMP performance has been
reported for systems up to 128 processors.
SPEC HPC2002 is the latest release of the HPC benchmark
suite. It is suitable for shared and distributed memory machines
or clusters of shared memory nodes. SPEC HPC applications
have been collected from among the largest, most realistic
computational applications that are available for distribution
by SPEC. In contrast to SPEC OMP, they are not restricted
to any particular programming model or system architecture.
Both shared-memory and message passing methods are supported. All codes of the current SPEC HPC2002 suite are
available in an MPI and an OpenMP programming model and
they include two data set sizes.
The SPEC MPI benchmark suite is currently under development. It will be based on real MPI application codes. It was
felt that there is a need for a standard MPI benchmark suite
based on real applications in the marketplace, which uses the
SPEC model of result submission and publication, and SPEC
HPG aims to fill this need.
Performance characteristics of application programs on
large-scale systems are often significantly different from those
on smaller systems. In our previous paper[6] we have discussed the scaling of SPEC OMP benchmarks. In this paper, we characterize the performance behavior of large-scale
systems (32-way and larger) using the SPEC OMPL and
HPC2002 benchmark suites. In Section 2, we provide a short
description of the applications contained in the benchmarks.
Section 3 analyzes the published results of SPEC OMPL2001
and SPEC HPC2002 on large systems, based on application
program behavior and systems’ architectural features. Section
4 concludes the paper.
II. D ESCRIPTION OF THE B ENCHMARKS
A. Overview of the SPEC OMPL2001 Benchmark
The SPEC OMPL2001 benchmark suite consists of 9 application programs, which represent the type of software used in
scientific technical computing. The applications include modeling and simulation programs from the fields of chemistry,
mechanical engineering, climate modeling, and physics. Of
the 9 application programs, 7 are written in Fortran, and
2 (ART and EQUAKE) are written in C. The benchmarks
require a virtual address space of about 6.4 GB in a 16processor run. The rationale for this size were to provide
data sets significantly larger than those of the SPEC OMPM
benchmarks, with a requirement for a 64-bit address space.
Descriptions of the 9 applications codes are provided in
Tab. I.
B. Overview of the SPEC HPC2002 Benchmark Suite
SPEC HPC2002 is a benchmark suite based on highperformance computing (HPC) applications and the MPI and
2
TABLE I
TABLE II
D ESCRIPTION AND NUMBER OF LINES OF THE OMPL BENCHMARK
APPLICATIONS .
Code
APPLU
APSI
MGRID
SWIM
FMA3D
ART
GAFORT
EQUAKE
WUPWISE
Description
Solves 5 coupled non-linear PDEs on a 3dimensional logically structured grid, using the
Symmetric Successive Over-Relaxation implicit
time-marching scheme[7].
Lake environmental model, which predicts the
concentration of pollutants. It solves the model
for the mesoscale and synoptic variations of potential temperature, wind components, and for the
mesoscale vertical velocity, pressure, and distribution of pollutants.
Simple multigrid solver, which computes a 3dimensional potential field.
Weather prediction model, which solves the shallow water equations using a finite difference
method.
Crash simulation program. It simulates the inelastic, transient dynamic response of 3-dimensional
solids and structures subjected to impulsively or
suddenly applied loads. It uses an explicit finite
element method[8].
(Adaptive Resonance Theory) neural network,
which is used to recognize objects in a thermal
image[9]. The objects in the benchmark are a
helicopter and an airplane.
Computes the global maximum fitness using a
genetic algorithm. It starts with an initial population and then generates children who go through
crossover, jump mutation, and creep mutation
with certain probabilities.
Is an earthquake-modeling program. It simulates
the propagation of elastic seismic waves in large,
heterogeneous valleys in order to recover the
time history of the ground motion everywhere
in the valley due to a specific seismic event. It
uses a finite element method on an unstructured
mesh[10].
(Wuppertal Wilson Fermion Solver) is a program
in the field of lattice gauge theory. Lattice gauge
theory is a discretization of quantum chromodynamics. Quark propagators are computed within
a chromodynamic background field. The inhomogeneous lattice-Dirac equation is solved.
#Lines
4000
D ESCRIPTION AND
HPC2002 BENCHMARK
APPLICATIONS .
Code
Description
SPECenv
(WRF) is based on the WRF weather
model, a state-of-the-art, non-hydrostatic
mesoscale
weather
model,
see
http://www.wrf-model.org.
was developed by ARCO beginning in
1995 to gain an accurate measure of
performance of computing systems as it
relates to the seismic processing industry for procurement of new computing
resources.
used to simulate molecules ab initio, at
the quantum level, and optimize atomic
positions. It is a research interest under
the name of GAMESS at the Gordon
Research Group of Iowa State University
and is of interest to the pharmaceutical
industry.
7500
SPECseis
500
SPECchem
400
60000
1300
NUMBER OF LINES OF THE
20
#Lines and
Languages
25000 C
145000 F90
25000
F77 and C
120000
F77 and C
small OpenMP
small MPI
medium OpenMP
medium MPI
linear
15
1500
10
1500
5
0
2200
OpenMP standards for parallel processing. It is targeted at
those who evaluate performance for HPC systems, including
users, system vendors, software vendors, and researchers. It
uses a set of realistic applications to measure the performance of the computing system’s processors, memory architecture, and operating system. SPEC HPC2002 improves
upon and replaces the SPEC HPC96 benchmark suite. The
SPEC HPC2002 suite comprises three benchmarks, each with
a small- and medium-sized data set. A short description of the
benchmark applications is provided in Tab. II.
The SPECenv application is developed within the WRF
(Weather Research and Forecasting) Modeling System development project. This is a multi-year project being undertaken
by several agencies. Members of the WRF Scientific Board include representatives from EPA, FAA, NASA, NCAR, NOAA,
NRL, USAF and several universities. SPEC HPG integrated
version 1.2.1 of the WRF weather model into the SPEC tools
for building, running and verifying results. This means that
0
5
10
15
# threads or # processes
20
Fig. 1. Speedup of SPECenv for different data sets and programming models
on a Sun Fire 6800 platform.
the benchmark runs on more systems than WRF has officially
been ported to. It can run in OpenMP, MPI or mixed MPIOpenMP mode (hybrid). The benchmark runs use restart files
that are created after the model has run for several simulated
hours. This ensures that cumulus and microphysics schemes
are fully developed during the benchmark runs. Fig. 1 shows
the scalability for the different data sets and the OpenMP and
MPI mode on a Sun Fire 6800. The medium data set shows
better scalability. The best programming model will depend
on the platform and data set. Here, OpenMP is better for the
small data set and MPI for the large data set.
SPECseis consists of a modeling phase which generates
synthetic seismic traces for any size of data set, with a
flexibility in the geometry of shots and receivers, ground
structures, varying lateral velocity, and many other options.
A subsequent phase stacks the traces into common midpoint
stacks. There are two imaging phases which produce the
valuable output seismologists use to locate resources of oil.
The first of the two imaging phases is a Fourier method
3
16
10
small OpenMP
small MPI
medium OpenMP
medium MPI
linear
14
12
small OpenMP (1200 MHz)
small MPI (1200 MHz)
medium OpenMP (900 Mhz)
linear
8
10
6
8
4
6
4
2
2
0
0
0
2
4
6
8
10
12
# threads or # processes
14
16
0
5
10
15
# threads or # processes
20
Fig. 2. Speedup of SPECseis for different data sets and programming models
on a Sun Fire 6800 platform.
Fig. 3. Speedup of SPECchem for different data sets and programming
models on a Sun Fire 6800 platform normalized by the 2-way speedup.
which is very efficient but which does not take into account
variations in the velocity profile. Yet, it is widely used and
remains the basis of many methods for acoustic imaging. The
second imaging technique is a much slower finite-difference
method, which can handle variations in the lateral velocity.
This technique is used in many seismic migration codes today.
SPECseis can run in OpenMP or MPI mode. Fig. 2 shows the
scalability for the different data sets and the OpenMP and
MPI mode on a Sun Fire 6800. The medium data set shows
better scalability. The best programming model will depend
on the platform and data set. In this case OpenMP has better
scalability.
Like SPECseis, SPECchem is often used to exhibit performance of high-performance systems among the computer
vendors. Portions of SPECchem codes date back to 1984.
It comes with many built-in functionalities, such as various
field molecular wave-functions, certain energy corrections for
some of the wave-functions, and simulation of several different
phenomena. Depending on what wave-functions you choose,
SPECchem has the option to output energy gradients of
these functions, find saddle points of the potential energy,
compute the vibrational frequencies and IR intensities, and
more. SPECchem can run in OpenMP, MPI or mixed MPIOpenMP mode (hybrid). Fig. 3 shows the scalability for the
different data sets and the OpenMP and MPI mode on a
Sun Fire 6800. The medium data set shows better scalability,
despite the fact that this data set was measured on a machine
with faster processors. In this case MPI has better scalability
for the small data set.
NUMA, clusters) and interconnects, processors, memory hierarchy, compilers, and MPI implementations.
The draft benchmark suite currently consists of 16 candidate
application codes from the fields of CFD, weather forecasting,
Molecular Dynamics, Quantum Chemistry, Biochemistry, and
Finite Element Methods. The contents of the suite is not fixed
yet. It is not expected that all of the candidate codes will be
accepted, and more codes will be added to the list of candidate
codes. SPEC HPG is still open of MPI application code
submissions. MPI applications from any application area can
be submitted, but SPEC prefers programs that are considered
state-of-the-art in a given field. The application should also be
portable across different hardware and software architectures.
The SPEC tool set used is common with the tool set of
the upcoming update of the SPEC CPU benchmark suite. The
MPI benchmark suite is being tested by HPG group members,
ensuring portability across a wide selection of computing
resources.
C. Development of the SPEC MPI Benchmark Suite
In May 2004, SPEC HPG decided to start development
of a SPEC MPI benchmark suite. It was felt that there is
a need for a standard MPI benchmark suite based on real
applications in the marketplace which uses the SPEC model
of result submission and publication, and SPEC HPG aims to
fill this need.
The MPI benchmark will provide performance metrics that
can be used to compare different hardware architectures (SMP,
III. L ARGE S YSTEM P ERFORMANCE
B ENCHMARK S UITES
OF
SPEC
Performance characteristics of application programs on
large-scale systems are often significantly different from those
on smaller systems. Figure 4 shows a scaling of Amdahls
speedup for 32 to 128 threads, normalized by the Amdahls
speedup of 16 threads. Amdahl’s
predicts the speedup
formula
processors to be
, where is
on
is the fraction
the parallel coverage. The parallel coverage
of sequential execution time that is enclosed by a parallel
construct.
Amdahls speedup assumes perfect scaling of the parallel
portion of the program. Actual programs and actual hardware have additional sources of overhead, which degrade the
performance obtained on a real system relative to prediction given by Amdahls law. On the other hand Amdahl’s
speedup does not take into account that the performance
can strongly depend on the data size per processor. Figures
5,6, and 7 show the scaling data for published benchmark
results of SPEC OMPL2001. The numbers listed in the
4
TABLE III
OMPL2001 wupwise
P LATFORMS USED FOR THE OMPL BENCHMARKS .
4
Vendor
Clock [MHz]
L1 Inst.
L1 Data
L2 Cache
L3 Cache
HP PA
875
0.75MB
1.5MB
-
HP IA64
1500
16KB
16KB
256KB
6144KB
Sun
1200
32KB
64KB
8MB
-
SGI
400
32KB
32KB
8MB
-
Fujitsu
1299
128KB
128KB
2MB
-
HP-Superdome
Sun Fire 15K
SGI O3800
HP-Superdome IA64
Fujitsu Primepower
linear
3.5
3
2.5
2
1.5
1
following figures have been obtained from the results published by SPEC as of November, 2003. For the latest results published by SPEC, see http://www.spec.org/omp/results
and http://www.spec.org/hpc2002/results. All results shown
conform to Base Metrics reporting rules. Base Metrics are
produced by building all applications in the suite with a
common set of optimizations and without any modifications to
the source or directives. For better presentation of the graph,
we have normalized all results with the 32-processor results
of the same type of system configuration. If the same system
has faster and slower processor configurations, we used the
scores with the faster processors. In order to make the graphs
readable, we have selected the systems that provided at least
32- and 64-processor results.
0.5
0
0
20
40
60
80
# threads
100
120
140
100
120
140
100
120
140
100
120
140
OMPL2001 swim
4
HP-Superdome
Sun Fire 15K
SGI O3800
HP-Superdome IA64
Fujitsu Primepower
linear
3.5
3
2.5
2
1.5
1
0.5
0
0
A. Scalability of OMPL Benchmarks
As of November 2003 28 results had been published for
OMPL and 76 for OMPM. These results were used for the
figures. Note that as of January 2006, 39 results have been published for OMPL and 141 for OMPM. In this section we focus
on the results with the large data set. Figures 5,6, and 7 show
the scalability of the SPEC OMPL benchmark applications.
The results of five different architectures described in Tab. III
are shown: a HP Superdome with PA-8700+ CPUs, a HP
Superdome with Itanium2, a SUN Fire 15K with UltraSPARC
III, a SGI O3800 with R12000 and a Fujitsu Primepower
System with SPARC64V CPUs.
The benchmarks WUPWISE, SWIM, FMA3D, and ART
show good scalability up to 128 processors. In order for
SWIM to scale well, the bandwidth to main memory needs
to scale with the number of processors. To increase the
scalability OMPL2001 SWIM has more parallel loops than the
OMPM2001 version. In addition, some scalar computation is
performed in parallel, in favor of improved locality. This is of
20
40
60
80
# threads
OMPL2001 fma3d
4
HP-Superdome
Sun Fire 15K
SGI O3800
HP-Superdome IA64
Fujitsu Primepower
linear
3.5
3
2.5
2
1.5
1
0.5
0
0
20
40
60
80
# threads
OMPL2001 art
4
HP-Superdome
Sun Fire 15K
SGI O3800
HP-Superdome IA64
Fujitsu Primepower
linear
3.5
3
2.5
2
1.5
1
0.5
0
0
Fig. 4. Scaling of Amdahl’s Speedup of OMPM2001 through 128 threads,
normalized by the 16-thread speedup.
20
40
60
80
# threads
Fig. 5. Scalability of the OMPL benchmarks that scale well to 128 processors
on HP-Superdome, Sun Fire 15K and SGI O3800 normalized by the 32-thread
speedup.
5
special importance for machines with non uniform memory access using the first touch algorithm to place pages. OMPL2001
ART calls malloc() more efficiently than the OMPM2001
version. This change reduces contention on malloc(), and thus
improved the scalability of ART.
OMPL2001 mgrid
4
HP-Superdome
Sun Fire 15K
SGI O3800
HP-Superdome IA64
Fujitsu Primepower
linear
3.5
3
2.5
2
OMPL2001 applu
1.5
4
HP-Superdome
Sun Fire 15K
SGI O3800
HP-Superdome IA64
Fujitsu Primepower
linear
3.5
3
2.5
1
0.5
0
0
20
40
2
1.5
60
80
# threads
100
120
140
100
120
140
100
120
140
120
140
OMPL2001 equake
1
4
0.5
3.5
0
0
20
40
60
80
# threads
100
120
140
HP-Superdome
Sun Fire 15K
SGI O3800
HP-Superdome IA64
Fujitsu Primepower
linear
3
2.5
2
Fig. 6.
Superlinear scaling of APPLU on HP-Superdome and Fujitsu
PRIMEPOWER normalized by the 32-thread speedup.
1.5
1
The benchmark APPLU shows superlinear scaling on HP
Superdome and Fujitsu PRIMEPOWER. In our previous paper,
we also presented superlinear scaling for the SGI Origin 3800
and the Fujitsu PRIMEPOWER 2000[6]. This is due to a
more efficient usage of the cache as more processors are used.
The same effect is visible on the Sun Fire 15K. According
to the cache sizes of these systems, the sweet spot of the
aggregate cache amount is between 64MB and 96MB. In the
OMPL2001 version, false sharing was reduced by moving one
of the OpenMP DO directives from the outermost loop to the
second-level loop.
The benchmarks EQUAKE, MGRID, APSI and GAFORT
show good scaling up to 64 processors, but poor scaling
for larger numbers of processors. HP Superdome and SGI
Origin 3800 scaled less on EQUAKE. MGRID and EQUAKE
are sparse matrix calculations, which do not scale well to
large numbers of processors. In order to gain more scalability
in the OMPL2001 version, we exploited more parallelism
in EQUAKE, resulting in better scaling on HP Superdome
and SGI Origin 3800. Larger data set in OMPL2001 helped
the scaling of MGRID. Compared to OMPM2001 APSI,
OMPL2001 APSI has a larger trip count of 240 for the
corresponding loop. OMPL2001 APSI also has an improved
work array distribution scheme as well as improved handling
of parallel reduction operations.
0.5
0
0
20
40
60
80
# threads
OMPL2001 apsi
4
HP-Superdome
Sun Fire 15K
SGI O3800
HP-Superdome IA64
Fujitsu Primepower
linear
3.5
3
2.5
2
1.5
1
0.5
0
0
20
40
60
80
# threads
OMPL2001 gafort
4
HP-Superdome
Sun Fire 15K
SGI O3800
HP-Superdome IA64
Fujitsu Primepower
linear
3.5
3
2.5
2
1.5
B. Scalability of HPC Benchmarks
As of November 2003 38 results have been published for
SPEC HPC2002. 18 for the small and 20 for the medium
data set. Results on up to 128 processes-threads have been
published. These results were used for the figures. Note that
as of January 2006, 104 results have been published for SPEC
HPC2002. In this section we focus on the medium data set
results on an IBM SP with 128 Power-3 CPUs running at
1
0.5
0
0
20
40
60
80
# threads
100
Fig. 7. Scalability of the OMPL benchmarks that scale well to 64 processors
on HP-Superdome, Sun Fire 15K and SGI O3800 normalized by the 32threads speedup.
6
375MHz and an SGI 3800 with 128 R14000A CPUs at 600
MHz. In addition we show the results of a Sun Fire 6800
and 15K with UltraSparc III CPUs at 900MHz or 1200MHz
(indicated in the graphs), and a 64 CPU Sun Fire 880 cluster
with 8-way SMPs connected by Myrinet.
Fig. 8 shows the scalability of SPECenv for the medium
data set. All results use the MPI model of execution. The
benchmark shows good scalability up to 128 processors.
4
IBM SP-375
Sun Fire 6800 MHz OpenMP
Sun Fire 880 hybrid
Sun Fire 15K OpenMP
Sun Fire 15K hybrid
linear
3.5
3
2.5
2
1.5
1
8
IBM SP-375
SGI 3800
Sun Fire 6800
Sun Fire 15K
linear
7
6
0.5
0
0
5
4
10
20
30
40
50
# threads or # processes
60
70
Fig. 10. Speedup of SPECchem for the medium data set normalized by the
16-way speedup.
3
TABLE IV
2
S CALING OF DIFFERENT EXECUTION MODELS OF SPEC ENV FOR THE
SMALL AND MEDIUM DATA SET. P LEASE NOTE THE DIFFERENT BASE
1
POINTS .
0
0
20
40
60
80
100
# threads or # processes
120
140
Fig. 8. Speedup of SPECenv for the medium data set normalized by the
16-way speedup.
Fig. 9 shows the scalability of SPECseis for the medium
data set. The scaling behavior depends strongly on the programming model and platform. While it shows almost perfect
scaling (81% efficiency) on 16 processors, the efficiency on a
Sun Fire 15K is much less.
For the SPECchem benchmark only results of IBM and Sun
are available for the medium data set. For both published IBM
results the MPI model of execution was used. The efficiency
of the 32 processor run is 82% compared to the 16 processor
run on the IBM. Using OpenMP the Sun Fire 15K shows a
perfect, almost superlinear scaling from 16 to 32 processors,
but for 64 processors the efficiency is only 57% compared to
16
Sun Fire 6800 MPI
Sun Fire 6800 OpenMP
Sun Fire 15K MPI
Sun Fire 15K OpenMP
linear
14
12
10
8
6
4
2
0
0
10
20
30
40
50
# threads or # processes
60
70
Fig. 9. Speedup of SPECseis for the medium data set normalized by the
4-way speedup.
MPI x OMP
4way Itanium , small data set
MPI x OMP
Sun Fire 6800, medium data set
4x1
1.72
16x1
1.76
1x4
1.39
1x16
1.56
2x2
1.98
8x2
1.83
1x1
1
8x1
1
the 16 processor run. Better efficiency is achieved with the
hybrid execution model, where 81% is reached with 64 CPUs
on the Sun Fire 880 cluster.
Although two of the HPC2002 benchmarks can be used
in hybrid mode all of the published results are limited to
either pure OpenMP or MPI mode. Tab. IV shows the potential
benefit of an hybrid execution of SPECenv. Similar benefit is
visible for SPECchem in Fig. 11, since the MPI and OpenMP
parallelism is on a different level of granularity.
IV. C ONCLUSION
AND
F UTURE W ORK
In this paper we have analyzed the performance characteristics of published results of the SPEC OMPL2001 and
HPC2002 benchmark suites. We have found that the OMPL
benchmark programs scale well up to 64 processors. For the
HPC2002 benchmarks we also have demonstrated the impact
of the choice of the execution model. The results show that
the best choice of MPI, OpenMP or hybrid depends on the
used hardware architecture as well as on the program and the
data sets. Although HPC2002 is not limited to shared memory
platforms, there are no results of larger machines available, so
far. We attribute this to the relative recent release of HPC2002
and expect it to change in the near future.
The trends of the SPEC HPC2002 codes indicate clear limits
of scalability. We conclude that, even given sizeable data sets,
large-scale, realistic applications do not exhibit the near-ideal
speedups that some of the smaller benchmarks suggest. While
this is an expected finding for many readers, demonstrating
the evidence is an important result of the SPEC HPC2002
development. The fact that SPEC benchmarks and reports
7
hybrid
MPI
OpenMP
7
6
5
4
3
2
1
1
100
10
processes
10
threads
100 1
Fig. 11. Speedup of SPECchem for MPI, OpenMP and hybrid execution on
a Sun Fire 15K. The lines connecting hybrid runs denote runs with the same
number of processors*threads.
are fully disclosed will allow both scientists and engineers
to identify the causes that limit performance and develop
remedies.
We combined OMPM and OMPL into one benchmark
providing different data set sizes. It also includes alternative
sources that have been submitted to SPEC according to the
run rules of the benchmark.
SPEC HPG is currently developing a new benchmark suite,
that will be based on MPI application codes. The MPI
benchmark will provide performance metrics that can be used
to compare different hardware architectures (SMP, NUMA,
clusters) and interconnects, processors, memory hierarchy,
compilers, and MPI implementations.
SPEC/HPG is open for MPI application submissions. A
good candidate program would represent a type of computation that is regularly performed on high-performance computers.
ACKNOWLEDGMENTS
The authors would like to thank all of those who participated
in the benchmark development, especially the authors of the
application programs and the contributors of the data sets.
Benchmark development would have been impossible without
their hard work.
[5] H. Iwashita, E. Yamanaka, N. Sueyasu, M. van Waveren, and K. Miura,
“The SPEC OMP 2001 benchmark on the Fujitsu PRIMEPOWER system,” in 3rd European Workshop on OpenMP, EWOMP’01, Barcelona,
Spain, September 2001.
[6] H. Saito, G. Gaertner, W. Jones, R. Eigenmann, H. Iwashita, R. Lieberman, M. van Waveren, and B. Whitney, “Large system performance of
SPEC OMP benchmark suites,” Int’l Journal of Parallel Programming,
vol. 31, no. 3, June 2003.
[7] E. Barszcz, R. Fatoohi, V. Venkatkrishnan, and S. Weeratunga, “Solution
of regular sparse triangular systems on vector and distributed-memory
multiprocessors,” NASA Ames Research Center, Tech. Rep. Rept. No:
RNR-93-007, 1993.
[8] S. Key and C. C. Hoff, “An improved constant membrane and bending
stress shell element for explicit transient dynamics,” Computer Methods
in Applied Mechanics and Engineering, vol. 124, pp. 33–47, 1995.
[9] M. Domeika, C. Roberson, E. Page, and G. Tagliarini, “Adaptive
resonance theory 2 neural network approach to star field recognition,”
in Applications and Science of Artificial Neural Networks II, ser. Proc.
SPIE, S. K. Rogers and D. W. Ruck, Eds. International Society for
Optical Engineering, 1996, vol. 2760, pp. 589–596.
[10] H. Bao, J. Bielak, O. Ghattas, L. F. Kallivokas, D. R. O’Hallaron,
J. R. Shewchuk, and J. Xu, “Large-scale simulation of elastic wave
propagation in heterogeneous media on parallel computers,” Computer
Methods in Applied Mechanics and Engineering, vol. 152, no. 1-2, pp.
85–102, January 1998.
Matthijs van Waveren has contributed to the development of OpenMP Fortran and C/C++ 2.0, Fortran
95 and Fortran 2003. He is the representative of
Fujitsu Ltd in the SPEC High-Performance Group,
the OpenMP Architecture Review Board, and the
INCITS J3 committee on Fortran 2008. He received the Ph.D. degree from the University of
Amsterdam, the Netherlands, in 1989. He worked
at the SARA Computer Centre in Amsterdam, at
Schlumberger/Geco-Prakla R&D in Delft and the
University of Groningen before joining Fujitsu in
1996. He has published over 20 technical papers. He has served on the
programme committee of OpenMP workshops in Europe and Japan and as
Secretary of the SPEC High-Performance Group. Current address: Fujitsu
Systems Europe Ltd, 8, rue Maryse Hilsz, 31500 Toulouse, France. email:
waveren@fujitsu.fr
Kumaran Kalyanasundaram manages the Performance Engineering group at SGI. He is the primary
representative for SGI on SPEC’s High Performance
Group and Open Systems Group. He also serves as
the Chair of HPG. Kumaran received his PhD in
Aerospace Engineering & Engineering Mechanics
from Iowa State University in May 1995. His dissertation involved developing a fast, parallel, O(NlogN)
algorithm for generating 2-manifold surfaces in 3-D.
R EFERENCES
[1] OpenMP
C and
C++ Application
Programming
Interface
Version
2.0,
March
2002,
OpenMP
Architecture
Review
Board,
http://www.openmp.org/specs/mpdocuments/cspec20.pdf.
[2] OpenMP
Fortran
Application
Programming
Interface
Version 2.0, November 2000, OpenMP Architecture Review
Board,
http://www.openmp.org/specs/mpdocuments/fspec20.pdf.
[3] V. Aslot, M. Domeika, R. Eigenmann, G. Gaertner, W. B. Jones,
and B. Parady, “SPEComp: a new benchmark suite for measuring
parallel computer performance,” in WOMPAT’01: Workshop on OpenMP
Applications and Tools, ser. LNCS. Springer, July 2001, vol. 2104, pp.
1–10.
[4] V. Aslot and R. Eigenmann, “Performance characteristics of the SPEC
OMP2001 benchmarks,” in 3rd European Workshop on OpenMP,
EWOMP’01, Barcelona, Spain, September 2001.
Greg Gaertner was a senior engineer with the
High Performance Technical Computing Division
of the Hewlett-Packard company. Gregs worked in
the High Performance Computing Expertise Center.
Greg also was actively involved in performance
benchmarking for Technical Clusters. Greg received
his M.S. in Physics from the University of Minnesota. Greg has forty plus years of experience
in high performance technical computing having
worked for Control Data, Cray Research, SSI, Digital, Compaq and Hewlett Packard. Greg currently
resides in Bedford, New Hampshire, USA.
8
Wesley Jones , Ph.D., is a Senior Scientist in
the Computational Sciences Center at the National
Renewable Energy Laboratory in Golden Colorado.
He is an expert in scientific and technical computing
applications, algorithms, languages, software development, and application programming interfaces.
Wesley currently collaborates with NREL scientists
and engineers to address challenges of application
performance, portability, and scalability. He is also
co-investigator on a DOE grant ”Predicting the
Electronic Properties of 3D, Million-Atom Semiconductor Nanostructure Architectures.” He spent the 7 years prior to joining
NREL at the computer manufacturer, SGI, where he was technical lead of
the Performance Engineering Team and chair of the Standard Performance
Evaluation Corporation High Performance Group. At SGI Wesley worked
on techniques and algorithms for improving the performance of scientific
and technical applications. He also worked with the computer design and
production engineers on evaluating and modifying hardware, software and
networking for application of SGI computers to scientific and technical
computing workloads.
Rudolf Eigenmann is a Professor at the School
of Electrical and Computer Engineering at Purdue
University. His research interests include optimizing
compilers, programming methodologies and tools,
and performance evaluation for high-performance
computers. He has published over 100 technical
papers is these areas. Eigenmann received his Ph.D.
in Electrical Engineering/Computer Science in 1988
from ETH Zurich, Switzerland. He is the recipient
of a 1997 NSF CAREER award and serves on
the editorial boards of the International Journal of
Parallel Programming and the IEEE Computing in Science and Engineering Magazine. He has also served as the chairman/vice-chairman of the
High-Performance Group of the Standard Performance Evaluation Corporation (SPEC). He is a member of the IEEE, ACM, and ASEE. Address:
Purdue University, School of Electrical and Computer Engineering, 465
Northwestern Avenue, West Lafayette, IN 47907, eigenman@purdue.edu,
http://www.ece.purdue.edu/ẽigenman/
Ron Lieberman is a senior engineer with the High
Performance Technical Computing Division of the
Hewlett-Packard company. Ron works primarily on
the Fortran90 compiler for HP-UX. Ron also is
actively involved in performance benchmarking for
SMP based technical servers. Ron recived his B.S.
and M.S. from Central Michigan University in 1985.
Ron is Hewlett-Packard’s +alternate representative
on High-Performance Group of the Standard Performance Evaluation Corporation (SPEC). Address:
Hewlett-Packard Company, 3000 Waterview Pkwy,
MS: B2W25 Richardson Texas 75080, lieb@rsn.hp.com
Matthias S. Müller is Deputy Director and CTO
of the Center for Information Services and High
Performance Computing (ZIH) of the Dresden University of Technology. His research interest include
programming methologies and tools, computational
science on high performance computers and Grid
computing. He received his Ph.D. in Physics in 2001
from University of Stuttgart, Germany. In his dissertation he studied fast algorithms for the simulation
of granular particles. He is a member of the German
Physical Society (DPG) and SPEC HPG.
Brian Whitney is a Senior Staff Engineer with
Sun Microsystems, Inc. His interests include parallelization, performance optimization and machine
characterization of high-performance computers. He
has over 20 years experience working with a vast
array of computer architectures, including hypercubes, array processors, mini-supercomputers and
vector processors. He is a graduate of the University
of Washington in 1982 where he received his B.S.
in Computer Science and in Mathematics. He is
currently a member of the SPEC HPG group and
serves as the Release Manager.
Download