Speaker: Sergei-VinogradovSergey Vinogradov Organisation: Intel (in Co-operation with the University of Bristol) Authors: Julia Fedorova and Dan Curran, University of Bristol. Sergey Vinogradov and James Cownie, Intel Corporation.
Abstract: There are many discussions about whether different programming models can achieve high performance and how easy that is for the programmer. For HPC applications, where performance is critical, this question is especially interesting in the context of OpenMP 4.0 and OpenCL which offer different constructs for describing data parallelism.
We performed a case study to get a practical view of this matter. We experimented on five of the benchmarks from the Rodinia suite and on an industrial application: ROTORSIM, developed by the University of Bristol, and compared their OpenCL and OpenMP 4.0 implementations. While the individual impact of different factors such as memory access, vectorization, etc., is easier to study in the benchmarks, the primary interest is, of course, the performance of real applications. At the beginning we found the existing OpenMP implementations were either very naïve or not fully functional – so we applied changes to make OpenMP codes functional and have them implement similar algorithms to the OpenCL ones. We used the 4.0 version of the OpenMP API because it adds new forms of parallelism via device and SIMD constructs. These make the expressivity of OpenMP and OpenCL more similar.
Our study demonstrates that the critical factors are memory accesses, exploiting the vectorization capabilities of modern CPUs, minimizing unnecessary work and the overhead for organizing parallel calculations. While the OpenCL programming model encourages users to write code that can be vectorized by the OpenCL compiler out of the box, in many cases OpenMP 4.0 relies on the explicit usage of #pragma simd to enable vectorization. Both programming models provide enough control over these performance critical factors. Using different constructs but following the same principles we were able to achieve equivalent performance for the OpenMP and OpenCL implementations of the benchmarks as well as for the ROTORSIM application.
Our talk will present this study. We will go over the steps in the transformations of the benchmark codes and annotate each step with the performance numbers. We will demonstrate the usefulness of Intel® VTune™ Amplifier XE in understanding the performance limiting factors. We will present how the knowledge gained on the experiments with benchmarks is applied to the ROTORSIM code. Here we include the table illustrating the performance comparison for the five Rodinia benchmarks. We will show the ROTORSIM performance data in our presentation. General recommendations for developing high performing code will conclude the talk.
In performance tuning there is really no magic bullet. You need to use efficient algorithms, and understand the hardware characteristics of the target machine. The main impact of the programming model is in its ability to either enhance the optimization process or get in the way of the optimizations that are required for efficient execution. To the extent the programming model is not limiting, the limiting characteristic of the application performance, is the potential performance of the underlying hardware.