We've collected some of the more interesting items from this event for you:
The interaction among advanced compilation techniques, modern processor and computing architectures, and associated tools continues to face new challenges and opportunities. Traditional demands to increase performance, reduce power consumption, and reduce time to market now apply to heterogeneous, virtualized and diverse user-experience environments. Extensive data and task parallelism are being exposed by new programming environments such as OpenCL and Renderscript, relying on innovative architectures, compilers, binary translation and runtime tools. This conference will focus on these exciting new directions and how they are influencing the architecture and compilation domain. This year, the conference will span two days with a format similar to that held last year but with more time for networking and presentations, including a poster session.
The main focus of this conference is the interaction of compiler technologies, processor and computing architectures and tools to address the latest programming environments and demands. The topics of interest for this conference include, but are not limited to:
Ralph Potter, Paul Keir, Jan Lucas, Mauricio Alvarez-Mesa, Ben Juurlink, Andrew Richards, Codeplay, UK and TU Berlin, Germany
The prospect of GPU kernel fusion is often described in research papers as a standalone command-line tool. Such a tool adopts a usage pattern wherein a user isolates, or annotates, an ordered set of kernels. Given such OpenCL C kernels as input, the tool would output a single kernel, which performs similar calculations, hence minimizing costly runtime intermediate load and store operations. Such a mode of operation is, however, a departure from normality for many developers, and is mainly of academic interest.
Automatic compiler-based kernel fusion could provide a vast improvement to the end-user's development experience. The OpenCL Host API, however, does not provide a means to specify opportunities for kernel fusion to the compiler. Ongoing and rapidly maturing compiler and API research by Codeplay aims to provide a higher-level, single-source, industry-focused C++-based interface to OpenCL. Opportunities for kernel fusion have now also been investigated here; utilizing features from C++11 including lambda functions; variadic templates; and lazy evaluation using std::bind expressions.
While pixel-to-pixel transformations are interesting in this context, insomuch as they demonstrate the expressivity of this new single-source C++ API, we also consider fusing transformations which utilize synchronization within workgroups. Hence convolutions, utilizing halos; and the use of the GPU's local shared memory are also explored.
A perennial problem has therefore been restructured to accommodate a modern C++-based expression of kernel fusion. Kernel fusion thus becomes an integrated component of an extended C++ compiler and API.
Juan .A. Gonzalez-Lugo, R. Cammarotay, A. Avila-Ortegaz, N. Dutt, Tecnológico de Monterrey, Mexico and U. of California at Irvine, USA
The popularity of General-Purpose computing on Graphic Processing Units (GPGPUs) continues to increase due to both power-efficiency and high computational throughput that can be attained for data parallel applications. This has led to the creation of new programming models, such as Nvidia CUDA, AMD Brook+ and OpenCL that provide simple application programming interfaces to ease the software development effort by hiding the complexity of the architecture underneath. However, developing high performance, portable GPGPU applications still remains a challenge - mainly because of architecture idiosyncrasies that require the re-optimization of an GPGPU application for a target architecture. This paper presents a characterization approach to model GPGPU applications based on their performance bottlenecks on a number of GPU architectures. The proposed approach leverages GPU application signatures and clustering algorithms. The signatures are derived from program/architectural features (i.e. hardware counters), whereas clustering algorithms group applications together based on the similarity patterns inducted by the signatures. The proposed characterization approach is evaluated using several benchmark suites such as Nvidia CUDA SDK, Parboil and Rodinia. Experiments show that applications that have been subject to a certain optimization process are clustered together, i.e., similarity patterns are effectively captured by the proposed characterization approach. Results encourage the idea of providing hints to the software developer (or embedding such hints in a GPGPU compiler) on how to optimize a given application for a target architecture by similarity.
Dani Voitshechov, Yoav Etsion, Technion, Israel
Traditional von-Neumann architectures are tuned to execute a sequential stream of dynamic instructions that communicate through explicit storage. These two characteristics of the model also manifest its fundamental inefficiencies. Processors must fetch and decode each dynamic instruction instance, even though programs typically iterate over small static portions of the code. Moreover, using explicit storage as the only channel for transferring data between instructions implies that all intermediate values of the computation must be transferred back-and-forth between the functional units and the register file, rather than communicating them directly between the functional units. These (and other) inefficiencies dramatically affect the energy efficiency of modern processors.
Dataflow machines, in contrast, do not suffer from these inefficiencies. Dataflow machines represent program as a graph that, given a fabric of interconnected functional units, can be pre-loaded and executed multiple times, thus eliminating redundant instruction fetch and decodes. In addition, these machines allow operand values to be transferred directly between computational units. Therefore, dataflow architectures reduce both instruction and data memory accesses, as well as minimize register-file accesses.
In this research, we present the single-graph multiple-flows (SGMF) architecture, which targets efficient execution of emerging task-based programming models by mapping multiple task instances onto a tagged-token dataflow engine composed of a fabric of interconnected functional units. The SGMF architecture facilitates efficient execution of tasks by 1) eliminating recurrent instruction fetching and decoding, and 2) minimizing data traffic and register-file accesses.
The architecture targets single-instruction multiple-threads (SIMT) programming models (CUDA/OpenCL). SIMT tasks can be efficiently translated to dataflow graphs, and the mass execution of multiple task instances on the same fabric to eliminate frequent reconfigurations. The dataflow-based SGMF engine enables SIMT programming models to enjoy better performance and power consumption than prevalent von-Neumann-based SIMT processors, namely GPGPUs.
Alejandro Acosta, Francisco Almeida, La Laguna University, Spain
The advent of emergent SoCs and MPSocs opens a new era on the small mobile devices (Smartphones, Tablets, ...) in terms of computing capabilities and applications to be addressed. The efficient use of such devices, including the parallel power, is still a challenge for general purpose programmers due to the very high learning curve demanding very specific knowledge of the devices. While some efforts are currently being made, mainly in the scientific scope, the scenario is still quite far from being the desirable for non-scientific applications where very few of them take advantage of the parallel capabilities of the devices.
We propose Paralldroid (Framework for Parallelism in Android), a parallel development framework oriented to general purpose programmers for standard mobile devices. Paralldroid presents a programming model that unifies the different programming models of Android. The user just implements a Java application and introduces a set of Paralldroid annotations in the sections of code to be optimized. The Paralldroid system automatically generates the native C, OpenCL or Renderscript code for the annotated section. The Paralldroid transformation model involves source-to-source transformations and skeletal programming.
Radosław Drabiński, Paweł Majewski, Krystian Matusiewicz, Konrad Trifunović, Marek Targowski, Intel, Poland
Open and community driven OpenGL* and OpenCL* APIs have been rising in prominence in the last couple of years. OpenGL* APIs lack any low level assembly-like shader program representation, though. Textual source code is still used as a sole representation of a program. This enables almost perfect portability but comes with a number of disadvantages: exposed ISV’s IP, longer on-device compilation time, lack of offline compilers and optimizers, etc.
On the other hand, OpenCL* community has addressed the problems by defining an extension to the API and proposing SPIR*, a LLVM* IR derivative as a GPGPU low level intermediate representation. Additionally, one can observe a slow but steady convergence in GPU chip architectures among major HW vendors with a ditch of AOS (Array-of-Struct) data organization and common move to SOA (Structure-of-Arrays) processing. We believe that the time has come to propose a common low level intermediate representation for 3D Graphics.
In this paper, we discuss general properties such a representation should possess, abstracting from implementation details. We assume the abstract GPU processor we are targeting naturally implements SIMT (Single Instruction Multiple Threads) parallelism, SOA data organization, is capable of implementing all stages of a 3D pipeline as defined in OpenGL* 4.4 spec. The representation is aimed as a portable intermediate layer between graphics API and vendor specific hardware backend (finalizer). In particular we discuss how to represent uniform (same for all threads within a SIMT) and non-uniform (varying across a SIMT) data types; give general insights on how to map shader inputs and outputs, external resources and synchronization primitives; define a set of built-in-functions necessary to implement shading in a modern 3D graphics pipeline.
Boaz Ouriel, Intel, Haifa
OpenCL is one of the promising programming environments today for heterogeneous systems supporting a wide range of CPU's, GPU's and other devices. An OpenCL environment allows programmers to distribute their portable code either in source format or in pre-compiled executable format. The former suffers from lack of IP protection and runtime parsing overhead among other issues, while the latter is evidently not portable. We have been working together with other Khronos colleagues on a new intermediate representation standard called “SPIR” (Standard Portable Intermediate Representation) to address this challenge. SPIR is officially a Khronos OpenCL 1.2 extension, and is based on the widely used LLVM IR language thereby also facilitating non-OpenCL tool-chains to efficiently target OpenCL environments. For example, a converter from C++ AMP to OpenCL could generate SPIR more efficiently than OpenCL source code. In this talk we will describe SPIR, present the current status of OpenCL tools supporting SPIR and highlight potential industrial and academic usages of SPIR.
Anton Malakhov, Evgeny Fiksman, Intel, Russia and Israel
In this presentation we share our experience of improving the performance of the Intel TBB library when handling workloads with fine-grained parallelism on the Intel Xeon Phi coprocessor. This research was driven by the requirements of OpenCL™ programs provided by customers.