SC19

Event is now over

Find all of the presentations and other assets from this event in our Video and Presentations Library. Checkout some of our upcoming events over here.

SC19 Banner
November 17-21, 2019
Colorado Convention Center, Denver, CO

Join Khronos in Denver, Colorado for SC19, the International Conference on High Performance Computing. Talks and presentations this year will cover Khronos Standards SYCL, SPIR-V and OpenCL.

Khronos BOF

Khronos SYCL Heterogeneous C++ Status and Directions

Speakers: Simon Mcintosh-Smith, University of Bristol; Michael Wong (Codeplay)
Date and Time: Thursday, 21 November 2019 | 12:15pm - 1:15pm
Location: 501-502
Website: Link

Many HPC programmers have not heard of SYCL, however, with the increasing importance of modern C++ in HPC, or just seeking alternatives to proprietary languages, SYCL is becoming critical as a vendor neutral way to write C++ code that embraces heterogeneous parallelism. SYCL is an open standard. There are multiple implementations available, both open source and commercial.

In this BoF, experts and SYCL implementers will explain its advantages, how the language is governed, where it is going, and why you need to be aware of it if you are intending to write C++ code which targets HPC machines.

Khronos Related Sessions

H2RC 2019: Fifth International Workshop on Heterogeneous High-Performance Reconfigurable Computing

Workshop Organizers: Jason Bakos, Christian Plessl, Franck Cappello, Torsten Hoefler, Michaela Blott (Xilinx)
Date and Time: Sunday, 17 November 2019 | 9am - 5:30pm
Location: 607
Website: Link

As in the previous four years, this workshop will bring together application experts, software developers, and hardware engineers, both from industry and academia, to share experiences and best practices to leverage the practical application of reconfigurable logic to Scientific Computing, Machine/Deep Learning, and “Big Data” applications. In particular, the workshop will focus on sharing experiences and techniques for accelerating applications and/or improving energy efficiency with FPGAs using OpenCL, OpenMP, OpenACC, SYCL, C, C++, and other high-level design flows, which enable and improve cross-platform functional and performance portability while also improving productivity. Particular emphasis is given to cross-platform comparisons and combinations that foster a better understanding within the industry and research community on what are the best mappings of applications to a diverse range of hardware architectures that are available today (e.g., FPGA, GPU, Many-cores and hybrid devices, ASICs), and on how to most effectively achieve cross-platform compatibility.

SYCL: A Single-Source C++ Standard for Heterogeneous Computing

Speakers: Ronan Keryell (Xilinx)
Date and Time: Sunday, November 17, 2019 | 9:05am - 9:45am
Location: 607
Website: Link

Outlining the benefits of using SYCL for FPGA programming

hlslib: Software Engineering for Hardware Design

Speakers: Johannes de Fine Licht, Torsten Hoefler
Date and Time: Sunday, 17 November 2019 | 9:45am - 10am
Location: 607
Website: Link

High-level synthesis (HLS) tools have brought FPGA development into the mainstream, by allowing programmers to design architectures using familiar languages such as C, C++, and OpenCL. While the move to these languages has brought significant benefits, many aspects of traditional software engineering are still unsupported, or not exploited by developers in practice. Furthermore, designing reconfigurable architectures requires support for hardware constructs and workflows that are not covered by CPU-oriented tools and languages. To address this gap, we have developed hlslib, a collection of software tools, plug-in hardware modules, and code samples, designed to enhance the productivity of HLS developers. The goal of hlslib is two-fold: first, create a community-driven arena of bleeding edge development, which can move quicker, and provides more powerful abstractions than what is provided by vendors; and second, collect a wide range of example codes, both minimal proofs of concept, and larger, real-world applications, that can be reused directly or inspire other work. hlslib is offered as an open source library, containing CMake files, C++ headers, convenience scripts, and examples codes, and is receptive to any contribution that can benefit HLS developers, through general functionality or examples.

The Memory Controller Wall: Benchmarking the Intel FPGA SDK for OpenCL Memory Interface

Speakers: Hamid Reza Zohouri, Satoshi Matsuoka
Date and Time: Sunday, 17 November 2019 | 11:45am - 12:10pm
Location: 607
Website: Link

Supported by their high power efficiency and recent advancements in High Level Synthesis (HLS), FPGAs are quickly finding their way into HPC and cloud systems. Large amounts of work have been done so far on loop and area optimizations for different applications on FPGAs using HLS. However, a comprehensive analysis of the behavior and efficiency of the memory controller of FPGAs is missing in literature, which becomes even more crucial when the limited memory bandwidth of modern FPGAs compared to their GPU counterparts is taken into account. In this work, we will analyze the memory interface generated by Intel FPGA SDK for OpenCL with different configurations for input/output arrays, vector size, interleaving, kernel programming model, on-chip channels, operating frequency, padding, and multiple types of overlapped blocking. Our results point to multiple shortcomings in the memory controller of Intel FPGAs, especially with respect to memory access alignment, that can hinder the programmer’s ability in maximizing memory performance in their design. For some of these cases, we will provide work-arounds to improve memory bandwidth efficiency; however, a general solution will require major changes in the memory controller itself.

2GRVI Phalanx: A 1332-Core RISC-V RV64I Processor Cluster Array with an HBM2 High Bandwidth Memory System, and an OpenCL-like Programming Model, in a Xilinx VU37P FPGA [WIP Report]

Speakers: Jan Gray
Date and Time: Sunday, 17 November 2019, 10:30am - 10:45am
Location: 607
Website: Link

2GRVI (and its predecessor, GRVI) are FPGA-efficient 64b (resp. 32b) RISC-V processing element cores. Phalanx is a parallel processor and accelerator array overlay framework. Groups of PEs and accelerator cores form shared memory compute clusters. Clusters, DRAM, NICs and other I/O controllers communicate by message passing on an FPGA-optimal Hoplite torus soft NoC. This extended abstract summarizes work-in-progress to redesign the 2017 GRVI Phalanx to take advantage of new Xilinx FPGAs with 460 GB/s dual stack HBM2 DRAM-in-package, and to provide a familiar parallel programming experience via an OpenCL-like programming model and tools. The new system is the first kilocore RV64I SoC and the first RISC-V multiprocessor with an HBM2 memory system.

Data Flow Pipes: A SYCL Extension for Spatial Architectures

Speakers: Michael Kinsner &John Freeman (Intel)
Date and Time: Sunday, 17 November 2019 | 11am - 11:15am
Location: 607
Website: Link

FIFOs are a common construct in design for spatial and data flow architectures. OpenCL 2.0 defined a “pipe” feature to expose the FIFO construct, but the design didn’t meet all needs of spatial architectures. This talk describes a pipes extension to the Khronos SYCL single-source, C++-based programming framework, that exposes a pipe abstraction which closes the gaps in the OpenCL design, while also offering a more usable interface. The C++ type system is leveraged to provide static connectivity guarantees without extensive compiler implementation effort, and to provide well-defined interaction with C++ features. The described extension provides a usable interface that can also act as a substrate for additional abstractions to be built on top. This talk will motivate the utility of FIFOs/pipes in high level language FPGA design, describe the SYCL pipes extension and its mapping to SPIR-V and OpenCL, and provide examples of use in common spatial design patterns.

Productive Parallel Programming for FPGA with High-Level Synthesis

Speakers: Johannes de Fine Licht, Torsten Hoefler
Date and Time: Sunday, 17 November 2019 | 1:30pm - 5pm
Location: 407
Website: Link

Energy efficiency has become a first class citizen in the design of large computing systems. While GPUs and custom processors have shown merit in this regard, reconfigurable architectures, such as FPGAs, promise another major step in energy efficiency, constituting a middle ground between fixed hardware architectures and custom-built ASICs. Programming FPGAs has traditionally been done in hardware description languages, requiring extensive hardware knowledge and significant engineering effort. This tutorial shows how high-level synthesis (HLS) can be harnessed to productively achieve scalable pipeline parallelism on FPGAs. Attendees will learn how to target FPGA resources from high-level C++ or OpenCL code, guiding the mapping from imperative code to hardware, enabling them to develop massively parallel designs with real performance benefits. We treat well-known examples from the software world, relating traditional code optimizations to both corresponding and new transformations for hardware, building on existing knowledge when introducing new topics. By bridging the gap between software and hardware optimization, our tutorial aims to enable developers from a larger set of backgrounds to start tapping into the potential of FPGAs with real high performance codes.

Agent-Based Simulation of Fire Extinguishing: an Assignment for OpenMP, MPI, and CUDA/OpenCL

Speakers: Arturo Gonzalez-Escribano, Jorge Fernández-Fabeiro
Date and Time: Sunday, 17 November 2019 | 4:35pm - 4:40pm
Location: 702
Website: Link

We present a new assignment used in a parallel computing course to teach the approaches to the same problem in different parallel programming models. It targets concepts of shared-memory programming with OpenMP, distributed-memory programming with MPI, and/or GPU programming with CUDA or OpenCL. This assignment is based on a simplified agent-based simulation where teams of firefighters aim to extinguish a set of fire focal points in a dynamically evolving scenario. The program is designed to be simple, easy to understand by students, and to include specific parallelization and optimization opportunities. Although there is a quite direct parallel solution in the three programming models, the program has plenty of opportunities for further improvements. It extends the ideas of a previously presented assignment, in order to use more interesting data structures, load balancing techniques, and code optimizations. It has been successfully used in parallel programming contests during a real course, using the performance obtained by the students’ code as a measure of success.

DPC++ - A Technical Overview

Speaker: Max Domeika
Date and Time: November 18, 2019 | 9:15 – 9:45
Location: Ellie Caulkins Theater – Intel® HPC Developer Conference
Website: Link

DPC++ is a new language to enact parallelism upon accelerators like Intel's GPUs and FPGAs. DPC++ is based upon Khronos SYCL. This talk focuses on DPC++ as a language and how developers can employ the language to obtain high performance.

An Open Ecosystem for HPC Developers

Speaker: Andrew Richards CEO, Codeplay Software Ltd
Date and Time: November 18, 2019 | 10:30 – 11:00
Location: 3rd Floor Loge – Intel® HPC Developer Conference
Website: Link

Developers can currently use SYCL to target a wide range of processors, including those from Intel, Arm, NVidia (experimental support) and Renesas with support for Imagination Technologies GPUs coming soon. This is currently provided by ComputeCpp, the world's first conformant SYCL 1.2.1 implementation. SYCL offers performance, programmability and portability, providing a C++ programming environment similar to CUDA, but with extensive hardware support. Alongside this support however is a growing ecosystem of projects developed using SYCL that provide developers with optimized libraries for common operations and algorithms used in AI. During this presentation we will examine some of these projects including TensorFlow, the most popular machine learning framework, SYCL-DNN, offering neural network operations similar to CuDNN, SYCL-BLAS that offers a set of BLAS operations, and Eigen, one of the most popular C++ linear algebra libraries available. Benchmark data will show how these libraries can be used to offer portability and performance required, and how these libraries can be used to accelerate complex HPC applications.

Invited Talk: The SPEC ACCEL Benchmark – Results and Lessons Learned

Speaker: Robert Henschel
Date and Time: Monday, 18 November 2019 | 2pm - 2:30pm
Location: 702
Website: Link

The High-Performance Group (HPG) of the Standard Performance Evaluation Corporation (SPEC) is a forum for discussing and developing benchmark methodologies for High-Performance Computing (HPC) systems. The group released the SPEC ACCEL benchmark in 2014, containing OpenCL and OpenACC components. In 2017, an OpenMP 4.5 target offload component was added by porting the OpenACC applications to OpenMP 4.5. This talk will introduce the benchmark, show results and talk about the lessons learned from developing and maintaining this directive based benchmark. In addition, current challenges of creating a follow on suite are discussed.

Performance of the RI-MP2 Fortran Kernel of GAMESS on GPUs via Directive-Based Offloading with Math Libraries

Speaker: JaeHyuk Kwack (Argonne National Laboratory), Colleen Bertoni (Argonne National Laboratory), Buu Pham Jeff Larkin (NVIDIA)
Date and Time: Monday, 18 November 2019 | 2:30pm - 3pm
Location: 702
Website: Link

The US Department of Energy (DOE) started operating two GPU-based pre-exascale supercomputers in 2018 and plans to deploy another pre-exascale in 2020, and three exascale supercomputers in 2021/2022. All of the systems are GPU-enabled systems, and they plan to provide optimized vendor-promoted programming models for their GPUs such as CUDA, HIP and SYCL. However, due to their limited functional portability, it is challenging for HPC application developers to maintain their applications in an efficient and effective way with good productivity across all US DOE pre-exascale/exascale systems. Directive-based programming models for accelerators can be one of the solutions for HPC applications on the DOE supercomputers. In this study, we employ OpenMP and OpenACC offloading models to port and re-implement the RI-MP2 Fortran kernel of the GAMESS application on a pre-exascale GPU system, Summit. We compare and evaluate the performance of the re-structured offloading kernels with the original OpenMP threading kernel. We also evaluate the performance of multiple math libraries on the Nvidia V100 GPU in the RI-MP2 kernel. Using the optimized directive-based offloading implementations, the RI-MP2 kernel on a single V100 GPU becomes more than 7 times faster than on dual-socket Power9 processors, which is near the theoretical speed-up based on peak performance ratios. MPI + directive-based offloading implementations of the RI-MP2 kernel perform more than 40 times faster than a MPI + OpenMP threading implementation on the same number of Summit nodes. This study demonstrates how directive-based offloading implementations can perform near what we expect based on machine peak ratios.

Convergence, Divergence, or New Approaches? - The Future of Software-Based Abstractions for Heterogeneous Supercomputing

Presenter: Fernanda Foertter (NVIDIA)
Panelists: Jeff R. Hammond, Jack Deslippe, Christian Robert Trott, Michael Wolfe, Johannes Doerfert
Date and Time: Monday, 18 November 2019 | 4:35pm - 5:25pm
Location: 702
Website: Link

This panel hopes to work towards this resolution by bringing together standards committee, supporters, and users of these abstractions, e.g., from OpenACC, OpenMP, Kokkos or SYCL. The panelists will share their insights on the future of these abstractions - Will they converge, diverge, or will there be new approaches that would be needed? What makes a good accelerator programming model? Is there a measure for this "goodness"? The audience is also encouraged to challenge the panelists with their questions or share their insights.

Breakthrough AI+HPC

Presenter: Intel
Date and Time: Sunday, 18 November 2019
Location: Booth #1301
Website: Link

SYCL will play a role here after Intel's announcement of it's OneAPI initiative and DPC++ interface that implements the SYCL standard. You can learn about DPC++ with a technical overview presentation and if you want to do some SYCL coding, there is a hands-on-lab. Codeplay CEO Andrew Richards will be presenting at the Intel event, his talk An Open Ecosystem for HPC Developers will cover the current SYCL ecosystem from top to bottom.

Poster 34: Analysis of Automata Processing Acceleration on Disparate Hardware Technologies

Author: Marziyeh Nourian
Date and Time: Tuesday, 19 November 2019 | 8:30am - 5pm
Location: E Concourse
Website: Link

Pattern matching is a computation that maps naturally onto finite automata (FA) abstractions. There has been a substantial amount of work on accelerating FA processing on various parallel platforms. However, the advantages and disadvantages of different automata processing accelerators and the innovation space in this area are still unclear. We target this problem and propose a compiler tool-chain that automates the deployment of non-deterministic finite automata (NFAs) onto different target platforms. Using this toolchain, we perform an apples-to-apples comparison between AP, GPU- and FPGA-based NFA accelerator designs on large-scale datasets. Specifically, we observe that memory-based designs are limited by memory size and bandwidth. To address this issue, we target fixed-topology NFAs and propose a memory-efficient design that embeds the automata topology in code and stores only the transition symbols in memory. Our solution is suitable for SIMD architectures and is called SIMD_NFA. We design a compiler that automates the deployment of this design on SIMD platforms. We showcase our compiler framework on GPU and Intel platforms. Additionally, we observe that for NFAs with a grid-like fixed-topology (e.g., NFAs for Levenshtein and Hamming distance-based matching), transitions do not need to be encoded within the traversal code but can be inferred from the reference string to be matched and the knowledge of the NFA topology. Lastly, SIMD_NFA is a good fit for FPGA deployment using OpenCL-to-FPGA toolchains. We investigate the deployment of the OpenCL version of SIMD_NFA, on FPGA and explore a set of optimizations techniques to retarget SIMD_NFA to FPGA.

Poster 40: Performance, Portability, and Productivity for Data-Parallel Computations on Multi- and Many-Core Architectures

Author: Ari Rasch
Date and Time: Tuesday, 19 November 2019 | 8:30am - 5pm
Location: E Concourse
Website: Link

This thesis presents an approach to performance, portability, and productivity for data-parallel computations on multi- and many-core architectures, e.g., Intel CPU and NVIDIA GPU. We introduce the algebraic formalism of Multi-Dimensional Homomorphisms (MDHs) – a class of functions that cover important data-parallel computations, e.g., linear algebra routines (BLAS) and stencil computations. For our MDHs, we propose a Domain-Specific Language (DSL), based on patterns of parallelism (a.k.a. algorithmic skeletons), to enable conveniently expressing MDH functions. We introduce a code generation approach for our DSL to automatically generate for MDHs optimized program code targeting multi- and many-core architectures. Our code generation approach relies on OpenCL – an emerging de-facto standard for uniformly programming parallel architectures, such as CPU and GPU. A major feature of our generated code is that it is targeted to OpenCL’s abstract device models (rather than a particular architecture) by being parameterized in performance-critical parameters of these abstract models (e.g., the number of threads and size of tiles). With our code generation approach, we enable both high performance and performance portability: we fully automatically optimize our generated code -- for any given combination of an MDH function, architecture, and input size -- by automatically choosing (auto-tuning) optimized values of our code’s performance-critical parameters using our own Auto-Tuning Framework (ATF). Our experimental results on CPU and GPU demonstrate competitive and often significantly better performance of our MDH+ATF approach as compared to the currently best-performing competitors, e.g., Intel MKL/MKL-DNN, NVIDIA cuBLAS/cuDNN, and Facebook’s Tensor Comprehensions framework.

Poster 124: Porting Finite State Automata Traversal from GPU to FPGA: Exploring the Implementation Space

Author: Marziyeh Nourian
Date and Time: Thursday, 21 November 2019 | 8:30am - 5pm
Location: E Concourse
Website: Link

While FPGAs are traditionally considered hard to program, recently there are efforts to allow using high-level programming models intended for multi-core CPUs and GPUs to program FPGAs. For example, both Intel and Xilinx are now providing OpenCL-to-FPGA toolchains. However, since GPU and FPGA devices offer different parallelism models, OpenCL code optimized for GPU can prove inefficient on FPGA, in terms of both performance and hardware resource utilization.

In this poster, we explore this problem on an emerging workload: finite state automata traversal. Specifically, we explore a set of structural code changes, custom, and best-practice optimizations to retarget an OpenCL NFA engine designed for GPU to FPGA. Our evaluation, which covers traversal throughput and resource utilization, shows that our optimizations lead, on a single execution pipeline, to speedups up to 4x over an already optimized baseline that uses one of the proposed code changes to fit the original code on FPGA.

Performance Portability of a Wilson Dslash Stencil Operator Mini-App Using Kokkos and SYCL

Presenters: Bálint Joó, Thorsten Kurth, M. A. Clark, Jeongnim Kim, Christian Robert Trott, Dan Ibanez, Daniel Sunderland, Jack Deslippe
Date and Time: Friday, 22 November 2019 | 8:50am - 9:05am
Location: 401-402-403-404

We describe our experiences in creating mini-apps for the Wilson-Dslash stencil operator for Lattice Quantum Chromodynamics using the Kokkos and SYCL programming models. In particular, we comment on the performance achieved on a variety of hardware architectures, limitations we have reached in both programming models and how these have been resolved by us, or may be resolved by the developers of these models.

Performance Portability of Multi-Material Kernels

Speakers: Istvan Z. Reguly
Date and Time: Friday, 22 November 2019 | 9:05am - 9:20am
Location: 401-402-403-404
Website: Link

Trying to improve performance, portability, and productivity of an application presents non-trivial trade-offs, which are often difficult to quantify. Recent work has developed metrics for performance portability, as well some aspects of productivity - in this case study, we present a set of challenging computational kernels and their implementations from the domain of multi-material simulations, and evaluate them using these metrics.

Three key kernels are implemented using OpenMP, OpenMP offload, OpenACC, CUDA, SYCL, and KOKKOS, and tested on ARM ThunderX2, IBM Power 9, Intel KNL, Broadwell, and Skylake CPUs, as well as NVIDIA P100 and V100 GPUs. We also consider the choice of compilers, evaluating LLVM/Clang, GCC, PGI, Intel, IBM XL, and Cray compilers, where available. We present a detailed performance analysis, calculate performance portability and code divergence metrics, contrasting performance, portability, and productivity.

Khronos Members Exhibiting

Find a complete list of all SC19 Exhibitors here.