SC20

Event is now over

Checkout some of our upcoming events over here.

SC20 Banner
November 9-19, 2020
Online

The International Conference for High Performance Computing, Networking, Storage, and Analysis Everywhere We Are #MoreThanHPC

Khronos Sessions

HPC Application Development Using C++ and SYCL

Date and Time: November 9, 2020 | 10AM - 2:00PM EST
Presenters: Tim Mattson (Intel), Michael Wong (Codeplay), Ronan Keryell(Xilinx), Rod Burns(Codeplay), Aksel Alpay (Heidelberg University)

SYCL is a programming model that targets a wide variety of devices (CPUs, GPUs, FPGAs and more) from a single code base. SYCL supports a single-source style of programming from completely standard C++. With increasingly heterogeneous processor roadmaps, a platform-independent model such as SYCL is essential for software developers.

In this tutorial, we introduce SYCL. We start by building a solid foundation to help programmers gain mastery of this language. We then explore how SYCL can be used to write serious applications, covering intermediate to advanced features as well as some of the tools and libraries that support SYCL application development. The tutorial is constructed around mini-applications that represent key design patterns encountered by people who program heterogeneous systems. This helps keep the tutorial grounded on practical knowledge students can immediately apply to their own programming problems.

C++ for Heterogeneous Programming: oneAPI (DPC++ and oneTBB): Part 1

Date and Time: November 9, 2020 | 10AM - 2:00PM EST
Presenters: James Reinders (James Reinders Consulting), Michael Voss (Intel), Pablo Reble (Intel), Rafael Asenjo (University of Malaga), Denisa-Andreea Constantinescu (University of Malaga), Andrey Federov (Intel)

This tutorial provides hands-on experience programming CPUs, GPUs and FPGAs using a unified, standards-based programming model: oneAPI. OneAPI includes a cross-architecture language: Data Parallel C++ (DPC++). DPC++ is an evolution of C++ that incorporates the SYCL language with extensions for Unified Shared Memory (USM), ordered queues and reductions, among other features. OneAPI also includes libraries for API-based programming, such as domain-specific libraries, math kernel libraries and Threading Building Blocks (TBB). The main benefit of using oneAPI over other heterogeneous programming models is the single programming language approach, which enables one to target multiple devices using the same programming model, and therefore to have a cleaner, portable, and more readable code. This tutorial’s main goal is not just teaching oneAPI as an easier approach to target heterogeneous platforms, but also to convey techniques to map applications to heterogeneous hardware, paying attention to the scheduling and mapping problems.

C++ for Heterogeneous Programming: oneAPI (DPC++ and oneTBB): Part 2

Date and Time: November 10, 2020 | 10AM - 2:00PM EST
Presenters: James Reinders (James Reinders Consulting), Michael Voss (Intel), Pablo Reble (Intel), Rafael Asenjo (University of Malaga)

This tutorial provides hands-on experience programming CPUs, GPUs and FPGAs using a unified, standards-based programming model: oneAPI. OneAPI includes a cross-architecture language: Data Parallel C++ (DPC++). DPC++ is an evolution of C++ that incorporates the SYCL language with extensions for Unified Shared Memory (USM), ordered queues and reductions, among other features. OneAPI also includes libraries for API-based programming, such as domain-specific libraries, math kernel libraries and Threading Building Blocks (TBB). The main benefit of using oneAPI over other heterogeneous programming models is the single programming language approach, which enables one to target multiple devices using the same programming model, and therefore to have a cleaner, portable, and more readable code. This tutorial’s main goal is not just teaching oneAPI as an easier approach to target heterogeneous platforms, but also to convey techniques to map applications to heterogeneous hardware, paying attention to the scheduling and mapping problems.

Performance Evaluation of the Vectorizable Binary Search Algorithms on an FPGA Platform

Date and Time: November 11, 2020 | 3:30PM - 3:40PM EST
Presenters: Zheming Jin (Argonne National Laboratory), Hal Finkel (Argonne National Laboratory)

Field-programmable gate arrays (FPGAs) are becoming promising heterogeneous computing components. In the meantime, high-level synthesis (HLS) tools are pushing the FPGA-based development from the register-transfer level to high-level-language design flow using Open Computing Language (OpenCL), C, and C++. The performance of binary search applications is often associated with irregular memory access patterns to off-chip memory. In this paper, we implement the binary search algorithms using OpenCL, and evaluate their performance on an Intel Arria-10 based FPGA platform. Based on the evaluation results, we implement the grid search in XSBench by vectorizing and replicating the binary search kernel. In addition, we overcome the overhead of kernel vectorization by grouping work-items into work-groups. Our optimizations improve the performance of the grid search using the classic binary search by a factor of 1.75 on the FPGA.

H2RC – Introduction: Sixth International Workshop on Heterogeneous High-Performance Reconfigurable Computing

Date and Time: November 13, 2020 | 10:00AM - 10:05AM EST
Presenters: Kenneth O'Brien (Xilinx), Jason Bakos (University of South Carolina), Christian Plessl (Paderborn University), Torsten Hoefler (ETH Zurich), Franck Cappello (Argonne National Laboratory

As in the previous five years, this workshop will bring together application experts, software developers and hardware engineers, both from industry and academia, to share experiences and best practices to leverage the practical application of reconfigurable logic to scientific computing, machine/deep learning, and “Big Data” applications. In particular, the workshop will focus on sharing experiences and techniques for accelerating applications and/or improving energy efficiency with FPGAs using OpenCL, OpenMP, OpenACC, SYCL, DPC++, C, C++ and other high-level design flows, which enable and improve cross-platform functional and performance portability while also improving productivity. Particular emphasis is given to cross-platform comparisons and combinations that foster a better understanding within the industry and research community on what are the best mappings of applications to a diverse range of hardware architectures available today (e.g., FPGA, GPU, many-cores and hybrid devices, ASICs), and on how to most effectively achieve cross-platform compatibility.

Evaluating FPGA Accelerator Performance with a Parameterized OpenCL Adaptation of Selected Benchmarks of the HPCChallenge Benchmark Suite

Date and Time: November 13, 2020 | 10:35AM - 11:05AM EST
Speakers: Marius Meyer, Tobias Kenter, Christian Plessl

We have developed an OpenCL-based open-source implementation of the HPCC benchmark suite for Xilinx and Intel FPGAs. This benchmark can serve to analyze the current capabilities of FPGA devices, cards, and development tool flows, track progress over time, and point out specific difficulties for FPGA acceleration in the HPC domain. Additionally, the benchmark documents proven performance optimization patterns. We will continue optimizing and porting the benchmark for new generations of FPGAs and design tools and encourage active participation to create a valuable tool for the community.

Tracking Performance Portability on the Yellow Brick Road to Exascale

Date and Time: November 13, 2020 | 11:10AM - 11:40AM EST
Presenters: Tom Deakin (University of Bristol), Andrei Poenaru (University of Bristol), Tom Lin (University of Bristol), Simon McIntosh-Smith (University of Bristol)

With exascale machines on our immediate horizon, there is a pressing need for applications to be made ready to best exploit these systems. However, there will be multiple paths to exascale, with each system relying on processor and accelerator technologies from different vendors. As such, applications will be required to be portable between these different architectures, but it is also critical that they are efficient too. These double requirements for portability and efficiency begets the need for performance portability. In this study we survey the performance portability of different programming models, including the open standards OpenMP and SYCL, across the diverse landscape of exascale and pre-exascale processors from Intel, AMD, NVIDIA, Fujitsu, Marvell, and Amazon, together encompassing GPUs and CPUs based on both x86 and Arm architectures. We also take a historical view and analyze how performance portability has changed over the last year.

Performance and Portability of a Linear Solver Across Emerging Architectures

Date and Time: November 13, 2020 | 11:50AM - 12:15PM EST
Presenters: Aaron Walden (NASA), Mohammad Zubair (Old Dominion University), Eric Nielsen (NASA)

A linear solver algorithm used by a large-scale unstructured-grid computational fluid dynamics application is examined for a broad range of familiar and emerging architectures. Efficient implementation of a linear solver is challenging on recent CPUs offering vector architectures. Vector loads and stores are essential to effectively utilize available memory bandwidth on CPUs, and maintaining performance across different CPUs can be difficult in the face of varying vector lengths offered by each. A similar challenge occurs on GPU architectures, where it is essential to have coalesced memory accesses to utilize memory bandwidth effectively. In this work, we demonstrate that restructuring a computation, and possibly data layout, with regard to architecture is essential to achieve optimal performance by establishing a performance benchmark for each target architecture in a low level language such as vector intrinsics or CUDA. In doing so, we demonstrate how a linear solver kernel can be mapped to Intel Xeon and Xeon Phi, Marvell ThunderX2, NEC SX-Aurora TSUBASA Vector Engine, and NVIDIA and AMD GPUs. We further demonstrate that the required code restructuring can be achieved in higher level programming environments such as OpenACC, OCCA and Intel OneAPI/SYCL, and that each generally results in optimal performance on the target architecture. Relative performance metrics for all implementations are shown, and subjective ratings for ease of implementation and optimization are suggested.

Cross-Platform Performance Portability of DNN Models Using SYCL

Date and time: November 13, 2020 | 5:25PM - 5:55PM EST
Author/Presenters: Mehdi Goli (Codeplay), Kumudha Narasimhan (Codeplay), Ruyman Reyes (Codeplay), Ben Tracy (Codeplay), Daniel Soutar (Codeplay), Svetlozar Georgiev (Codeplay), Evarist Fomenko (Intel), Eugene Chereshnev (Intel)

This talk is given as part of P3HPC: 3rd International Workshop on Performance Portability and Productivity.

The incoming deployment of Exascale platforms with a myriad of different architectures and co-processors have prompted the need to provide a software ecosystem based on open standards that can simplify maintaining HPC applications on different hardware. Applications written for a particular platform should be portable to a different one, ensuring performance is as close to the peak as possible. However, it is not expected that key performance routines on relevant HPC applications will be performance portable as is, especially for common building blocks such as BLAS or DNN. The oneAPI the initiative aims to tackle this problem by combining a programming model, SYCL, with a set of interfaces for common building blocks that can be optimized for different hardware vendors. In particular, oneAPI includes the oneDNN performance library, which contains building blocks for deep learning applications and frameworks.

Evaluating the Performance and Portability of Contemporary SYCL Implementations

Date and time: November 13, 2020 | 5:55PM - 6:25PM EST
Author/Presenters: Beau Johnston (Oak Ridge National Laboratory), Jeffrey S. Vetter (Oak Ridge National Laboratory), Josh Milthorpe (Australian National University)

This talk is given as part of P3HPC: 3rd International Workshop on Performance Portability and Productivity.

In this paper, we evaluate the existing SYCL implementations for important SYCL features across a range of hardware in order to understand SYCL's performance and portability.

OpenCL-enabled Parallel Raytracing for Astrophysical Application on Multiple FPGAs with Optical Links

Date and Time: November 13, 2020 | 6PM - 6:25PM EST
Speakers: Norihisa Fujita, Ryohei Kobayashi, Yoshiki Yamaguchi, Taisuke Boku, Kohji Yoshikawa, Makito Abe, Masayuki Umemura

We have optimized the Authentic Radiative Transfer (ART) method to solve space radiative transfer problems in early universe astrophysical simulation on Intel Arria 10 FPGAs as earlier work. In this paper, we optimize it for the latest FPGA -- Intel Stratix 10 and evaluate its performance comparing with GPU implementation on multiple nodes. For the multi-FPGA computing and communication framework, we apply our original system named Communication Integrated Reconfigurable CompUting System (CIRCUS) to realize OpenCL base programming to utilize multiple optical links on FPGA for parallel FPGA processing, and this is the first implementation of real application over CIRCUS.

The oneAPI Software Abstraction for Heterogeneous Computing

Date and Time: November 17, 2020 | 10AM - 11:30AM EST
Moderator: Sujata Tibrewala (Intel)
Panelists: Rafael Asenjo (University of Malaga, Spain), Erik Lindahl (Stockholm University), Xiaozhu Meng (Rice University), Michael Wong (Codeplay), David Hardy (University of Illinois), Maria Garzaran (Intel)

OneAPI is a cross-industry, open, standards-based unified programming model. The oneAPI specification extends existing developer programming models to enable a diverse set of hardware through language, a set of library APIs and a low-level hardware interface to support cross-architecture programming. It builds upon industry standards and provides an open, cross-platform developer stack to improve productivity and innovation. At the core of oneAPI is the DPC++ programming language, which builds on the ISO C++ and Khronos SYCL standards. DPC++ provides explicit parallel constructs and offload interfaces to support a broad range of accelerators. In addition to DPC++, oneAPI also provides libraries for compute- and data-intensive domains; e.g., deep learning, scientific computing, video analytics and media processing. Finally, a low-level hardware interface defines a set of capabilities and services to allow a language runtime system to effectively utilize a hardware accelerator.

Khronos SYCL 2020 Release and ISO C++ 20 status and future directions

Date and Time: November 19, 2020 | 10AM - 11:15AM EST
Panelists: Michael Wong (Codeplay) and Simon Mcintosh-Smith (University of Bristol)

SYCL is an open standard planning a new release and C++ is also releasing C++20 in 2020. After SC17, SC18, and SC19's successful ISO C++ for HPC BoF and SYCL BoF, and with increasing use of C++ in HPC, there was popular demand for updates on the new SYCL 2020 and C++20 features. SYCL is a vendor-neutral way to write ISO C++ that embraces heterogeneous parallelism, especially in ECP's Aurora exascale supercomputer. In this BoF, we have integrated SYCL and C++ BoF so C++ and SYCL experts will explain the new features in SYCL 2020, and C++20 relevant to HPC.

devilish