LENS2015 International Workshop

Program

Thursday, October 29th
10:00 - 10:30 Opening Session
Chair: Taisuke Boku
10:00 - 10:05 Welcome Address
Mitsuhisa Sato
10:05 - 10:30 "FLAGSHIP 2020 Project and XcalableMP 2.0"
Mitsuhisa Sato
10:30 - 13:00 Session 1
Chair: Takeshi Nanri
10:30 - 11:30 "Taming Heterogeneity by Segregation - The DEEP and DEEP-ER take on Heterogeneous Cluster Architectures"
Norbert Eicker
Abstract
Abstract

On the way to explore the path to Exascale, the DEEP/-ER projects take a radically different approach on heterogeneity. Instead of combining different computing elements within single nodes DEEP/-ER's Cluster-Booster concept integrates multi-core processors in a standard Cluster while combining many-core processors in a separate cluster of accelerators, the so-called Booster. For this, DEEP's Booster consists solely of Intel Xeon Phi processors interconnected by the EXTOLL network.

The talk will not only share insights on the challenges of the hardware integration in a most energy efficient way but also discuss the strong requirements the architecture poses on the corresponding programming model. While it turns out that MPI provides all the low-level semantics required to utilize the Cluster-Booster system, the project uses an OmpSs abstraction layer in order to support software developers to adapt their applications to the heterogeneous hardware. The ultimate goal is to reduce the burden on the application developers. To this end DEEP/-ER provides a well-accustomed programming environment that saves application developers from some of the tedious and often costly code modernisation work. Confining this work to code-annotation as proposed by DEEP/-ER is a major advancement.

The presentation is completed by final results of the DEEP project that is finalized by end of August 2015.

11:30 - 12:00 "TCA: Tightly Coupled Accelerator - Concept and Implementation"
Taisuke Boku
12:00 - 12:30 "TCA System Implementation on PEACH2 and Its Performance"
Toshihiro Hanawa
12:30 - 13:00 "Computation / Communication Unification on FPGA Solution"
Yohei Miki
13:00 - 14:30 Lunch Break
14:30 - 17:40 Session 2
Chair: Taisuke Boku
14:30 - 15:15 "Experiences in Supporting Fortran Coarrays for HPC"
Deepak Eachempati
Abstract
Abstract

In the most recent version of the standard, Fortran 2008, new parallel processing features based on coarrays were incorporated into the specification. The speaker will describe techniques developed and implemented for supporting these features. He also will discuss upcoming features which are expected to be adopted in the next revision of the standard, and provide results based on an early implementation.

15:15 - 15:45 "XcalableACC: A Directive-Based Language Extension for Accelerator Clusters" [PDF]
Hitoshi Murai
"Evaluation of Productivity and Performance of the XcalableACC programming language" [PDF]
Masahiro Nakao
15:45 - 16:00 Coffee Break
16:00 - 16:25 "Implementation and Evaluation of Coarray Fortran Translator Based on OMNI XcalableMP" [PDF]
Hidetoshi Iwashita
16:25 - 16:50 "Dynamic Task Parallelism in PGAS Language XcalableMP" [PDF]
Keisuke Tsugane
16:50 - 17:15 "Performance evaluation of the SOR method with various ordrings by XMP and MPI implementations" [PDF]
Shu Ogawara
17:15 - 17:40 "Performance Comparison between two programming models of XcalableMP" [PDF]
Hitoshi Sakagami
18:30 - Reception
Venue: PRONTO IL BAR UDX AKIBA ICHI
Friday, October 30th
09:30 - 12:30 Session 3
Chair: Masahiro Nakao
09:30 - 09:45 Introduction [PDF]
Atsushi Hori
09:45 - 10:30 "OpenSHMEM: Introduction, Version 1.3, and Beyond"
Manjunath Gorentla Venkata
Abstract
Abstract

The OpenSHMEM is a predominant PGAS library interface specification. It is a community effort to standardize the SHMEM programming models, driven by Oak Ridge National Laboratory (ORNL), Department of Defense (DoD), and University of Houston (UH). The community has released three versions of the OpenSHMEM specification, and it will the latest version version 1.3 at SC15. In this talk, first, I will introduce OpenSHMEM, present its history, and discuss the upcoming features. Then, I will discuss the efforts preparing OpenSHMEM for the exascale era, and provide an overview of the OpenSHMEM activities, which includes specification development, reference implementation, and research. Lastly, I will provide an overview of OpenSHMEM reference implementation and its network layer, UCX.

10:30 - 11:00 "Optimizing MPI Intra-node Communication with New Task Model for Many-core Systems" [PDF]
Akio Shimada
Abstract
Abstract

Recently, number of cores in an HPC system node grows rapidly. Partitioned Virtual Address Space(PVAS) is a new task model, which enables efficient parallel processing in such many-core systems. The PVAS task model allows multiple processes to run in the same address space, which means that processes running on the PVAS task model can conduct intra-node communication without overhead for crossing address space boundaries. In this presentation, we show that PVAS task model can optimize MPI intra-node communication performance. By using PVAS task model, message can be copied from sender process's buffer to receiver process's buffer directly. Moreover, PVAS task model enables MPI processes to access the MPI objects of the other processes, which makes it possible to implement efficient intra-node communication. We optimized the intra-node communication of Open MPI, and the benchmark results show that the optimized MPI intra-node communication improves MPI application performance.

11:00 - 11:30 "MapReduce Frameworks on Multiple PVAS for Heterogeneous Computing Systems" [PDF]
Mikiko Sato
Abstract
Abstract

This study presents a MapReduce framework for building applications flexibly by the task model that exploits parallelism across host CPUs and Xeon Phi coprocessors. Heterogeneous computing system consisting of many-core and multi-core CPUs should tightly collaborate with the others in order to fully utilize the performance of CPUs. However, current MapReduce frameworks which are applied to many-core and multi-core CPUs are limited to execution on the shared-memory systems and some kind of communication method is needed to realize mutual collaboration between CPUs. In this study, the task execution model "Multiple Partitioned Virtual Address Space (M-PVAS)" is applied to the MapReduce framework to realize an application execution on the global virtual address space for the heterogeneous system. The M-PVAS implemented on Xeon and Xeon Phi system is possible to communicate between tasks on different CPUs with an overhead on a minimum of 2.0 usec. MapReduce framework is implemented on the M-PVAS system, and the effect of the MPVAS model is estimated by the MapReduce applications.

11:30 - 12:00 "Spare Node Substitution for Failure Nodes" [PDF]
Kazumi Yoshinaga
Abstract
Abstract

In the upcoming Exa-scale era, faults could happen more frequently than ever, and thus, many mechanisms to survive failures has been proposed and investigated. One of the mechanism is user-level fault mitigation to which user program handles failures in order that the program can survive from the failures and continue its execution. A simple method is that program continues its execution only with healthy nodes after failure. However, it is not suitable for some applications(e.g. stencil applications). Since a application must be executed with an unusual number of nodes after failure, it is difficult to load balance and keep the communication pattern. To deal with this problem, using spare nodes to substitute failed nodes is a solution. An important issue is how the failed nodes should be substituted with spare nodes. In this talk, we will show that the possibility of communication performance degradation due to the substitutions. Moreover, we will present and discuss several substitution methods.

12:00 - 12:30 "Throttling Approach Towards High Performance Collective MPI-IO" [PDF]
Yuichi Tsujita
Abstract
Abstract

Nowadays, MPI-IO is playing an important role for high performance parallel I/O, especially collective MPI-IO is frequently used in an underlying parallel I/O layer of HDF5 or PnetCDF. Therefore performance improvement of collective MPI-IO leads to performance improvement in HDF5 or PnetCDF. A well-known MPI-IO implementation named ROMIO has two-phase I/O optimization in such collective MPI-IO. We have been focusing throttling approach named EARTH (Effective Aggregation Rounds using THrottling) for further performance improvements of two-phase I/O on the K computer. The EARTH optimization tunes the number of I/O requests generated at the same time to reveal I/O contention on a parallel file system and data aggregation on the K computer. As a result, the EARTH optimization improves collective MPI-IO performance up to twice the original I/O performance.

12:30 - 14:00 Lunch Break
14:00 - 17:00 Session 4
Chair: Atsushi Hori
14:00 - 14:45 "Designing Hybrid MPI+PGAS Library for Exascale Systems: MVAPICH2-X Experience"
Dhabaleswar K. Panda
Abstract
Abstract

This talk will focus on challenges in designing hybrid MPI+PGAS library for exascale systems. Motivations, features and design guidelines for supporting hybrid MPI and PGAS (OpenSHMEM, UPC and CAF) programming model with the MVAPICH2-X library will be presented. The role of unified communication runtime to support hybrid programming models on InfiniBand, accelerators and co-processors will be outlined. Unique capabilities of the hybrid MPI+PGAS model to re-design HPC applications to harness performance and scalability will also be presented through a set of case-studies.

14:45 - 15:30 "Towards Exascale with Global Arrays using Communication Runtime at Extreme Scale (ComEx)"
Abhinav Vishnu
Abstract
Abstract

Global Arrays is a Partitioned Global Address Space Programming model, which uses Communication Runtime at Extreme Scale (ComEx) as its communication backend for large scale systems. In this talk, Dr. Vishnu will present the research conducted by the group on performance, and fault tolerance aspects of Global Arrays and ComEx. He will present approaches for designing ComEx on upcoming systems by using MPI as the backend ‹ by using two-sided and one-sided semantics. A performance evaluation of this design using NWChem and several other kernels shows the effectiveness of this approach ‹ and similar performance in comparison to the native ports.

15:30 - 15:40 Coffee Break
15:40 - 16:00 "Introduction of ACE (Advanced Communication library for Exa) Project" [PDF]
Takeshi Nanri
16:00 - 16:20 "Basic Layer and Data Library of ACP (Advanced Communication Primitives) Library" [PDF]
Takafumi Nose
16:20 - 16:40 "Simulation of RDMA Communication with NSIM-ACE" [PDF]
Hidetomo Shibamura
16:40 - 17:00 "Development of Applications on ACP Library" [PDF]
Hiroaki Honda
17:00 - 17:15 Closing Remarks
Taisuke Boku