Poster Session

Poster session of ISP2S2 will be held during 17:30-20:00 in conjunction with reception from 18:30,
on December 2, 2014, at Seminar Room, AICS 1F.


  • Directors / Adopted FY2010
  • Directors / Adopted FY2011
  • Directors / Adopted FY2012
Highly Productive, High Performance Application Frameworks for Post Petascale Computing
Title / Speaker
P26 Fault-Tolerant Mechanism Using GPU Virtualization Software: DS-CUDA
Minoru Oikawa, Keio University
abstract
Abstract Recently, high-performance computing systems consist of large number of components which are distributed in a network and connected complicatedly with each other. Graphics Processing Units (GPUs) are widely used components in recent supercomputers for accelerating calculation with the aid of their capability of running larger number of parallel threads compared with general CPUs. However, some difficulties exist when developing and executing applications on such large systems with distributed GPUs: (1) Getting better performance with Message-Passing Interface (MPI) becomes more difficult with distributed GPUs since their memories cannot be transferred directly and their latency is larger than that of CPU memory. (2) The Mean Time Between Failure (MTBF) of supercomputers with GPUs tends to be shorter than normal clusters because of unstability of GPUs. For reducing these difficulties, we are developing a software development framework named "Distributed-Shared CUDA (DS-CUDA)". This framework enables programmers to descript codes without MPI to control large number of GPUs on a distributed system using only CUDA language. It also has fault tolerant mechanism with automatic redundant calculation and migration of GPU process from faulty GPU to healthy one.
P27 MassiveThreads/DM: a Global-View Task-Parallel Library for Distributed Memory Machines
Shigeki Akiyama, Graduate School of Information Science and Technology, The University of Tokyo
abstract
Abstract We propose MassiveThreads/DM, a global-view task-parallel library for implementing parallel divide-and-conquer algorithms on distributed memory supercomputers. MassiveThreads/DM provides fine-grained tasks, distributed load balancing, and partitioned global address space (PGAS) features suited for fine-grained task parallelism. In order to adapt dynamic nature of task parallelism and utilize data locality of parallel divide-and-conquer programs, the PGAS features support object migration among computational nodes and manual caching for globally shared data, in addition to basic PGAS features such as globally shared objects and distributed arrays. MassiveThreads/DM is implemented as C/C++ library, so it can easily be integrated with existing applications, libraries, and runtime system in parallel programming languages. This presentation will discuss the detailed MassiveThreads/DM features and show preliminary results of benchmark programs such as Unbalanced Tree Search and cache-oblivious matrix multiplication.
P28 Advanced GPU Optimizations for Stencil-Based Real-World Applications: Adaptive Mesh Refinement for CFD and Optimizing Multi-Kernel Transformations for Data Locality
Mohamed Wahib, RIKEN Advanced Institute for Computational Science, Kobe, Japan
abstract
Abstract Stencil-based operations constitute a fundamental tool in computational science and engineering, and hence a GPU-optimized implementation is of paramount importance. This poster presents two different projects for optimizing stencil-based operations for next generation applications: a) a high-performance CFD code based on adaptive mesh refinement (AMR) method for stencil-based computation on GPU, and b) An method for GPU multi-kernel transformations to exploit inter-kernel data locality based on kernel fusion. AMR is an effective technique to reduce computational time and memory usage. In our approach, the Octree data structure is adopted. GPU memory is managed by the Space-filling curve algorithm. By introducing the AMR method, the required memory is reduced to 1/30 in the advection simulation. The introduced method is showed to be more efficient in comparison to our previous approaches without AMR for the tsunami and compressible fluid simulations. Multi-Kernel transformation can improve performance by reducing data traffic to off-chip memory; we fuse kernels that share data arrays to generate new kernels at which hidden localities are exposed. The main challenges are a) a scalable search method for the optimal kernel fusions while constrained by data dependencies and kernels Eprecedences, b) a codeless performance upper-bound projection model, and b) effectively applying kernel fusion to achieve speedup. The proposed method improved the performance of two real-world applications containing tens of kernels by 1.35x and 1.2x.
P25 High Performance and Highly Productive Stencil Framework for GPU Accelerators
Naoya Maruyama, RIKEN Advanced Institute for Computational Science, Kobe, Japan
abstract
Abstract Physis is an application framework for stencil computations that is focused on achieving both performance and productivity in large-scale parallel computing systems, in particular heterogeneous GPU-accelerated systems. The framework consists of an implicitly parallel domain-specific language for stencil computations and its translators to parallel platforms. The DSL extends the standard C language with a small number of custom programming constructs that allow the user to express typical stencil operations with regular multi-dimensional grids in a very simple fashion. The Physis code is then translated to standard source code for target devices such as GPU accelerators, where our CUDA code generator automatically applies various loop optimizations, resulting in comparable performance as hand-tuned CUDA. In this talk, we present the design and implementation of the Physis framework with GPU performance results. We also present some of our ongoing efforts where we are developing more aggressive optimizations that exploit data localities available in stencil computations on GPUs, but usually not possible to utilize for standard general-purpose compilers. We show preliminary results of the fusion optimization with climate model GPU codes.
System Software for Post Petascale Data Intensive Science
Title / Speaker
P1 System software for Post-Petascale Data-Intensive Science
Osamu Tatebe, University of Tsukuba
abstract
Abstract Data-intensive science requires more I/O performance than the computational science. To improve the I/O performance further, the system architecture and the system software for supercomputers should be investigated. The separate installation of compute nodes and storage array cannot always scale I/O performance up. The storage architecture that federates local storage in compute nodes may solve this issue like distributed memory architecture. Our research includes such scaled-out distributed file system having non-uniform access performance, compute-node operating system including cooperative caching and reduced OS noise, and runtime systems such as workflow system, MPI-IO, MapReduce, batch queuing system considering non-uniform access performance feature. This talk covers our effort of system software to improve the I/O bandwidth and IOPS for data-intensive science applications.
P2 An Efficient Caching Mechanism for Post-Petascale Distributed Storage
Yoshihiro Oyama, The University of Electro-Communications
abstract
Abstract We will talk about a cooperative caching mechanism implemented to the Gfarm distributed file system. The mechanism enables Gfarm client nodes to read the memory cache on other client nodes and have a large amount of memory cache as a whole. We assume that the users execute data-intensive scientific applications on the file system and the client nodes are connected with InfiniBand. We achieve high-speed data transmission between client nodes by InfiniBand remote direct memory access (RDMA). In this talk, we explain the design and implementation of the mechanism and report evaluation results. A major challenge to achieve the mechanism is management of cache location information among client nodes and storage servers. Although the mechanism traces cache locations loosely, we confirm that it can correctly predict the locations with high accuracy in our experiments. The results also show that the mechanism significantly reduces the execution time of real-world applications. For example, the execution time of the Montage workflow is decreased by 15.4%. We also briefly mention other Gfarm extensions being implemented for extremely high performance, including a deduplication mechanism based on content-defined chunking and a zero-copy kernel driver.
P3 Pwrake - a workflow system for data-intensive science
Masahiro Tanaka
P4 Object Storage for OpenNVM flash primitives
Fuyumasa Takatsu, University of Tsukuba
abstract
Abstract Recently, several flash and non-volatile memory devices are available. Fusion IO ioDrive is a flash device connected by the PCI express. It provides not only the standard block device interface but also virtual storage layer (VSL) functionality such as atomic writes and sparse addressing, using the OpenNVM. We have designed an object storage using the OpenNVM, targeting to achieve maximum IOPS/bandwidth performance. In our approach, each object is stored in a contiguous fixed-size region, which can be specified by the Object ID. Using the sparse address space, on-demand block assignment is possible such that only written blocks in a region are physically assigned. This design is enabled by the sparse addressing, and prevents complicated indirect block reference management. By using atomic writes, the object storage supports the ACID property in each write. The object storage supports two types of object layout; direct layout and log structured layout, to store object data. Performance of the object creation shows more than 740K object creations per second, which is over 10 times better than the Fusion IO DirectFS. Sequential and random access shows better performance than the DirectFS, and achieves 687 MB/s and 734 MB/s for read and write, respectively.
ppOpen-HPC: Open Source Infrastructure for Development and Execution of Large-Scale Scientific Applications on Post-Peta-Scale Supercomputers with Automatic Tuning (AT)
Title / Speaker
P49 ppOpen-HPC: Open Source Infrastructure for Development and Execution of Large-Scale Scientific Applications on Post- Peta-Scale Supercomputers with Automatic Tuning (AT)
Kengo Nakajima, University of Tokyo
abstract
Abstract ppOpen-HPC is an open source infrastructure for development and execution of optimized and reliable simulation code on post-peta-scale (pp) parallel computers based on many-core architectures, and it consists of various types of libraries, which cover general procedures for scientific computation. Source code developed on a PC with a single processor is linked with these libraries, and the parallel code generated is optimized for post-peta-scale systems. In ppOpen-HPC, we are focusing on five types of discretization methods for scientific computing, which are FEM, FDM, FVM, BEM, and DEM. ppOpenHPC provides a set of libraries covering various types of procedures for these five methods, such as parallel I/O of data-sets, assembling of coefficient matrix, linear-solvers with robust and scalable preconditioners, adaptive mesh refinement (AMR), and dynamic load-balancing. The target post- peta-scale system is the Post T2K System of the University of Tokyo and University of Tsukuba based on many-core architectures, such as Intel MIC/Xeon Phi. It will be installed in FY.2015-2016 and its peak performance is expected to be 20-30 PFLOPS. Source codes of ppOpen-HPC with documents are available at http://ppopenhpc.cc.u-tokyo.ac.jp/ .
P50 Multi-scale and multi-physics simulation of seismic waves-building vibration coupling by using the ppOpen-HPC Libraries
Takashi Arakawa, JAMSTEC/RIST
abstract
Abstract Recent progress on high-end supercomputers has enabled larger scale, more complex simulations. Owing to this progress, the phenomena currently possible to simulate have become more complex, and software applications seen as a projection of actual natural phenomena have also become larger and larger. Because of this background, we have been developing a coupling library called ppOpen-MATH/MP. PpOpen-MATH/MP is designed to be applicable to models employing various discretization methods supported by ppOpen-HPC software suites such as FDM, FVM, and FEM. This wide applicability is achieved by being independent from the grid structure, and is designed so that users can implement their own interpolation code. To demonstrate the applicability of ppOpen-MATH/MP, we utilized it to achieve the coupling of a seismic model and a structure model. The seismic model selected for this purpose is Seism3D employing the FDM discretization method. FrontISTR++ is used as the structure model to be coupled with Seism3D. The discretization method of FrontISTR++ is FEM. In the poster, we will present features of ppOpen-MATH/MP and will discuss the results of coupled simulation.
P51 ppOpen-APPL/DEM: Integrated libraries for simulation of many particles with short-range interactions
Miki Y. Matsuo, JAMSTEC
P52 Performance Evaluations of Applications using the ppOpen-APPL/FDM Library
Masaharu Matsumoto, Information Technology Center, The University of Tokyo
abstract
Abstract ppOpen-APPL/FDM is a standard FDM library that is designed for execution on future architectures as a part of the ppOpen-HPC libraries. We have optimized the parallel performance of an application for 3D seismic wave propagation analysis using the library on the Intel Xeon Phi co-processor. On the other hand, innovative algorithms are needed to run simulations on future architectures with a reasonable cost of computer resources. We have also implemented and evaluated an adaptive mesh refinement (AMR) framework for the ppOpen-APPL/FDM library. To overcome the problem of load imbalance in parallelized AMR simulations, in our framework we implemented a dynamic domain decomposition (DDD) technique with which the whole computational domain is dynamically re-decomposed into new sub-domains so that the computational load on each process becomes nearly the same. The DDD procedure succeeds in holding the average execution time down significantly.
P53 Exploring Solver Space for Stokes Flow with Highly Heterogeneous Viscosity Structure
Patrick Sanan, Università della Svizzera italiana (USI) (University of Italian-speaking Switzerland)
abstract
Abstract Stokes flow is an important model of fluid flow in several disciplines including microfluidics and geophysical modeling. Intense research in recent years has provided a vast landscape of algorithmic options for solving large scale Stokes problems within the framework of a Newton-Krylov method. Concurrently, the diversity of competitive high performance computing architectures has expanded, providing parallel systems with specialized processing units, complex memory space hierarchies, and vastly different communication bandwidths and latencies between memory spaces. Several related preconditioning approaches have proven to be algorithmically scalable. These methods necessarily involve hierarchical and nested solves, and choosing an effective solver involves navigating a combinatorial explosion of options. The parameters of an effective method are sensitive to both the viscosity structure of the problem being solved and the characteristics of the parallel system available. Here, we survey a space of scalable preconditioned Krylov methods as applied to a challenging problem from mantle convection, and attempt to quantify how highly heterogeneous viscosity structure and communication bottlenecks influence the results of performance tests.
Parallel System Software for Multi-core and Many-core
Title / Speaker
P5 System Software for Many-core & Multi-core Architecture
Atsushi Hori, RIKEN AICS
abstract
Abstract Ongoing system software research for post-peta flops computing with the hybrid architecture consisting of many-core and multi-core CPUs will be reported. 1) New task model, named "Partitioned Virtual Address Space," for many-core architecture will be introduced and presented that MPI intra-node communication can be faster and more memory-efficient. 2) "Multiple-PVAS" for hybrid architectures consisting of many-core and multi-core CPUs is also introduced. Multiple-PVAS can be though as a software implementation of hUMA proposed by AMD. Finally, 3) fault resilience research will be introduced. This is a collaborative research with the Tennessee University, ICL. User-Level Fault Mitigation (ULFM) is an MPI implementation enabling survival from process failure. Various techniques to continue execution after the survival from a failure is also being surveyed.
P6 New Task Model for Efficient Intra-node Communication on a Many-core Architecture
Akio Shimada, RIKEN AICS
abstract
Abstract We propose "Partitioned Virtual Address Space (PVAS)", which is a new task model for enabling efficient intra-node communication on a many-core architecture. A number of parallel processes run on a many-core architecture, and thefrequency at which intra-node communications take place becomes greater. Therefore, intra-node communication performance between parallel processes can be seen as an important issue. Not only the performance but also the memory footprint for intra-node communication is important because the amount of per-core memory resources in many-core architectures is strictly limited. The PVAS task model allows parallel processes to run in the same address space. Thus, they can conduct intra-node communication without incurring extra costs for crossing address space boundaries, which result in high latency and large memory footprint. The PVAS task model is applied to the intra-node communication of Message Passing Interface (MPI). The benchmark results show that our MPI intra-node communication improves MPI application performance by up to 18%. Moreover, the memory footprint for MPI intra-node communication is decreased by upto 264 MB.
P7 An Evaluation of Spare Node Method for Fault Tolerance in Exascale Systems
Kazumi Yoshinaga, RIKEN AICS
abstract
Abstract In the Exa-scale era, faults happen much more frequently. Thus, lots of mechanisms to survive node failures are proposed. However, there is no discussion how a job should survive from node failures. In this poster, we discuss about various methods of survival from node failures. First, we present guidelines for methodology of survival from node failures. And we show the recovery by replacing a failed node with a spare node is better than the recovery simply avoiding the failed node. Second, we address the characteristics of method using spare nodes. We focus on a 5-point stencil communication pattern on a 2-D mesh network, and compare the performance degradation caused by a difference of process replacement patterns. Finally, based on the above discussions, we propose the effective method of survival from a node failure.
P8 Multiple-PVAS: Parallel Task Execution Model on a Global Address Space for Heterogeneous Systems
Mikiko Sato
Tokyo University of Agriculture and Technology
abstract
Abstract This poster proposes "Multiple-PVAS" (Multiple Partitioned Virtual Address Space, M-PVAS), which is a program execution environment that enables parallel use of a multi-core CPU and a many-core CPU on a hybrid-architecture system. Previous to this study, we have proposed a novel task model called PVAS (Partitioned Virtual Address Space) that disposes and executes a number of PVAS task address spaces in one virtual address space. M-PVAS is an extension of PVAS so that PVAS tasks running on a multi-core CPU and PVAS tasks running on a many-core CPU can coexist in one virtual address space. The idea here is to share roles of many-core CPU and multi-core CPU. Many-core CPU deals with highly parallel computation and multi-core CPU deals with hard-to-parallelize computation. M-PVAS couples them tightly so that they can communicate very efficiently and obtain a high level of computational performance. M-PVAS is implemented in the Linux Kernel on Xeon and Xeon Phi. The basic performance of the M-PVAS shows that it is possible to implement cooperation between CPUs with an overhead on the order of a minimum of 2.4 usec and it is effective for the communication between tasks on the hybrid system.
Development of an Eigen-Supercomputing Engine using a Post-Petascale Hierarchical Model
Title / Speaker
P33 Development of an Eigen-Supercomputing Engine using a Post-Petascale Hierarchical Model
Tetsuya Sakurai, University of Tsukuba
P34 A hierarchical parallel eigen-computing engine for large sparse eigenvalue problems
Yasunori Futamura, University of Tsukuba
abstract
Abstract We have developed a parallel eigen-computing engine named z-Pares for solving large sparse eigenvalue problems. z-Pares implements the Sakurai-Sugiura method which computes an eigen-subspace using complex moments that are obtained by contour integral. z-Pares can compute eigenvalues located in a specified contour path and corresponding eigenvectors. The matrix symmetry and definiteness are appropriately exploited to reduce the computational complexity. z-Pares provides two-level MPI distributed parallelism. The first level is the parallelism for solving independent linear systems with respect to quadrature points and the second level is the parallelism of matrix and vector operations. This feature enables users to utilize post-peta scale computational resources. We present performance evaluations of z-Pares using large-scale matrices from some practical applications.
P35 Modeling the performance of parallel dense eigensolvers on peta/post-petascale systems
Takeshi Fukaya, RIKEN AICS
abstract
Abstract Performance modeling is often helpful for efficient development of programs. Currently, we are developing a dense eigensolver for upcoming post-petascale systems. Our eigensolver employs a new approach via pentadiagonal (or wider banded) matrix, and its suitability to such systems needs to be validated. Anticipating emerging bottleneck (e.g. communication cost, etc.) in advance is also vital. In addition to them, estimating the runtime of a program accurately is beneficial for users in application fields. For the above reasons, we have studied modeling the performance of eigensolvers on peta/post-petascale systems along with developing the eigensolver itself. In this poster, we present our recent results on the performance modeling of our dense eigensolver. We have constructed a prototype model based on the results obtained during past performance evaluation. We have also verified the model by comparing its predictions with actual results on the K computer. Assuming the specifications of a post-petascale system, we then predict the performance of our eigensolver by using our model. The prediction shows the potential of the new approach employed in our eigensolver and indicates vital issues to be solved for achieving high performance.
P36 Optimally combined eigenvalue problem solvers and their benchmark on the K computer
Hiroto Imachi Tottori University, JST-CREST
abstract
Abstract Optimally combined numerical solvers were constructed for large-scale generalized eigenvalue problems that arise in quantum nanomaterial simulation or electronic structure calculation. Dense matrix algorithms for (symmetric positive definite) generalized eigenvalue problems consist of many subprocedures, namely, (i) reduction to a symmetric standard eigenvalue problem, (ii) tri- or penta- diagonalization, (iii) divide and conquer algorithm, and (iv) inverse transformation of eigenvectors. We chose an optimal routine from ScaLAPACK, EigenExa (T. Imamura, in this symposium), and ELPA for each subprocedure and combined them to construct optimal generalized eigenvalue problem solvers. In the pure ScaLAPACK solver, (i) the reduction procedure and (ii) the tridiagonalization procedure are bottlenecks. We measured the performance of various combined solvers for large (up to 430,080 dimension) matrices on the K computer and found that several of them remedy the bottlenecks and achieve high scalability. The authors are developers of the large-scale quantum nanomaterial simulator ELSES ( http://www.elses.jp/) and will connect the present solver to the simulator for future nanomaterial research.
Development of a Numerical Library based on Hierarchical Domain Decomposition for Post Petascale Simulation
Title / Speaker
P37 Development of a Numerical Library based on Hierarchical Domain Decomposition for Post Petascale Simulation
Ryuji SHIOYA
abstract
Abstract We have been developing an open source system software, ADVENTURE, which is a general-purpose parallel finite element analysis system and can simulate a large scale analysis model with supercomputer like K-computer. In the system, HDDM (hierarchical domain decomposition method), which is a very effective technique to large-scale analysis, was developed. The aim of this project is to develop a numerical library based on HDDM that is extended to pre and post processing parts, including mesh generation and visualization of large scale data, for the Post Petascale simulation in order to (1) convert easily existing FEM code to high performance DDM code, (2) obtain higher Flop/s per peak, parallel efficiency, and convergent rate than the existing libraries of iterative solvers for linear systems of a sparse matrix and (3) supply domain decomposition techniques in DDM as a large-scale data manipulation framework for the post petascalecomputing.
P38 Development of libraries of a DDM-based sparse linear solver and a versatile scientific computer graphics for the post-petascale FEM simulation
Masao Ogino, Nagoya University
abstract
Abstract For a post-petascale finite element method (FEM) simulation, we have been developing libraries of iterative solvers based on the domain decomposition method, named LexADV_IsDDM, and a versatile scientific computer graphics, named LexADV_VSCG. At first, an objective of LexADV_IsDDM is to solve an ultra-large-scale linear system from continuum mechanics. Therefore, a library solves a Schur complement equation of linear system by the iterative methods. Moreover, using a coarse grid correction preconditioner, high and stable convergence can be expected. Secondly, an objective of LexADV_VSCG is to generate a first-detail-image for post-processing efficiently. Therefore, a library supports triangles and particles rendering, and generates 4K or higher resolution images. In this poster, as an example of a post-petascale scientific simulation, a 100 billion FEM simulation with these libraries were performed.
P39 Development of distributed parallel explicit Moving Particle Simulation (MPS) method and large scale tsunami analysis on urban areas
Kohei Murotani, The University of Tokyo
abstract
Abstract In this research, a distributed memory parallel algorithm of the explicit MPS (Moving Particle Simulation) method is described. The MPS method is one of the popular particle method with collision. The ParMETIS is adopted for domain decomposition. We show the algorithm and the results of parallel scalability in our poster using the FX10 of the University of Tokyo. As the applications, a large-scale run-up tsunami analysis such as inundating the Ishinomaki urban area and carrying two 10m diameter tanks by the tsunami is done.
P40 A Development of Domain Specific Language(DSL) for Continuum Mechanics
Hirotaka Tanimura, Toyo University
abstract
Abstract We developed the Domain Specific Language (DSL) for continuum mechanics named AutoMT (read "otemoto"). AutoMT translates LaTeX source codes to C/C++/Fortran source codes. It is the library for tensors, matrices, and vectors operations. It can be applied to the large scale parallel calculations. We has mainly put in mind the application for continuum mechanics simulation, in particular, HDDMPPS (hierarchical domain decomposition method for post petascale simulation). By this AutoMT, programming effort in continuum mechanics simulation would be greatly reduced. Source codes that have been converted by AutoMT have achieved two to three times higher calculation performance than existing conventional codes. AutoMT is scheduled to be published in open source, as part of the ADVENTURE project (official name is "development of computational mechanics system for large scale analysis and design").
Advanced Computing and Optimization Infrastructure for Extremely Large-Scale Graphs on Post Peta-Scale Supercomputers
Title / Speaker
P41 Advanced Computing and Optimization Infrastructure for Extremely Large- Scale Graphs on Post Peta-Scale
Katsuki Fujisawa, Kyushu University
abstract
Abstract In this talk, we present our ongoing research project. The objective of this project is to develop an advanced computing and optimization infrastructure for extremely large-scale graphs on post peta-scale supercomputers. We explain our challenge to Graph 500 and Green Graph 500 benchmarks that are designed to measure the performance of a computer system for applications that require irregular memory and network access patterns. The Graph500 list was released in November 2010. The Graph500 benchmark measures the performance of any supercomputer performing a BFS (Breadth-First Search) in terms of traversed edges per second (TEPS). We have implemented world’s first GPU-based BFS on the TSUBAME 2.0 supercomputer at Tokyo Institute of Technology in 2012. The Green Graph 500 list collects TEPS-per-watt metrics. In ISC14, our project team was a winner of the 8th Graph500 benchmark and 3rd Green Graph 500 benchmark. We also present our parallel implementation for large-scale SDP (SemiDefinite Programming) problem. We solved the largest SDP problem (which has over 2.33 million constraints), thereby creating a new world record. Our implementation also achieved 1.713 PFlops in double precision for large-scale Cholesky factorization using 2,720 CPUs and 4,080 GPUs on the TSUBAME 2.5 supercomputer.
P42 Extreme-scale Graph Processing
Hitoshi Sato, Tokyo Institute of Technology
P43 ScaleGraph: A Billion-Scale Graph Processing Library
Toyotaro Suzumura
P44 Petascale General Solver for Semidefinite Programming Problems with over Two Million Constraints
Katsuki Fujisawa, Kyushu University
abstract
Abstract The semidefinite programming (SDP) problem is one of the central problems in mathematical optimization. The primal-dual interior-point method (PDIPM) is one of the most powerful algorithms for solving SDP problems. However, two well-known major bottlenecks, i.e., the generation of the Schur complement matrix (SCM) and its Cholesky factorization, exist in the algorithmic framework of the PDIPM. We have developed a new version of the semidefinite programming algorithm parallel version (SDPARA), which is a parallel implementation on multiple CPUs and GPUs for solving extremely large-scale SDP problems with over a million constraints. When the generation of the SCM becomes a bottleneck, SDPARA can attain high scalability using a large quantity of CPU cores and some processor affinity and memory interleaving techniques. SDPARA can also perform parallel Cholesky factorization using thousands of GPUs and techniques for overlapping computation and communication if an SDP problem has over two million constraints and Cholesky factorization constitutes a bottleneck. We demonstrate that SDPARA is a high-performance general solver for SDPs in various application fields through numerical experiments conducted on the TSUBAME 2.5 supercomputer, and we solved the largest SDP problem (which has over 2.33 million constraints), thereby creating a new world record. Our implementation also achieved 1.713 PFlops in double precision for large-scale Cholesky factorization using 2,720 CPUs and 4,080 GPUs
An evolutionary approach to construction of a software development environment for massively-parallel heterogeneous systems
Title / Speaker
P29 Enhancing Performance Portability of Real Applications Using Xevolver
Shoichi Hirasawa, Tohoku Univ./JST CREST
abstract
Abstract HPC computing platforms have diversified for utilizing different levels of parallelism and memory locality. Meanwhile, existing HPC applications are subject to being optimized for a few target computing platforms to achieve high performance and high efficiency, even though code optimizations for one computing platform may degrade the performance on other computing platforms. This results in degradation of performance portabilities across different platforms, and makes the applications difficult to evolve whenever a computing platform becomes available. Xevolver, which is a code transformation framework using only the standard XML techniques, helps users enhance the performance portability of an existing application by separating those code optimizations from the application code. The application functionality is not forced to change for adaptation to a target platform if code optimizations are defined for each computing platform. In this presentation, case studies of large-scale HPC applications show that those code optimizations are maintained as translation rules independently of application codes to enhance their performance portabilities.
P30 Parallel Numerical Libraries with Xevolver towards Exa-Scale Systems
Takahashi Daisuke
P31 Designing an HPC Refactoring Catalog toward Post Peta-scale Computing Era
Ken'ichi Itakura
P32 Tools for Exa-Scale Computational Science Codes based on Xevolver
Reiji Suda, the University of Tokyo
abstract
Abstract Our project pursue a framework to help application code developers with such code optimizations that are required for high-performance scientific computing. Our framework, called Xevolver, provides such functionalities by exposing the abstract syntax tree (AST) in XML format, and utilizing standard XML transformation systems such as XSLT. In this poster presentation, we will describe Xevolver Tools, a tool set which provides an easy access to Xevolver for the developers who are not familiar with AST, XML or XSLT, and also provides a framework to collect information from the source code, to modify the transformation rules according to the collected information, and to apply repeated or recursive transformations. The major components of Xevolver Tools are tgen, an automatic XSLT template generator, xevdrs, a script language to control the transformations, and xevtu, a set of utility routines for extensible code transformations. The programming model that Xevolver provides can be seen as an idiomatic metaprogramming; the target code is generated by code transformations that are selected and activated based on user-defined code patterns which appear in the source code. This feature enables a separate description of high-performance code optimizations, most likely to be platform-dependent, from the source code.
Development of Scalable Communication Library with Technologies for Memory Saving and Runtime Optimization
Title / Speaker
P9 ACE (Advanced Communication for Exa) Project
Takeshi Nanri, Kyushu University
abstract
Abstract According to the recent discussions about next generation supercomputers, due to the issue of the energy efficiency on memory bandwidth, the available amount of total memory is expected to be almost the same with, or even smaller than the current systems. Therefore, memory-efficient communication is becoming a new issue for achieving sustained scalability towards such extreme scale computing systems. As an approach to the issue, the project ACE (Advanced Communication for Exa). This is working on a memory-efficient communication library, ACP (Advanced Communication Primitives). This poster introduces the following activities of the project: - Design programming interfaces to support developing scalable applications. - Study techniques to optimize communications according to the runtime information. - Examine the advantage of this library on applications. - Develop a network simulator to estimate communication cost with this library.
P10 ACP (Advanced Communication Primitives) Basic Layer
Shinji Sumimoto
P11 ACP (Advanced Communication Primitives) Middle Layer
Toshiya Takami, Research Institute for Information Technology, Kyushu University
abstract
Abstract The new communication library, Advanced Communication Primiteives (ACP), is developed in order to achieve low-overhead data transfer with just-enough amount of memory in exa-scale systems. The middle-layer is devoted to construct programmer-friendly interfaces on data structures and channel-based communications. In this poster presentation, we give detailed explanation of the basic concept to configure this layer as well as examples, typical usage and current implementation of these functions. The channel interface, for examples, prepares functions for allocating and de-allocating buffers for each pair of processes that requires message passing communication, which enables us to reduce excess memory consumption. Moreover, the functionality of this interface is limited to single direction and in order. This minimizes the memory consumption and the overhead of the implementation. Other types of pattern communications included in this layer will also be shown with simple examples.
P12 NSIM-ACE: Network Simulator for Global Memory Access
Hidetomo Shibamura, Institute of Systems, Information Technologies and Nanotechnologies
abstract
Abstract NSIM-ACE is an interconnect simulator to support various performance evaluation of extreme-scale interconnect. This simulator is enhanced in simulation of global memory access from conventional NSIM. RDMA (Remote Direct Memory Access) style communication is newly supported in addition to message-passing style communication. Simulation events as communication pattern are generated internally in response to real execution of MPI-compatible C program (MGEN program). Accurate simulation is performed by using fine interconnect configurations and various MPI overheads given by dedicated files. Topologies of mesh, torus (up to 6-dimension), Tofu (K computer, FX10), and Fat Tree are supported. NSIM-ACE is implemented with MPI based on PDES (Parallel Discrete Event Simulation), and very fast simulation is achieved. For example, a random ring communication (in HPC Challenge) on 128K-node 3D torus is simulated in 18 minutes by using real 128-core system.
Software development for post petascale supercomputing --- Modularity for Supercomputing
Title / Speaker
P54 Modularity for Supercomputing
Shigeru Chiba
P55 Formal Verification of the Correctness of CUDA Programs Using Hoare Logic
Kensuke Kojima, Kyoto University, JST CREST
abstract
Abstract We study a verification method for CUDA kernels. Our main aim is to check functional correctness of a kernel, that is, that the output of the kernel satisfies its specification which is expressed by a mathematical formula involving inputs and outputs. To this end, we adapt Hoare Logic, a traditional technique to reason about sequential programs, to (a subset of) CUDA kernels. Although our method relies on the assumption that all of the threads are executed at a time, which is not necessarily true for the actual GPUs, this assumption is sound for race-free programs. Since many GPU kernels are written so that it is race-free, we believe this assumption is reasonable. Based on this logic, we are planning to develop a verifier that receives a CUDA program and its specification together with a hint for the verifier (e.g. loop invariant), and automatically checks whether the specification is satisfied or not.
P56 Effect of Applying HPC to MSR - A Case Study of Code Clone Detection -
Naoyasu Ubayashi, Kyushu University
abstract
Abstract The Mining Software Repositories (MSR) researchers integrate and analyze data in software repositories such as source control and bug tracking systems to help practitioners make strategic decisions about their projects. One of the current challenges in the MSR field is how to deal with large-scale data, since most repositories continue to rapidly grow in size. Therefore, researchers in the MSR filed are interested in applying high performance computing (HPC) like a supercomputer to their MSR analyses. In our poster, we empirically evaluate the impact of HPC on the MSR analyses. As a case study, we apply HPC (FX10) to code clone detection, which is the automated process of identifying duplications in source code and requires a large amount of computation, using the dataset collected from the Apache CXF project. From the case study, we find that we only need to describe a small amount of additional lines of code to use HPC and FX10 is up to a factor of 1,197 faster than a desktop
P57 A Dynamically-typed Language for Prototyping High-Performance Data Parallel Programs
Hidehiko Masuhara Tokyo Institute of Technology
abstract
Abstract We present design and implementation overview of Ikra, a Ruby extension with data-parallel abstractions. Ikra extends Ruby's built-in array class so as to execute their map and reduce operations on parallel computers including GPU accelerated computers. Our compiler uses a type inference algorithm to identify array operations in dynamically typed Ruby programs, and compiles the kernels of those operations into a program in C-based parallel languages (such as CUDA and C with a MPI library). This will allows the programmers to seamlessly use Ruby's existing constructs and libraries along with those data-parallel operations.
P58 Distributed Gene Sequencing Without the Pain
Yves Vandriessche, Software Languages Lab, Vrije Universiteit Brussel
abstract
Abstract Scientists have been using scripting languages for decades to glue together several domain-specific applications into computational experiments. The explosion of data and compute power needs have outgrown these traditional scripting approaches, who break apart on the large-scale coordination of parallel tasks. We bring in Intel Concurrent Collections as a way to deal with this coordination issue, but ensure that this is transparent to the user. We present the initial results of an experiment which aims to demonstrate that this approach can lead to a more painless way to script the parallel execution of a whole-genome sequencing pipeline, a domain where such ‘parallel scripting crisis' is very outspoken.
Software Technology that Deals with Deeper Memory Hierarchy in Post-petascale Era
Title / Speaker
P13 Software Technology that Deals with Deeper Memory Hierarchy in Post-petascale Era
Toshio Endo, Tokyo Tech
P14 The efficient utilization of memory hierarchy on GPU Clusters
Guanghao Jin, Tokyo Institute of Technology
abstract
Abstract The domain size of scientific simulations that can be computed by GPU cluster is limited by the memory capacity of GPUs in common way case. In this poster, we introduce our optimization methods to efficiently use the memory hierarchy of GPU cluster to enable the computation on big domains for stencil computation. We use temporal blocking method to reduce the communication cost between different memories and propose our optimization methods to solve redundancy problem of temporal blocking method. We evaluate our optimization methods on different GPU systems and the result shows that our optimization methods can enable the computation on the domains that are bigger than memory capacity of GPUs while maintaining high performance.
P15 A Profiling Tool set for measuring B/F Ratios and Cache Behaviors from Actual Applications
Shimpei Sato, JAIST
abstract
Abstract In the optimization of applications, improving cache hit rates using techniques that use the locality of processing is important for their performance. One of such techniques is called temporal blocking that uses the locality of time steps and improves cache utilization. In this poster, we present methodologies to measure B/F ratios and cache behaviors from actual applications based on the Exana tool set, which is an application profiling and optimization infrastructure we have developed in our project. We evaluate both of the obtained B/F ratios and cache behaviors and discuss the relationship between them. Furthermore, we compare the execution of the original un-optimized code with the code execution whose source code is optimized using the temporal blocking.
P16 DLM: Remote Memory Paging for Efficient Use of Memory Resource on Clusters
Hiroko Midorikawa, Seikei University
abstract
Abstract DLM, Distributed Large Memory, provides a user-level remote memory paging mechanism to application processes in clusters to realize out-of-core solution for existing OpenMP and pthread programs. It offers seamless programming interfaces to existing programs, replacing malloc with dlm_alloc, and/or adding dlm, an extended C-storage class specifier defined in dlm c compiler, to existing array declaration. It also realizes efficient utilization of memory resources in clusters among several application processes sharing a cluster. DLM has several implementations, such as DLM-LAN, which consists of several multi-client memory servers and an admin-process monitoring status of memory servers in a cluster, DLM-WAN, which is designed to utilize memory resource in multiple WAN-connected clusters efficiently, and DLM-MPI, which is designed for open clusters managed by MPI-batch systems. This poster introduces DLM implementations and related research topics, and discusses future extensions and possibilities for post peta-scale computing.
Power Management Framework for Post-Petascale Supercomputers
Title / Speaker
P17 Power Management Framework for Post-Petascale Supercomputers
Masaaki Kondo, The University of Tokyo
abstract
Abstract Power consumption is expected to be a first class design constraint for developing future post-petascale supercomputers. To make effective use of limited power budget, one paradigm shift awaits us is that we need to allow peak power to exceed the power constraint (over-subscription) and adaptively set power-caps to each application or each hardware component within an application. To effectively utilize such systems, we are developing an automatic power-performance optimization framework, power management runtime systems, a power-performance simulator of large scale HPC systems, and power management APIs. In this poster, we will introduce the concept and the overview of our power management framework.
P18 Power-Performance Optimization for Power-Constrained Supercomputer Systems
Yuichi Inadomi, Kyushu University
abstract
Abstract Needless to say, the electric power is one of the most important resource for post-petascale supercomputer. And we believe that supercomputers will be operated under power constrained in near future. It was known that power consumptions without power cap and performances with same power cap were different between processors with same catalogue-specification. Therefore, the performance of MPI-program using static load-balancing becomes worse when the same power cap is applied for all processors. To improve the performance, the power cap for each processor is determined depending on the power-consumption characters of processors without changing total power budget. This talk introduces how the power cap for each processor is determined and how good performance improved.
P19 NsimPower: Large Scale Interconnection Network Simulator for Power/Performance Analysis
Koji Inoue, Kyushu University
abstract
Abstract This talk introduces NsimPower which can be used for power-performance analysis of large scale interconnection networks. NsimPower is built based on the parallel discrete event simulation approach, and provides an MPI-compatible programming interface. It supports low-power idle behavior that attempts to turn active links off if no traffic appears. Therefore users can deeply analyze and understand the impact of such runtime power management on interconnect power-performnace characteristics. The power-performance profile generated by NsimPower can be fed to a visualization tool called Boxfish. This tool chain makes it possible for users to well understand the static (or steady-state) and dynamic (or transient) power behavior.
P20 Job Scheduling and Resource Management for Power-Constrained Supercomputers
Thang Cao, Univ. of Tokyo
abstract
Abstract Limited power budget has become a crucial problem in designing and implementing a supercomputer system. Maintaining required performance of high priority jobs while improving total throughput of the system makes it more difficult to schedule multiple jobs under the strict power limitation. This work studies a dynamic resource manager and job scheduler for power constrained super computers. The scheduler selects a new job to submit based on available hardware resources and power budget together with profiles of running jobs. The resource manager periodically monitors resource usage, optimizes power cap for each job dynamically, and ensures that operating power does not exceed the predetermined power limit. Evaluation on an HPC system shows that the dynamic resource manager and job scheduler can successfully control power consumption of executing jobs with negligible overhead, while satisfying required performance and improving total throughput of the system.
Framework for Administration of Social Simulations on Massively Parallel Computers
Title / Speaker
P45 CASSIA Project: Comprehensive Architecture of Social Simulation for Inclusive Analysis
Itsuki Noda, National Institute of Advanced Industrial Science and Technology
abstract
Abstract Project CASSIA (Comprehensive Architecture of Social Simulation for Inclusive Analysis) aims to develop a framework to administer to execute large-scale multiagent simulations exhaustively to analyze socially interactive systems. The framework consists of a manager module and distributed execution middleware; the manager module conducts effective execution plans of simulations among massive possible conditions according to available computer resources, while the execution middleware provides functionality to realize distributed multi-agent simulation on many-core computers. The framework will provides flexible engineering environment to analyze, design and synthesize social systems like traffics, economy and politics. Currently, we are applysing the framework for investigating guidance of evacuations from disasters, crowd control for large events, transportation design for urban area, and market mechanism design for stock markets.
P46 Whole Tokyo Stock Exchange Market Simulation Project: Design of Financial Market Regulations using Agent-based Simulation
Takuma Torii Department of Systems Innovation, School of Engineering, The University of Tokyo
abstract
Abstract Agent-based simulations can be a strong tool to design market systems and rules. We applies computer simulation to support design of market regulations in Tokyo Stock Exchange. In our simulation, various types of trader agents sell or buy multiple stocks and/or index futures in the market. Some agents called arbitrageurs make profits from the difference between stock and index futures prices. To find the conditions that a sudden price down of one stock spreads over the markets, we conducted exhaustive simulations for combinations of various agent types. Agent-based simulations typically have a large parameter space, which is hard to be exhausted in general. For this, we used the softwares to exploit modern high-performance computing technologies for agent-based simulations: OACIS, a job management software for large-scale simulations, and X10, a programming language for multi-core, parallel computing environment. Since modern markets show market-wide co-movement of multiple assets, that is considered to be correlated with financial crisis, a market-wide regulation for multiple assets, called market-wide circuit breaker, have been explored. We evaluated using computer simulation the stabilization effectiveness of the two market regulations, single-stock and market-wide circuit breakers.
P47 City Traffic Simulation on the Large-scale Agent Simulation Framework
Hideyuki Mizuta, IBM Research, JST CREST
abstract
Abstract In this poster, we introduce a highly scalable distributed agent-based simulation framework and microscopic vehicle simulation for metropolitan macro traffic flow. X10-based Agents Executive Infrastructure for Simulation is a multi-agent simulation platform on top of the X10 language. X10 is the state-of-the-art PGAS (Partitioned Global Address Space) language that brings high productivity when implementing highly parallel and distributed applications on post-peta or exascale machines. The X10-based Agents Executive Infrastructure for Simulation works on the latest X10 (version 2.5). On this platform, we developed the IBM Mega traffic simulator which can simulate millions of vehicles in an entire city to evaluate the city planning. We have applied the city traffic simulation with real road networks including Hiroshima and Tokyo for demonstration experiments and validation. We will also show the changes of average travel times of the vehicles with the estimation-based signal control strategy using the simulator.
P48 OACIS: a mass-execution manager
Yohsuke Murase, RIKEN Advanced Institute for Computational Science
abstract
Abstract We present a job management software for large-scale simulations, “OACIS E(Organizing Assistant for Comprehensive and Interactive Simulations). As a simulation model becomes complex and the number of its parameter increases, the parameter space we need to explore grows exponentially. This issue can be critical especially for agent-based social simulations where the models are often more complex and have more parameters than those for natural sciences. Thus, an exploration in huge parameter space is needed to obtain meaningful conclusions. Using OACIS, users can control a large number of simulation jobs executed in various remote servers, keep these results in an organized way together with the execution logs such as executed date, host, and elapsed times. The software has a web browser front end, and users can submit various jobs to appropriate remote hosts from a web browser easily. In the presentation, simulations of evacuation and urban traffic are demonstrated as examples.
Research and Development on Unified Environment of Accelerated Computing and Interconnection for Post-Petascale Era
Title / Speaker
P21 Overview of Tightly Coupled Accelerated Computing
Taisuke Boku
P22 PEACH2/PEACH3: Switching Hubs for TCA
Hideharu Amano
P23 XcalableACC: A PGAS Language for Accelerated Parallel Computers
Hitoshi Murai, RIKEN AICS
abstract
Abstract Accelerated parallel computers (APC) such as GPU clusters are emerging as an HPC platform. This paper proposes XcalableACC, a directive-based language extension to program APCs. XcalableACC is a combination of two existing directive-based languages, XcalableMP and OpenACC, and has two additional functions: data/work mapping among multiple accelerators and direct communication between accelerators. The result of the preliminary evaluation shows that XcalableACC is a promising means to program APCs.
P24 Development of Applications with GPU/TCA
Masayuki Umemura