Issue 2 - 2021-08-09

Download the complete issue as PDF from Zenodo.

The Journal of High-Performance Storage (JHPS) is a new open-access journal edited by storage experts that unite key features of journals fostering openness and trust in storage research. In particular, JHPS offers open reviews, living papers, digital replicability, and free open access. The editing team is proud to announce the publication of the second JHPS issue. It contains two articles regarding the important topic of performance analysis. Since our last publication, we hardened the manuscript central that we use to manage submissions. With the ISSN 2748-7814 assigned, the journal is also now recognized as a serial publication. Now that we are confident in the effectiveness of the established workflows and tools, our goal is to foster the adoption of the journal and to refine the workflows for digital replicability.

Articles

Holistic I/O Activity Characterization Through Log Data Analysis of Parallel File Systems and Interconnects

Yuichi Tsujita, Yoshitaka Furutani, Hajime Hida, Keiji Yamamoto, Atsuya Uno
Keywords
  • Holistic log data analysis
  • K computer
  • FEFS
  • Lustre
  • Tofu
  • MPI-IO
Date: 2021-06-02
Version: 1.0
PDF DOI Workflow
BibTeX Provide feedback

BibTeX

@article{ JHPS-2021--1,
author = {Yuichi Tsujita \and Yoshitaka Furutani \and Hajime Hida \and Keiji Yamamoto \and Atsuya Uno},
title = {{Holistic I/O Activity Characterization Through Log Data Analysis of Parallel File Systems and Interconnects}},
year = {2021},
month = {06},
journal = {Journal of High Performance Computing},
series = {Issue },
isbn = {},
doi = {10.5281/zenodo.5120840},
url = {\url{https://jhps.vi4io.org/issues/#-1}},
abstract = {{The computing power of high-performance computing (HPC) systems is increasing with a rapid growth in the number of compute nodes and CPU cores. Meanwhile, I/O performance is one of the bottlenecks in improving HPC system performance. Current HPC systems are equipped with parallel file systems such as GPFS and Lustre to cope with the huge demand of data-intensive applications. Although most of the HPC systems provide performance tuning tools on compute nodes, there is not enough chance to tune I/O operations on parallel file systems, including high speed interconnects among compute nodes and file systems. We propose an I/O performance optimization framework that utilizes log data of parallel file systems and interconnects in a holistic way for improving the performance of HPC system, including effective use of I/O nodes and parallel file systems. We demonstrated our framework at the K computer with two I/O benchmarks for the original and the enhanced MPI-IO implementations. The analysis by using the framework revealed the effective utilization of parallel file systems and interconnects among I/O nodes in the enhanced MPI-IO implementation, thus paving the way towards holistic I/O performance tuning framework in the current HPC systems.}}
}

The computing power of high-performance computing (HPC) systems is increasing with a rapid growth in the number of compute nodes and CPU cores. Meanwhile, I/O performance is one of the bottlenecks in improving HPC system performance. Current HPC systems are equipped with parallel file systems such as GPFS and Lustre to cope with the huge demand of data-intensive applications. Although most of the HPC systems provide performance tuning tools on compute nodes, there is not enough chance to tune I/O operations on parallel file systems, including high speed interconnects among compute nodes and file systems. We propose an I/O performance optimization framework that utilizes log data of parallel file systems and interconnects in a holistic way for improving the performance of HPC system, including effective use of I/O nodes and parallel file systems. We demonstrated our framework at the K computer with two I/O benchmarks for the original and the enhanced MPI-IO implementations. The analysis by using the framework revealed the effective utilization of parallel file systems and interconnects among I/O nodes in the enhanced MPI-IO implementation, thus paving the way towards holistic I/O performance tuning framework in the current HPC systems.

A Workflow for Identifying Jobs with Similar I/O Behavior Utilizing Time Series Analysis

Julian Kunkel, Eugen Betke
Keywords
  • performance analysis
  • monitoring
  • time series
  • job analysis
Date: 2021-08-09
Version: 1.0
PDF DOI Workflow
BibTeX Provide feedback

BibTeX

@article{ JHPS-2021--2,
author = {Julian Kunkel \and Eugen Betke},
title = {{A Workflow for Identifying Jobs with Similar I/O Behavior Utilizing Time Series Analysis}},
year = {2021},
month = {08},
journal = {Journal of High Performance Computing},
series = {Issue },
isbn = {},
doi = {10.5281/zenodo.5172281},
url = {\url{https://jhps.vi4io.org/issues/#-2}},
abstract = {{One goal of support staff at a data center is to identify inefficient jobs and to improve their efficiency. Therefore, a data center deploys monitoring systems that capture the behavior of the executed jobs. While it is easy to utilize statistics to rank jobs based on the utilization of computing, storage, and network, it is tricky to find patterns in 100,000 jobs, i.e., is there a class of jobs that aren't performing well. Similarly, when support staff investigates a specific job in detail, e.g., because it is inefficient or highly efficient, it is relevant to identify related jobs to such a blueprint. This allows staff to understand the usage of the exhibited behavior better and to assess the optimization potential. In this article, our goal is to identify jobs similar to an arbitrary reference job. In particular, we describe a methodology that utilizes temporal I/O similarity to identify jobs related to the reference job. Practically, we apply several previously developed time series algorithms and also utilize the Kolmogorov-Smirnov-Test to compare the distribution of the metrics. A study is conducted to explore the effectiveness of the approach by investigating related jobs for three reference jobs. The data stem from DKRZ's supercomputer Mistral and include more than 500,000 jobs that have been executed for more than 6 months of operation. Our analysis shows that the strategy and algorithms are effective to identify similar jobs and reveal interesting patterns in the data. It also shows the need for the community to jointly define the semantics of similarity depending on the analysis purpose.}}
}

One goal of support staff at a data center is to identify inefficient jobs and to improve their efficiency. Therefore, a data center deploys monitoring systems that capture the behavior of the executed jobs. While it is easy to utilize statistics to rank jobs based on the utilization of computing, storage, and network, it is tricky to find patterns in 100,000 jobs, i.e., is there a class of jobs that aren’t performing well. Similarly, when support staff investigates a specific job in detail, e.g., because it is inefficient or highly efficient, it is relevant to identify related jobs to such a blueprint. This allows staff to understand the usage of the exhibited behavior better and to assess the optimization potential. In this article, our goal is to identify jobs similar to an arbitrary reference job. In particular, we describe a methodology that utilizes temporal I/O similarity to identify jobs related to the reference job. Practically, we apply several previously developed time series algorithms and also utilize the Kolmogorov-Smirnov-Test to compare the distribution of the metrics. A study is conducted to explore the effectiveness of the approach by investigating related jobs for three reference jobs. The data stem from DKRZ’s supercomputer Mistral and include more than 500,000 jobs that have been executed for more than 6 months of operation. Our analysis shows that the strategy and algorithms are effective to identify similar jobs and reveal interesting patterns in the data. It also shows the need for the community to jointly define the semantics of similarity depending on the analysis purpose.