Issue

Issue 1 - 2021-01-29

The Journal of High-Performance Storage (JHPS) is a new open-access journal edited by storage experts that unite key features of journals fostering openness and trust in storage research. In particular, JHPS offers open reviews, living papers, digital replicability, and free open access.

The editing team is proud to announce the publication of the first JHPS issue today representing an important milestone. The first issue contains just one publication, however, the difficult situation in 2020 has impaired the submission numbers. We use this chance to look back at the developments during this turbulent year. While the webpage has been officially started about one year ago in 2020, we knew that the processes and toolchains needed further development and testing. As it turned out, the year was even more challenging than we anticipated, not only for HPC and storage experts but for society as a whole facing the COVID-19 pandemic. For researchers, the pandemic impacted their general research focus, administrative tasks, and their productivity which impacted their publication behavior.

In 2020, JHPS managed to review the processes revolving around publication; we improved their quality and increased the capabilities of tools based on the feedback of authors and reviewers. Particularly, we thank the HPC-IODC workshop for the fruitful collaboration with JHPS to test the open review process on the submitted research papers for HPC-IODC. Initially, the Google Docs format was explored for the public review process as its suggestion mode is powerful and allows reviewers to effectively add comments and minor suggestions. However, it turned out, the text setting features provided by Google Docs does not meet our aspirations for high-quality camera-ready publications. Therefore, we developed a LaTeX template and a Google Docs plugin to allow annotations to LaTeX files hosted at GitHub. It turned out that this tooling yields high-productivity while it is inclusive for public reviewers. Additionally, we introduced the JHPS Manuscript Central, a lightweight web-based system that manages the relevant publication workflows for authors and reviewers.

Now that we are confident in the effectiveness of the established workflows and tools, our goal is to foster the adoption of the journal and to refine the workflows for digital replicability.

We thank all authors, reviewers, and readers.

Cordially,
Julian Kunkel, Jean-Thomas Acquaviva, Suren Byna, Adrian Jackson, Ivo Jimenez, Anthony Kougkas, Jay Lofstead, Glenn K. Lockwood, Carlos Maltzahn, George S. Markomanolis, Lingfang Zeng
JHPS Editors

Articles

Classifying Temporal Characteristics of Job I/O Using Machine Learning Techniques

Eugen Betke, Julian Kunkel

Keywords

IO fingerprinting
performance analysis
monitoring

Date: 2021-01-29

Version: 1.0

PDF DOI Workflow

BibTeX Provide feedback

BibTeX

@article{ JHPS-2021-1-1,
author = {Eugen Betke \and Julian Kunkel},
title = {{Classifying Temporal Characteristics of Job I/O Using Machine Learning Techniques}},
year = {2021},
month = {01},
journal = {Journal of High Performance Computing},
series = {Issue 1},
isbn = {},
doi = {10.5281/zenodo.4478960},
url = {\url{https://jhps.vi4io.org/issues/#1-1}},
abstract = {{Every day, supercomputers execute 1000s of jobs with different characteristics. Data centers monitor the behavior of jobs to support the users and improve the infrastructure, for instance, by optimizing jobs or by determining guidelines for the next procurement. The classification of jobs into groups that express similar run-time behavior aids this analysis as it reduces the number of representative jobs to look into. This work utilizes machine learning techniques to cluster and classify parallel jobs based on the similarity in their temporal I/O behavior. Our contribution is the qualitative and quantitative evaluation of different I/O characterizations and similarity measurements and the development of a suitable clustering algorithm. <br><br> In the evaluation, we explore I/O characteristics from monitoring data of one million parallel jobs and cluster them into groups of similar jobs. Therefore, the time series of various I/O statistics is converted into features using different similarity metrics that customize the classification. <br><br> When using general-purpose clustering techniques, suboptimal results are obtained. Additionally, we extract phases of I/O activity from jobs. Finally, we simplify the grouping algorithm in favor of performance. We discuss the impact of these changes on the clustering quality.}}
}

Every day, supercomputers execute 1000s of jobs with different characteristics. Data centers monitor the behavior of jobs to support the users and improve the infrastructure, for instance, by optimizing jobs or by determining guidelines for the next procurement. The classification of jobs into groups that express similar run-time behavior aids this analysis as it reduces the number of representative jobs to look into. This work utilizes machine learning techniques to cluster and classify parallel jobs based on the similarity in their temporal I/O behavior. Our contribution is the qualitative and quantitative evaluation of different I/O characterizations and similarity measurements and the development of a suitable clustering algorithm.

In the evaluation, we explore I/O characteristics from monitoring data of one million parallel jobs and cluster them into groups of similar jobs. Therefore, the time series of various I/O statistics is converted into features using different similarity metrics that customize the classification.

When using general-purpose clustering techniques, suboptimal results are obtained. Additionally, we extract phases of I/O activity from jobs. Finally, we simplify the grouping algorithm in favor of performance. We discuss the impact of these changes on the clustering quality.

Issue 2 - 2021-08-09

Articles

Holistic I/O Activity Characterization Through Log Data Analysis of Parallel File Systems and Interconnects

Yuichi Tsujita, Yoshitaka Furutani, Hajime Hida, Keiji Yamamoto, Atsuya Uno

Keywords

Holistic log data analysis
K computer
FEFS
Lustre
Tofu
MPI-IO

Date: 2021-06-02

Version: 1.0

PDF DOI Workflow

BibTeX Provide feedback

BibTeX

@article{ JHPS-2021-2-1,
author = {Yuichi Tsujita \and Yoshitaka Furutani \and Hajime Hida \and Keiji Yamamoto \and Atsuya Uno},
title = {{Holistic I/O Activity Characterization Through Log Data Analysis of Parallel File Systems and Interconnects}},
year = {2021},
month = {06},
journal = {Journal of High Performance Computing},
series = {Issue 2},
isbn = {},
doi = {10.5281/zenodo.5120840},
url = {\url{https://jhps.vi4io.org/issues/#2-1}},
abstract = {{The computing power of high-performance computing (HPC) systems is increasing with a rapid growth in the number of compute nodes and CPU cores. Meanwhile, I/O performance is one of the bottlenecks in improving HPC system performance. Current HPC systems are equipped with parallel file systems such as GPFS and Lustre to cope with the huge demand of data-intensive applications. Although most of the HPC systems provide performance tuning tools on compute nodes, there is not enough chance to tune I/O operations on parallel file systems, including high speed interconnects among compute nodes and file systems. We propose an I/O performance optimization framework that utilizes log data of parallel file systems and interconnects in a holistic way for improving the performance of HPC system, including effective use of I/O nodes and parallel file systems. We demonstrated our framework at the K computer with two I/O benchmarks for the original and the enhanced MPI-IO implementations. The analysis by using the framework revealed the effective utilization of parallel file systems and interconnects among I/O nodes in the enhanced MPI-IO implementation, thus paving the way towards holistic I/O performance tuning framework in the current HPC systems.}}
}

The computing power of high-performance computing (HPC) systems is increasing with a rapid growth in the number of compute nodes and CPU cores. Meanwhile, I/O performance is one of the bottlenecks in improving HPC system performance. Current HPC systems are equipped with parallel file systems such as GPFS and Lustre to cope with the huge demand of data-intensive applications. Although most of the HPC systems provide performance tuning tools on compute nodes, there is not enough chance to tune I/O operations on parallel file systems, including high speed interconnects among compute nodes and file systems. We propose an I/O performance optimization framework that utilizes log data of parallel file systems and interconnects in a holistic way for improving the performance of HPC system, including effective use of I/O nodes and parallel file systems. We demonstrated our framework at the K computer with two I/O benchmarks for the original and the enhanced MPI-IO implementations. The analysis by using the framework revealed the effective utilization of parallel file systems and interconnects among I/O nodes in the enhanced MPI-IO implementation, thus paving the way towards holistic I/O performance tuning framework in the current HPC systems.

A Workflow for Identifying Jobs with Similar I/O Behavior Utilizing Time Series Analysis

Julian Kunkel, Eugen Betke

Keywords

performance analysis
monitoring
time series
job analysis

Date: 2021-08-09

Version: 1.0

PDF DOI Workflow

BibTeX Provide feedback

BibTeX

@article{ JHPS-2021-2-2,
author = {Julian Kunkel \and Eugen Betke},
title = {{A Workflow for Identifying Jobs with Similar I/O Behavior Utilizing Time Series Analysis}},
year = {2021},
month = {08},
journal = {Journal of High Performance Computing},
series = {Issue 2},
isbn = {},
doi = {10.5281/zenodo.5172281},
url = {\url{https://jhps.vi4io.org/issues/#2-2}},
abstract = {{One goal of support staff at a data center is to identify inefficient jobs and to improve their efficiency. Therefore, a data center deploys monitoring systems that capture the behavior of the executed jobs. While it is easy to utilize statistics to rank jobs based on the utilization of computing, storage, and network, it is tricky to find patterns in 100,000 jobs, i.e., is there a class of jobs that aren't performing well. Similarly, when support staff investigates a specific job in detail, e.g., because it is inefficient or highly efficient, it is relevant to identify related jobs to such a blueprint. This allows staff to understand the usage of the exhibited behavior better and to assess the optimization potential. In this article, our goal is to identify jobs similar to an arbitrary reference job. In particular, we describe a methodology that utilizes temporal I/O similarity to identify jobs related to the reference job. Practically, we apply several previously developed time series algorithms and also utilize the Kolmogorov-Smirnov-Test to compare the distribution of the metrics. A study is conducted to explore the effectiveness of the approach by investigating related jobs for three reference jobs. The data stem from DKRZ's supercomputer Mistral and include more than 500,000 jobs that have been executed for more than 6 months of operation. Our analysis shows that the strategy and algorithms are effective to identify similar jobs and reveal interesting patterns in the data. It also shows the need for the community to jointly define the semantics of similarity depending on the analysis purpose.}}
}

One goal of support staff at a data center is to identify inefficient jobs and to improve their efficiency. Therefore, a data center deploys monitoring systems that capture the behavior of the executed jobs. While it is easy to utilize statistics to rank jobs based on the utilization of computing, storage, and network, it is tricky to find patterns in 100,000 jobs, i.e., is there a class of jobs that aren’t performing well. Similarly, when support staff investigates a specific job in detail, e.g., because it is inefficient or highly efficient, it is relevant to identify related jobs to such a blueprint. This allows staff to understand the usage of the exhibited behavior better and to assess the optimization potential. In this article, our goal is to identify jobs similar to an arbitrary reference job. In particular, we describe a methodology that utilizes temporal I/O similarity to identify jobs related to the reference job. Practically, we apply several previously developed time series algorithms and also utilize the Kolmogorov-Smirnov-Test to compare the distribution of the metrics. A study is conducted to explore the effectiveness of the approach by investigating related jobs for three reference jobs. The data stem from DKRZ’s supercomputer Mistral and include more than 500,000 jobs that have been executed for more than 6 months of operation. Our analysis shows that the strategy and algorithms are effective to identify similar jobs and reveal interesting patterns in the data. It also shows the need for the community to jointly define the semantics of similarity depending on the analysis purpose.