The incubator contains papers in the review phase prior to acceptance of the journal.

Anyone can access the article and the Git repository containing the reproducible workflow and provide feedback.
Please see the guidelines for reviewers for details on the process.

Articles in the Incubator

The following articles are currently in the incubator:

Characterizing I/O Optimization Effect Through Holistic Log Data Analysis of Parallel File Systems and Interconnects

Yuichi Tsujita, Yoshitaka Furutani, Hajime Hida, Keiji Yamamoto, Atsuya Uno
  • Holistic log data analysis
  • K computer
  • FEFS
  • Tofu
  • MPI-IO
Date: 2021-04-12
Version: 0.8
PDF Draft

The performance of HPC systems is increasing with a rapid growth in the number of compute nodes and CPU cores. Meanwhile, I/O performance is one of the bottlenecks in improving HPC system performance. Recent HPC systems utilize parallel file systems such as GPFS and Lustre to cope with the huge demand of data-intensive applications. Although most of the HPC systems provide performance tuning tools on compute nodes, there is not enough chance to tune I/O activities on parallel file systems, including high speed interconnects among compute nodes and file systems. We propose an I/O performance optimization framework that uses log data of parallel file systems and interconnects in a holistic way for improving performance of HPC systems including I/O nodes and parallel file systems. We demonstrate our framework at the K computer with two I/O benchmarks for the original and the enhanced MPI-IO implementation. The analysis by the framework reveals the effective utilization of parallel file systems and interconnects among I/O nodes in the enhanced MPI-IO implementation.

A Workflow for Identifying Jobs with Similar I/O Behavior Utilizing Time Series Analysis

Julian Kunkel, Eugen Betke
  • performance analysis
  • monitoring
  • time series
  • job analysis
Date: 2021-03-16
Version: 0.5
PDF Draft Workflow

One goal of support staff at a data center is to identify inefficient jobs and to improve their efficiency. Therefore, a data center deploys monitoring systems that capture the behavior of the executed jobs. While it is easy to utilize statistics to rank jobs based on the utilization of computing, storage, and network, it is tricky to find patterns in 100.000 jobs, i.e., is there a class of jobs that aren’t performing well. Similarly, when support staff investigates a specific job in detail, e.g., because it is inefficient or highly efficient, it is relevant to identify related jobs to such a blueprint. This allows staff to understand the usage of the exhibited behavior better and to assess the optimization potential.

In this paper, we describe a methodology to identify jobs related to a reference job based on their temporal I/O similarity. Practically, we apply several previously developed time series algorithms and also utilize the Kolmogorov-Smirnov-Test to compare the distribution of the metrics. A study is conducted to explore the effectiveness of the approach by investigating related jobs for three reference jobs. The data stems from DKRZ’s supercomputer Mistral and includes more than 500.000 jobs that have been executed for more than 6 months of operation. Our analysis shows that the strategy and algorithms are effective to identify similar jobs and revealed interesting patterns in the data. It also shows the need for the community to jointly define the semantics of similarity depending on the analysis purpose.