Incubator

The incubator contains papers in the review phase prior to acceptance of the journal.

Anyone can access the article and the Git repository containing the reproducible workflow and provide feedback.
Please see the guidelines for reviewers for details on the process.

Articles in the Incubator

The following articles are currently in the incubator:

Classifying Temporal Characteristics of Job I/O Using Machine Learning Techniques

Eugen Betke, Julian Kunkel
Keywords
  • IO fingerprinting
  • performance analysis
  • monitoring
Date: 2020-07-10
Version: 0.8
PDF Draft LaTeX Workflow

Every day, supercomputers execute 1000s of jobs with different characteristics. Data centers monitor the behavior of jobs to support the users and improve the infrastructure, for instance, by optimizing jobs or by determining guidelines for the next procurement. The classification of jobs into groups that express similar run-time behavior aids this analysis as it reduces the number of representative jobs to look into. This work utilizes machine learning techniques to cluster and classify parallel jobs based on the similarity in their temporal I/O behavior. Our contribution is the qualitative and quantitative evaluation of different I/O characterizations and similarity measurements and the development of a suitable clustering algorithm.

In the evaluation, we explore I/O characteristics from monitoring data of one million parallel jobs and cluster them into groups of similar jobs. Therefore, the time series of various IO statistics is converted into features using different similarity metrics that customize the classification.

When using general-purpose clustering techniques, suboptimal results are obtained. Additionally, we extract phases of IO activity from jobs. Finally, we simplify the grouping algorithm in favor of performance. We discuss the impact of these changes on the clustering quality.

Characterizing I/O Optimization Effect Through Holistic Log Data Analysis of Parallel File Systems and Interconnects

Yuichi Tsujita, Yoshitaka Furutani, Hajime Hida, Keiji Yamamoto, Atsuya Uno
Keywords
  • holistic log data analysis
  • K computer
  • FEFS
  • Tofu
  • MPI-IO
Date: 2020-04-01
Version: 0.5
Draft

Performance of HPC systems is increasing with a rapid growth in the number of compute nodes and CPU cores. Meanwhile I/O performance is one of the bottlenecks in optimizing file I/O in HPC systems. Recent HPC systems utilize parallel file systems such as GPFS and Lustre to cope with a huge demand of data-intensive applications. Although most of the HPC systems provide performance tuning tools on compute nodes, there is not enough chance to tune I/O activities on parallel file systems including high speed interconnects among compute nodes and file systems.

We propose an I/O performance optimization framework using log data of parallel file systems and interconnects in a holistic way for effective use of HPC systems including I/O nodes and parallel file systems. We demonstrate our framework at the K computer with two I/O benchmarks for the original and the enhanced MPI-IO implementations. Its I/O analysis has revealed that I/O performance increases achieved by the enhanced MPI-IO implementation is due to effective utilization of parallel file systems and interconnects among I/O nodes compared with the original one.

Investigating the Overhead of the REST Protocol to Reveal the Potential for Using Cloud Services for HPC Storage

Frank Gadban, Julian Kunkel, Thomas Ludwig
Keywords
  • HPC-Cloud-Convergence
  • RESTful APIs
  • HTTP2
  • HTTP3
Date: 2020-04-01
Version: 0.5
Draft Workflow

With the significant advances in Cloud Computing, it is inevitable to explore the usage of Cloud technology in HPC workflows. While many Cloud vendors offer to move complete HPC workloads into the Cloud, this is in fact limited by the massive demand of computing power alongside storage resources typically required by HPC applications. Eventually, the cost of storing and managing data produced by those applications often determines where workloads should run. It is widely believed that HPC hardware and software protocols like MPI yield superior performance compared to RESTful Web Services and TCP/IP and lower resource consumption such as CPU load on the client/server. With the advent of enhanced versions of HTTP – which is the most commonly used protocol for Cloud services and in particular Cloud storage – it is time to reevaluate the effective usage of cloud-based storage in HPC and their ability to cope with various types of data-intensive workloads.

In this paper, we investigate the overhead of different versions of the HTTP protocol compared to the HPC-native communication protocol MPI when storing and retrieving objects. We are particularly interested in the impact of data transfer on measurable performance metrics. Our contribution is the creation of a performance model based on hardware counters that provide an analytical representation of data transfer over different protocols. We validate this model by comparing the results obtained for REST and MPI on the two different systems, one equipped with Infiniband and one with Gigabit Ethernet. Thus, we evaluate its accuracy and show that REST can be a viable and resource-efficient solution, in particular for accessing large files.