TACC Stats: A Comprehensive and Transparent Resource Usage Monitoring Tool for HPC Systems

Date and Time: 
2015 April 14 @ 1:30pm
FL2-1022 Large Auditorium
Todd Evans

We have developed and deployed the transparent and comprehensive resource usage monitoring and analysis tool TACC Stats at the Texas Advanced Computing Center (TACC). This tool is currently used to aid TACC’s system administrators and HPC consultants in the diagnosis and resolution of application and system issues and to identify jobs with poor performance characteris- tics or inefficient resource usage utilization. TACC Stats automatically collects resource usage data at regular time intervals and computes performance metrics for every job run on an HPC system. The data collected is intended to be comprehensive, ranging from shared filesystem statistics to individual cores’ hardware counters. The collected data and computed metrics are readily explorable via a web interface which enables searches based on combinations of job metadata and metric thresholds.

Jobs with poor performance or inefficient resource utilization are automatically identified by TACC Stats and can be associated to specific users, applications or projects in addition to a variety of other metadata. Consultants and system administrators can then apply additional scrutiny to the indicated jobs, users, applications or projects. Conversely, TACC Stats also enables the identification of jobs that are performing exceptionally well which informs best practices for application configuration and system management. In this report we introduce TACC Stats and the capabilities it provides along with demonstrations of these capabilities using case studies.

Speaker Description: 

Dr. Todd Evans is an HPC Research Associate at the Texas Advanced Computing Center and Research Scientist Lecturer in the Department of Statistics & Data Science at UT Austin. Dr. Evans received his Ph.D. in Physics from the University of Illinois at Urbana-Champaign in 2008 and has been staff at UT Austin since 2013. Evans current research interests include the development of tools for transparent job-level monitoring and performance analysis of HPC systems.

PDF icon SEA15_ToddEvans_TACC_Stats.pdf1.5 MB
Video recorded: 

If you use a non-flash enabled device, you may download the video here

Event Category: