MACHINE LEARNING-BASED MODELING FOR PERFORMANCE IMPROVEMENT IN AN EXASCALE SYSTEMS
- Hits: 553
Volume 3 (2), December 2020, Pages 223-233
Etibar V. Vazirov
The combination of heterogeneous resources within exascale architectures guarantees to be capable of revolutionary compute for scientific applications. There will be some data about the status of the current progress of jobs, hardware and software, memory, and network resource usage. This provisional information has an irreplaceable value in learning to predict where applications may face dynamic and interactive behavior when resource failures occur. In this paper was proposed building a scalable framework that uses special performance information collected from all other sources. It will perform an analysis of HPC applications in order to develop new statistical footprints of resource usage. Besides, this framework should predict the reasons for failure and provide new capabilities to recover from application failures. We are applying HPC capabilities at exascale causes the possibility of substantial scientific unproductiveness in computational procedures. In that sense, the integration of machine learning into exascale computations is an encouraging way to obtain large performance profits and introduce an opportunity to jump a generation of simulation improvements.
High Performance Computing, Machine Learning, Exascale Computing System, Artificial Intelligence.
Bhatele, A., Titus, A. R., et al. (2015, May). Identifying the culprits behind network congestion. In 2015 IEEE International Parallel and Distributed Processing Symposium (pp. 113-122). IEEE.
Bhowmick, S., Eijkhout, V., Freund, Y., Fuentes, E., & Keyes, D. (2006). Application of machine learning to the selection of sparse linear solvers. Int. J. High Perf. Comput. Appl. (submitted)
Dean, J. (2017, December). Machine learning for systems and systems for machine learning. In Presentation at 2017 Conference on Neural Information Processing Systems.
ExaLearn Project to bring Machine Learning to Exascale. Retrieved from: https://insidehpc.com/2019/03/exalearn-project-to-bring-machine-learning-to-exascale/
Fu, S. (2011, December). Performance metric selection for autonomic anomaly detection on cloud computing systems. In 2011 IEEE Global Telecommunications Conference-GLOBECOM 2011 (pp. 1-5). IEEE.
Guo, J., Nomura, A., Barton, R., Zhang, H., & Matsuoka, S. (2018, March). Machine learning predictions for underestimation of job runtime on HPC system. In Asian Conference on Supercomputing Frontiers (pp. 179-198). Springer, Cham.
Huck, K. A., & Malony, A. D. (2005, November). Perfexplorer: A performance data mining framework for large-scale parallel computing. In SC’05: Proceedings of the 2005 ACM/IEEE Conference on Supercomputing (pp. 41-41). IEEE.
Ibidunmoye, O., Hernández-Rodriguez, F., & Elmroth, E. (2015). Performance anomaly detection and bottleneck identification. ACM Computing Surveys (CSUR), 48(1), 1-35.
Islam, T. Z., Thiagarajan, J. J., et al. (2016, November). A machine learning framework for performance coverage analysis of proxy applications. In SC’16: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (pp. 538-549). IEEE.
Jha, S., & Fox, G. (2019, September). Understanding ML Driven HPC: Applications and Infrastructure. In 2019 15th International Conference on eScience (eScience) (pp. 421-427). IEEE.
Netti, A., Kiziltan, Z., et al. (2019, August). Online fault classification in hpc systems through machine learning. In European Conference on Parallel Processing (pp. 3-16). Springer, Cham.
Sukhija, N., Malone, B., Srivastava, S., Banicescu, I., & Ciorba, F. M. (2014). A learning-based selection for portfolio scheduling of scientific applications on heterogeneous computing systems. Parallel and Cloud Computing, 3(4), 66-81.
The Exascale Era Is Coming, And Here’s Why It Matters. Retrieved from: https://www.forbes.com/sites/forbestechcouncil/2019/10/24/the-exascale-era-is-coming-and-heres-why-it-matters/#1200517fbce0
Thiagarajan, J. J., Anirudh, R., et al. (2018, May). Paddle: Performance analysis using a data-driven learning environment. In 2018 IEEE International Parallel and Distributed Processing Symposium (IPDPS) (pp. 784-793). IEEE.
Tuncer, O., Ates, E., et al. (2017, June). Diagnosing performance variations in HPC applications using machine learning. In International Supercomputing Conference (pp. 355-373). Springer, Cham.
Tuncer, O., Ates, E., et al. (2018). Online diagnosis of performance variation in HPC systems using machine learning. IEEE Transactions on Parallel and Distributed Systems, 30(4), 883-896.
Why We Need Exascale Computing. Retrieved from: https://www.huffpost.com/author/acmblog-222
Yeom, J. S., Thiagarajan, J. J., et al. (2016, November). Data-driven performance modeling of linear solvers for sparse matrices. In 2016 7th International Workshop on Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems (PMBS) (pp. 32-42). IEEE.
Zhang, H., You, H., Hadri, B., & Fahey, M. (2012). HPC usage behavior analysis and performance estimation with machine learning techniques. In Proceedings of the International Conference on Parallel and Distributed Processing Techniques and Applications (PDPTA) (p. 1). The Steering Committee of The World Congress in Computer Science, Computer Engineering and Applied Computing (WorldComp)