Smart City Gnosys

Smart city article details

Title Job-Aware Communication Scheduling For Dml Training In Shared Cluster
ID_Doc 34323
Authors Chen B.; Yang Y.; Xu M.
Year 2020
Published Proceedings - 2020 IEEE 22nd International Conference on High Performance Computing and Communications, IEEE 18th International Conference on Smart City and IEEE 6th International Conference on Data Science and Systems, HPCC-SmartCity-DSS 2020
DOI http://dx.doi.org/10.1109/HPCC-SmartCity-DSS50907.2020.00058
Abstract Distributed machine learning (DML) systems equipped with multiple computing nodes have been widely adopted to accelerate large model training in the industry. To maximize resource utilization, a critical problem is how to schedule the communication of DML jobs efficiently. However, previous approaches work well only when a job can use the network resources exclusively. Training multiple jobs in shared cluster without scheduling will bring significant performance degradation since network contention. In this paper, we propose JCS, a job-Aware communication scheduler to overcome the above problems. JCS profiles the priority with a novel metric among jobs and schedule communication of jobs according to both computation and communication information. To demonstrate the effectiveness of our algorithm, we perform extensive simulations with DML job traces. The simulation results show that our algorithm can reduce average job completion time by 19%, 39% and 46% over RRSP, SCF and LCoF. © 2020 IEEE.
Author Keywords Communication Scheduling; Distributed Machine Learning; Network Contention; Shared Cluster


Similar Articles


Id Similarity Authors Title Published
42904 View0.861Wang N.; Zhou R.; Jiao L.; Zhang R.; Li B.; Li Z.Preemptive Scheduling For Distributed Machine Learning Jobs In Edge-Cloud NetworksIEEE Journal on Selected Areas in Communications, 40, 8 (2022)
21112 View0.859Zhang R.; Shen G.; Gong L.; Guo C.Dsana: A Distributed Machine Learning Acceleration Solution Based On Dynamic Scheduling And Network AccelerationProceedings - 2020 IEEE 22nd International Conference on High Performance Computing and Communications, IEEE 18th International Conference on Smart City and IEEE 6th International Conference on Data Science and Systems, HPCC-SmartCity-DSS 2020 (2020)