Smart City Gnosys

Smart city article details

Title Dsana: A Distributed Machine Learning Acceleration Solution Based On Dynamic Scheduling And Network Acceleration
ID_Doc 21112
Authors Zhang R.; Shen G.; Gong L.; Guo C.
Year 2020
Published Proceedings - 2020 IEEE 22nd International Conference on High Performance Computing and Communications, IEEE 18th International Conference on Smart City and IEEE 6th International Conference on Data Science and Systems, HPCC-SmartCity-DSS 2020
DOI http://dx.doi.org/10.1109/HPCC-SmartCity-DSS50907.2020.00037
Abstract Distributed machine learning(DML) has become a feasible solution to deal with the growing training data and models. Reviewing the existing architecture of DML, Parametric server(PS) architecture stands out in iterative convergence algorithms and widely deployed in practice, thanks to flexible expansion and so on. Under this architecture, the parameter synchronization mode based on Bulk Synchronous Parallel(BSP) has become one of the research hotspots. As for the BSP mode, each iteration efficiency is determined by the slowest node in the cluster, therefore, the straggler problem becomes the main reason for reducing the efficiency of DML training, which is even more prominent in the heterogeneous cloud services. Existing works mainly focus on the straggler problem, and the importance of communication is usually ignored. However, inefficient communication is also one of the reasons for the inefficiency of DML iterations. In this paper, we propose DSANA, which first alleviates certain straggler problems by dynamically scheduling computation tasks. Secondly, DSANA improves the overlap of computation/communication by dividing larger transmission parameters, thus further improving the iteration efficiency of DML training. We conduct comparison experiments with the classic iterative algorithm PageRank on four different-scale data sets in two cloud service scenarios. The experimental results show that DSANA can improve the training efficiency to 36.6%\sim 56.4% compared with the baseline solution. © 2020 IEEE.
Author Keywords Bulk Synchronous Parallel; Distributed Machine Learning; Dynamic Scheduling; Network Acceleration; Parameter Server; Straggler


Similar Articles


Id Similarity Authors Title Published
42904 View0.876Wang N.; Zhou R.; Jiao L.; Zhang R.; Li B.; Li Z.Preemptive Scheduling For Distributed Machine Learning Jobs In Edge-Cloud NetworksIEEE Journal on Selected Areas in Communications, 40, 8 (2022)
5924 View0.876Yu H.; Zhu Z.; Chen X.; Cheng Y.; Hu Y.; Li X.Accelerating Distributed Training In Heterogeneous Clusters Via A Straggler-Aware Parameter ServerProceedings - 21st IEEE International Conference on High Performance Computing and Communications, 17th IEEE International Conference on Smart City and 5th IEEE International Conference on Data Science and Systems, HPCC/SmartCity/DSS 2019 (2019)
34323 View0.859Chen B.; Yang Y.; Xu M.Job-Aware Communication Scheduling For Dml Training In Shared ClusterProceedings - 2020 IEEE 22nd International Conference on High Performance Computing and Communications, IEEE 18th International Conference on Smart City and IEEE 6th International Conference on Data Science and Systems, HPCC-SmartCity-DSS 2020 (2020)
20985 View0.852Zhou R.; Wang N.; Huang Y.; Pang J.; Chen H.Dps: Dynamic Pricing And Scheduling For Distributed Machine Learning Jobs In Edge-Cloud NetworksIEEE Transactions on Mobile Computing, 22, 11 (2023)