Smart City Gnosys

Smart city article details

Title	Dsana: A Distributed Machine Learning Acceleration Solution Based On Dynamic Scheduling And Network Acceleration
ID_Doc	21112
Authors	Zhang R.; Shen G.; Gong L.; Guo C.
Year	2020
Published	Proceedings - 2020 IEEE 22nd International Conference on High Performance Computing and Communications, IEEE 18th International Conference on Smart City and IEEE 6th International Conference on Data Science and Systems, HPCC-SmartCity-DSS 2020
DOI	http://dx.doi.org/10.1109/HPCC-SmartCity-DSS50907.2020.00037
Abstract	Distributed machine learning(DML) has become a feasible solution to deal with the growing training data and models. Reviewing the existing architecture of DML, Parametric server(PS) architecture stands out in iterative convergence algorithms and widely deployed in practice, thanks to flexible expansion and so on. Under this architecture, the parameter synchronization mode based on Bulk Synchronous Parallel(BSP) has become one of the research hotspots. As for the BSP mode, each iteration efficiency is determined by the slowest node in the cluster, therefore, the straggler problem becomes the main reason for reducing the efficiency of DML training, which is even more prominent in the heterogeneous cloud services. Existing works mainly focus on the straggler problem, and the importance of communication is usually ignored. However, inefficient communication is also one of the reasons for the inefficiency of DML iterations. In this paper, we propose DSANA, which first alleviates certain straggler problems by dynamically scheduling computation tasks. Secondly, DSANA improves the overlap of computation/communication by dividing larger transmission parameters, thus further improving the iteration efficiency of DML training. We conduct comparison experiments with the classic iterative algorithm PageRank on four different-scale data sets in two cloud service scenarios. The experimental results show that DSANA can improve the training efficiency to 36.6%\sim 56.4% compared with the baseline solution. © 2020 IEEE.
Author Keywords	Bulk Synchronous Parallel; Distributed Machine Learning; Dynamic Scheduling; Network Acceleration; Parameter Server; Straggler

Similar Articles

Id	Similarity	Authors	Title	Published
42904	0.876	Wang N.; Zhou R.; Jiao L.; Zhang R.; Li B.; Li Z.	Preemptive Scheduling For Distributed Machine Learning Jobs In Edge-Cloud Networks	IEEE Journal on Selected Areas in Communications, 40, 8 (2022)
5924	0.876	Yu H.; Zhu Z.; Chen X.; Cheng Y.; Hu Y.; Li X.	Accelerating Distributed Training In Heterogeneous Clusters Via A Straggler-Aware Parameter Server	Proceedings - 21st IEEE International Conference on High Performance Computing and Communications, 17th IEEE International Conference on Smart City and 5th IEEE International Conference on Data Science and Systems, HPCC/SmartCity/DSS 2019 (2019)
34323	0.859	Chen B.; Yang Y.; Xu M.	Job-Aware Communication Scheduling For Dml Training In Shared Cluster	Proceedings - 2020 IEEE 22nd International Conference on High Performance Computing and Communications, IEEE 18th International Conference on Smart City and IEEE 6th International Conference on Data Science and Systems, HPCC-SmartCity-DSS 2020 (2020)
20985	0.852	Zhou R.; Wang N.; Huang Y.; Pang J.; Chen H.	Dps: Dynamic Pricing And Scheduling For Distributed Machine Learning Jobs In Edge-Cloud Networks	IEEE Transactions on Mobile Computing, 22, 11 (2023)