Smart City Gnosys

Smart city article details

Title Accelerating Distributed Training In Heterogeneous Clusters Via A Straggler-Aware Parameter Server
ID_Doc 5924
Authors Yu H.; Zhu Z.; Chen X.; Cheng Y.; Hu Y.; Li X.
Year 2019
Published Proceedings - 21st IEEE International Conference on High Performance Computing and Communications, 17th IEEE International Conference on Smart City and 5th IEEE International Conference on Data Science and Systems, HPCC/SmartCity/DSS 2019
DOI http://dx.doi.org/10.1109/HPCC/SmartCity/DSS.2019.00042
Abstract Different from homogeneous clusters, when distributed training is performed in heterogeneous clusters, there will be great performance degradation due to the effect of stragglers. Instead of the synchronous stochastic optimization commonly used in homogeneous clusters, we choose an asynchronous approach, which does not require waiting for stragglers but has the problem of using stale parameters. To solve this problem, we design a straggler-aware parameter server (SaPS), which can detect stragglers through the version of parameters and mitigate their effect by a coordinator which can limit the staleness of parameters without waiting for stragglers. Experimental results show that SaPS can converge faster than fully synchronous, fully asynchronous and some SGD variants. © 2019 IEEE.
Author Keywords distributed training; heterogeneous cluster; parameter server; stragglers


Similar Articles


Id Similarity Authors Title Published
21112 View0.876Zhang R.; Shen G.; Gong L.; Guo C.Dsana: A Distributed Machine Learning Acceleration Solution Based On Dynamic Scheduling And Network AccelerationProceedings - 2020 IEEE 22nd International Conference on High Performance Computing and Communications, IEEE 18th International Conference on Smart City and IEEE 6th International Conference on Data Science and Systems, HPCC-SmartCity-DSS 2020 (2020)