| Title |
Accelerating Distributed Training In Heterogeneous Clusters Via A Straggler-Aware Parameter Server |
| ID_Doc |
5924 |
| Authors |
Yu H.; Zhu Z.; Chen X.; Cheng Y.; Hu Y.; Li X. |
| Year |
2019 |
| Published |
Proceedings - 21st IEEE International Conference on High Performance Computing and Communications, 17th IEEE International Conference on Smart City and 5th IEEE International Conference on Data Science and Systems, HPCC/SmartCity/DSS 2019 |
| DOI |
http://dx.doi.org/10.1109/HPCC/SmartCity/DSS.2019.00042 |
| Abstract |
Different from homogeneous clusters, when distributed training is performed in heterogeneous clusters, there will be great performance degradation due to the effect of stragglers. Instead of the synchronous stochastic optimization commonly used in homogeneous clusters, we choose an asynchronous approach, which does not require waiting for stragglers but has the problem of using stale parameters. To solve this problem, we design a straggler-aware parameter server (SaPS), which can detect stragglers through the version of parameters and mitigate their effect by a coordinator which can limit the staleness of parameters without waiting for stragglers. Experimental results show that SaPS can converge faster than fully synchronous, fully asynchronous and some SGD variants. © 2019 IEEE. |
| Author Keywords |
distributed training; heterogeneous cluster; parameter server; stragglers |