Smart City Gnosys

Smart city article details

Title Job Placement Strategy With Opportunistic Resource Sharing For Distributed Deep Learning Clusters
ID_Doc 34322
Authors Li H.; Sun T.; Li X.; Xu H.
Year 2020
Published Proceedings - 2020 IEEE 22nd International Conference on High Performance Computing and Communications, IEEE 18th International Conference on Smart City and IEEE 6th International Conference on Data Science and Systems, HPCC-SmartCity-DSS 2020
DOI http://dx.doi.org/10.1109/HPCC-SmartCity-DSS50907.2020.00079
Abstract Distributed deep learning frameworks train large deep leaning workload with multiple training jobs on shared distributed GPU servers. There are new challenges when scheduling resources for these systems. Modern deep learning training jobs tend to consume large amount of GPU memory. A training job has an iterative nature that causes the memory usage fluctuate overtime. Jobs sharing a host may suffer from significant performance degradation caused by memory overload in runtime. Moreover, even without memory overloads, deep learning training jobs still experience different levels of performance interference when sharing a GPU device. This paper studies these two issues. We introduced an opportunistic memory sharing model to allocate resources for training jobs with time-varying memory requirements. Based on this model, we introduced an opportunistic Job Placement Problem (OJPP) for shared GPU clusters that seeks job placement configurations using minimum number of GPU devices and guarantees user-defined performance requirements. We proposed a greedy algorithm and a heuristic algorithm with computational complexities of O(n\log n) and O(n^{2}\log n), respectively, to solve the problem. Extensive experiments are conducted using a GPU cluster to verify the correctness, effectiveness, and the scalability of our approaches. The proposed approach achieved over 80% percent of the standalone performance, in term of average job completion time, with less than 30% extra resources consumption. © 2020 IEEE.
Author Keywords Bin Packing; Deep Learning Cluster; Job Placement; Opportunistic Sharing


Similar Articles


Id Similarity Authors Title Published
34376 View0.88Wang H.; Chen X.; Xu H.; Liu J.; Huang L.Joint Job Offloading And Resource Allocation For Distributed Deep Learning In Edge ComputingProceedings - 21st IEEE International Conference on High Performance Computing and Communications, 17th IEEE International Conference on Smart City and 5th IEEE International Conference on Data Science and Systems, HPCC/SmartCity/DSS 2019 (2019)
38608 View0.863Chen Z.; Luo L.; Quan W.; Shi Y.; Yu J.; Wen M.; Zhang C.Multiple Cnn-Based Tasks Scheduling Across Shared Gpu Platform In Research And Development ScenariosProceedings - 20th International Conference on High Performance Computing and Communications, 16th International Conference on Smart City and 4th International Conference on Data Science and Systems, HPCC/SmartCity/DSS 2018 (2019)