Research Project Title:
Studying Communication Patterns in Distributed Machine Learning Training
abstract:As deep learning grows more popular and its models become larger, training these models becomes difficult. Even with dedicated hardware accelerators like GPUs, a single machine can still be a bottleneck for efficient and fast training. Thus, the community has increasingly turned to distributed training algorithms (e.g., ring-allreduce) and training distributor packages (e.g., Horovod). The goal of this project is to study the different communication patterns between servers in training, studying how different models’ training performance is affected by different configurations by placing special emphasis on inter-server communication over the network, and intra-server communication between GPUs.
"Taking 6.033 (Computer Systems Engineering), I became fascinated with systems in general but with the internet in particular. Our project aims to explore new designs of data centers, a backbone of scalable modern internet technologies. Through this, I hope to learn about different architectures and schools of thought, while also learning to propose, communicate, explore, and validate ideas. I am most excited about working within a team."