Amir Farhat

amirf@mit.edu

Scholar Title

MIT EECS | Hewlett Foundation Undergraduate Research and Innovation Scholar

Research Title

Studying Communication Patterns in Distributed Machine Learning Training

Cohort

2019–2020

Department

Electrical Engineering and Computer Science

Research Areas

Systems and Networking

Supervisor

Manya Ghobadi

ghobadi@csail.mit.edu

Abstract

As deep learning grows more popular and its models become larger, training these models becomes difficult. Even with dedicated hardware accelerators like GPUs, a single machine can still be a bottleneck for efficient and fast training. Thus, the community has increasingly turned to distributed training algorithms (e.g., ring-allreduce) and training distributor packages (e.g., Horovod). The goal of this project is to study the different communication patterns between servers in training, studying how different models’ training performance is affected by different configurations by placing special emphasis on inter-server communication over the network, and intra-server communication between GPUs.

Quote

Taking 6.033 (Computer Systems Engineering), I became fascinated with systems in general but with the internet in particular. Our project aims to explore new designs of data centers, a backbone of scalable modern internet technologies. Through this, I hope to learn about different architectures and schools of thought, while also learning to propose, communicate, explore, and validate ideas. I am most excited about working within a team.

Back to Scholars