Machine Learning Operating Systems Scalability TensorFlow
Abstract:
Machine Learning is widely used in academics and industry to study patterns, give recommendations, and build statistical models. We need a scalable infrastructure to train models quicker as well as fit into the working memory of the training device. Vertical scaling is expensive, so even with expensive and big machines with lots of memory, it would be more efficient by cost to use many small machines. In our research, we characterize the resource utilization of machine learning workloads during the teaching and inference phases. This will further help us understand how to improve a systems capability to run such workloads more efficiently and realize how to optimize the system for these workloads. In addition, we will be able to determine the limiting factor in the performance of the machine learning teaching and inference models by running them on a standard system while using profiling tools to see the load on memory, CPU, and power. We can then analyze how different methods of handling system processes (such as page replacement, threading, etc.) affect the performance of the workload. Afterward, we will determine whether core scaling and memory binding play an important role in determining the scalability of machine learning model training. Finally, we will investigate if hyperthreading improves performance for model training and whether the same applies to inter-operator parallelism.