Almost every data scientist new to the field kicks off with the resources they already have at hand: their Personal Computers. However, they quickly run into the common problem of very poor performance (which can be very frustrating). It’s not the Personal Computer’s fault. Personal Computers are generally intended for light usage and mobility.
The next logical step is to move to the cloud. And why not indeed? The cloud is fast, cheap and very flexible. A typical Amazon EC2 GPU Instance (p2.xlarge) costs only $0.9 per hour at the time of writing this blog. What made Public Cloud computing so cheap was the introduction of the hypervisor, which is a software layer that sets between the hardware and the operating system of the individual instances, or Virtual Machines.
This blogpost sheds some light on the impact of the hypervisor on Machine Learning applications.
The Test Environment
For the study to be meaningful, I set up two (almost) identical environments and ran exactly the same tests with the same datasets. One environment was a virtualized private cloud based on VMware vSphere 6. The other was a Bare Metal cloud with no virtualization. The Metal cloud was provided by Bigstep.
Each environment consisted of three servers running CentOS 7 64bit. Each server contained 40 processors and 32GB of RAM (the Physical RAM was higher on the Metal Cloud nodes. However, H2O was configured to use 30GB to match the virtualized environment). The virtualized cloud had traditional mechanical disks spinning at 10k rpm while the Metal Cloud had Solid State Disks. This difference did not affect the computing time as will be clarified shortly.
For the actual Machine Learning computing, H2O was installed on all serves in both environments (H2O can be downloaded for free here). An H2O cluster was formed to distribute computing workload. In total, each environment enjoyed 120 processors and 96GB of RAM (to be precise, 2GB of RAM was reserved to the Operating System on each node, making the actual amount available to H2O 30GB).
H2O is a very efficient memory-based Machine Learning platform. It loads the entire dataset into the memory and compresses it. Thus, once the dataset is loaded, disk performance becomes irrelevant. This is very crucial for Big Data applications because disks can’t keep up with the high performance of today’s processors, especially in clustered environments.
In order to ensure consistency in the two environments, I used the example “Airline Delay” that is available in H2O. Two datasets were used; a large dataset consisting of 152 Million observations (about 15GB in size) and a small subset of it, consisting of 2000 observations (about 4MB in size). The small dataset was used to see how the two environments behave when the dataset can fit into the cache memory (spoiler alert: this turned out to be very interesting and totally unexpected!).
The test itself consisted of three parts:
- Parsing the Data
- Training GLM Model
- Training Deep Learning Model
Parsing loads the datasets from disks into memory; converts the dataset into H2O’s native format and then performs in-memory compression. The first step is very IO intensive, while the rest is both processor and memory intensive. In clustered environments, the data is parsed in parallel. Each node loads only a portion of the data (typically 1/number of nodes) .
H2O’s Generalized Linear Models (GLM) estimates regression models for outcomes following exponential distributions. It’s largely single-threaded, meaning that only one processor out of the 120 can be used at a time.
H2O’s Deep Learning is based on a multi-layer feed-forward Artificial Neural Network that is trained with stochastic gradient descent using back-propagation. Typical of ANN, it’s embarrassingly parallel and indeed fully hammered all available processors.
It is important to note that I had to disable “early stopping” option to force H2O to perform the same amount of computation while training the network. This was a necessary measure due to the stochastic nature of the Deep Learning Implementation.
Each test was repeated three times and the best time was recorded.
Test 1: 15GB file, DL Fast Mode: True, Number of Epochs: 10
Parsing large files is very IO intensive task. In this respect, even a relatively fast Hard Disk (spinning at 10k rpm compared to 7.2k rpm on desktops and 5.4k rpm on laptops) is showing its age.
The Fast Mode in H2O’s Deep Learning enables minor approximation in back-propagation. This basically means that less computation is performed on each observation while training the Neural Network. Effectively, this makes the test memory bound. The Bare Metal cluster managed to crunch 260 thousands samples per second against 198 thousands for the virtualized cluster, which is about 25% (you can think of 30 processors being wasted by the hypervisor!).
The performance gap was much more visible in the GLM test, which as indicated previously, is largely single-threaded.
Test 2: 15GB file, DL Fast Mode: False, Number of Epochs: 0.1
Note: For this test, the number of epochs was reduced to 0.1 only.
Here, the Deep Learning test was repeated but with Fast Mode disabled to force H2O to perform more computation on every observation. Thus, this test is largely processor bound. The performance gap was reduced to about 15% thanks to the hardware-enabled virtualization features of the Intel Xeon processors. Intel Virtualization Technology bundle (VT-x, VT-d, VT-x with Extended Page Tables) significantly reduces virtualization overhead by offloading certain workloads from the hypervisor to dedicated functional units in the processor itself.
Test 3: 4MB file, GLM
The outcome of this test was very interesting (rather controversial!). The Bare Metal cluster performed exactly as one would expect, taking about 2 seconds to train the model. The virtualized cluster had enormous trouble converging the model, thus taking phenomenally longer. During the test, I observed that the busy thread kept jumping from one node to another. Since this behavior was not observed in the Bare Metal cluster, it is clearly attributed to the Scheduler of the hypervisor. Interestingly, H2O took noticeably less time to train the same GLM model using the larger dataset!
Most cloud computing providers rely on virtualization to deliver cheap virtual machines to consumers; data scientists included. Originally, virtualization was developed to increase the efficiency and utilization of computing resources in typical business environments. The virtualization overhead can be justified in these environments because computing resources are normally underutilized. However, for Machine Learning tasks where computing resources can be pushed to the extreme, the impact of virtualization can be overwhelming and unpredictable. Data scientists who are considering using virtualized cloud platform such as Amazon EC2, Microsoft Azure and Google for Deep Learning or similar workloads should consider buying extra resources to make up for the overhead of the hypervisor.
An alternative to the hypervisor it to use Containers such as Docker. While this may eliminate the overhead of the hypervisor, performance can still suffer due to hardware resources oversubscription. Containers are also considered less secure than virtual machines.
The ultimate option for Machine Learning tasks remains dedicated hardware, either in-premises or off- premises such as Bigstep’s Bare Metal cloud.