Scaling Performance

scaling performance



Obtaining data has never been easier. Individuals share their daily activities on social networks. That includes places they visit, things they like (or dislike), items they purchase, movies they watch and photos and videos they take anywhere at any time with their smartphones’ high-res cameras. Organisations aren’t much different in principle. They share data about events they organize or attend, data on their products and services and data on their financial performance. Researchers also have a lot to share. Data on scientific discoveries and breakthroughs, inventions and the like can be found in publicly and freely accessible databases.

There is simply more data at our disposal than we ever cared. However, the problem remains how to turn this massive amount of data into meaningful information, and ultimately into insights that can make a difference?

To make the matter a tiny bit more inconvenient, not all data are relevant – at least for the problem at hand – and not all relevant data are useful. Relevant data often contain “noise” and “outliers” that aren’t useful at best (mind you, we are more interested in outliers than normal data at times). After all, garbage in, garbage out.

The first step in the long journey to attain an insight is to sort the data out into “relevant” and “irrelevant”, in which case some computation needs to be done (actually in many cases a lot of computation).

With massive data size that keeps growing, fitting all of it in a computer is increasingly becoming more complicated.

This lengthy introduction should’ve made one point clear: the problem of scalability cannot be exaggerated.


Current Approaches to Scalability

Generally speaking, scalability approaches fall into two broad categories: Vertical and Horizontal.

Scaling up vertically refers to the process of adding additional hardware resources to the same computer. In other words, making an existing computer more capable. Horizontal scalability, on the other hand adds additional “nodes” to an exiting “cluster”, making the cluster as a whole more capable.

Both approaches allow more data to be crunched given the same amount of time, and this is where their similarity ends. Each approach will be discussed in turn.


Vertical Scalability

Upgrading a single computer is highly dependent on the architecture of that particular computer. For instance, computers built for desktop usage often include a single CPU socket, a few memory slots and limited PCI-E lanes. They do incorporate plenty of external ports though. Desktop computer platforms aren’t built with reliability in mind. They lack the basic kinds of error recovery options such as ECC (Error Correcting Code). As of the time of this blog, the most capable desktop CPU from Intel is the i7-6950X Extreme Edition. It has 10 cores and supports Hyperthreading. It can access up to 128GB of RAM in quad-channel configuration.

Storage capacity is limited by the number of available ports (typically SATA3) and by the available disk bays in the chassis, whichever is smaller. While it is possible to attach fast Solid State Disks (SSDs) to a Desktop computer, performance is likely to be capped by the SATA3 controller itself, which can handle a data rate of 6Gbps (appx 600MBps) max. Indeed, a single SSD nowadays can saturate the disk controller. Adding more disks simply means the bandwidth is split up and each disk runs at a lower speed.

It must be noted that a “seemingly” equivalent Mobile processor is considerably less capable than its Desktop counterpart. Mobile processors are designed to prioritize efficiency over performance. In addition, laptops have fewer memory slots and works in dual-channel configuration.


Server as Better Option

The server architecture is far more robust and reliable in every single aspect compared to the Desktop architecture (apart from the price of course). A typical server contains 2 CPU sockets (4 is becoming more popular), plenty of memory slots and many more PCI-E lanes. For example, the Intel Xeon E5-2699v4 has 22 Cores and supports two-way Hyperthreading. It can access as much as 1.5TB of RAM in quad-channel configuration with 76GBps total memory bandwidth per socket. Servers don’t rely on the limited SATA controller. Instead, SAS (Serial Attached SCSI) is used to control the SAS disks. SAS provides double the bandwidth of SATA and supports more sophisticated features to increase both performance and reliability. With more PCI-E lanes at disposal, multiple SAS controllers can be added to a single server without degrading I/O performance. Ultimately, servers are designed with mission-critical tasks in mind. As such, they are more resilient in handling long-running tasks without crashing.


The table below compares the main features of Desktop and Server platforms

Feature Desktop Server
Number of Sockets 1 2 or more
Max Cores per Socket 10 22
Max RAM per Socket 128GB 1.5TB
ECC support No Yes
PCI-E Lanes per Socket 40 40
Disk Controller / Bandwidth SATA3 / 600MBps SAS3 / 1200MBps
Disk Bays (typical) 3 ~ 6 4 ~ 24


Horizontal Scalability

44 Cores (88 Threads), 3TB of RAM and 24 high performance SAS disks sounds behemothic, and indeed it does. However, there are situations when even this much is insufficient. Crunching 10’s of Terabytes of data can be so demanding that a single server can take so long, or even runs out of memory. Another scenario is when the data mining platform is shared between many data scientists and researchers, which is the common (and more realistic) case in the academic and corporate environments. In such situations, multiple servers are joined together to form a “Cluster”. From an end user perspective, a cluster should look just like a single computer with ample of computing resources. Such flexibility, however, adds an additional layer of complexity as will be explained in a moment.

In a solo server (or desktop) scenario, a single instance of the data mining software accesses and processes the entire dataset. On the other hand, multiple instances of the data mining software must run in the cluster (an instance per cluster node). This introduces two new problems:

  1. The dataset must be split evenly and distributed to all cluster nodes over the network. This is known as the Mapping phase.
  2. The outputs from all cluster nodes must be accumulated at a single cluster node. This is known as the Reduction phase.

The two problems above further create many hard-to-crack challenges. For instance, splitting the dataset can lead to significant variance in the output of each node in trendy datasets. This also adds an additional layer of latency: the network. Inefficient network topology can substantially impact the overall performance. However, the most challenging problem is that the Reduction phase can only start when all nodes have finished processing. A slow node can stale the entire cluster.



A high-end desktop computer is a great way to start. It’s affordable and can handle small datasets with acceptable performance. The mobile platform (laptops and tablets) is inefficient and very limited and should be avoided unless no alternatives are available. When the dataset is too large for a desktop computer, or when the algorithm is very CPU intensive (which is typical for Deep Learning algorithms), the first alternative to seek should be a server with at least 2 CPU sockets and plenty of RAM slots.

Because there is no one-solution-fits-all to address cluster performance and scalability challenges, the cluster option should only be considered when the task at hand exceeds the capability of a single server. In this case, special attention must be paid to all detail: the hardware, the data mining software and the dataset to be crunched.


So, What about GPUs?

The introduction of GP-GPUs (General Purpose Graphics Processing Units) has made a major shift (or hype if you like) in the data mining arena. Vendors like Nvidia have been making bold marketing claims around their Deep Learning GPU-powered appliances.

GP-GPUs promise performance boost of as much as 3-digit of magnitude! However, the notion “no one-solution-fits-all” still applies. GP-GPU’s excel in scenarios where the “active” dataset (which can be the entire dataset) can fit in the GPU’s local memory, which is very limited at best, and when the algorithm is parallel enough to utilize the vast number of small GPU cores. Proper assessment should be carried out before investing in GPU’s enables data mining solutions.

Published in
  1. Author
    Jaafar 2 years ago

    Intel has announced a new lineup of the Xeon Phi Co-processor. I will cover it in a future blog

    • Oliver 2 years ago

      Good on you Jaafar. We very much appreciate your contributions and can not wait to read your next blog post.

Leave a reply

Thank you! Your subscription has been confirmed. You'll hear from us soon.

Log in with your credentials


Forgot your details?

Create Account