0:15

Specifically, we will look at scaling challenges, data and model parallelism.

Â Allreduce approaches, and synchronous and asynchronous multi-node training and

Â trade-offs.

Â Before we do, let's review gradient descent and stochastic gradient descent.

Â When using gradient descent to find a local minimum, we take steps proportional

Â to the negative of the gradient of the cost, with respect to the weight.

Â However, there are three main issues with gradient descent.

Â First, in gradient descent, you have to run through all of the samples in your

Â training set to do a single update for a parameter in a particular iteration.

Â 0:54

Second, gradient descent gets stuck at saddle point,

Â which are very common in high-dimensional spaces.

Â AlexNet, for

Â example, has 60 million dimensions, making saddle point a large issue.

Â Finally, gradient descent is thought to converge to sharp

Â minimums more frequently.

Â This is problematic, as even a slight variations between the training and

Â test objective functions can cause dramatic shifts in their values,

Â when in sharp minimum.

Â On the other hand, a stochastic gradient descent solves many of these issues.

Â By breaking the training data set into minibatches,

Â each of which are then used to compute a parameter update.

Â This algorithm is more explanatory than gradient descent,

Â converges to flatter minimum.

Â Does not get caught in saddle points points as easily, and

Â is much more computationally efficient.

Â As seen here, in gradient descent, to make a word update,

Â one must go through entire training data set, which can be very slow.

Â On the other hand, stochastic gradient descent makes an update after each

Â minibatch, which has a dramatic effect on efficiency.

Â Nearly all, if not all, deep learning models are trained with some variant

Â of a stochastic gradient descent, for these reasons.

Â We'll now transition to discussing scale challenges.

Â When using SGD, choosing the right batch size is very important.

Â 2:20

For smaller batch sizes,

Â one does not efficiently utilize the computational resources.

Â Yet for larger batch sizes,

Â one can run into similar issues that we explored with gradient descent.

Â Very slow progress near saddle-like points, and getting stuck in sharp minima.

Â With very deep networks, trained on large data set,

Â efficiently parallelizing networks across multiple servers becomes essential,

Â in order to minimize the time to train.

Â To this end, we will explore two algorithms, data parallelism and

Â model parallelism.

Â In model parallelism, we split the model weights among end nodes,

Â with each node working on the same minibatch.

Â With data parallelism, we use the same model with every node, but

Â feed it different parcels of data.

Â Such a method is better for networks with few weights, like GoogLeNet,

Â and is currently used in Intel Optimized Caffe.

Â Visually, we can see that with data parallelism,

Â the algorithm distributes the data between various nodes.

Â And each node independently tries to estimate the same parameters,

Â then they exchange their estimates with each layer.

Â Using a parameter server or an AllReduce method, as we will discuss,

Â to come up with the right estimate for the step.

Â While the minibatch is distributed or mini nodes, three in this example,

Â one can not simply increase the byte size by three times.

Â As the time to train increases with larger minibatches,

Â due to similar issues as with gradient descent.

Â With model parallelism, the algorithm sends the same data to all nodes, but

Â each node is responsible for estimating different parameters.

Â These nodes then exchange their estimates with each other,

Â to come up with the right estimates for all the parameters.

Â Data parallelism is preferred when the number of weights is small,

Â which is true for linear topologies.

Â When updating weights on a single node using stochastic grading descent,

Â we pass in x training examples, where x is a batch size,

Â and forward-propagate them through the network.

Â After computing the cost of all examples, we then compute the data gradient

Â from layer to layer, and update the model weights using the weight gradients.

Â When we have multiple nodes, 32 in this example, and

Â are using data parallelism, we partition the minibatch

Â into 32 subsections, and distribute them to 32 workers.

Â When back-propagating, each worker computes a sum of the weight gradient for

Â each subset of their batch.

Â The weight gradients are then summed across workers,

Â producing identical numerical result as one would find with a single node,

Â training with a large batch size.

Â 5:16

Deep learning practitioners have demonstrated a scaling across various

Â nodes.

Â Baidu distrubuted training to 40 GPU nodes, later that year,

Â UC Berkeley scaled training to 120 GPU nodes.

Â Their paper provided sufficient details for

Â other practitioners to build upon their work.

Â A few months later, Intel demonstrated scaling to 128 CPUs,

Â Google to 120 GPUs, Amazon to 120 GPUs.

Â Most recently, and not shown in this slide,

Â Facebook demonstrated near-linear scaling to 256 GPUs,

Â reducing the time to train from several days to just one hour.

Â With very large batch sizes, the time to train becomes quite large,

Â making training slow and not able to reach the same accuracy.

Â Therefore, let's assume that we have a batch size of 1024,

Â how can we distribute the data across nodes?

Â One option is to have 1024 nodes, each node with a batch size 1.

Â However, with this arrangement, the communication between the nodes becomes

Â a bottleneck, and the computation itself on each node is too little.

Â On the other hand, using 16 nodes, each with a batch size of 64, is more

Â reasonable, as most of the communication is hidden in the computation.

Â Multi-node training on IntelCaffe, which uses data parallelism,

Â works in the following manner.

Â First, the data on a given node is forward-propagated through the network,

Â which in this case is composed of two layers, L1 and L2.

Â Then the L2 gradients are sent to the parameter server after that

Â layer has been propagated through.

Â Similarly, the L1 gradients are sent subsequently to the server after

Â L1 has been back-propagated through.

Â When the server receives the L2 gradings from all nodes, it then applies an update,

Â and broadcasts it to all the nodes, and likewise with the L1 gradients.

Â Nodes wait for

Â these updates before forward-propagating through the updated network.

Â Now that we have discussed how data and model parallelism work,

Â we will consider strategies for implementing gradient aggregation.

Â Parameter server, reduction trees, rings, and butterfly.

Â One strategy for

Â communicating gradients is to appoint one node as the parameter server.

Â Which computes the sum of the communicated gradients,

Â and sends the updates to each of the workers.

Â However, there is a bottleneck in sending and

Â receiving all of the gradients with just one parameter server.

Â Another strategy is an AllReduce tree.

Â An AllReduce communication method is where each worker produces one or

Â more data values that must be globally reduced.

Â Generally, we'd have commutative, binary element-wise operator,

Â to produce a single result value.

Â And then this single value must be broadcast to all workers

Â before they can continue.

Â In an AllReduce tree, the local gradient information is

Â distributed to the entire network using a tree-based algorithm,

Â which is then broadcasted to each individual node.

Â 8:35

In this example, there are seven nodes, and

Â each has a gradient value between one and seven.

Â In this AllReduce tree example, where the goal is to sum all of its values.

Â 1, 2, and 5 are summed to 8, and 3, 4, and 6 are summed to 13, in the first step.

Â Then these results, 8 and 13, are summed with 7 to make 28, in the next step.

Â Finally, 28 is then broadcasted to the rest of the network in just two steps.

Â In total, if N is the number of nodes, the time is a function of the log of n,

Â which makes it ideal for power of two number of nodes- 1.

Â 9:19

All Reduce butterfly requires the least of number of steps,

Â as it does not require separate reduce and broadcast steps.

Â In this example, there are four nodes with three steps, starting from the top, and

Â how the communication happens at each step.

Â As shown here in step one, node 1 and

Â 2 are first summed concurrently with 3 and 4.

Â Resulting in the first two nodes with the values 3, and

Â the next two nodes with value 7.

Â In step two, the nodes with 3 and 7 communicate,

Â resulting in all the nodes with a value of 10.

Â The complexity of this algorithm is also log 2 of N, but

Â requires half of the steps as AllReduce tree.

Â A ring All Reduce algorithm is useful as the communication cost is constant, and

Â independent of the number of nodes in the system.

Â And is determined solely by the slowest connection.

Â 10:16

Each node will only ever send data to its right neighbor, and

Â receive data from its left neighbor.

Â The algorithm proceeds in two steps.

Â First, a scalar reduce to exchange data, and

Â then an all-gather to communicate the final result.

Â While we have looked at synchronous multi-node training,

Â we will now briefly discuss asynchronous multi-node training, and its trade-offs.

Â In order to accelerate the convergence of SGD, and improve the training speed,

Â asynchronous parallelization of stochastic gradient descent has been investigated.

Â While asynchronous SGD overcomes communication delays,

Â it comes with a myriad of problems.

Â The algorithm requires more tuning of hyperparameters such as momentum and

Â learning rate, and requires more epochs to train.

Â Furthermore, it does not match single node performance.

Â It's quite hard to debug, and has not been shown to scale and

Â retain accuracy on large models.

Â Therefore, while there do exist benefits of asynchronous SGD,

Â there exist many issues that have not yet been fully addressed.

Â