Summary

Stochastic Gradient Descent (SGD) is widely used in machine learning (first in the 50’s and reappearing as “different” algorithms), but its sequential by nature. There had been a lot of proposals before Hogwild but they all require locking — even if its linear speed up in theory in practice its pretty bad: * Master/worker * Round Robin * Average/Batch Gradient

Bottom line: communication is too expensive. Hogwild is an algorithm that eliminates the overhead of synchronization by allowing all processors to access shared memory and update at will. Counter intuitively, such scheme actually works because when data access is sparse, memory overwrites are rare and the computation error introduced is minimal. The paper formalizes measures for when the algorithm works well and provide canonical examples. In the evaluation, they have found that often the performance is better than theoretical guarantees and do really well on a variety of machine learning applications.

In order to quantify when the algorithm is applicable, the authors devised the following measure in Section 2 and showed that the values are actually all quite small: * Maximum edge size * Maximum Degree * Maximum Edge Degree

Assume that the longest delay between an update and a memory read is tau (evaluated from the specific system) and compute a reasonable step size.

They also tried this on matrix factorization with biased sampling to increase locality (which breaks assumptions needed for convergence) and still found it to work well. The algorithm shuffles the data (biased for locality), then have asynchronously helper threads process the independent subsets, and move the subsets together. They have found thatthis makes it 100x faster than standard solvers, and 3x speedup than Hogwild in 40 core machine. This one has no theory: breaking iid should make the performance worse but it didn’t. They did some interesting math that compares arithmetic-geometric mean inequality for matrices and found an inequality (not proven) that seems to justify what they were seeing.

Evaluation

perf

We could see that the more dense the data set measured by \rho (maximum fraction of edges that intersect any given edge), and the 4th column \delta node-regularity (maximum fraction of edges that intersect any variable), and size, the less the performance gain (but still the gain is positive). And even with large values its still 5x speed up. The takeaway is that they have some theory about when this would work and found that even if these requirements are not satisfied the algorithm still works.

Ben Recht has a great talk.