Amazon DSSTNE : Why beating the combinatorial explosion is so hard

Ever wondered why Amazon’s recommendations feel sub-optimal in-spite of having so much of our data? Here’s some insight into why that’s the case. In the simplest terms, identifying good 1-1 recommendations is HARD.

Amazon open sourced DSSTNE, which drives the search and recommendation algorithms of Amazon. Deep Scalable Sparse Tensor Network Engine, (DSSTNE), pronounced “Destiny”, is an Amazon developed library for building Deep Learning (DL) machine learning (ML) models. Amazon claims that DSSTNE significantly outperforms current open source DL libraries on certain tasks where training data is sparse (where almost all the values are zero).

So what is sparse data? It means, not a whole lot of valuable information. Recommendations usually operate on sparse data, not everything is connected, but you can manage to find some valuable links between people and items. Now, even with the sparse data making good recommendations requires neural networks. Even a simple 3-layer autoencoder with an input layer of hundreds of million nodes (one for every product), a hidden layer of 1000 nodes, and an output layer mirroring the input layer has upwards of a trillion parameters to learn. That’s the scale of the problem they are dealing with. Now imagine if there was more useful data – it would compound this complexity!

Most ML libraries implement data-parallel training, as in it splits the training data across multiple GPUs. This works, but there’s definitely a tradeoff between speed and accuracy. DSSTNE uses model-parallel training, so instead of splitting data across multiple GPUs, it splits the model across multiple GPUs. So all the layers are spread out across multiple GPUs in the same server automatically.

Amazon had to do this because the weight matrices it had for recommendations, that is all the mappings of users and attributes just didn’t fit in the memory of a single GPU. For example, the weight matrices of a 3-layer autoencoder with 8 million nodes in the input and output layer and 256 nodes in the hidden layer consumes 8GB of memory in single precision arithmetic. Even training such a network using open source software with shopping data from tens of millions of users would take weeks on the fastest GPUs available on the market.

You will find more information @ Build a Movie Recommender – Machine learning for hackers and DSSTNE