1 Kimikazu Kato Nihon Unisys, Ltd. You Might Also Like: A Multi-GPU Recommendation System

2 up compared to a latest up compared to a latest - - 300 speed 300 speed – – We have implemented a GPUs “recommendation system” on multiple CPU (single core implementation) x20 x20 • • • Summary

3 Principle of recommendation system Known serial algorithm k-nearest neighbor problem Singular value decomposition – – – – Conclusion Benchmark Introduction Algorithm on CUDA Algorithm on CUDA • Table of Contents • • • •

4 users’ preference Netflix Prize: competition about the predict of Sends DVDs you are likely to prefer “Customers who bought this item also bought” • • • Netflix Amazon – – Commonly used by web-based shops customers’ known preferences Predicts unknown preferences from Famous examples: Famous examples: • • • • Recommendation System

5 Example

6 Sometimes it takes days or weeks – Computationally heavy when data size is large Directly useful for business Directly useful for business Why recommendation system? • • •

7 0 3 3 Z 4 4 5 Y 1 4 3 X The answer is 5 4 W movie E B A C D D person Find a nearest vector to regarding a blank as Each person gives score to movies How it works A person who like movie “X” might also like... Answer: “W”

8 dimension of millions Millions of vectors and vectors of Imagine how many customs and items Amazon has Amazon has – – Some kind of data compression is necessary Direct computation is too heavy large Usually the size of the matrix is very • • However...

9 hundreds hundreds millions Approximates the given matrix with a product of smaller size matrices preserving “which is near” relationship Only a rough approximation is sufficient It is under the assumption that the phenomenon is explained by small number of features It is under the assumption that the phenomenon Singular Value Decomposition • • • • millions

10 most likely movie th Who cares the “10000 you might like?” – For each vector, find k nearest vectors to it k is small enough compared to the size of matrix of matrix • • k-Nearest Neighbor Problem

11 Neighbor K-Nearest Singular Value Decomposition Outline of the whole process

12 n isfied Coalesced memory access gives a big performance gai Each thread belong to a thread block Works on a GPU Memory access is coalesced if a certain rule is sat Optimal for thousands of threads Register, Shared memory, Global memory – – – – – – Hierarchical memory system Locality of memory access is important Programming environment with massive parallelism Hierarchical management of very light weighted threads threads CUDA • • • •

13 are desirable for each such that Minimize Error for each is estimated by Sparse (expressed by index-value pair) Computation of SVD To diminish freedom of the variables and attain computational stability, smaller and

14 for each such that For each where by sequential steepest descent method Fill and with random values Repeat until convergence Algorithm by B.Webb (a.k.a. Simon Funk) Known serial algorithm Minimize Solve

15 gets smaller might make another bigger gets smaller might make another bigger The steepest path for can make bigger but is not steepest for The direction where But anyway, practically Webb’s algorithm works, and is time efficient It might look strange because...

16 For each where For each where Fill and with random values Repeat until convergence can be separated Update of and and can be computed and can be computed parallelly parallelly Because only depends on and For each where Fill and with random values Analysis toward parallelization Repeat until convergence Original algorithm by Webb

17 Each raw is assigned to a thread block Each thread block computes Each column is assigned to a thread block Each thread block computes Step1: Step2: SVD in CUDA

18 coordinate value of coordinate value of th th - - Same process for Step2 (Reduction framework, see SDK sample of CUDA) Compute in parallel thread calculates the k thread calculates the k th th - - Detail of step 1 Each thread block calculates k k

19 Block 0 Block 1 Block 2 Outline of SVD Block 0 Block 1 Block 2

20 I don’t know why For example, some items are very popular and some are rarely bought. some are rarely bought. – – elements in each raw (column) differs unbalanced because the number of non-zero The workload among thread blocks are very Basically thread blocks are executed in pipeline, so the unbalanced thread blocks don’t affect the performance However, in fact, the order of thread blocks affect the performance • Problem of load balancing • •

21 , list up nearest k vectors to each For a given k and set of vectors For each , Find such that k-Nearest Neighbor Problem In other words, In other words, Description of problem:

22 Our algorithm is bruteforce Bruteforce is better if it works fast For a performance reason, clustering algorithms such as k-means method are often used. Practically, it works well especially the data all the pair of vectors is more accurate But a bruteforce algorithm which calculates actually has some clusters actually has some clusters Bruteforce vs Clustering • • •

23 and Computation of distances Computation of distances k-partial sort The algorithm consists of two parts: k-Nearest Neighbor Algorithm

24 Basically same as n-body algorithm But need another trick to deal with relatively high dimension Computation of distances • •

25 GPU Gems III, pp 677—695, 2008 Because of limited size of shared memory Block 1 Block 2 Block 0 N-Body Algorithm [Nyland et al. 2008] Nyland et al. “Fast N-Body Simulation with CUDA”, in

26 d in a shared memory Block 0 Block 1 Block 2 “Slice” the dimension so that the data can be loade High dimensional case

27 If the distance function is symmetric, it is enough to compute only the half of the distances. Note •

28 g GPU”, CVPRW ’08, 2008 V.Garcia et al. “Fast k-Nearest Neighbor Search Usin For k-Nearest neighbor problem, Garcia et al. proposed an algorithm which uses texture memory. So far, we can not tell our algorithm is better or not. So far, we can not tell our algorithm is • • • Related research

29 number can be found quickly Sort multiple arrays in parallel Use a heap so that the k-th smallest Since k<

30 steps st number heap data to the Push the local Read data and necessary discard it if not Store it to a local array if it is smaller Discard it if it is larger Atomicly push to the heap after a certain number of Fetch a data and compare it to a current k-th smalle Each thread prepare a local array k-partial sort in parallel Block 0 Block 2 Block 1 Block 2

31 Useful for a general application Full sort is well studied (see [Cederman—Tsigas 2008] for example), but there might be a better but there might be a better implementation for a partial sort Research wanted: k-partial sort • •

32 GPU0 GPU1 GPU2 GPU2 GPU2 GPU1 GPU0 For multiple GPUs

33 heaps for Array of GPU2 Array of heaps for GPU1 heaps for GPU0 Array of computing assigned grids These are merged after all of the GPUs finish pushes to their own heaps Each GPU For multiple GPUs (cont.)

34 s 4.1 24.3 24.3 100.0 40000 1.0 25.0 25.0 25.0 20000 6.2 0.2 31.0 31.0 10000 n matrix reduced dimesion=256 1% non-zero element n x Core i7 (b) GTX280 (a) (b)/(a) Elapse time per iteration (sec) (b)/(a) CPU implementation is only on a single core n Bench mark (SVD)

35 65.2 68.6 64.4 331.7 331.7 124.8 131.8 123.3 172.6 80000 80000 8041.4 22756.9 17.7 16.9 34.1 32.1 62.6 320.9 320.9 118.9 166.5 40000 40000 2010.0 5680.7 5.5 5.7 8.2 8.6 91,4 61.3 248.9 248.9 503.0 173.3 20000 20000 1419.0 1.7 1.8 2.6 2.7 73.4 48.0 196.7 196.7 354.2 124.8 131.1 10000 10000 d=256, k=100 n n 1x GTX280 (a) (c)/(a) Core i7 (c) 1x GTX280 (a) 2x GTX280 (b) (c)/(b) (c)/(a) Core i7 (c) 2x GTX280 (b) (c)/(b) (c)/(b) Euclidian distance Hellinger distance Bench mark (kNN)

36 CUDA is not only for scientists! – A recommendation system can be faster on a GPU The k-nearest neighbor problem, which is the heaviest part of a is the heaviest part of a GPU is also useful for marketing recommendation system, has a scalable implementation on multiple GPUs • • Conclusion •