Group Meeting – 3/16

Group Meeting – 3/16 Brian Ackerman

OrdRec • An Ordinal Model for Predicting Personalized Rating Distributions • Yehuda Koren (Yahoo! Research) and Joe Sill • Yehuda Koren won the Netflix Prize during his time at AT&T • Voted best paper at RecSys’11

Road Map • Background Information • Recommender Systems • Learning to Rank • Stochastic Gradient Descent • OrdRec

Recommender Systems – Setting • Input • Set of items (I) and a set of users (U) • Set of ratings (R) which take the form rui in R where rui is user u’s rating on item i • Output • Predictions for new user/item pairs

Recommender Systems – SVD • Each user and item are represented by a feature vector of length k • E.g. Item A may be vector φA = [a1 a2 a3 … ak] • Imagine the features for items were fixed • E.g. items are movies and each feature is a genre such as comedy, drama, etc… • Features of the user vector are how well a user likes that feature

Recommender Systems – SVD • Consider a movie (e.g. Die Hard) • Its feature vector may be φDH = [1 0 0] if the features are action, comedy, and drama • Maybe the user has the feature vector φU = [3.87 2.64 1.32] • We can try to predict a user’s rating using the dot product of these two vectors • r’U,DH(φU,φDH) = φU∙φDH = [1 0 0] ∙ [3.87 2.64 1.32] = 3.87

Recommender Systems – SVD • The optimization goal for this approach is

Learning to Rank – Architecture Training Data Learning System Model Test Data Predictions Ranking System

Learning to Rank – Methods • Pointwise (OrdRec) • Each element of training data has a numerical or ordinal score (e.g. 1, 2, 3 or A, B, C) • Always a fixed scale (e.g. [1, 3] or A > B > C) • Pairwise • Pairs of elements have pairwise relationships (e.g. i > j, j > i, or i ~ j)

Stochastic Gradient Descent • Learning method used for big data • Estimates the parameters of a model incrementally over a number of passes • Model is f(Θ) where Θ are the parameters • The SVD model only has two parameters and is r’u,i(Θ) = r’u,i(φu,φi)

Stochastic Gradient Descent • Uses training data in the form of (x, r) pairs were x is an instance and r is the desired output • e.g. for instance x = (u, i), r = 3 • Goal is to learn Θthat best fit the model to minimize the error between the r’ values and r values

Stochastic Gradient Descent • Update each parameter in Θ (θ in Θ) based on error (ε) with respect to model • E.g. updating all values in φuand φi • Use a small learning rate (λ) for gradual change • E.g. λ = .001

Stochastic Gradient Descent • Update each parameter (θ) on each training pass over all instances • Δθ = λ * ε * (∂f(Θ) / ∂θ) • E.g. Δφu = λ * ε * (∂r’u,i(φu,φi) / ∂φu ) = λ * ε * φi • Example • ru,i = 3, r’u,i = 2.1, φu= .3, φi = .7 • ε = 3 – 2.1 = .9, λ = .001 • Δφu = .001 * .9 * .7 = .00147

Motivation 1 Star = 1.0 2 Stars = 2.0 …

Motivation A+ = ? B- = ? … A+ > B-

Problem Setting • Input • Set of items (I) and a set of users (U) • Set of ratings (R) which take the form rui in R where rui is user u’s rating on item I, but all of these values fall into a relevance class (e.g. 1, 2, 3, …, S where S is the highest class)

Problem Setting • Output • Rating predictions • Predicted distribution of ratings • Leads to recommendation confidence scores

OrdRec Framework • Works with most existing recommender system methods • Framework assumes the internal scoring mechanism of an existing system is denoted as yui for user u and item i

OrdRec Model • Likelihood function • Update based on …

Thresholds • There are S – 1 ordered thresholds • t1 ≤ t2 ≤ … ≤ tS-1 • t1 is the only parameter • There is also a set of parameters β1, β2, …, βS-2 • tr+1 = tr + exp(βr)

Thresholds • Assume we have ratings A, B, C • There are 2 thresholds (t1 and t2) as S = 3 • There is 1 threshold parameter (β1) exp(β1 ) t2 t1 C B A

Methods • Given the parameters, the model generates an observed rating using a random process • zui ~ N(yui, 1) • The observed rating corresponds to the threshold where it falls • P(rui = r|Θ) = P(tr-1 < zui ≤ tr)

Methods • Assuming t0 = -∞ and tS = ∞ • P(rui ≤ r|Θ) = P(zui ≤ tr) = Φ(tr– yui) • Φ is the CDF (cumulative distribution function) • New probability definition • P(rui = r|Θ) = P(rui ≤ r|Θ) – P(rui ≤ r – 1|Θ)

OrdRec Model • Update Rule

Predicting Ranking • Each user (u), and item pair (i > j) combination gets a vector (ΔPu(i,j)) • Each index represents and ordinal score (1, 2, …, S) • ΔPu(i,j)[r] = P(rui = r|Θ) - P(ruj = r|Θ) • Global weight vector (w) • ΔPu(i,j) and w yields a score

Group Meeting – 3/16