RecTree: An Efficient Collaborative Filtering Method

RecTree: An Efficient Collaborative Filtering Method 2004.5.4

Outline • 1.Introduction • 2.Group average, Memory-based algorithm • 3.RecTree Algorithm • 4.Experiments and Comparisons • 5.Future work • 6.Question & Answer

1. Introduction • Collaborative Filtering • Can recommend items to active user • Based on others rates • Application • Two e-commerce sites integrates the CF: Amazon & CDNow • Recommend books, music and movies • An example

2. Group average algorithm & Memory-based algorithm • Group average algorithm • Assumes that all advisors are equally trusted and consequently. • A common set of friends of Sam and Baz • Predict a movie by computing average rating of their friends • Shortcomings: same recommendation to all active users. • Memory-based algorithm • Use history record • Identify similar advisor • Sum advisors rating to recommend

Correlation-based CF (corrCF)] • Pearson correlation • Pair-wise similarity wu,a • Standard deviation , u, a • Items set user and advisor rated: Yu,a • Prediction • Alpha is a constant • N, n matrix, Thus O(n2)

An open issue • Sparsity problem • Rating density is low, it is hard to produce accurate recommendation • Two approach: • Force very one vote for all items in a date set. • OLAP used as a simple way of filtering

3.RecTree • Solve the scalability problem by divide-and-conquer approach • Dynamically create a hierarchy of cliques of users, users within a clique is more similar to each other • Seek advisors from the cliques that user belongs to • Yielding a higher overall accuracy.

RecTree Algorithm • Partitions data into cliques of similar users • Recursively splitting the dataset • Maximize the intra-partition similarity • Minimize the inter-partition similarity • Resembles a binary tree • Within leaf node, computes a similarity matrix. • Predicts by taking a weighted deviation from cliques advisor ratings

Lemma 1 The training time for a partitioned memory-based collaborative filter is O(nb), where n is the dataset size if the partitions are approximately b in size and there are n/b partitions. • Growing the Tree due to sparsity and low cost of RAM, a fast in-memory clustering algorithm KMeans is used here. • KMeans • Select k initial seeds • Assign users to the cluster • Recompute the centroid of the clusters • Reassign users to the new cluster • Not an optimal clustering algorithm but guarantee the fast training. • ConstructRecTree() • Three threshold • Partition size • Branch depth • Current branch depth

Algorithm 1. ConstructRecTree(parent, dataSet, partitionMaxSize, curDepth, maxDepth) • Input: parent is the parent node from which the dataSet originates. partitionMaxSize is the maximum partition size and curDepth is the depth of the current branch. maxDepth is the maximum tree depth. • Output: The RecTree. • Method: • 1. Create a node and link it to parent. • 2. Assign dataSet to node. • 3. If SizeOf(dataSet) ≤ partitionMaxSize OR curDepth > maxDepth then ComputeCorrelationMatrix(dataSet); RETURN. • 4. curDepth++. • 5. Call KMeans (dataSet, numberofClusters=2) • 6. For each child cluster resulting from KMeans: Call ConstructRecTree(node,childClusterDataSet, partitionMaxSize, curDepth, maxDepth).

Lemma 2. The RecTree data structure is constructed in O(gnlog2(n/b)), if the maximum depth is glog2(n). n is the dataset size, b is the maximum partition size and g is a constant. • Generating a Recommendations • Without default voting, Bea and Dan are indistinguishable • With default voting, Dan is more similar to Sam than Bea • ConstructRecTree divide uses into two groups and capture users taste

4. Experiments • The Off-line/On-line Execution Time • The off-line and on-line execution time for RecTree as a function of the number of users and maximum partition size, b • RecTree’s on-line performance is independent of the number of users already in the system.

Normalized Mean Absolute Error (NMAE) • The accuracy of RecTree is a function of number of users and maximum partition size, b, see figure5 • RecTree’s accuracy is due in part to the success of the partitioning phase in localizing highly correlated users in the same partition.

5.Future work • how the internal nodes of the RecTree data structure can contribute to even more accurate predictions. • Parallelize computing

RecTree: An Efficient Collaborative Filtering Method