Collaborative filtering
Definition
Collaborative Filtering(CF) refers to the use of software algorithms for narrowing down a large set of choices by using collaboration among multiple agents, viewpoints, and data sources.
Overview
The term Collaborative Filtering was first coined by the makers of one of the earliest recommendation systems, Tapestry. Collaborative filtering is used in several e-commerce websites that provide personalized recommendations to users based on their past ratings for a product. The basic assumption in CF is that user A and user B's personal tastes are correlated if both users rate n items similarly. Collaborative filtering requires ratings for an item to make a prediction for it. Rating is an association of a user and an item by means of a value. Rating can be either explicit or implicit. Explicit rating requires the user to rate an item. Implicit rating infers a user's preference from the his or her actions. If a user visits a product page, then he might have an interest in buying the product but if he ends up buying the product, then it it inferred that the user has a very strong interest in similar products.
Collaborative Filtering systems generally follow this approach to produce recommendations:
1. Gather ratings from users and maintain user's ratings in a database.
2. Compute the correlations between pairs of users to determine a user’s neighbors in taste space
3. Compute the ratings of these neighbors to make recommendations.
Collaborative Filtering Systems
Collaborative Filtering systems can be separated into 3 classes:
1. Memory-based(Heuristic) Recommendation Systems
Memory-based algorithms make predictions by operating on data (users, items and ratings) stored in memory. They can be classified into:
(i ) Nearest Neighbor algorithms
(ii) Top-N recommendation algorithms
(i) Nearest neighbor algorithms are the most commonly used Memory based CF algorithms. Users similar to the current user with respect to preferences are called as neighbors. Nearest neighbor algorithms can be further classified into:
- User-based nearest neighbor
- Item-based nearest neighbor algorithms
User-based nearest neighbor algorithms generate predictions for a given user based on ratings provided by users in the neighborhood.
A common approach to user-based nearest neighbor algorithm is:
- Weight all users who are similar to the current user. Similarity weighting is done either by using:
- Pearson correlation coefficient between ratings for current user c, and another user, u.
- Cosine-based correlation where in the two users c and u are considered to be two vectors in an m-dimensional space and the similarity between the two is measured by computing the cosine of the angle between them.
- Pearson correlation coefficient between ratings for current user c, and another user, u.
- Select a subset of the users (neighbors) to use as predictors.
- Neighbor selection is done by finding similar users or users having a similarity weighting above a certain threshold.
- Neighbor selection is done by finding similar users or users having a similarity weighting above a certain threshold.
- Normalize ratings and compute a prediction from a weighted combination of the selected neighbors’ ratings.
- Present items with highest predicted ratings as recommendations.
Item-based algorithms generate predictions based on similarities between items.
- The correlation of each item ik with other items is computed.
- For each user ui , the ratings of the items highly correlated with ik are aggregated.
The table below represents a User-Item Rating matrix for 4 users and 4 movies. Movies 'A-Team' and 'Salt' have more or less similar ratings. Based on this, we can predict the rating X for User C by building a weighted average of User C's other ratings. Since 'Inception' has a higher rating than 'A-Team', we can guess that the rating for 'Inception' is more important. Hence, we predict rating X = 0.25*5 + 0.75*3 = 3.5
The weights are assumed in this calculation. [1]
Inception | A-Team | Salt | The Last Airbender | |
---|---|---|---|---|
User A | 4 | 3 | 3 | 2 |
User B | 5 | 2 | 3 | 1 |
User C | 5 | 3 | X | |
User D | 4 | 4 | 4 | 2 |
(ii) Top-N recommendations
Top-N recommendation is to recommend a set of N top-ranked items that will be of interest to a certain user. Top-N recommendation techniques analyze the user-item matrix to correlate different users or items and use them to compute the recommendations. They are further classified into:
- User-based Top-N algorithms
- Item-based Top-N algorithms
User-based Top-N Collaborative Filtering Algorithms.
- Identify the k most similar users (nearest neighbors) to the active user using the Pearson correlation or vector-space model.
- The corresponding rows of the k most similar users in the user-item matrix R are aggregated to identify a set of items, C, purchased by the group.
- With the set C, user-based CF techniques recommend the top-N most frequent items in C that the current user has not purchased.
User-based top-N recommendation algorithms have limitations related to scalability and real-time performance.
Item-Based Top-N Collaborative Filtering Algorithms.
- Compute the k most similar items for each item according to the similarities.
- Identify the set, C, as candidates of recommended items by taking the union of the k most similar items and removing each of the items in the set, U, that the user has already purchased.
- Calculate the similarities between each item of the set C and the set U.
- Resulting set of the items in C, sorted in decreasing order of the similarity, will be the recommended item-based Top-N list
Item-based top-N recommendation algorithms address the scalability problem of user-based top-N recommendation algorithms.
Advantages of Memory-based algorithms:
1. It is easy to implement.
2. It scales well with correlated items.
3. It does not require the content of items, only the ratings are sufficient.
4. New data can be added easily.
Disadvantages of Memory-based algorithms:
1. It depends on human ratings.
2. Correlations are skewed when data is sparse.
3. Time and memory requirements scale with the number of users and ratings.
4. It cannot recommend for new users and items(Cold start Problem).
2. Model-based Collaborative Filtering Systems
Model-based algorithms use the collection of ratings to learn a model, which is then used to make rating predictions. Model-based CF algorithms include Bayesian models (probabilistic) and clustering models.
Advantages:
- Model-based CF technique addresses the shortcomings of memory-based CF algorithms such as scalability and sparsity.
- It also improves prediction performance.
Disadvantages:
- Model-based CF technique improve scalability at the cost of prediction performance.
- Model building is expensive.
3. Hybrid Collaborative Filtering Systems
Hybrid Collaborative Filtering Recommendation Systems are another class of Recommendation Systems that combine Content-based and Collaborative-filtering techniques so as to overcome the limitations of either approach. See also: Recommendation system
Challenges of Collaborative Filtering
Data sparsity
Initially, the User-Item matrix for a CF recommendation system is sparse as it lacks ratings. This cold start problem occurs when a new user, new item or new community of users enters the system. For a new user, there is a lack of rating history which means the user can't be correlated to other similar users.This makes it difficult for the system to recommend any items. A new item lacks users who rate it, hence it won't be recommended unless someone enters an initial rating. The new user problem can be solved by doing any or all of the following: asking the user for some initial ratings,his general taste, his demographic or other users who belong to the same demographic as the current user and producing non-personalized recommendations for the current user. Another way of solving the cold start problem is by using dimensionality reduction techniques that removes unrepresentative users and items so as to reduce the user-item matrix's dimensions. However, this may corrupt information related to this data.
Scalability
Generally, a CF algorithm has a worst case complexity of O(MN) where M is the number of users and N is the number of items. The CF system also has to react to ratings in real-time. As the number of users and items increase, the time and memory requirements also increase and the CF system suffers from scalability issues. This can be solved by using non-probabilistic algorithms that transform or cluster the ratings space to reduce the ratings space dimensionality.
Synonymy
CF algorithms find it very hard to distinguish between two items that are named differently but essentially mean the same. For example, items such as "letter pad" and "memo" are similar products but are treated differently by the CF algorithm which cannot find a correlation between these two items. This increases sparsity and results in poor quality of recommendations.
Gray Sheep
Gray sheep are users whose ratings don't match any group consistently and hence have no use for collaborative filtering. Black sheep are users whose ratings are so ridiculous that recommendations are unachievable. The Gray Sheep problem can be solved by using Hybrid Recommendation techniques.
Shilling Attacks
Shilling attacks refers to the entering of favorable ratings to inflate the popularity of products or services a user owns or wants to promote. It is also used to describe negative ratings that are entered to do harm to another individual or business.
Privacy and Security
A CF system can provide better recommendations if a user provides more personal information. However, people may not want to release their personal information to e-commerce websites as they worry about the confidentiality of their data. Other privacy challenges in CF systems are unsolicited marketing, malicious users or software figuring out a user's personal information, price discrimination and legal issues. Although recent surveys indicate that about 80% of Internet users are interested in personalization, as many as 94% in another survey concur that they should have legal rights to know all the information that a website collects from them.[2] Obviously, there is a need to maintain a balance between privacy and personalization.
Security is also a concern when a user provides his personal data. Users who correlate to clusters of other users are the backbone of a CF system but they are also 'weak ties' as they are more vulnerable to security attacks.
Trust
A recent survey indicated that 63% of consumers do not reveal their data to a website if they do not trust it. Many users reveal their personal data only to websites that they feel are safe, reliable and reputable, ask them to agree on a privacy statement and have a privacy seal. Users also return to websites that incorporate CF recommendations when they have a had a positive experience in the past. A user should be able to trust that his CF provider does not sell his ratings and preferences to a third-party and a CF provider has to make sure that it detects malicious users deliberately lowering or increasing an item's ratings to his advantage.
Collaborative Filtering in e-commerce
Collaborative Filtering algorithms are employed in websites of several e-commerce businesses such as:
Amazon.com
Amazon.com is an online retailer that uses recommendation algorithms to personalize each customer's online store. Amazon uses an item-to-item collaborative filtering algorithm which is iterative;it correlates each customer's purchased and rated products to similar products,and aggregates these similar products to recommend the most popular products.
For each item in product catalog I1 For each customer C who purchased I1 For each item I2 purchased by customer C Record that a customer purchased I1 and I2 For each item I2 Compute the similarity between I1 and I2[3]
If the number of customers is M and the number of products is N, the computation's worst case is O(N2M). However, Amazon computes this similar-items table offline unlike traditional CF algorithms whose online computation scales with the number of products and consumers. This offline computation makes the Amazon algorithm more scalable even for large data sets, hence improving the quality of recommendations.
Netflix.com
Netflix.com is an e-commerce site that offers online video-streaming and BluRay and DVD rentals by mail. Netflix uses a similar recommendation system as Amazon's to generate recommendations to a user based on the videos or movies watched and rated by him or her. In 2006, Netflix announced an open competition for a collaborative filtering algorithm which would improve Netflix's own algorithm, Cinematch by 10% in predicting recommendations for users based on the user's past preferences and ratings. The $1 Million grand prize was won by the team 'BellKor's Pragmatic Chaos' in 2009 for coming up with an algorithm that beat Cinematch by 10.06%
References
- Analysis of Recommendation Algorithms for E-Commerce
- A Survey of Collaborative Filtering Techniques
- Empirical Analysis of Predictive Algorithms for Collaborative Filtering (1998)
- Towards an Introduction to Collaborative Filtering
- Item-based collaborative filtering recommendation algorithms
- Recommender systems in e-commerce
- Netflix Prize