w9b Netflix Prize
w9b Netflix Prize
In 2006 Netflix was just a mail-based DVD rental company (they weren’t streaming videos
yet). Customers with a subscription could rent as many DVDs as they liked, and Netflix
wanted to keep posting DVDs to their customers. People who weren’t using the service
would realize, and cancel their subscription. Netflix kept up demand by recommending
movies their customers would like, and they had data showing that better personalized
recommendations led to higher customer retention. To improve recommendations further,
Netflix launched a challenge — with a million dollar prize — to improve the root mean
square error of their recommendation engine by 10%.
The rules of the competition and FAQ are still online if you want more detail. Much of
how the competition was set up was well thought out. The way the leaderboard works, and
limitations on the number of submissions, was since adopted in much the same form by
Kaggle’s competitions. As we’ll see though, Netflix didn’t give enough thought to privacy.
Wikipedia has a good, short summary: https://en.wikipedia.org/wiki/Netflix_Prize
1 SVD approach
One of the most significant approaches to the competition was referred to as “SVD”. Simon
Funk, one of the competitors, beat Netflix’s existing system early on with a short and simple
C program, which performed stochastic gradient descent on a simple model.
The model stated that the C × M matrix of movie ratings for the C customers and M
movies can be approximately decomposed into the product of a tall thin C × K matrix and
a short wide K × M matrix. This low-rank approximation is like a standard truncated SVD
approximation, but without an intermediate diagonal matrix, which can be absorbed into
the other matrices. A conventional SVD routine finds an approximation with the minimum
possible square error, summed over every element of the matrix. However, the Netflix data
matrix isn’t fully observed (no customer rates every movie), so we minimize the sum of
the square differences between the observed matrix M and its approximation, only at the
observed elements. A conventional SVD routine can’t fit this cost function. However, we can
apply stochastic gradient descent to the cost function, where the thin rectangular movie and
customer matrices contain the parameters of the model. The resulting approximate matrix
can then be evaluated at any cell, giving predictions for ratings that haven’t been observed.
One of the fitted matrices contains K learned features about each customer. The other
contains K learned features about each movie. The inner product of these features is used
to predict the customer’s rating of a movie. One of the K indexes might correspond to
“romance”: the corresponding customer feature will take on large values if the customer likes
romantic movies, and the movie feature will take on large values if it is a romantic movie. In
this model customers can like multiple genres of movie, and no other customer has to have
the same combination of tastes for the system to work. No genre labels are required to fit
the model though: the fitting procedure learns the features for itself.
Rather than expand further, some of the successful Netflix competitors wrote an accessible
article, fleshing out the details: Matrix factorization techniques for recommender systems,
Koren et al., IEEE Computer 42(8):30–37, 2009.
"Is there any customer information in the dataset that should be kept private?
No, all customer identifying information has been removed; all that remains are ratings
and dates. This follows our privacy policy, which you can review here. Even if, for
example, you knew all your own ratings and their dates you probably couldn’t identify
them reliably in the data because only a small sample was included (less than one-tenth
of our complete dataset) and that data was subject to perturbation. Of course, since you
know all your own ratings that really isn’t a privacy problem is it?"
— https://www.netflixprize.com/faq.html
I’ll also admit that (in 2006) I didn’t understand why they were bothering to “perturb the
dataset”, which meant they randomly changed some of the ratings from their true values.
How could this data possibly be deemed sensitive?
Firstly, anonymizing data doesn’t work if an adversary might have access to other data that
they can correlate with your data. In this case the publicly-available Internet Movie Database
(IMDB) contains movie ratings. It was possible to identify some people in the Netflix prize
data, by comparing patterns of ratings to those found in IMDB.
The IMDB ratings are public already though, so how has any harm been done? Well,
users of IMDB know that it is public, and so may choose not to rate movies with certain
political or sexual leanings. However, the Netflix dataset did contain ratings of political and
pornographic movies. By matching users to IMDB, some of these sensitive ratings were
attached to identifiable individuals. Not surprisingly, the relevant people were upset, and
some of them took legal action. The wikipedia article on the Netflix Prize has more details
and references.
1. Starting points: do you use the web, a credit/debit or loyalty card, a smart-phone, or use any services?