The Netflix Prize

The Netflix Prize was a competition announced by Netflix in 2006 to improve its movie recommendation algorithm, Cinematch. The company offered a US$1 million prize to any team that could enhance the algorithm’s accuracy by 10.06%

Background

Netflix’s recommendation engine, Cinematch, used collaborative filtering to predict user ratings for movies based on previous ratings and patterns across similar users. By 2006, Cinematch had been refined internally for several years, but Netflix sought a significant improvement.

To enable the competition, Netflix released an anonymized dataset on October 2, 2006, which included:

480,189 users
17,770 movies
100,480,507 ratings
Ratings on a 1.0 to 5.0 scale, in increments of 0.5
Ratings dated between October 1998 and December 2005

Netflix stated that all personally identifiable information had been removed, replacing user names with numeric IDs.

The dataset was split into training and test sets for evaluation, and submissions were measured against a hidden test set to prevent overfitting

competition structure

Start Date: October 2, 2006
Target: Improve RMSE by at least 10% over Cinematch baseline
Evaluation Metric: RMSE (Root Mean Squared Error)
Prize: US$1,000,000 for the first qualifying team
Duration: The competition officially ended on September 21, 2009

Over the course of the contest, thousands of teams worldwide participated, including academic groups, independent researchers, and corporate teams.

Re-identification Concerns

In December 2007, researchers from the University of Texas at Austin demonstrated that Netflix’s anonymization was insufficient. By comparing Netflix ratings with publicly available ratings from IMDb, they re-identified some users in the dataset. This process, known as a linkage attack, used overlapping movie ratings and timestamps to match identities.

The researchers noted that even slight differences in rating patterns could uniquely identify individuals. This raised concerns about the privacy of Netflix subscribers and the risks of releasing large-scale datasets, even when anonymized.

Regulatory Action and Lawsuit

In 2009, Netflix announced plans for a second Netflix Prize, which would use an even larger and more detailed dataset, incorporating demographic and behavioral data. However, before its release:

The Federal Trade Commission (FTC) launched an inquiry into privacy implications.
In December 2009, a class-action lawsuit was filed in U.S. District Court, alleging that Netflix had violated the Video Privacy Protection Act (VPPA) by releasing data that could potentially identify subscribers.

As a result:

Netflix canceled the second competition in March 2010.
In March 2010, Netflix agreed to settle the lawsuit for US$9,000,000, which was allocated for privacy education and research programs