Writing a Digg-Style Popularity Algorithm
Theory March 26th, 2007Not so long ago I was tasked with creating a web site similar to Digg - users voted on records in the database, and those deemed “popular” were promoted to the front page. But how exactly does such an algorithm work? Let’s take a look under the hood…
Before we begin, a quick disclaimer: These are my own thoughts on such an algorithm, from scratch. I don’t know how Digg, Reddit and other such sites rank popular items, so this may be a quite different.
Records and Votes
We have a database record, whether it be a news story, picture, video, podcast, whatever. And we have voting, a method for users to place a single vote on a particular database record. We could order the “Popular” category by number of votes and be done with it.
Popularity = Votes
However records very quickly become stale, as new submissions are entered and the voting process begins again. If records from last year have 300 votes each, but popular records this year only have 100 or so, they won’t see the front page. So we need to look at time as a factor as well.
Records and Time
Let’s introduce the age of the record as another variable. If the record is newer it should have a higher prominence on the front page, yes?
Popularity = Votes / Record Age
The older the record, the more votes it requires to achieve popularity. But that’s not really fair - if a record takes a little while to receive votes, it doesn’t get the credit it deserves. This is especially a problem if your site doesn’t yet have enough traffic.
So what’s the solution? Let’s take a look at the age of each vote for the record.
Records and Vote Time
To keep the front page fresh, we can give more weight to votes placed recently. That way if a story is hasn’t yet received the credit it deserves, it’s still in for a chance if several users notice its value and vote accordingly. So let’s iterate through votes and calculate a popularity score:
Popularity = (V1/A1) + (V2/A2) + … + (Vn/An)
Vn is a vote, and An is the age of that vote (for example, in minutes). If a vote is 60 minutes old, it is worth 1/60th of a vote placed 1 minute ago. All the values of all the votes are added to achieve a popularity score.
This seems to solve the previous problem, but introduces a new one. If a single person votes on a record a year old, his vote will be worth more than 200 votes on a different record posted yesterday. Old material comes back to haunt the front page. So we’re close, but no cigar yet.
Let’s take a look again at the age of the record …
Records and Time and Vote Time
If we put together everything we’ve discussed so far, we get something that looks like this:
Popularity = [ (V1/A1) + (V2/A2) + … + (Vn/An) ] / Record Age
It’s a bit of a mouthful, but basically it adds together the weighted votes based on age, then divides that total by the age of the record. It doesn’t impose too much of a time limit on becoming popular, it dampens votes based on age, and prevents old stories from leaping back to the front page. It solves all our problems.
Dampening Popularity
Admission: In writing this article, I’ve discovered a more advanced algorithm than used previously on my project. Guess what I’m doing this evening?
But already I can see a problem - I think a dampening effect will need to be introduced to prevent wild jumps of increased popularity and back down again. I will update this post when I’ve had a chance to implement the new algorithm in the wild.
Other Variables
What other variables could we introduce to the algorithm?
- Number of page views - a form of popularity, but far less useful than voting
- Page views versus people who don’t vote - If 60% of readers vote, should it be more popular than if 10% vote? Remember this isn’t the number of votes, it’s the percentage of readers who vote
- Iterative algorithm of amount of time between individual votes
- Voting down as well as up
- Trustworthiness of user who originally submitted the record, maybe based on votes of previous submissions
- If a web URL is involved, maybe use metadata such as Google PageRank, inbound links (Google/Yahoo API), Blogosphere activity, and so forth
Further Reading