Counting Retweets - Or Why Is This Hard
One of the primary goals of Measured Voice is to provide excellent statistics on each tweet sent by our users, including an accurate count of retweets. Counting retweets is a supremely difficult challenge. Here's why…
The first challenge to counting retweets is defining what a retweet really is. Is it Twitter's official retweet? Is it any tweet that contains the exact same text as the original tweet? Or should it include the same link or only some part of the text? Do retweets need to include RT @username or /via @username or other attributions? It gets pretty confusing pretty fast.
Our plan for Measured Voice is to track both "official" and "traditional" retweets. We'd ultimately like to keep track of them separately, but limitations of Twitter's API make that impossible for now. For a long time, we'd been counting retweets using this algorithm:
We liked this approach because it—in theory—allowed us to count a lot of retweets from a lot of messages by performing only one simple search query for the username, saving us from expending API calls to count retweets by searching for short URLs for text from each individual message.
So yeah, in theory this should work fine, but it doesn't because Twitter search is a slippery beast. We've learned that when you search for a username, the results do not include all mentions of that username. We know that Twitter search only goes back about 7 days, but we're saying that the search results appear to randomly omit mentions of the username within that 7-day window.
For example, take the following tweet:
There are 19 million new sexually transmitted infections each year. Most have no symptoms. Get yourself tested: http://j.mp/9ikwGM
Using the method described above, we'd search twitter for @USAgov and look for instances of http://j.mp/9ikwGM within the results.
Here is what we'd get:
We get 6 messages that we would count (the two in red use a different link than the original, so we'd miss them). However, if we search for the link itself, we get this:
Here we see the 6 results from the first query and 4 more! Those 4 in yellow never show up in the @USAgov search. Where did they go? No one knows.
In fact, when we survey the various services that track retweets, this is what we find when looking for retweets of the above tweet:
Why is this so hard? Why can't we go "http://api.twitter.com/1/statuses/id/retweet_count" and just get the number? Twitter makes it pretty clear that they have a hard time keeping up with API demand, so why are they asking us to pull each and every result over and over and manually count them? There are messages on the Google Group as far back as 2008 saying they are working on this, but we're curious how others are solving this problem.
Are we missing something obvious?