Full Throttle Ruby and Rails Development

Home

Blog

Portfolio

Services

Speaking

Contact

Counting Retweets - Or Why Is This Hard

Counting Retweets - Or Why Is This Hard

One of the primary goals of Measured Voice is to provide excellent statistics on each tweet sent by our users, including an accurate count of retweets. Counting retweets is a supremely difficult challenge. Here's why…

The first challenge to counting retweets is defining what a retweet really is. Is it Twitter's official retweet? Is it any tweet that contains the exact same text as the original tweet? Or should it include the same link or only some part of the text? Do retweets need to include RT @username or /via @username or other attributions? It gets pretty confusing pretty fast.

Our plan for Measured Voice is to track both "official" and "traditional" retweets. We'd ultimately like to keep track of them separately, but limitations of Twitter's API make that impossible for now. For a long time, we'd been counting retweets using this algorithm:

  1. Search for the username of the Twitter account whose retweets we're trying to count (we'll just call this "the account" from here on)
  2. Go through each result and check if we've seen it before
  3. If we haven't seen it before, we check to see if it contains a matching short URL or the first six words of a message sent from the account
  4. If it matches, we increment the retweet count

We liked this approach because it—in theory—allowed us to count a lot of retweets from a lot of messages by performing only one simple search query for the username, saving us from expending API calls to count retweets by searching for short URLs for text from each individual message.

So yeah, in theory this should work fine, but it doesn't because Twitter search is a slippery beast. We've learned that when you search for a username, the results do not include all mentions of that username. We know that Twitter search only goes back about 7 days, but we're saying that the search results appear to randomly omit mentions of the username within that 7-day window.

For example, take the following tweet:

There are 19 million new sexually transmitted infections each year. Most have no symptoms. Get yourself tested: http://j.mp/9ikwGM11:02 AM Apr 7th via Measured Voice

Using the method described above, we'd search twitter for @USAgov and look for instances of http://j.mp/9ikwGM within the results.

Here is what we'd get:
Click to see fullsize

We get 6 messages that we would count (the two in red use a different link than the original, so we'd miss them). However, if we search for the link itself, we get this:


Click to see fullsize

Here we see the 6 results from the first query and 4 more! Those 4 in yellow never show up in the @USAgov search. Where did they go? No one knows.

In fact, when we survey the various services that track retweets, this is what we find when looking for retweets of the above tweet:

blog comments powered by Disqus
copyright © 2010 Notch8 and licensed under the creative commons attribution-share alike license