home | articles | site map | contacts
about us
consulting
client login
support
contacts
  Detecting a Hacked Tweet with Machine Learning
  10/9/2013 (Modified 10/15/2013)
Follow PrimaryObjects on Twitter Subscribe to Primary Objects via RSS More Software Articles
by Primary Objects
enter email address
 

Introduction

This article is part of a presentation for The Associated Press, 2013 Technology Summit.

On April 23, 2013 the stock market experienced one of its biggest flash-crash drops of the year, with the Dow Jones industrial average falling 143 points (over 1%) in a matter of minutes. Unlike the 2012 stock market blip, this one wasn't caused by an individual trade, but rather by a single tweet from the Associated Press (AP) account on the social network, Twitter. The tweet, of course, wasn't written by AP, but rather by an imposter who had temporarily gained control of the account. Considering the impact of real-time messaging services, such as Twitter, could it be possible to detect the tweet as hacked?

In this article, we'll discuss how to use machine learning and so-called "big data" analysis to mine large amounts of information and classify meaningful relationships from them. In particular, we'll walk through a prototype machine learning example that attempts to classify tweets as having been authored by AP or not. We'll examine learning curves to see how they help validate machine learning algorithms and models. As a final test, we'll run the program on the hacked tweet and see if it's able to successfully classify the tweet as being authentic or hacked.

The Suspect

Breaking: Two Explosions in the White House and Barack Obama is injured.

1:07 PM - 23 Apr 13 Breaking: Two Explosions in the White House and Barack Obama is injured.

Can It Be Done?

Detecting a tweet as being authentic or hacked, is really a question of authorship. Before beginning with a model, the first question to ask oneself is, does enough data exist within the hacked tweet to indicate authenticity?

A human looking at the original tweet, shown above, might easily assume this is legitimate. The tweet appears to use language that is common among AP's history of tweets. It's typical for a headline to begin with the phrase "Breaking:", followed by a description.

However, upon looking closer, those familiar with AP's language and terminology may be able to see anomalies within the tweet. The first issue is the casing of the term "Breaking". Traditionally, AP uses the capitalized version "BREAKING" to announce timely news. (In our machine learning prototype, we'll actually ignore case, therefore we'll require other artifacts from the tweet to indicate authorship).

In addition the other casing anomalies within the text, there is also an unusual combination of the phrases "Two Explosions in the White House" + "and" + "Barack Obama is injured". Specifically, the subject phrases and usage of the term "and" seems out of place.

It seems possible for some determination to be made that the target tweet may indeed not be from the original author. Putting aside human analysis, let's give the computer a try. We'll attempt to use a machine learning algorithm to see if it can correctly classify AP's tweets.

Why, Hello There, Twitter

The foremost important part of a machine learning solution is the amount and quality of data to base learning upon. To classify AP's tweets by authorship, we'll need to extract tweets from AP's Twitter account history to serve as the positive cases. We'll also need a collection of non-AP tweets to serve as the negative cases.

To aid in the collection of tweets, the C# .NET library TweetSharp was used. Queries were initially prepared to extract AP tweets, using the search term "from:AP", and later refined to include date ranges.

Extracting Tweets

The following is an example of C# .NET code, using TweetSharp, to extract recent AP tweets. The first step is to automatically login to the Twitter service:


// Step 1 - Retrieve an OAuth Request Token
OAuthRequestToken requestToken = service.GetRequestToken();

// Step 2 - Redirect to the OAuth Authorization URL
Uri uri = service.GetAuthorizationUri(requestToken);
Process.Start(uri.ToString());

Console.Write("Enter Twitter auth key: ");
string key = Console.ReadLine();

// Step 3 - Exchange the Request Token for an Access Token
access = service.GetAccessToken(requestToken, key);                

Once logged in, tweet extraction can begin, as follows:


public static IEnumerable<TwitterStatus> Search(string keyword, int count,
                                                long? resumeId = null)
{
    List<TwitterStatus> result = new List<TwitterStatus>();
    int subCount = count;

    if (_service == null)
    {
        // Login to the Twitter service.
        _service = LoginTwitter();
    }

    // Continue loading tweets until we reach the desired count.
    while (result.Count < count)
    {
        var status = _service.Search(new SearchOptions() {
        Q = keyword, Lang = "en", IncludeEntities = false, 
        Count = subCount, SinceId = resumeId });
        if (_service.Response.StatusCode == HttpStatusCode.OK)
        {
            if (status.Statuses.Count() > 0)
            {
                var existing = result.Select(t => 
                Regex.Replace(t.Text.ToLower(), @"(http|https)://[^\s]*", "")).ToList();

                // Remove duplicates.
                foreach (var tweet in status.Statuses)
                {
                    // Ignore retweets.
                    if (!tweet.Text.StartsWith("RT"))
                    {
                        // Ignore duplicates.
                        string tweetCleaned = 
                        Regex.Replace(tweet.Text.ToLower(), @"(http|https)://[^\s]*", "");
                        if (!existing.Contains(tweetCleaned))
                        {
                            result.Add(tweet);
                            existing.Add(tweetCleaned);
                        }
                    }
                }

                // Get the last item's id, so we know where to continue searching from.
                resumeId = status.Statuses.Last().Id - 1;

                // Continue loading until we reach our fill.
                subCount = count - result.Count() + 1;

                Console.Write(result.Count + "/" + count + "..");
            }
            else
            {
                // No more tweets.
                break;
            }
        }
        else
        {
            // Check the rate limit.
            TwitterRateLimitStatus rateSearch = _service.Response.RateLimitStatus;
            if (rateSearch.RemainingHits < 1)
            {
                DateTime resetTime = rateSearch.ResetTime + TimeSpan.FromMinutes(1);
                Console.WriteLine("Rate limit exceeded. Sleeping until " +
                resetTime + ".\n\n" + _service.Response.Response);

                Thread.Sleep(resetTime - DateTime.Now);
            }
            else
            {
                // Some other error.
                throw new Exception("Twitter error. " + _service.Response.Response);
            }
        }
    }

    return result;
}

The above code attempts to retrieve a set number of Tweets. We ignore retweets (indicated by a tweet starting with "RT"). We also cleanse tweets to remove newlines, tabs, and duplicates.

Since TweetSharp has a limit of 200 results per query, we need to continually loop, until the count has filled. Note, Twitter also has rate-limiting, which is why a check is included on the resulting status code to see if we should pause querying for a duration of time.

The results from TweetSharp are then saved to a CSV format file, using the C# .NET library CsvHelper.

While TweetSharp worked quite well for extracting a limited history of tweets, the API is apparently limited by how far back in time tweets may be extracted from. This would leave us with about 1,100 data examples to train on. For a more optimal scenario, we could use a lot more data. Note, initial trainings on this minimal data-set actually achieved 94% accuracy, although the learning charts indicated a higher accuracy could be achieved with more data.

It's Not Who Has The Best Algorithm That Wins

As the traditional phrase in machine learning describes: "it's not who has the best algorithm that wins; it's who has the most data". Therefore, more data was obtained through various data sources, allowing a more complete history of AP's tweet content. Keywords used for extracting data included the format: "from:AP since:2012-01-01 until:2012-12-31", etc.

Keywords used for extracting non-AP data included the format: "-from:AP". Additional targeted non-AP data was extracted, including 100 tweets from "-from:AP obama", "-from:AP breaking", and "-from:AP explosions". Since our target tweet shares these topics, this allows the algorithm to have knowledge about the domain.

Digitizing Tweets

To allow the machine learning algorithm to process the tweets, each tweet will need to be converted into a numerical format. There are a couple of different methods for doing this, such as TF*IDF, but the optimal method appeared to be word indexing.

First, the collection of tweets was separated into two portions: the training set, and the cross validation (CV) set. The training set would be used for all learning-based examples, while the CV set would be used for calculating accuracy scores.

A vocabulary was built off of the training set by tokenizing the text of the tweets and then using the porter-stemmer algorithm (Centivus.EnglishStemmer.dll) to obtain the collection of base distinct words.

We then digitize each tweet in the training set to an array of ints, corresponding to the word existing in the vocabulary. For each tweet, we check each word in the vocabulary and see if it exists in the current tweet. If the vocabulary word exists, we place a 1 for that index in the array. if it does not, we place a 0 for that index. The end result is a vector of size n, where n equals the number of terms in the vocabulary. This ensures that each training set item contains the same length n, consisting of a series of 0's and 1's. For example, if the vocabulary consists of 250 stemmed terms then each tweet will be converted into an array of 250 integers (giving us a matrix of m data rows, each of length 250).

Note, if TF*IDF (term frequency inverse document frequency) were used, the values in the array would instead by doubles. However, since the length of tweets is only 140 characters, it's more difficult gathering value from term frequency relations within the text, thus indexing was used instead.

Proof That We're Learning Something

Learning curves are an excellent way for telling if a machine learning algorithm is actually learning. By plotting the accuracy against the number of training set items, it becomes apparent whether the algorithm is learning as data examples grow, and if adding more data will actually help or hinder accuracy.

For machine learning algorithms in C# .NET, the Accord .NET library was used.

An initialization of an SVM can be done with the following code:


MulticlassSupportVectorMachine machine = new 
MulticlassSupportVectorMachine(inputs[0].Length, 
  new Accord.Statistics.Kernels.Linear(), 2);
var teacher = new MulticlassSupportVectorLearning(machine, inputs, outputs);
teacher.Algorithm = (svm, classInputs, classOutputs, i, j)
  => new SequentialMinimalOptimization(svm, classInputs, classOutputs);
double error = teacher.Run();

// Calculate an example against the trained svm.
int result = svm.Compute(input); // 0,1

A First (Pretty Good) Attempt

For the first attempt, a support vector machine (SVM) with a gaussian kernel (sigma 2) was used. It achieved an accuracy of 99.74% Training and 96.22% CV on a training set of 1140 items.

Learning curve for an SVM gaussian 2 on AP tweets 96.22% accuracy

This is pretty good. Especially, considering we're only using 1140 training set items. The learning curve is also promising. Bias is virtually non-existent, and variance is kept to a minimum. Looking at the slope of the curve, it certainly appears that more data will only improve the accuracy. Still, we can do better.

The Second (Even Better) Attempt

For the second attempt, the SVM was changed to use a linear kernel, achieving an accuracy of 100% Training and 97.21% CV.

Learning curve for an SVM linear on AP tweets 97.21% accuracy

Now we're cooking! This bumps our accuracy up 1% and the slope appears just as sharp, meaning that more data could push the accuracy up even further.

It's time to feed our C# .NET machine learning algorithm more data and see what it can do. The data was increased to a training set size of 6,054 tweets.

Results?

The best algorithm was trained on 6,054 tweets. Roughly half were authored by AP, and the rest were authored by other users.

Learning curve for an SVM linear on AP tweets 97.38% accuracy

The program achieved a final accuracy of 100% Training, 97.38% CV, 96.23% Test. Judging by the learning curve, it looks like there is still some room to go even further, by providing more training examples.

Here is a view of the resulting program running on real live data (test set). The program never saw these tweets before in its whole life. Honest!

Tweets
Correct: 930/965 (96.37%)
 
  1. 7 Oct
    Positive
    image
    AP
    AP
    AP VIDEO: Shelling rocks Syria's capital of Damascus: http://t.co/8Rgj9xEjCi -SS
  2. 7 Oct
    Positive
    image
    AP
    AP
    Human rights expert urges U.S. to end prisoner's four decades in solitary calling it "torture": http://t.co/qtMrCbFYcR -SS
  3. 7 Oct
    Positive
    image
    AP
    AP
    Egg-sized diamond fetches record $30.6 million at auction in Hong Kong (with photo): http://t.co/4XoSUszXAJ -SS
  4. 7 Oct
    Positive
    image
    AP
    AP
    Sisters of woman killed in D.C. chase say she wasn't delusional and may have been fleeing danger: http://t.co/Fflb7J2zpv -SS
  5. 7 Oct
    Positive
    image
    AP
    AP
    Japanese court fines anti-Korean activists for shouting racist abuse outside Korean school in Kyoto: http://t.co/Vu2zmVBiN2 -SS
  6. 7 Oct
    Positive
    image
    AP
    AP
    AP VIDEO: Frugality is king for consumers still scarred by the financial crisis: http://t.co/QsJnrTWO9c #TheGreatReset -SS
  7. 7 Oct
    Positive
    image
    AP
    AP
    Divers recover more bodies from wreck of migrant boat bringing death toll to 211: http://t.co/1dJ6xSrsZ0 -SS
  8. 7 Oct
    Positive
    image
    AP
    AP
    MORE: U.S. official says Somalia raid targeted Abdulkadir Mohamed Abdulkadir an al-Shabab operative: http://t.co/PqGuZgmyBQ -SS
  9. 7 Oct
    Positive
    image
    AP
    AP
    BREAKING: US official: target in Somalia counterterror raid was Abdulkadir Mohamed Abdulkadir.
  10. 7 Oct
    Positive
    image
    AP
    AP
    Just what is "vesicle traffic"? Find the answer in our look at the winners of the Nobel Prize in medicine: http://t.co/hKz0JUibYO -SS
  11. 7 Oct
    Positive
    image
    AP
    AP
    Japan Airlines signs first ever purchase from Airbus in blow to Boeing for 31 A350 jets: http://t.co/WHiuwWMO9C
  12. 7 Oct
    Positive
    image
    AP
    AP
    AP PHOTOS: A single family scratching out a life amid the fighting is all that remains of this Syrian town: http://t.co/kSXLs61GFG -SS
  13. 7 Oct
    Positive
    image
    AP
    AP
    Woman killed three others injured after being gored by bull during festival in central Spain: http://t.co/ZAhkG2qP5Y -SS
  14. 7 Oct
    Positive
    image
    AP
    AP
    Babies pioneer research into benefits of gene mapping ethical questions abound: http://t.co/MNMpqT0xS3 - VW
  15. 7 Oct
    Positive
    image
    AP
    AP
    Wave of deadly attacks in Egypt kills at least 8 day after street clashes left 51 dead: http://t.co/URFIiHpDlN - VW
  16. 7 Oct
    Positive
    image
    AP
    AP
    Japan's Abe seeks to reassure fellow Asia-Pacific leaders on economy military as Obama stays away: http://t.co/hgV5Lgj0Ln - VW
  17. 7 Oct
    Positive
    image
    AP
    AP
    MORE: Rabbi Ovadia Yosef Israeli religious scholar and political kingmaker dies at 93: http://t.co/io0lfTwz3c
  18. 7 Oct
    Positive
    image
    AP
    AP
    Among #AP10Things to Know: 3 researchers win Nobel Prize in medicine & Elizabeth smart recounts kidnapping. http://t.co/mYmvdezzYA
  19. 7 Oct
    Positive
    image
    AP
    AP
    BREAKING: Israeli officials announce death of Rabbi Ovadia Yosef the spiritual leader of Sephardic Jews
  20. 7 Oct
    Positive
    image
    AP
    AP
    Syrian government troops reopen key road to northern city of Aleppo after heavy clashes: http://t.co/64Id4Vp6zM - VW
  21. 7 Oct
    Positive
    image
    AP
    AP
    BREAKING: 2 Americans German win Nobel medicine prize for discovery of cell transport system. http://t.co/9zvycqG3Bp
  22. 7 Oct
    Positive
    image
    AP
    AP
    Kerry assures execs of robust US role in Asia-Pacific says shutdown will end and be forgotten: http://t.co/ipsNfxquWB - VW
  23. 1 Oct
    Negative
    image
    EllaKorsand
    EllaKorsand
    @sally_hasselby thank you hun☺️🙈 missguided hooked me up aha
  24. 1 Oct
    Negative
    image
    NBCHannibal
    NBCHannibal
    October means furiously plotting our #Hannibal themed Halloween costume! Who are you going as?
  25. 1 Oct
    Negative
    image
    AmQurious
    AmQurious
    Nobody panic! #Facebook’s #artificial #intelligence just wants to know how you’re feeling? http://t.co/wooI6sDhyR
  26. 1 Oct
    Negative
    image
    iiM_JusJerricka
    iiM_JusJerricka
    I hate when mfs think they can change me!
  27. 1 Oct
    Negative
    image
    KayleighAC
    KayleighAC
    I need to find out the name of this boy in my film studies 👌❤️
  28. 1 Oct
    Negative
    image
    naruseibaras
    naruseibaras
    @_aleatorie OH HEY THERE
  29. 1 Oct
    Negative
    image
    tori87hains
    tori87hains
    @BritishBakeOff im struggling to come to terms with missing #GBBO tonight! Its going to be difficult to avoid the spoilers!!
  30. 1 Oct
    Negative
    image
    Love1Cimorelli
    Love1Cimorelli
    @Cimorelliband OMG!!!!YOU'RE IN THE OCTOMBER ISSUE♡♡♡♡♡ http://t.co/vDn387VLiD
  31. 1 Oct
    Negative
    image
    Lejla_1Dlover
    Lejla_1Dlover
    Let's go crazy crazy crazy till we see the sun... :D
  32. 1 Oct
    Negative
    image
    tylerr_wilsonn
    tylerr_wilsonn
    Missing this little nugget today... My favorite second grader in the whole world 😘 I hate that I can't… http://t.co/hnG2aDUzYM
  33. 1 Oct
    Negative
    image
    MakeItWorkMolly
    MakeItWorkMolly
    @verilymag shoot- I work from home now!!! Lol 😉 and in school full time with two kids #singlemomlife
  34. 1 Oct
    Negative
    image
    GOATRYDER252
    GOATRYDER252
    @xoxoarb u need to put that address in your GPS to help ma get there
  35. 1 Oct
    Negative
    image
    ArnoModd
    ArnoModd
    Where is Huguian???
  36. 1 Oct
    Negative
    image
    ToriGMitchell
    ToriGMitchell
    “@YoKidAintMine: In the end everyone becomes what they said they'd never be.” *who
  37. 1 Oct
    Negative
    image
    Jordan_Comstock
    Jordan_Comstock
    @chrisschmidt24 no an not ur an fagit
  38. 1 Oct
    Negative
    image
    sensuhaz
    sensuhaz
    niallzmuffin 8
  39. 1 Oct
    Negative
    image
    RhettRiott
    RhettRiott
    The construction from robinwood to hcc drives me insane hahaha
  40. 1 Oct
    Negative
    image
    destiny_shevon
    destiny_shevon
    Am I out yet?! 🔐
  41. 1 Oct
    Negative
    image
    becmoreland10
    becmoreland10
    Ello fellas. Here I am. Put your American sausage in my English mcmuffin
  42. 1 Oct
    Negative
    image
    CassandraKhan
    CassandraKhan
    our safety guys are funny as hell
  43. 1 Oct
    Negative
    image
    bhullihouse
    bhullihouse
    @centerofright @PMOIndia Want to get rid of immoral Government whose head does not has the magic wand to command confidence of cabinet.
  44. 1 Oct
    Negative
    image
    Breakingviews
    Breakingviews
    $GS and $JPM are creating independent legal and compliance departments. At least they one job split right http://t.co/ObvwAATyVR @holdingren
  45. 1 Oct
    Negative
    image
    _GunzNRoses_
    _GunzNRoses_
    Calling a car dealership is the most aggravating thing ever. You call about one car and they try to sell you everything else on the lot smh.
  46. 30 Aug
    Negative
    image
    Hacked
    Hacked
    Breaking: Two Explosions in the White House and Barack Obama is injured.

You can download the full test results, in all their glory, to view all 965 tweets.

Notice in the test results, the majority of the tweets by AP are correctly marked positive and colored in green. The tweets by other users are correctly marked negative and colored in red. Some errors slip through (around 3%), but the results by the computer are impressive.

If you scroll to the very bottom of the test set, you'll come to our suspect tweet, correctly colored in red:

Correct classification of hacked AP tweet

Conclusion

This year has seen some significant advances in big-data analysis, and in particular, machine learning. With the increasingly massive amounts of data being passed through computers every day, it's becoming more and more difficult for humans to keep pace. Luckily, faster processors and smarter algorithms are allowing us to make sense of it all; and may in fact, end up taking over much of what we do today.

Machine learning and artificial intelligence are exciting parts of computer science that are growing in importance as more data is collected. This article has provided a short introduction to the power of machine learning and, possibly, a hint of what's to come in the future.

Interested in more? You can read about the other things I've done to help make computers smarter.

About the Author

This article was written by , Microsoft certified software developer and architect, providing C# ASP .NET Javascript web application development, database design, and mobile software development across a variety of domains for clients in both the business and consumer sectors.


   
comments powered by Disqus
Profile
Learn more about Primary Objects and our goals ..  More
09/09/2013
Primary Objects releases SentimentView Twitter sentiment analysis engine .. More
05/31/2013
Primary Objects releases ColorBot interactive machine learning, AI .. More
Home | About Us | Services | Client Login | Job Opportunities | Contact Us
Copyright © Primary Objects 2013
Privacy Policy
Follow us on Twitter