13 March 2012 38 Comments

Harvesting Twitter Search Results for Analysis using Python & SQL Server

Twitter. It's informative. It's time-wasting.
It's useful. It's entertaining. It's important.
It's drinking-from-the-fire-hose. It's everywhere.

It's Data. Realtime, Social, Contextual Data.

I Repeat; Real-time. Data. Sh*tloads of it.

A service unlike any we've ever had before, and yes, very different and much faster than the traditional internet. We've seen it used as a tool for social activism, a platform for change, as well as a real-time news stream in a way that old-school media just can't play.

The thing about drinking from the firehose is, well, just that. Overwhelming data and no real easy way to capture it all in a format that you can easily use. There is simply too much shit happening every second to utilize properly.

So, if you want to do any real trending, data mining, post-processing, or simple analysis on it - you really need to be making a historical record of what you're looking for and store it in a usable way. The Twitter search function only goes back so far (at least from what I've found), so for the vast majority of data (aka anything you'd want to do some long-term trending on) - when its gone, its gone (for easy public consumption, at least).

Anyone who has been reading this blog over the past month or so, has seen my Metallica stuff - and yes, honestly, I wrote this script in order to add some data to that project. I wanted to chart mentions of the bands name, and member names - analyze them at a later date and try to draw some conclusions (as well as pretty visualizations).

But none of that is possible without the actual DATA first.
So, let's go get some "tweet harvesting" started.

Disclaimer: This is pretty quick and dirty - but for 99% of cases it's perfect (possibly even overkill). Want to make it better? Go ahead, man! You are the master of your own destiny.

So the question is, "How do we grab stuff from the timeline in the first place?" My answer, as it always is - Python, the world's best Swiss army knife for data.

Enter my trusty Python 2.7 (32-bit to not break all sorts of modules). If you don't already have it, you can download it here. They have flavors for whatever you favorite OS is, but in this case I'm going to be using Windows and SQL Server, if you use OS X or Linux, you'd have to rewrite the SQL statements and connect syntax to use MySQL (which isn't a big deal, the rest would work fine as-is).

Once python is installed you need 2 more modules for this to work:

Now that our Python installation is all sexed up like a fresh bowl of Kraft Macaroni and Cheese, we can worry about the database.

Did I say "worry"? Nah, no worries - keep your pants on and keep you wallet in them, muchacho. The free SQL Server Express Edition works wonders for this type of shiz - and you can download it here. The 32-bit or 64-bit version should work fine for this.

Obviously, regular licensed versions of SQL Server work fine too, as well as SQL Server 2012 (Express or not) - but for the sake of actually getting it to work the first time (and not being on a RC release), I recommend 2008 R2.

When you DO install it, just make sure that you use "mixed-mode" authentication - since we're going to use a hardcoded user - none of that pesky "Windows Autentication" shit - AND install it as the "Default Instance" NOT a "Named Instance". You see what I mean as it walks you through the process.

Setting up your SQL Server table structure

now we've got a fresh and spanky SQL Server installation - we need to set up some database tables to hold all our Tweets. "A table to hold my tweets", Hmm, sounds like a sad new-wave children's story... Anyways...

First we need to make a new database. Log into your installation by using the SQL Server Management Studio app (as the 'sa' user, who you created a password for earlier), right-click on the "Databases" folder and go to "New Database". Name whatever you want, just remember that you'll have change my script to match. The defaults are fine in the creation.

Once its created, click "New Query" in the upper left bar.

Go to the query window and type "user [TweetHarvester]" (or [PinkyWinky], or whatever the hell you called your database)

Now we are ready to make some tables, here is the structure that I used for my script.

---main tweet table
SET ANSI_NULLS ON
GO

SET QUOTED_IDENTIFIER ON
GO

SET ANSI_PADDING ON
GO

CREATE TABLE [dbo].[TweetBank](
    [tweet_id] [BIGINT] NOT NULL,
    [tweet_datetime] [datetime] NULL,
    [tweet_keyword] [VARCHAR](50) NULL,
    [tweet] [VARCHAR](200) NULL,
    [tweeter] [VARCHAR](MAX) NULL,
    [lang] [VARCHAR](50) NULL,
    [geo] [VARCHAR](50) NULL,
 CONSTRAINT [PK_TweetBank] PRIMARY KEY CLUSTERED
(
    [tweet_id] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF,
        IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON,
        ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]
) ON [PRIMARY] TEXTIMAGE_ON [PRIMARY]

GO

SET ANSI_PADDING OFF
GO



--temp tweet table (see below)
SET ANSI_NULLS ON
GO

SET QUOTED_IDENTIFIER ON
GO

SET ANSI_PADDING ON
GO

CREATE TABLE [dbo].[TweetBankTemp](
    [tweet_id] [BIGINT] NOT NULL,
    [tweet_datetime] [datetime] NULL,
    [tweet_keyword] [VARCHAR](50) NULL,
    [tweet] [VARCHAR](200) NULL,
    [tweeter] [VARCHAR](MAX) NULL,
    [lang] [VARCHAR](50) NULL
) ON [PRIMARY] TEXTIMAGE_ON [PRIMARY]

GO

SET ANSI_PADDING OFF
GO



--a tweet run logging table (for fun)
SET ANSI_NULLS ON
GO

SET QUOTED_IDENTIFIER ON
GO

SET ANSI_PADDING ON
GO

CREATE TABLE [dbo].[TweetLog](
    [RunId] [INT] IDENTITY(1,1) NOT NULL,
    [BatchId] [INT] NULL,
    [RunDate] [datetime] NULL,
    [Keyword] [VARCHAR](50) NULL,
    [HarvestedThisRun] [INT] NULL,
    [TotalHarvested] [INT] NULL,
 CONSTRAINT [PK_TweetLog] PRIMARY KEY CLUSTERED
(
    [RunId] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF,
       IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON,
       ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]
) ON [PRIMARY]

GO

SET ANSI_PADDING OFF
GO

Go ahead and Download this if you aren't the Cut-n-Paste type.

We are making 3 tables here:

  1. The actual table to hold the historical tweets going forward
  2. A "Temp Tweet" table exactly like the "real" table, in order to easily lazily filter out dupes
  3. A Log table in order to capture what has happened on each run. Because, lets face it - stats are fun

Before we go any further, we need to figure out what we're going to track...

In my case, as I said earlier, I'm tracking Metallica tweets (which includes the names of all past and present members), so my script has an array of all of those keywords and will iterate over them one at a time.

harvest_list = ['metallica', 'james hetfield', 'lars ulrich', 'kirk hammett', 'rob trujillo', 'jason newsted', 'cliff burton']

Ok, enough foreplay - here is the actual script that does all the work. Notice that I commented the shit out of it, hope that helps.

from twython import Twython
import string, json, pprint
import urllib
from datetime import timedelta
from datetime import date
from time import *
import string, os, sys, subprocess, time
# probably wasteful imports, but in this case, who cares?
import pymssql

conn = pymssql.connect(host='localhost',  user='notsa', password='notsayly', database='TallicaTweets')
cur = conn.cursor()
# connect to our database and create a cursor to do some work

harvest_list = ['metallica', 'james hetfield', 'lars ulrich', 'kirk hammett', 'rob trujillo', 'jason newsted', 'cliff burton']
# my aforementioned harvest list, use as many as you want,
# they will all be separated in the database by keyword


cur.execute("select max(isnull(batchid,0)) from tweetlog")
batch_id_cur = cur.fetchall()
# updated 3-8-2012
if batch_id_cur[0][0] is None:
    batch_id = 0
else:
    batch_id = batch_id_cur[0][0]+1
# grabbing the last "batch id", if it exists so we
# can make log entries that make SOME sense



for tweet_keyword in harvest_list: # for each keyword, do some shit

        cur.execute("""delete from tweetbanktemp where tweet_keyword = '"""+str(tweet_keyword)+"""'""")
        conn.commit()
        # whack the temp table in case we didn't exit cleanly

        twitter = Twython()
        search_results = twitter.searchTwitter(q=tweet_keyword, rpp="100")
        # our search for the current keyword

        #pp = pprint.PrettyPrinter(indent=3)
        # uncomment for debugging and displaying pretty output

        for tweet in search_results["results"]:
                # some me the tweet, jerry!
                print "        Tweet from @%s Date: %s" % (tweet['from_user'].encode('utf-8'),tweet['created_at'])
                print "        ",tweet['text'].encode('utf-8'),"\n"

                try:
                        # lets try to to put each tweet in our temp table for now
                        cur.execute("""insert into TweetBankTemp (tweet_id, tweet_datetime, tweet_keyword, tweet, tweeter, lang)
                                        values ('"""
+str(tweet['id_str'].encode('utf-8').replace("'","''").replace(';',''))+"""',
                                                cast(substring('"""
+str(tweet['created_at'].encode('utf-8'))+"""',5,21) as datetime),
                                                '"""
+str(tweet_keyword)+"""',
                                                '"""
+str(tweet['text'].encode('utf-8').replace("'","''").replace(';',''))+"""',
                                                '"""
+str(tweet['from_user'].encode('utf-8').replace("'","''").replace(';',''))+"""',
                                                '"""
+str(tweet['iso_language_code'].encode('utf-8').replace("'","''").replace(';',''))+"""'
                                        ) """
)
                except:
                        print "############### Unexpected error:", sys.exc_info()[0], "##################################"
                        # just in case something heinous happens we don't lose
                        # ALL the tweets in this run, just the error one


        cur.execute("""insert into tweetbank (tweet_id, tweet_datetime, tweet_keyword, tweet, tweeter, lang)
        select * from tweetbanktemp where tweet_id NOT in
        (select distinct tweet_id from tweetbank)"""
)
        # take all the tweets that we DIDNT already have
        # and put them in the REAL tweet table

        cur.execute("""delete from tweetbanktemp where tweet_keyword = '"""+str(tweet_keyword)+"""'""")
        # take all THESE out of the temp table to not
        # interfere with the next keyword

        cur.execute("""insert into tweetlog (BatchId, keyword, RunDate, HarvestedThisRun, TotalHarvested) values
        (
        '"""
+str(batch_id)+"""',
        '"""
+str(tweet_keyword)+"""',
        getdate(),
        ((select count(*) from tweetbank where tweet_keyword = '"""
+str(tweet_keyword)+"""')-(select top 1 isnull(TotalHarvested,0) from tweetlog where keyword = '"""+str(tweet_keyword)+"""' order by RunDate desc)),
        (select count(*) from tweetbank where tweet_keyword = '"""
+str(tweet_keyword)+"""')
        )"""
)
        # add a record to the log table saying what we did!

        conn.commit()
        # hot soup!

Go ahead and Download this if you aren't the Cut-n-Paste type.

Disclaimer #2: I take no responsibility if this code ends up eating your grandmother's birthday photos, makes your Shih Tzu sick, or causes you to forget about those Jethro Tull tickets you've been sitting on. It probably won't do most of those things, but still!

Probably could have been done MUCH more efficiently - but I'm all about results, and for a project like this. Quick wins. Set it up to run every 5 minutes in the background using Task Scheduler and suddenly we're off to the races gathering tweets.


Pic of my "TweetLog" Table.


Pic of my "TweetBank" table.

What do we DO with this data? Well, that'll have to wait a bit while I gather it - but I sense some Tableau in the future...

Would love feedback, especially if you got it working (or if you didn't, maybe I can help).

  • the1percentblog

    looks great – will try it in the next few days, problem is i am a beginner coder and using OS X. Can you elaborate on the changes needed. Also can you get the followed and follower stats using this method?

    • http://ryrobes.com/ Ryan Robitaille

      Heya, so sorry I missed this comment last week.

      You should have no problem using the OSX version of Python, one of the only changes you’ll have to make will be to use a local installation of MySQL to save the data to.

      I’m not sure, offhand if the follower and followed stats are in the data object that I pull all the fields from, but I’m pretty sure that this twitter module has ways of getting that too.

      Let me know how it goes!

  • http://ryrobes.com/ Ryan Robitaille

    Anyone having any issues with this, please let me know. I’m sure we can figure them out. Its way too awesome of a datasource to get forgotten (after x days).

    • http://40limon.es dariomartinezb

      Hello Ryan,

      I’m wondering if you could update your script to use the streaming API instead of the search API. The rpp limitation is driving me nuts, as I can’t get pass 100 tweets per keyword on any given batch run. I would like to grab, say, 1500 tweets per keyword/run, but I haven’t been unable to do so.

      Any ideas?

  • jeff

    Yes I am having issues.  I love what you did and want to do something similar for a project.  I used your code as a test run.  Did everything exactly the same.  The code created tables in SQL Server.  When I tried running the python code,  I get an error that I cant connect to interface.  I am running windows 7,  64 but installed python2.7.3 32 bit and modules in 32 bit.  Tried uninstalling antivirus and disabling windows firewall but get same error.  Any suggestions?

    • http://ryrobes.com/ Ryan Robitaille

      hey Jeff, can you post your error? Sounds like you might have some funkiness with the pymssql modules (they can be weird and there are seemingly 10 diff versions floating around out there).

      • Jeff

        Ryan sorry for the delay, your response did not come through to my email.  If I use “server= ‘localhostSQLEXPRESS’, I get this:

        Traceback (most recent call last):  File “C:UsersNiemannDesktopMetTest.py”, line 24, in     conn = pymssql.connect(server=’localhostSQLEXPRESS’, user=’sa’, password=’twitter’, database=’Metallica’)  File “pymssql.pyx”, line 552, in pymssql.connect (pymssql.c:7162)InterfaceError: Connection to the database failed for an unknown reason.

        If I use server= ‘localhost’ I get:

        Traceback (most recent call last):  File “C:UsersNiemannDesktopMetTest.py”, line 24, in     conn = pymssql.connect(server=’localhost’, user=’sa’, password=’twitter’, database=’Metallica’)  File “pymssql.pyx”, line 549, in pymssql.connect (pymssql.c:7110)OperationalError: (20009, ‘Net-Lib error during Unknown error’)

        Ive messed with firewall configs and cant figure it out.  I can connect with odbc so I think it is definitely within pymssql.  Ant suggestions?

        • http://ryrobes.com/ Ryan Robitaille

          Give it a try without the database part. most newer version of pymssql shouldn’t have a problem with named instances like the localhostSQLEXPRESS one above.

          • Jeff

            Ryan, how would you get the lat long (location) from the tweets?  Would you mind showing a tweak in your code to do so?

  • Dirk

    I was wondering what the line is for adding the geo information to the table, we create a table with geo as a field but in the insert statement there is no geo information.

    For the rest I got it working on a mac os x, needed to change some parts. If you are interested I can send the changes and you can add them to this page.

    • http://40limon.es dariomartinezb

      I also am wondering about the geo field not being used. I’m plain stupid when it comes to python (this is actually my first practical use of it). I’m trying to put the geo thing to work but to no avail yet. Also, I’m trying to capture some other fields… I guess I’ll need to grab a Python for Dummies or something.

      But still, this is a great script and solves a lot of things for me :) Kudos to Ryan! You got me interested in Python, and being myself a marketing guy, that’s quite something. :)

  • CAM77

    I’m brand new to python and trying to get this example to work. I’ve got the database set up with the appropriate tables, and the python modules installed, but when I run the python script, the shell opens and displays a flashing cursor for about 15 seconds then closes without any error messages, but the tables are not populated with any data. Any ideas what I am doing wrong?

    • CAM77

      Update: Tried running in PythonWin and this is the error message displayed in the interactive window:

      Traceback (most recent call last): File “C:Python27LibSITE-P~1PYTHON~2pywinframeworkscriptutils.py”, line 325, in RunScript exec codeObject in __main__.__dict__ File “C:UsersCraigDesktopPython Scriptsharvester_of_tweets.py”, line 11, in conn = pymssql.connect(host=’localhost’, user=’sa’, password=’password1.’, database=’Twitter_Python’) File “pymssql.pyx”, line 549, in pymssql.connect (pymssql.c:7110)OperationalError: (20009, ‘Net-Lib error during Unknown error’)

      • http://ryrobes.com/ Ryan Robitaille

        Heya, sorry I missed your earlier post. What ver of SQL Server are you using?

        • CAM77

          Hey, Ryan, thanks for responding, I’m running SQL Server Express 2008 R2.

          • http://40limon.es dariomartinezb

            I had the same problem. I solved it using pyodbc instead of pymssql. Not that I’m happy (I don’t like ODBC at all) but for the sake of having it running…

            After having pyodbc set up, I change the connection string to:

            import pyodbc

            conn = pyodbc.connect(‘DRIVER={SQL Server};SERVER=MYPCNAMESQLEXPRESS;DATABASE=MyDB;UID=sa;PWD=MyPass’)

            That way I got it working.

            Hope this helps. :)

            PS: I’m running SQL Express 2012.

        • CAM77

          I’m running 2008 R2 (Express)

  • Tyler

    Instead of using harvest keywords, can Twitter’s streaming API (
    https://dev.twitter.com/docs/api/1.1/get/statuses/sample) be used with this instead? For my purposes, I’m not so much after specific keywords, but more after trending topics.

  • Steve

    Had to change i don’t know maybe library changed

    search_results = twitter.searchTwitter(q=tweet_keyword, rpp=”100″)
    to

    search_results = twitter.search(q=tweet_keyword, rpp=”100″) but now works great. Thanks for the example

  • http://twitter.com/SamirSalimBg Samir Salim

    I am new to python or twython. I tried replicating the twython search given here, but I get “Bad request” errors. Is it necessary to authenticate to do a twitter search? Is this something new?

  • http://twitter.com/Kristina_Ra Kristina Rabarison

    hello, i have never used python before but will learn because i’d like to harvest some twitter data for a project. I am using Mountain Lion and I have trouble with the easy install instruction – when I try to run the setuptools shell as indicated I get a SyntaxError: invalid syntax error message with setuptools highlighted in red. I am so sorry my question is elementary, I was wondering if you could advise me on how to properly install all necessary parts of tolls needed. Thank you, Kristina

    • http://ryrobes.com/ Ryan Robitaille

      No worries! I’m going to run one on OSX since many people seem to be using it. It should MOSTLY be the same, but the install will likely be different.

  • allan

    hi thanks for the post, was wondering how i can download the tweet coordinates together with the other information to a postgresql server

  • http://twitter.com/OllyOsborne Olly Osborne

    Will this be able to run on Mountain Lion on Mac?

    • http://ryrobes.com/ Ryan Robitaille

      Should work the same – I know it works on Linux but I haven’t tried it on OSX. I know the modules are available. Give it a shot and let me know what happens, maybe I can help!

  • Raymond Lou

    Hi Ryan, thanks for the interesting writeup. I’m a beginner learning Python and was wondering how difficult it would be to implement a similar process for Weibo, the Chinese version of Twitter. As far as I can tell, it doesn’t have an API like Twython.

    Thanks!

  • joefer

    Hi Ryan,

    I am getting this error.

    Traceback (most recent call last):
    File “C:/Python27/Harvest_Twit.py”, line 48, in
    for tweet in search_results["results"]:
    KeyError: ‘results’

    I just started using python and I don’t know what caused this error.

    Hope that you can give some insights on how to solve this.

    Thanks,

    Joefer

  • Daria

    Hi Ryan – what function does “results” in the line (for tweet in search_results["results"]: ) exactly serve/do? It’s giving me a keyword error on it and I’m not sure why it’s even there in the first place.

    Thanks!

    • Kris Erickson

      Okay twitter changed their API so several things changed.

      1: Change twitter.searchTwitter(q=tweet_keyword, rpp=”100″)
      to twitter.search(q=tweet_keyword, count = 100)

      2:Change search_results["results"]
      to search_results['statuses']

      3: Adjust these two variables:
      ['from_user'] to ['user']['screen_name'] (in two places)
      ['iso_language_code'] to ['metadata']['iso_language_code']

      4: Time, you must re-sql that time section, either change the table to accept varchar and just throw in created_at, or parse it into a new form and convert it properly

      I think that is a list of all the things I changed once you get to this error. I had to use pyodbc too.

      • Jim

        Hi Kris, The python code runs, and stores some empty(ish) values in the tweetlog. However in the python output The tweets are successfully returned but ” Unexpected error: ” is displayed above each one. Did you have this issue?

  • paceman61

    Due to changes in twitter auth model, I included the following before the first for loop (being very verbose with parameters):

    APP_KEY = ‘MY APP KEY’
    APP_SECRET = ‘MY APP SECRET’
    twitter = Twython(app_key=APP_KEY, app_secret=APP_SECRET, oauth_version=2)
    ACCESS_TOKEN = twitter.obtain_access_token()
    twitter = Twython(app_key=APP_KEY, app_secret=APP_SECRET, access_token=ACCESS_TOKEN, oauth_version=2)

    for tweet_keyword ….

    But am getting the same error as joefer
    Regards
    paceman61

  • Murphy

    Hey Ryan, I get the error: “Something went wrong during the installation.” When I try to run the script for the ez_setup.py.

    Any idea why this is?
    Thanks.

  • Murphy

    Also, to specify the error more clearly. All I am doing is opening the ez_setup.py file (which opens in a new online window). Then I copy the code that is in it and run it in Python. That’s when it gives me the error. All of the text it gives me is actually this:

    Downloading https://pypi.python.org/packages/source/s/setuptools/setuptools-1.1.6.tar.gz
    Extracting in c:usersmurphy~1appdatalocaltemptmpnvxjay
    Now working in c:usersmurphy~1appdatalocaltemptmpnvxjaysetuptools-1.1.6
    Installing Setuptools
    Something went wrong during the installation.
    See the error message above.
    >>>

    I can’t figure out what the problem is.

  • Danny

    I added the changes from Kris Ericssson and paceman61 , but now I get the error:

    twython.exceptions.TwythonAuthError: Unable to obtain OAuth 2 access token.

    Does anybody know how I can fix this? And telling me something like “Store the token” won’t help me unfortunately since I’m not that familiar with Python

  • SQIAR BI

    thanks for the post, was wondering how i can download the tweet coordinates together with the other information to a postgresql server.http://www.sqiar.com/solutions/technology/tableau/

  • http://sidsavenue.blogspot.com/ Sid

    Hi Ryan, i am looking to find twitter accounts based on their names. And then i want to find out accounts that meet a specific criteria. Any help is highly appreciated. thanks a ton.