How to Harvest Facebook updates with Python, MySQL or SQL Server

5 July 2012 18 Comments

How to Harvest Facebook updates with Python, MySQL or SQL Server

While most people focus on search.twitter.com (or automated pulls from their API) for data collection regarding the "pulse" on a particular topic or brand - they often overlook the fact that the public Facebook timeline can be a useful source of data as well.

That's how I was able to harvest both twitter and facebook streams during the Stanley Cup games and look for similarities (like, tons).

The BOOK OF FACE!

Here I am going to show you 2 quick and dirty scripts will gather data from the Facebook API and insert it into a database (using MySQL or Microsoft SQL Server).

First, the SQL tables..

(don't fret if your cut and paste skills are shitty, I've got zip files at the bottom of the post)

MySQL

SET SQL_MODE="NO_AUTO_VALUE_ON_ZERO";
SET time_zone = "+00:00";

--
-- Database: `facebank`
--

-- --------------------------------------------------------

--
-- Table structure for table `fbsearchlog`
--

CREATE TABLE IF NOT EXISTS `fbsearchlog` (
`RunId` INT(11) NOT NULL AUTO_INCREMENT,
`BatchId` INT(11) DEFAULT NULL,
`RunDate` datetime DEFAULT NULL,
`Keyword` VARCHAR(50) DEFAULT NULL,
`HarvestedThisRun` INT(11) DEFAULT NULL,
`TotalHarvested` INT(11) DEFAULT NULL,
`RunTime` FLOAT DEFAULT NULL,
PRIMARY KEY (`RunId`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1 AUTO_INCREMENT=338 ;

-- --------------------------------------------------------

--
-- Table structure for table `searchbank`
--

CREATE TABLE IF NOT EXISTS `searchbank` (
`id` INT(11) NOT NULL AUTO_INCREMENT,
`run_id` INT(11) DEFAULT NULL,
`fid` VARCHAR(250) DEFAULT NULL,
`from_name` VARCHAR(250) DEFAULT NULL,
`from_id` BIGINT(20) DEFAULT NULL,
`keyword` VARCHAR(50) DEFAULT NULL,
`type` VARCHAR(50) DEFAULT NULL,
`posted` datetime DEFAULT NULL,
`message` VARCHAR(5000) DEFAULT NULL,
`story` VARCHAR(5000) DEFAULT NULL,
`link` VARCHAR(1500) DEFAULT NULL,
`likes` INT(11) DEFAULT NULL,
`comments` INT(11) DEFAULT NULL,
`shares` INT(11) DEFAULT NULL,
`harvested` datetime DEFAULT NULL,
`page_likes` INT(11) DEFAULT NULL,
PRIMARY KEY (`id`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1 AUTO_INCREMENT=12255 ;

-- --------------------------------------------------------

--
-- Table structure for table `searches`
--

CREATE TABLE IF NOT EXISTS `searches` (
`id` INT(11) NOT NULL AUTO_INCREMENT,
`run_id` INT(11) DEFAULT NULL,
`fid` VARCHAR(250) DEFAULT NULL,
`from_name` VARCHAR(250) DEFAULT NULL,
`from_id` BIGINT(20) DEFAULT NULL,
`keyword` VARCHAR(50) DEFAULT NULL,
`type` VARCHAR(50) DEFAULT NULL,
`posted` datetime DEFAULT NULL,
`message` VARCHAR(5000) DEFAULT NULL,
`story` VARCHAR(5000) DEFAULT NULL,
`link` VARCHAR(1500) DEFAULT NULL,
`likes` INT(11) DEFAULT NULL,
`comments` INT(11) DEFAULT NULL,
`shares` INT(11) DEFAULT NULL,
`harvested` datetime DEFAULT NULL,
`page_likes` INT(11) DEFAULT NULL,
PRIMARY KEY (`id`)
) ENGINE=MEMORY DEFAULT CHARSET=latin1 AUTO_INCREMENT=1 ;

Note that one table "searches" is a MEMORY table - that's because its just used as a temp table to avoid dupes. We don't need to hold on to that data, and in-memory tables are hella fast.

SQL Server

In this case 2008R2, but any semi-recent version will work (besides SQL Server 2008 Express is a great database, and is freaking free).

USE [fbOpenGraph]
GO

/****** Object: Table [dbo].[fbsearchLog] Script Date: 07/05/2012 14:22:27 ******/
SET ANSI_NULLS ON
GO

SET QUOTED_IDENTIFIER ON
GO

SET ANSI_PADDING ON
GO

CREATE TABLE [dbo].[fbsearchLog](
[RunId] [INT] IDENTITY(1,1) NOT NULL,
[BatchId] [INT] NULL,
[RunDate] [datetime] NULL,
[Keyword] [VARCHAR](50) NULL,
[HarvestedThisRun] [INT] NULL,
[TotalHarvested] [INT] NULL,
[RunTime] [FLOAT] NULL,
CONSTRAINT [PK_fbsearchLog] PRIMARY KEY CLUSTERED
(
[RunId] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]
) ON [PRIMARY]

GO

SET ANSI_PADDING OFF
GO

USE [fbOpenGraph]
GO

/****** Object: Table [dbo].[SearchBank] Script Date: 07/05/2012 14:22:39 ******/
SET ANSI_NULLS ON
GO

SET QUOTED_IDENTIFIER ON
GO

SET ANSI_PADDING ON
GO

CREATE TABLE [dbo].[SearchBank](
[id] [INT] IDENTITY(1,1) NOT NULL,
[run_id] [INT] NULL,
[fid] [VARCHAR](250) NULL,
[from_name] [VARCHAR](250) NULL,
[from_id] [BIGINT] NULL,
[keyword] [VARCHAR](50) NULL,
[TYPE] [VARCHAR](50) NULL,
[posted] [datetime] NULL,
[message] [VARCHAR](MAX) NULL,
[story] [VARCHAR](MAX) NULL,
[link] [VARCHAR](1500) NULL,
[likes] [INT] NULL,
[comments] [INT] NULL,
[shares] [INT] NULL,
[harvested] [datetime] NULL,
[page_likes] [INT] NULL,
CONSTRAINT [PK_SearchBank] PRIMARY KEY CLUSTERED
(
[id] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]
) ON [PRIMARY]

GO

SET ANSI_PADDING OFF
GO

USE [fbOpenGraph]
GO

/****** Object: Table [dbo].[Searches] Script Date: 07/05/2012 14:22:48 ******/
SET ANSI_NULLS ON
GO

SET QUOTED_IDENTIFIER ON
GO

SET ANSI_PADDING ON
GO

CREATE TABLE [dbo].[Searches](
[id] [INT] IDENTITY(1,1) NOT NULL,
[run_id] [INT] NULL,
[fid] [VARCHAR](250) NULL,
[from_name] [VARCHAR](250) NULL,
[from_id] [BIGINT] NULL,
[keyword] [VARCHAR](50) NULL,
[TYPE] [VARCHAR](50) NULL,
[posted] [datetime] NULL,
[message] [VARCHAR](MAX) NULL,
[story] [VARCHAR](MAX) NULL,
[link] [VARCHAR](1500) NULL,
[likes] [INT] NULL,
[comments] [INT] NULL,
[shares] [INT] NULL,
[harvested] [datetime] NULL,
[page_likes] [INT] NULL,
CONSTRAINT [PK_Searches] PRIMARY KEY CLUSTERED
(
[id] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]
) ON [PRIMARY]

Okay - now that our data structure is in place - let's get some Pythonin' goin' onin'!

First the MYSQL version

(will work on Windows and Linux)
Note: you're going to need to following Python (2.7 32-bit recommended) modules to use this bad boy as-is:

MySQLdb
colorama (colors are FUN!)
simplejson

import string, json, pprint
import urllib
from datetime import timedelta
from datetime import date
from time import *
import string, os, sys, subprocess, time
import simplejson as json

from time import time, sleep
startTime = time()

from colorama import init
init() #init colorama (needed on win32, not in *nix)
from colorama import Fore, Back, Style

import MySQLdb

pp = pprint.PrettyPrinter(indent=3)

conn = MySQLdb.connect(host='localhost', user='root', db='facebank', passwd='harvie') #, passwd='fsdfsdfsdf'
conn.autocommit(True)
cur = conn.cursor()

cur.execute("select ifnull((select ifnull(run_id,0) from searchbank order by run_id desc limit 0,1),0)")
run_id = cur.fetchall()[0][0]+1
print 'BatchRun #' + str(run_id)

newones = 100
keywords = ['metallica','james hetfield', 'lars ulrich', 'kirk hammett', 'rob trujillo', 'jason newsted','cliff burton','tsoukalos','ancient aliens','bigfoot','vaynermedia']
# put yer keywords here!

for keyword in keywords:

sleep(1)
nextpage = 'https://graph.facebook.com/search?q="'+str(keyword)+'"&type=post&limit=100'
newones = 100

while newones > 0 and nextpage <> 'NO':
sleep(1)
f = urllib.urlopen(nextpage)
s = f.read()
f.close()
ss = json.loads(s)
#pp.pprint(ss)

try:
for num in range(0,len(ss['data'])):
#print '*************************'
#for majorkey, subdict in ss['data'][num].iteritems(): #'paging' or 'data'?
# print majorkey
# print subdict
# for subkey, value in subdict.iteritems():
# print subkey, value
#print ' '+str(ss['data'][num]['created_time'].encode('ascii', 'ignore'))
createdtime = str(ss['data'][num]['created_time'].encode('ascii', 'ignore'))
#print ' '+str(ss['data'][num]['type'].encode('ascii', 'ignore'))+'!'
fbtype = str(ss['data'][num]['type'].encode('ascii', 'ignore'))
try:
#print ' msg '+str(ss['data'][num]['message'].replace("'","''").replace(';','').encode('ascii', 'ignore'))+'!'
fbmessage = str(ss['data'][num]['message'].replace("'","''").replace(';','').encode('ascii', 'ignore'))
except:
fbmessage = ''
try:
#print ' sty '+str(ss['data'][num]['story'].replace("'","''").replace(';','').encode('ascii', 'ignore'))+'!'
fbstory = str(ss['data'][num]['story'].replace("'","''").replace(';','').encode('ascii', 'ignore'))
except:
fbstory = ''

try:
#print ' '+str(ss['data'][num]['comments']['count'])+'.'
fbcomments = str(ss['data'][num]['comments']['count'])
except:
fbcomments = 0
try:
#print ' '+str(ss['data'][num]['likes']['count'])+'.'
fblikes = str(ss['data'][num]['likes']['count'])
except:
fblikes = 0
try:
#print ' '+str(ss['data'][num]['shares']['count'])+'.'
fbshares = str(ss['data'][num]['shares']['count'])
except:
fbshares = 0
try:
namee = str(ss['data'][num]['from']['name'])
except:
namee = ''
try:
#print ' '+str(ss['data'][num]['link'])+'.'
link = str(ss['data'][num]['link']).replace("'","''")
except:
link = ''
#print '*************************'

#print str(ss['data'][num]['id'])
fid = str(ss['data'][num]['id'])
print str(ss['data'][num]['id']) + ' - ' + namee + ' (' + fbtype + ')'
from_name = namee.replace("'","''").encode('ascii', 'ignore')
#print str(ss['data'][num]['from']['id'])
from_id = str(ss['data'][num]['from']['id'])

query = """insert into searches (fid, from_id, from_name, keyword, type, link, posted, message, story, likes, comments, shares, harvested, run_id)
values ('"""+str(fid)+"""', '"""+str(from_id)+"""', '"""+str(from_name)+"""', '"""+str(keyword)+"""', '"""+str(fbtype)+"""', '"""+str(link)+"""',cast('"""+str(createdtime).replace('+0000','')+"""' as datetime), '"""+str(fbmessage)+"""', '"""+str(fbstory)+"""', '"""+str(fblikes)+"""', '"""+str(fbcomments)+"""', '"""+str(fbshares)+"""', now(), '"""+str(run_id)+"""')"""
#print query
try:
cur.execute(query)
except:
print '#################### SQL ISSUE ###########################'
print query
print '#################### SQL ISSUE ###########################'
#print ' '+str(ss['data'][num]['message'].encode('ascii', 'ignore'))+' '
conn.commit()
#print ss['paging']['next']
print str(len(ss['data'])) + ' = element "data" length'
#sleep(1)

cur.execute("""insert into searchbank (fid, from_id, from_name, keyword, type, link, posted, message, story, likes, comments, shares, harvested, run_id)
select distinct fid, from_id, from_name, keyword, type, link, posted, message, story, likes, comments, shares, harvested, run_id from searches where fid NOT in
(select distinct fid from searchbank)""")

cur.execute("""delete from searches where keyword = '"""+str(keyword)+"""'""")

cur.execute("select ifnull((select ifnull(TotalHarvested,0) from fbsearchlog where keyword = '"+str(keyword)+"' order by rundate desc limit 0,1),0)")
total_harvested = cur.fetchone()[0]

cur.execute("""insert into fbsearchlog (BatchId, keyword, RunDate, HarvestedThisRun, TotalHarvested) values
(
'"""+str(run_id)+"""',
'"""+str(keyword)+"""',
now(),
((select count(*) from searchbank where keyword = '"""+str(keyword)+"""')-"""+str(total_harvested)+"""),
(select count(*) from searchbank where keyword = '"""+str(keyword)+"""')
)""")

conn.commit()

cur.execute("""select ifnull(HarvestedThisRun,TotalHarvested) from fbsearchlog where BatchId = '"""+str(run_id)+"""' and keyword = '"""+str(keyword)+"""' order by RUnDate desc limit 0,1""")
harvienum = cur.fetchone()
newones = harvienum[0]
print Fore.GREEN + Style.BRIGHT + str(keyword) + Style.RESET_ALL
print str(newones) + ' actually kept'

try:
nextpage = ss['paging']['next']
print nextpage
except:
nextpage = 'NO'
print 'No next page avail'
except:
print 'Some kind of API errors - invalid data JSON returned maybe?'
nextpage = 'NO'
newones = 0
print 'ERROR'
print sys.exc_info()[0]
print sys.exc_info()[1]
sleep(2)
#sleep(2)

elapsed = time() - startTime
print "Finished. Total time: " + strftime('%H:%M:%S', gmtime(elapsed))
done_time = strftime('%H:%M:%S', gmtime(elapsed))

Note: the only stuff you SHOULD have to change is the MySQL connection info and your list of keywords (unless you really want to harvest all the Facebook mentions of 'lars ulrich', in which case - knock yourself out!)

Now the SQL Server version

Note: you're going to need to following Python (2.7 32-bit recommended) modules to use this one as-is:

import string, json, pprint
import urllib
from datetime import timedelta
from datetime import date
from time import *
import string, os, sys, subprocess, time
import simplejson as json
#import winsound

from time import clock
import time
clock()

import pymssql

pp = pprint.PrettyPrinter(indent=3)

conn = pymssql.connect(host='localhost:1450', user='sa', password='harvie',database='fbopengraph')
conn.autocommit(True)
cur = conn.cursor()

cur.execute("select isnull((select top 1 isnull(run_id,0) from fbopengraph.dbo.searchbank order by run_id desc),0)")
run_id = cur.fetchall()[0][0]+1
print run_id

newones = 100
keywords = ['bic','metallica','james hetfield', 'lars ulrich', 'kirk hammett', 'rob trujillo', 'jason newsted','cliff burton','garyvee','vaynerchuk','tsoukalos','ancient aliens','bigfoot','vaynermedia','orion']

for keyword in keywords:

time.sleep(1)
nextpage = 'https://graph.facebook.com/search?q="'+str(keyword)+'"&type=post&limit=100'
#homelink = 'https://graph.facebook.com/'+str(fid)
newones = 100

while newones > 0 and nextpage <> 'NO':
time.sleep(2)
f = urllib.urlopen(nextpage)
s = f.read()
f.close()
ss = json.loads(s)
pp.pprint(ss)

try:
for num in range(0,len(ss['data'])):
print '*************************'
for majorkey, subdict in ss['data'][num].iteritems(): #'paging' or 'data'?
print majorkey
#print subdict
#for subkey, value in subdict.iteritems():
# print subkey, value
print ' '+str(ss['data'][num]['created_time'].encode('ascii', 'ignore'))
createdtime = str(ss['data'][num]['created_time'].encode('ascii', 'ignore'))
print ' '+str(ss['data'][num]['type'].encode('ascii', 'ignore'))+'!'
fbtype = str(ss['data'][num]['type'].encode('ascii', 'ignore'))
try:
print ' msg '+str(ss['data'][num]['message'].replace("'","''").replace(';','').encode('ascii', 'ignore'))+'!'
fbmessage = str(ss['data'][num]['message'].replace("'","''").replace(';','').encode('ascii', 'ignore'))
except:
fbmessage = ''
try:
print ' sty '+str(ss['data'][num]['story'].replace("'","''").replace(';','').encode('ascii', 'ignore'))+'!'
fbstory = str(ss['data'][num]['story'].replace("'","''").replace(';','').encode('ascii', 'ignore'))
except:
fbstory = ''

try:
print ' '+str(ss['data'][num]['comments']['count'])+'.'
fbcomments = str(ss['data'][num]['comments']['count'])
except:
fbcomments = 0
try:
print ' '+str(ss['data'][num]['likes']['count'])+'.'
fblikes = str(ss['data'][num]['likes']['count'])
except:
fblikes = 0
try:
print ' '+str(ss['data'][num]['shares']['count'])+'.'
fbshares = str(ss['data'][num]['shares']['count'])
except:
fbshares = 0
try:
print ' '+str(ss['data'][num]['link'])+'.'
link = str(ss['data'][num]['link']).replace("'","''")
except:
link = ''
print '*************************'

print str(ss['data'][num]['id'])
fid = str(ss['data'][num]['id'])
print str(ss['data'][num]['from']['name'].replace("'","''").encode('ascii', 'ignore'))
from_name = str(ss['data'][num]['from']['name'].replace("'","''").encode('ascii', 'ignore'))
print str(ss['data'][num]['from']['id'])
from_id = str(ss['data'][num]['from']['id'])

query = """insert into fbopengraph.dbo.searches (fid, from_id, from_name, keyword, type, link, posted, message, story, likes, comments, shares, harvested, run_id)
values ('"""+str(fid)+"""', '"""+str(from_id)+"""', '"""+str(from_name)+"""', '"""+str(keyword)+"""', '"""+str(fbtype)+"""', '"""+str(link)+"""',convert(datetime,cast(left('"""+str(createdtime)+"""',19) as nvarchar(50)),126), '"""+str(fbmessage)+"""', '"""+str(fbstory)+"""', '"""+str(fblikes)+"""', '"""+str(fbcomments)+"""', '"""+str(fbshares)+"""', getdate(), '"""+str(run_id)+"""')"""
print query
cur.execute(query)
#print ' '+str(ss['data'][num]['message'].encode('ascii', 'ignore'))+' '
conn.commit()
#print ss['paging']['next']
print str(len(ss['data'])) + ' = element "data" length'
time.sleep(1)

cur.execute("""insert into searchbank (fid, from_id, from_name, keyword, type, link, posted, message, story, likes, comments, shares, harvested, run_id)
select distinct fid, from_id, from_name, keyword, type, link, posted, message, story, likes, comments, shares, harvested, run_id from searches where fid NOT in
(select distinct fid from searchbank)""")

cur.execute("""delete from searches where keyword = '"""+str(keyword)+"""'""")

cur.execute("""insert into fbsearchlog (BatchId, keyword, RunDate, HarvestedThisRun, TotalHarvested) values
(
'"""+str(run_id)+"""',
'"""+str(keyword)+"""',
getdate(),
((select count(*) from searchbank where keyword = '"""+str(keyword)+"""')-(select top 1 isnull(TotalHarvested,0) from fbsearchlog where keyword = '"""+str(keyword)+"""' order by RunDate desc)),
(select count(*) from searchbank where keyword = '"""+str(keyword)+"""')
)""")

conn.commit()

cur.execute("""select top 1 isnull(HarvestedThisRun,TotalHarvested) from fbopengraph.dbo.fbsearchlog where BatchId = '"""+str(run_id)+"""' and keyword = '"""+str(keyword)+"""' order by RUnDate desc""")
harvienum = cur.fetchone()
newones = harvienum[0]
print str(keyword)
print str(newones) + ' actually kept'

try:
nextpage = ss['paging']['next']
print nextpage
except:
nextpage = 'NO'
print 'No next page avail'
except:
print 'Some kind of API errors - invalid data JSON returned maybe?'
nextpage = 'NO'
newones = 0
print 'ERROR'
print sys.exc_info()[0]
print sys.exc_info()[1]
#winsound.PlaySound("*", winsound.SND_ALIAS)

end_script = clock()
print "Finished: %s minutes" % str(end_script / 60)

Again, you only need to change connection info and keyword list.

Like I said, its quick and dirty - but hopefully it's useful to someone!

When using this script - it will initially attempt to search as far back as possible until it runs out of results or hits an error. Upon running it again, it will look for new results and quit digging when it encounters too many dupes (stuff you already have).

All the results will be in the 'searchbank' table, and counts and stats about the data you pulled will be in the 'fbsearchlog' table

Here's some download links the files themselves:

MSSQL Files MYSQL Files

Ryan Robitaille

Business Technologist, Data Artist, Problem Solver

How to Harvest Facebook updates with Python, MySQL or SQL Server

The BOOK OF FACE!

First, the SQL tables..

MySQL

SQL Server

First the MYSQL version

Now the SQL Server version

Here's some download links the files themselves:

Have fun, any issues - hit me up in the comments!

Related Posts

Harvesting Twitter Search Results for Analysis using Python & SQL Server

Using XLWT and Python to export an Oracle dataset to Excel (Python Simple ETL Part 2)

Running Python script(s) as a Windows Service – Keep your Python Mojo Engines Running while you Sleep!

Fun Data Visualizations, Analytics

Sweet Data Visualizations

Categories

Fun Data Visualizations, Analytics

Links

Search

Recent Comments

Ryan Robitaille

Business Technologist, Data Artist, Problem Solver

How to Harvest Facebook updates with Python, MySQL or SQL Server

The BOOK OF FACE!

First, the SQL tables..

MySQL

SQL Server

First the MYSQL version

Now the SQL Server version

Here's some download links the files themselves:

Have fun, any issues - hit me up in the comments!

Related Posts Harvesting Twitter Search Results for Analysis using Python & SQL Server Using XLWT and Python to export an Oracle dataset to Excel (Python Simple ETL Part 2) Running Python script(s) as a Windows Service – Keep your Python Mojo Engines Running while you Sleep!

Related Posts

Fun Data Visualizations, Analytics

Sweet Data Visualizations

Categories

Fun Data Visualizations, Analytics

Links

Search

Recent Comments

Crazy-Ass Category Cloud

Related Posts

Harvesting Twitter Search Results for Analysis using Python & SQL Server

Using XLWT and Python to export an Oracle dataset to Excel (Python Simple ETL Part 2)

Running Python script(s) as a Windows Service – Keep your Python Mojo Engines Running while you Sleep!