Finding Social Media Accounts, Automatically, for A List of Organizations.
In a recent post, I introduced using R to grab the links to over 8000 organizations’ websites and social media accounts. The logic behind the web scrapping is simple. We use each organization name as the search term and extract the link of the top Google search result, assuming that the top link is the organization’s official website. We do the same to get their social media sites by limiting the search to specific domains (i.e., Facebook.com and Twitter.com).
In this post, I will introduce a way to cross-check whether the social media sites returned from the search are indeed the ones operated by the listed organizations. In doing so, we calculate the match ratio — the proportion of words in an organization’s name that co-occur in the account page name returned from Twitter/Facebook API.
Step 1: Extract Twitter/Facebook handles from search result links.
For instance, we extract AARP from the link (https://www.facebook.com/AARP/). Here is the code that does the job:
First, let’s load necessary libraries and the dataset. In my example, the data frame is named d, imported from a CSV file. We need to create two new columns to store the profile information from Twitter and Facebook API (named NameFromTwitterAPI and NameFromFacebookAPI, respectively).
library(Rfacebook)
library(twitteR)
library(stringdist)
library(stringr)d <- read.csv(“Orgs with social media accounts (web scripting).csv”)d$NameFromTwitterAPI <- NA
d$NameFromFacebookAPI <- NA
Now, let’s parse out Twitter/Facebook handles from the search result links. The extracted handles will be put in the columns: TwitterHandle and FBHandle.
d$TwitterHandle <- tolower(d$Twitter)
d$TwitterHandle<-gsub(“https://twitter.com/", “”,d$Twitter)
d$TwitterHandle<-gsub(“/$”, “”,d$TwitterHandle)
d$TwitterHandle<-gsub(“…/”, “”,d$TwitterHandle)
d$TwitterHandle<-gsub(“-”, “”,d$TwitterHandle)
d$TwitterHandle<-gsub(“ “, “”,d$TwitterHandle)d$FBHandle <-tolower(d$Facebook)
d$FBHandle<-gsub(“https://www.facebook.com/", “”,d$FBHandle)
d$FBHandle<-gsub(“/$”, “”,d$FBHandle)
d$FBHandle<-gsub(“…/”, “”,d$FBHandle)
d$FBHandle<-gsub(“-”, “”,d$FBHandle)
d$FBHandle<-gsub(“ “, “”,d$FBHandle)
To prevent error in API requests, we will only work on cases with a complete organization name and a Twitter handle.
d<- d[complete.cases(d$OrganizationName),]
d<- d[complete.cases(d$TwitterHandle),]
Step 2: Grab Twitter/Facebook profile information from API.
We now can start grabbing Twitter bio information. I let the API connection pause for 3 seconds after each request. Notice that in this step, we are getting the name of a Twitter account (e.g., Alliance for Justice for @afjustice) and store it in the column TwitterHandle.
#authorize your Twitter API connection. setup_twitter_oauth("xxxx", "xxxxx", "xxxx", "xxxx")for (account in d$TwitterHandle[1:length(d$TwitterHandle)]){ print(c("finding info for:",account)) try(d[d$TwitterHandle==account,]$NameFromTwitterAPI <- getUser(account)$name) print(paste("the name is:",tw_df$name, collapse = NULL)) #d[d$TwitterHandle==account,]$NameFromTwitterAPI <- tw_df$name Sys.sleep(3)
}
We can also collect the profile information from the Facebook API. The name of a Facebook page is stored in the column FBHandle.
#authorize Facebook API connection.token <- fbOAuth("xxxx", "xxx", extended_permissions = TRUE,legacy_permissions = FALSE)for(account in d$FBHandle[1:length(d$FBHandle)]) { print(c(“finding info for:”,account)) try(d[d$FBHandle==account,]$NameFromFacebookAPI <- fb_df<-getPage(page=account, token, n = 1, feed = FALSE,reactions = FALSE)$from_name) Sys.sleep(3)}
Step 3: Calculate the match ratio.
The match ratio is defined as the the proportion of words in an organization’s name that overlap with the name of a Twitter/Facebook account. Here is a side-by-side comparison of the organization names with the names obtained from the Twitter API.
Let’s first calculate the match ratio for Twitter handles and store the match ratio in the new column MatchRatioTwitter.
d$MatchRatioTwitter <- NAfor (name in d$OrganizationName[1:length(d$OrganizationName)]){ print(c(“matching text for:”,name)) string1<-str_replace_all(iconv(d[d$OrganizationName==name,]$OrganizationName),”[[:punct:]]”,” “) string1 <- unlist(str_split(tolower(string1),” “)) string2<-str_replace_all(iconv(d[d$OrganizationName==name,]$NameFromTwitterAPI),”[[:punct:]]”,” “) string2 <- unlist(str_split(tolower(string2),” “)) d[d$OrganizationName==name,]$MatchRatioTwitter <- length(intersect(string1,string2))/length(string1) }
For Facebook:
d$MatchRatioFacebook <- NAfor (name in d$OrganizationName[1:length(d$OrganizationName)]){ print(c(“matching text for:”,name)) string1<-str_replace_all(iconv(d[d$OrganizationName==name,]$OrganizationName),”[[:punct:]]”,” “) string1 <- unlist(str_split(tolower(string1),” “)) string2<-str_replace_all(iconv(d[d$OrganizationName==name,]$NameFromFacebookAPI),”[[:punct:]]”,” “) string2 <- unlist(str_split(tolower(string2),” “)) d[d$OrganizationName==name,]$MatchRatioFacebook <- length(intersect(string1,string2))/length(string1) }
We still need human intelligence to check the remaining incomplete cases, but the process of cross-checking is made much easier with the R code.
Question? Follow and contact my Twitter @cosmopolitanvan and my website.