Automating the Google search for the web presence of 8000+ organizations
I recently worked on a spreadsheet containing a list of over 8500 nonprofits. For each nonprofit, I wanted to get the URLs to its website, Facebook page and Twitter account. The task is arduous, if done without some sort of automation. In the below, I will show you how to use a few lines of simple R scripts to get the task done.
- Install R, RStudio and libraries.
R is an open-source data analytics engine, the lingua franca in data mining, along with Python. RStudio is an integrated development environment (IDE) for running R scripts and tracking outputs. You can install it from here.
After you create a project in RStudio, you can use the following scripts to install R libraries needed for the task. For example, rvest is a library for simple web scrapping.
install.packages(“rvest”)
install.packages(“urltools”)require("rvest")
require("urltools")
2. Load spreadsheet into R
My spreadsheet is called data.csv, and is located in the same folder where the R project for this task is saved. I can use the following line to import the spreadsheet. In R the imported spreadsheet is named d.
d <- read.csv(“data.csv”)
View(d)
View(d) will open the spreadsheet and you will see how the data are structured.
To access a particular column, for example the first column which contains the names of the organizations, simply run
d$OrganizationName
3. Start web scrapping — the simple version
We first create three new columns, named Website, Twitter and Facebook respectively. We fill the columns with NA as the placeholder for the output from web scrapping.
d$Website <- “NA”
d$Twitter <- “NA”
d$Facebook <- “NA”
We then use the library rvest to make Google Search requests, with each organization name as the search term. Here is a simple version of the web scrapping function.
for (name in d$OrganizationName[1:2000]){
print(paste0(“finding the url for:”,name)) Sys.sleep(3) url1 = URLencode(paste0(“https://www.google.com/search?q=",name))
page1 <- read_html(url1)
results1 <- page1 %>% html_nodes(“cite”) %>% html_text()
result1 <- as.character(results1[1]) d[d$OrganizationName==name,]$Website <- result1}
In line 1, you will notice that we let R to parse from the 1st through the 2000th cell in the column called OrganizationName (you can of course change the numbers to suit your need). We also let R to print (a.k.a, display in the R console) each organization name.
Sys.sleep(3) is put in so that R will rest for 3 seconds after each scrapping request.
url1 is the unique Google search url based on a given search term. It is https://www.google.com/search?q=, followed by a search term. So, to search red cross, the url is: https://www.google.com/search?q=red+cross (note: the space between two words is replaced by + or %20).
page1 <- read_html(url1) is using the read_html function in rvest to download the html code using a search url. results1 contains the html output. And result1 is the first search result.
4. Web scrapping — full features
The simple web scrapping function we’ve created from the above step only returns website URLs. What about Facebook pages and Twitter accounts? We can add a few lines to the function to do more advanced scrapping. The idea is simple: we add two more Google searches, each is limited to a specific domain. For example, by adding “site:twitter.com” to the search url, we can get search results strictly from twitter.com.
The complete web scrapping function will be like this:
for (name in d$OrganizationName[1531:2000]){
print(paste0(“finding the url for:”,name))
Sys.sleep(3)
url1 = URLencode(paste0(“https://www.google.com/search?q=",name))
page1 <- read_html(url1)
results1 <- page1 %>% html_nodes(“cite”) %>% html_text()
result1 <- as.character(results1[1])
d[d$OrganizationName==name,]$Website <- result1
print(paste0(“finding the Twitter url for:”,name))
url2 = URLencode(paste0(“https://www.google.com/search?q=",gsub(" ","+",name),"+site:twitter.com"))
page2 <- read_html(url2)
results2 <- page2 %>% html_nodes(“cite”) %>% html_text()
result2 <- as.character(results2[1])
d[d$OrganizationName==name,]$Twitter <- result2
print(paste0(“finding the Facebook url for:”,gsub(" ","+",name)))
url3 = URLencode(paste0(“https://www.google.com/search?q=",gsub(" ","+",name),"+site:facebook.com"))
page3 <- read_html(url3)
results3 <- page3 %>% html_nodes(“cite”) %>% html_text()
result3 <- as.character(results3[1])
d[d$OrganizationName==name,]$Facebook <- result3
}
Caveats
You may not always get accurate search result — the first website may not be related to your target organization at all. That’s when you need data cleaning — part of it can be automated as well!
You can create a new column to store the second website in a returned search result. For example, you can add the following two lines to the function.
result4 <- as.character(results1[2])
d[d$OrganizationName==name,]$Alternative_result <- result4
Since all organizations on my list are nonprofits. Their website URLs should end with .org. Using the following, I can create a new data frame (i.e., spreadsheet) to include only organizations whose URLs contain “.org”.
c<-d[grep(“.org”, d$Website), ]
In the end, we probably cannot get away without having human eyes check the data. But now the task is far less daunting.