Retrieving Wikipedia Data for Natural Language Processing

Reading Time: 4 minutes
Encyclopedia
Old School Wikipedia

The internet is not just for cat videos anymore, there’s too much useful freely available data to ignore.  In my opinion, being able to easily get data from the internet using API’s is a core skill for any data scientist or analyst. I’ve walked through how easy it is to use an API to get Strava exercise data and Twitter data, today we focus on Wikipedia to find some text data for our Natural Language Processing projects.

We’re going to use R for this project and let some great packages do the heavy lifting; WikipediR for working with the MediaWiki API and rvest for web scraping Wikipedia pages.  As usual there are a number of packages and ways to approach this task, but I found this the easiest and most straightforward.  The output of this code is a text file with text data from all the Wikipedia pages related to a given topic, with the goal of generating a corpus on a domain specific topic for use in training word vectors with Word2Vec.   That said, this code is easily modified to extract a variety of data from Wikipedia.

In addition to the two packages mentioned above, we will load tidyverse for any data manipulation tasks.  Tidyverse is a must have for making data manipulation easy, like this example working with customer survey data.  We’ll also initialize some parameters including our overall topic (Ice hockey) and output filename.  For your particular topic, just search Wikipedia for a category page and include all the text after ‘Category:’.  This script will then extract all the pages and sub categories associated with the topic.

library(tidyverse)            # Data Manipulation
library(WikipediR)            # Wikipedia Queries
library(rvest)                # Web Scraping

category_list <- "Ice hockey" # Set to starting category name
filename <- "data.txt"        # Output filename
total_pages <- c()
categories <- c()
text_data <- as.null()

As a first step, we want to use the MediaWiki API to retrieve a list of all the sub-categories and page titles associated with those categories.  The WikipediR package provides some nice wrapper functions to access the API.

  # retrieve pages and categories
  pages <- pages_in_category("en", "wikipedia", categories = category,  properties = c("title"), type = c("page"))
  sub_cats <- pages_in_category("en", "wikipedia", categories = category,  properties = c("title"), type = c( "subcat"))

The ‘pages_in_category’ function is helpful here by returning Wikipedia sub-category and/or page data depending on the function parameters.  Here we specify English Language Wikipedia with the first two parameters and use the previously defined category name.  The function can return a number of properties, but we only need the title.  Finally we define the element we are looking for (page, sub-category or file), and we assign the list results to pages and sub_cats respectively.  We use a couple of loops to extract each page and subcategory title and add it to our master list.

  # add pages to list  
  if (length(pages$query$categorymembers) > 0 ) {
    for (i in 1:length(pages$query$categorymembers)) {
      total_pages <- c(total_pages, pages$query$categorymembers[[i]]$title)
    }
  }

  # add sub categories to list
  if (length(sub_cats$query$categorymembers) > 0 ) {
    for (i in 1:length(sub_cats$query$categorymembers)) {
      sub_cat = gsub("Category:", "", sub_cats$query$categorymembers[[i]]$title)
      categories <- c(categories, sub_cat)
      next_category_list <- c(next_category_list, sub_cat)
    }
  }

Next, we want to repeat the process with each of the returned sub-categories, to retrieve all of the pages and sub categories associated with that category and add them to the master list.  We will continue ‘drilling down’ until there are no more sub categories or pages associated with that overall topic.  To view the entire code with the looping structure, please check out the code on my github.  Below are the final results, that’s a lot of categories and individual pages about hockey!

[1] "Number of Categories: 7322"
[1] "Number of Pages: 12912"

Next we will extract all the text data from each page using the rvest web scraping package.  We simply loop through the entire page list just created, and then extract all the paragraphs (denoted by the <p> node).  We are leaving some of each web page behind, such as lists and tables, but for this project I figured just grabbing nice clean paragraph and sentence data would be most useful.  All of the paragraphs for all the pages are simply pasted into a single ‘page_text’ variable.

# read all page paragraph data
for (i in 1:length(total_pages)) {
  page = gsub(" ", "_", total_pages[i])
  print(paste0("Loading Page: ", page))
  web_address <- paste0("https://en.wikipedia.org/wiki/",page)
  page_html <- read_html(web_address)
  page_paragraphs <- html_nodes(page_html,"p")
  page_text <- paste(html_text(page_paragraphs), sep = '', collapse = '')
  for (j in 1:length(page_text)) {
    if (is.null(text_data)) { text_data <- page_text }
    else { text_data <- paste(text_data, page_text, sep = '', collapse = '') }
  }
}

Before we write all this juicy text data to a .txt file for future NLP tasks, we should probably do a little data preprocessing.  Let’s remove existing line breaks and replace with line breaks after each sentence so our NLP treats each sentence as an entity.  We also remove all punctuation and citations (eg. [2]) and convert everything to lower case.  You may want to modify these preprocessing steps for your project.  Perhaps you want to distinguish between ‘doors’ of a house and the band ‘Doors’, in which case you wouldn’t want to convert everything to lower case.  I also kept all numerical data as-is, to keep dates and jersey numbers, but you may want to consider removing numbers all together.

# text pre-processing
text_data <- gsub("\n", " ", text_data)       # remove the existing line breaks
text_data <- gsub("\\.", "\r\n", text_data)   # add line breaks
text_data <- gsub("\\[\\d\\]"," ", text_data) # remove citations
text_data <- gsub("[[:punct:]]"," ", text_data) # remove punctuation
text_data <- text_data %>% tolower()

# save text file
write(text_data, filename)

The final step it to write this 6 million word corpus to a text file that we will use in the next post to train word vectors using Word2Vec.

All the code for this task can be found in github.  You can use this to generate text for any topic simply by changing the starting topic page.  Also, getting more familiar with the WikipediR and rvest packages will allow you to modify this script to extract any number of Wikipedia pages and data.

Thanks for reading!