How to Use Google Refine to Fill Gaps in a Dataset
This post also appears on Extract-A-Fact
In our first video training session, we presented a walkthrough of how to organize
USEITI data for use in the open source mapping software QGIS. Fortunately, that dataset included geographic identifiers called Federal Information
Processing Standard (FIPS) county codes–five digit codes identifying counties and county
equivalents throughout the United States. However, not every dataset will include a geographic identifier alongside data attributed to a location. Google
Refine is a powerful and versatile tool that can allow users to clean, manipulate, and transform their data. In this post we will walk through the process
of using Google Refine to add geographic coordinates to a dataset.
Google Refine has been renamed OpenRefine. Instructions for downloading and installing OpenRefine are available here. The older Google Refine software was used as it is the only
stable version available at the time this post was written. You may encounter problems accessing the the most recent file, “OpenRefine v.2.6 release
candidate 2,” and we recommend that you use Google Refine 2.5. The instructions detailed in Step 1 will also apply for OpenRefine v.2.6, if it becomes
accessible at a later date.
Step 1 – Download and Install Google Refine
Navigate to the OpenRefine download page, and download Google Refine 2.5 for your operating system.
Google Refine operates as a hybrid desktop and web application. When you run Google Refine, a browser window should open automatically and present you with
the Google Refine web interface. Despite operating within a web browser window, Google Refine does not require an active internet connection to work. As
long as the Google Refine application is running, you can navigate to http://127.0.0.1:3333/ to access the web
Before we move to the next step, take a moment to download the following.csv file. This dataset was downloaded from ResourceProjects.org, and was reduced to only include 2015 projects carried out by Tullow Oil.
Google Refine is a powerful piece of software, however, it can quickly get bogged down with very large sets of data. This file was limited to one company
for the purposes of this tutorial.
Step 2 – Upload your dataset to Google Refine
To get started, click ‘Create Project’. You will be presented with a number of options for data inputs. We will create a new project using data from ‘this
computer.’ Select the file downloaded in the step above, and click next to start the process of uploading the dataset.
Step 3 – Add a new column to fetch location information
With the dataset uploaded, Google Refine will present a preview of the entries. Review the data and headers to make sure everything appears as it should.
At the bottom of the window check that the ‘Parse next’ box is ticked so that the first row entries are parsed as column headers.
Click the ‘create project’ button in the upper right corner to proceed to the main working space of Google Refine. As noted above, we will be adding in
additional geographic information to this dataset. To do so, click the triangle in the ‘Paid to’ column and navigate to ‘Edit column’ > ‘Add column by
A window will pop up as shown below. Name the column and enter in the following text into the ‘Expression’ box. (Click here to learn more about General Refine Expression Language)
“http://maps.google.com/maps/api/geocode/json?sensor=false&address=” + escape(value,”url”)
Click ‘OK’ and the expression will produce a column containing what is essentially the output of a search of the google maps application programming
interface (API) on the basis of each term in the ‘Paid to’ column. This operation will typically take several minutes to complete depending on the size of
the dataset. While you wait for the process to complete you can experiment to get a better of sense of how this function works. Enter the expression we
just used, leaving off the last portion, into the address bar of another browser window:
Fill in the name of any location around the world after the “=” and you will see a page with all the relevant location information for that location. This
should give you a better sense of what is happening under the hood with the fetching URLs function in Google Refine.
Step 4 – Add another column to parse the information from the previous step
Once the process has completed, you will see a column filled with a long string of text and numbers.
To clean this up we will add another column parsing through that data. Click on the triangle in the new column you created in Step 3 containing all the
Google maps information, and select ‘Edit column’ > ‘Add column based on this column…’ Write in a title for this new column and enter in the following
text into the ‘Expression’ box:
with(value.parseJson().results.geometry.location, pair, pair.lat +”, ” + pair.lng)
Click ‘OK’ and the new column will populate with a neat seat of latitude and longitude coordinates separated by a comma derived from the data in the column
we produced in Step 3.
STEP 5 – Export your project
The final step is to click ‘Export’ in the upper right corner of the Google Refine window. Select ‘Comma-separated value’ or ‘Excel’ from the dropdown list
of file types.
You can then open the exported file in a desktop application to delete the column containing the unparsed location information while leaving the second
column we created that includes the latitude and longitude coordinates. Google Refine is ideal for refining, cleaning and adding to a dataset, but
operations like deleting rows and columns should be done in programs like Excel.
While this post demonstrates how latitude and longitude coordinates can be derived from a country name, the exact same process can be carried out for any
other location. If instead of country names the dataset contained the names of cities or provinces, the same steps can be used to obtain the latitude and
longitude coordinates. Location information can help you to create persuasive maps and other visualizations of your data. To learn more about what can be
done with extractives data and mapping, navigate to the training section of