How to combat government censorship and protect data: automate downloading documents from a website
This post also appears on www.ExtractAFact.org
A change in government often brings significant shifts in policy. Major initiatives taken up by a previous administration can be slowed or reversed, and information that was once publicly available may be taken down or censored. The White House webpage provides some clear examples of this phenomenon. Following the inauguration this past January, press reported that the Trump Administration White House homepage underwent some changes, such as striking references to climate change and removing a spanish language option. Fortunately, if a user wants to view the content from the White House homepage of President Barack Obama, it is still possible to do so by navigating to https://obamawhitehouse.archives.gov.
Citizens can also use the Internet Archive“Wayback Machine” to access www.WhiteHouse.gov and see content for any given day going back several years. These archive solutions are helpful for viewing web content, but hosted files on these pages still have the potential to get lost. Documents that are hosted on pages can become inaccessible as other content is changed.
Since 2010 the Publish What You Pay coalition, academics, industry, investors and other actors submitted hundreds of comment letters to the Securities and Exchange Commission (SEC) to influence the agency’s Section 1504 rulemaking. Every single comment that has been submitted to the SEC is available on the regulatory agency’s website. The comments are available as pdf files on four separate comment records:2010, 2010-2012, 2013-2015, and 2015-2016. Because of the current wave of government self-censorship, we wanted to make sure we could preserve the evidence in the Section 1504 record. This post will provide the steps to download all linked documents, such as pdf files, from a website. The SEC comment record will be used as an example, but the same steps can be used to download and preserve files hosted on any site. As with other data scraping and organizing processes, the steps described in this post could be carried out manually. For example, scraping data from a company pdf report can be done manually, with a user entering in data line by line into a spreadsheet, but that is a time-consuming process. As we described previously on Extract-A-Fact, there are tools to help speed up data scraping. To automate the downloading of all linked files on a website, we will use the Google Chrome extension, Chrono Download Manager– see the tutorial below.
Step 1 – Install the Chrome extension Navigate to the Chrome web store page for the Chrono Download Manager and click the ‘Add to Chrome’ button in the upper right. A notice will pop up and you can safely click ‘Add extension’ to confirm installation. When the installation completes you should find a new icon in the upper right corner of your Chrome browser.
Step 2 – Download linked files Before proceeding, we recommend you set a dedicated folder for downloads. Navigate to chrome://settings in your Chrome browser and set a specific downloads folder. See the image below for an example.
Next, navigate to the page with the files you intend to download. In this case we will use the most recent 1504 comment record. Once on the page, click the Chrono Download Manager icon in the upper right. Select the ‘Document’ tab in the window that pops up.
The ‘Document’ window presents a list of all the links on the page that are interpreted as documents. In this case, we are only concerned with downloading the pdf files. To narrow the selection, click the ‘pdf’ check box as shown below.
Once you’ve selected all the relevant documents you can click ‘Start all’ in the lower right of the window to download the files into the folder you selected in the Chrome browser settings.
*Optional Step 3 – Categorize the downloaded files If you follow the steps above you will be able to successfully download all of the files from a webpage, which will simply be listed by their filename (e.g. s72515-1.pdf). To help organize the files, you can have Chrono Download Manager automatically attach the descriptive text corresponding to each file. Click the first document highlighted in green (see image above), scroll down to the last pdf and press shift+left mouse button on the last highlighted pdf. With all of the pdf files checkmarked and selected, click the ‘Task Properties’ tab as shown below.
Click the text box next to ‘Naming Mask’ and select ‘*text*.*ext*’ then click ‘Start All’ to download all of the files. You’ll find that the downloaded files will now appear in the folder with a descriptive title (e.g. Jana L. Morgan, Director, Publish What You Pay – United States) rather than the numbered file name.