Using outwit hub to scrap data

Data tools and advice

Last year I embarked on a project to map how much and where the US’s National Institutes of Health spends its funding for research in Africa. This was done with the intention of writing a special issue on the NIH, one of the major health research funders in Africa.Its clout means that a lot of researchers are interested in where and how it spends its money in Africa and relevant for Research Africa, a science policy publication that I write for. I spent months on the project but only managed to transfer a few entries from a website the organisation uses to store data about  recipients of their funding. The project turned out to be cumbersome and time consuming so I put it aside for a while (because I don’t believe in giving up).

That was until this year when I learnt how to use outwit hub. This tool allowed me to scrap the NIH website to extract the data I needed. It converted it to excel so that it would be easier to use.

Now I will show you how outwit hub helped me do  months’ work in 20 minutes.

The page

First you will need to download outwit hub to your computer.  It will bring you to this page.

Select extension mozilla for firefox and download.

 

An outwit hub icon will show at the corner of your page, which you need to click on to open outwit hub

 

When outwit hub has opened, copy the url of the web page where you want to collect data from. In this example, its the NIH url

Paste it onto outwit hub.

On the left there is a list of options. Select scrapers

 

It will open to this page. Click new and this will allow you to create a folder that you can use to scrap the data. Give the folder a name.

Open the folder by clicking the top panel on the scrapers page.

Once opened it will have blank spaces. Use those to list the categories you want to scrap. The list can be guided by the one used on the website.  In this case Acts, Project title, project leader, organisation, funder and costs.

Once you have listed the categories, return to the left side of outwit hub and click page sources. This is what will show

 

Page sources provide codes for the categories you want to list. We will use the example of Acts on the NIH webpage. I initially made the mistake of selecting codes from the category

And this was the result

 

Instead copy the codes that are written before one of the examples on the category, which is UO1 in this case.

Paste them on the category you have listed on outwit hub

 

Copy the code after. Paste it on the scraper as done earlier

 

Click save and execute. This will be the result


I want the project titles so go back to page source again. Copy the codes before one of the actual project, which is University of KwaZulu-Natal CAPRISA HIV Clinical Trials Unit, in this case. Paste it onto the project title category listed on your scrapper. Copy the codes after and paste

Repeat the process for all other categories. Save and executive.

This will be result

 

Ci

Click export at the bottom of the page

And voila! In 20 minutes you will have an excel spreadsheet of your that you can analyse.