Benoit Perigaud
Benoit Perigaud Analyst

A quick guide to choosing the best scraping tool for the job

As data analysts and data scientists, there are many occasions where we would like to run analysis on data available online, on different pages or even websites, but not in a consolidated format easy to ingest in our favourite tool.

With today’s Machine Learning algorithms, the more data we can test our algorithms on, the more accurate our predictions will usually be. We are talking about hundreds of thousands of records, so, unless you have plenty of time on your hands, there is no way to do it manually.

Here is a quick guide on a selection of three different tools and which one to use based on the circumstances.

Enough talking! Let’s scrape the web..

1) Web Scraper Chrome extension

This chrome extension is my go-to tool if I need to do a once-off scrape of a website. The extension allows you to easily deal with paginated websites, extract URLs and follow links, or extract full HTML tables in a few clicks.

Despite being a relatively simple tool, there are a number of parsing capabilities out of the box. If you can create a regular expression to extract the data you are looking for, the regex capabilities of the extension will allow you to get exactly what you are looking for.

It will take some time to understand how to use the tool, however, the documentation is well-written and the developer has created video tutorials and even set up test sites to learn how to use it. What if there’s a specific thing you are trying to do but can’t figure out? Just jump to the forum where people are actually answering others’ questions!

One last thing worth mentioning is that all the config can be exported/imported with JSON, making it easy to share with colleagues.

After this quick intro, here is my even quicker review of this extension:

Advantages: 

  • Doesn’t require any programming skills
  • Super quick to build a basic scraper (we’re talking minutes)
  • Deals well with sites requiring authentication, you just need to log in to the website before and the extension will use the session just created
  • No problem to deal with JavaScript as the page is queried from Chrome directly

Limitations: 

  • Quite slow compared to other solutions (minimum of 2 seconds between each page load)
  • Not possible to run on a periodic basis unless going with the paid cloud scraper option
  • Requires manual steps to download the files, so not the best for being part of a bigger workflow

2) The Python library BeautifulSoup (aka BS4)

BeautifulSoup (BS4 in its latest version) is a Python library that focuses on pulling data out of HTML and XML files.

It is quite often used in conjunction with other libraries like requests to retrieve HTML source code from different addresses, or with pandas to collect all the results in a dataframe and use the existing capabilities of pandas to export dataframes to CSV files or to a SQL table.

The easiest way to use BS4 is usually to inspect the source code of a web page and look at how the information we are looking for is stored.

As a quick example, here is the code to find the element in the page with the CSS class “the_class”:

Stories = pageSoup.find("div", {"class":"the_class"})

And how we can then extract all the different list (<li> tag) elements from the Stories we found:

Items = Stories.find_all("li")

This is just a brief introduction on how to use BS4. The library is capable of much more, such as selecting the parents or children of a given HTML tag, or searching text by looking at the CSS4 Selectors.

Without going into too much detail, here are a few advantages and limitations I found when using BeautifulSoup.

Advantages:

  • Flexible, you can use it in addition to other Python libraries
  • Quick to set up when scraping just a few pages
  • Can be called from your a Jupyter Notebook
  • Learning curve not too steep if you have previous knowledge of Python
  • Quite fast (500 pages scraped in 6 minutes and 25 seconds)

Limitations:

  • Need to write the entire scraping logic by hand
  • Not as easy at integrating into a productionised workflow as Scrapy
  • No easy way to set up Autothrottle or to change the output format

3) The Python framework Scrapy

The last tool we are reviewing is Scrapy. More than just a library, Scrapy is a framework built for scraping. Its architecture is built around “spiders”, which are a set of instructions on how to extract specific data from a website.

 

Out of the box, Scrapy provides many functionalities, including the ability to write functions to validate your data and save it to a database. Once spiders have been coded locally, it’s easy to push them to a cloud solution like Scrapy Cloud, which takes care of crawling all required websites without using your local machine.

Scrapy might look a bit intimidating at the beginning, but there are many online tutorials to learn the basic concepts and functionalities. The example below shows how little code is required to create a spider able to scrape an entire news website.

After having waited way too long to learn how to use Scrapy I really see the advantages of using a full framework instead of individual libraries. Similarly to the two previous tools, here are the pros and cons of using Scrapy: 

Advantages: 

  • Easy to configure otherwise complicated parameters like Autothrottle or DNS cache
  • Once you understand the concept, Scrapy allows the configuration of spiders by writing minimal amount of code
  • Fast scraping (as fast as BeautifulSoup at least, 500 pages scraped in 6 minutes and 10 seconds)
  • The supplied Scrapy shell allows to very quickly debug the code, without having to run the spider
  • The output format can be changed extremely easily, just by changing the parameter used when running “scrapy crawl”
  • Follows the “robots.txt” rules by default and won’t crawl restricted urls

Limitations: 

  • Learning curve steeper compared to BeautifulSoup. Need to understand the concept of spiders and how the different parts of the framework interact
  • Requires running CLI scripts, does not integrate easily in a Jupyter Notebook context
  • May not be the best tool for scraping a single page

You are now set-up to crawl the web! 

Your next step might be to integrate this data with other sources you already have, or combine it with more information from your existing systems. At Aginic, we embrace Cloud technologies, so, if you want to discuss about data integration or data warehousing with our teams of friendly Engineers and Data Analysts, feel free to contact us or me directly.

We also recently released a suite of remote workshops tackling multiple subjects around Design, Cloud technologies, Data Analytics and Agile Delivery. So, if you are interested in a 1-day workshop on Modern Data Architecture or a half a day Cloud Kickstarter Workshop, jump here.