March 24, 2022

Three Approaches to Modern Web Scraping

clock
7
min read
Copied!

Nimble's Expert

linkedin
Three Approaches to Modern Web Scraping

Web scraping is quickly becoming a transformational force for businesses looking to gain a competitive edge. By empowering companies to gain unique insights into their competitors, protect their online reputation, better understand their users, and much more, web scraping has opened the door to a new world of possibilities. Having an understanding of how this technology works is key for any business or individual looking to implement the right solution for their unique goals.

Three Approaches to Modern Web Scraping

Web scraping is quickly becoming a transformational force for businesses looking to gain a competitive edge. By empowering companies to gain unique insights into their competitors, protect their online reputation, better understand their users, and much more, web scraping has opened the door to a new world of possibilities. Having an understanding of how this technology works is key for any business or individual looking to implement the right solution for their unique goals.

 

What is web scraping?

Web scraping (also called web data collection) is a multi-step process used by individuals and organizations to gather data from external sources, most commonly other websites. The first step is to access the desired data, followed by retrieving, parsing, and finally using the collected data. Web scraping is typically executed by a web crawler (commonly called a bot) - a program designed to access and browse webpages.

The bot or web data scraper will attempt to reach your desired web page, and return it in its raw form - HTML. Once retrieved and stored, the next step is to analyze, or parse, the HTML in order to extract useful information. Parsing is generally a very specialized task, crafted specifically for a particular source. This is because webpages tend to be structured differently, and thus extracting a particular piece of information from one webpage will be very different from another.

With parsing complete, the desired data is now extracted from the raw HTML and can be used for its intended purpose. This outlines a very basic web scraping process, and although it may seem daunting and difficult to setup, there are many tools that help simplify the process.

Approach #1: HTTP-Client

The first (and simplest) approach is to use an HTTP-client with a parsing library such as Beautiful Soup. HTTP Clients are very lightweight programs, and they have several advantages, including:

  • Speed - Requests can be executed very rapidly.
  • Simplicity - The code needed to create an HTTP-client is short, basic, and well documented.
  • Low-cost - HTTP-clients consume very little CPU and memory resources, and thus are very cheap to run.
  • Precision - HTTP-clients execute only the request they are given, and will not load additional resources or links (such as images, stylesheet files, javascript, etc.) This makes them faster and prevents unintentional network resource consumption due to loading potentially numerous additional resources.

Although HTTP-clients have great advantages, they also bring a number of disadvantages which often disqualify them from many use cases. These include:

  • Manual work - HTTP-clients only execute requests very precisely. If the target webpage loads additional content through XHR requests, this content will not be visible to the HTTP-client. To get access to that content, the programmer will need to identify how the website works, which data endpoints are used, what parameters they require, and how they return data, in order to trigger those endpoints manually and get access to the data.
  • Low success rate - Many websites will block data collection bots, and HTTP clients are very easy to detect and thus will get blocked more frequently than other approaches.

Approach #2: Lightweight JS Engines

There are many lightweight JS engines to choose from. Three popular examples include PhantomJS, DomJS, and Splash. The advantages of using JS engines include:

  • Javascript rendering - JS engines render javascript, which means the target webpage will be rendered similarly to how it would be in a user’s browser, making it easier to access content loaded with XHR requests.
  • Human Interaction - JS engines can click on buttons, and links, move a virtual cursor, and more. This allows the programmer to interact with the target webpage as needed in order to reveal or otherwise access the desired content.

These advantages make JS engines faster to setup, more useful, and easier to adapt to new webpages than HTTP-clients. However, they also have the following disadvantages:

  • Blocking and Detection - many JS engines are well-established and easily recognized by anti-bot measures. Additionally, fairly extensive fingerprinting work is required in order to evade anti-bot blockers, such as user agent modification and header order.
  • Resource Consumption - JS engines consume more resources than HTTP-clients, making them more expensive to run. Despite being lighter than full browsers (the third approach!), they can still consume hundreds of MBs of RAM and CPU, making them more intensive to run and especially to scale.

Approach #3: Real Browsers

Before we dive into the advantages and disadvantages of using real browsers, it’s important to distinguish them from headless browsers. Both headless and real browsers provide programmers with a method of interacting with an actual browser in much the same way that ordinary users interact with their browsers.

The difference lies in that headless browsers have no visual interface, do not visually render graphics, and are accessed purely through the command line, whereas real browsers actually render visual graphics (they require a GPU), and attempt to provide programmers with a tool that is as close as possible to an actual browser.

From a web scraping perspective, headless browsers offer the same advantages and disadvantages as lightweight JS engines. On the other hand, real browsers are very different:

  • High Success Rate - real browsers are the hardest to distinguish from ordinary browsers and thus enjoy the highest success rate (lowest block rate).
  • Flexible Human Interaction - Due to their visual rendering and robust javascript rendering, real browsers provide programmers with the greatest freedom to interact with the target webpage, and are much more capable of performing convoluted multi-step browsing processes.

These advantages have made real browsers some of the most popular for web scraping. However, to implement them successfully, the following disadvantages must be factored in:

  • Resource Intensive - fully rendering webpages requires a lot of resources. This makes real browsers expensive to run, especially at scale.
  • Unstable - most real browsers were not designed for web scraping, and were actually intended for browser automation and website testing. As a result, these tools tend to be unstable and crash frequently.
  • Unintentional Requests - Because real browsers fully render the target webpage, they may make dozens or even hundreds of additional requests in order to load scripts and images, complete XHR requests, and more. This can quickly eat up additional resources and increase network traffic expenses.

Conclusion

Web scraping allows companies and individuals to access the largest database in the world - the internet. Once established, a web scraping pipeline can provide incredible value to an organization, and with virtually unlimited use cases, this technology is set to become a bigger and bigger part of every data stack. This guide was intended to give beginners a glimpse at the inner workings of web scraping and to provide some insight into the process, costs, and benefits. If you’re interested in learning more about web scraping, check out our blog.

FAQ

Answers to frequently asked questions

No items found.