Web scraping strategies for Selenium (python)

A beginner’s guide on scraping sites using Selenium

Introduction

There are multiple technologies available to any aspiring web scraper. Depending on the nature of the specific task at hand, different tools have their advantages. In this guide, we will focus on Selenium, a commonly used python library that automates browser testing.

Since Selenium runs a browser instance, it is particularly suited to scraping sites with JavaScript and other interactive features such as buttons. For simpler sites, I highly recommend using a library like BeautifulSoup to simplify the process.

Below, I shall walk through a recent project where a Selenium web scraper was built to scrape job vacancies from a variety of sources. The GitHub repo can be found here.

The building blocks

Present in the scrapers module of the linked GitHub repository, these functions are the foundations for the scraping functionality.

is a function that forces the scraper to pause until the a specific element is present.

is a method from the selenium module extracts the first element that matches the given xpath.

method is similar to the singular version, but returns a list of all elements that matches the xpath.

Xpaths offer a way to refer to a specific element on a page, and are key to inform the scraper which element to extract. Often, multiple xpaths can be used to refer to the same element. Not all methods, however, will be equal. Some methods are more robust than others, particularly when it comes to being resilient to sudden changes in site structure.

A brief primer on xpaths

To view the HTML code from a browser, right-click the element of interest and click inspect. On Chrome, we might see something like this.

In the example, we want to locate the element that contains the number of contributors. From the HTML source (right-click and inspect the element of interest), we observe that the element comes nested many layers beneath the body element. The complete level-by-level xpath would look something like

If this element is the first “summary” element, we can refer to it by

We can also use the class to refer to it, though be wary if there are multiple elements of the same class:

Finally, we can use the text content to search for it as well

Choice of xpaths

With so many possible methods to refer to an element, it is not always obvious which to use. Below, I shall list several considerations when deciding a suitable xpath.

How likely is the site to change in the future?

The primary challenge when deciding on a suitable xpath to use lies in choosing one that is resistant to future changes. Hence, we really only need to be concerned when the site is likely to experience changes in structure, whether intentionally to prevent scrapers or unintentionally as part of updates. If changes are unlikely to occur, then the most convenient xpath can simply be chosen.

If a site is likely to change, will it be the text or structure to change?

Some sites undergo changes in their text or structure due to updates or revamps of the site. Unless it is a complete overhaul where the underlying structure of the sites changes, using specific element attributes such as an id or class will likely be able to withstand changes.

Other sites actively modify their element structure by changing element attributes such as id (possibly to hinder scrapers). These are a little tricker to deal with, but with careful selection of anchor points (specific elements in a page to use as a reference marker), a fairly robust xpath can still be generated. Using text as an identifier, as opposed to element attributes, also helps in circumventing such changes.

Are there any good anchor points?

A method of generate robust xpath is to base elements off good anchor points. Anchor points are elements that have unique characteristics that make them easily identifiable, and are unlikely to change significantly. For example, section headers, navigation buttons or field titles make good anchor points because they are part of the core layout of the page.

Dealing with JavaScript

One of the main reasons to use Selenium is to deal with JavaScript elements such as buttons, so here I detail some common interactions.

Loading page elements

Some elements require JavaScript to load elements, and sending a HTTP request returns a blank page. These sites make using Selenium absolutely necessary, but on the bright side do not required any further measures.

‘Load more’ button

Present on sites such as LinkedIn, these buttons require a click to display more items. To display more items, simply find a suitable xpath for the button and execute the Selenium click method on it. For a more robust solution, use the function that forces the scraper to wait till an element loads (found in the building block section) to ensure that the button is present before attempting to click it.

Scroll to end

This feature is similar to the ‘Load more’ button, but activates when the user scrolls to the bottom of the page. To trigger it, execute a scroll script in Selenium that looks like this

Example

Let’s walk through an example where we scrape the job description in a jobs page from LinkedIn.

A few observations:

Complete description is hidden behind a “Show more” button (the complete description is actually available in the HTML code, but for this example let’s pretend otherwise).

The “Show more” button acts as a good anchor point as it is right below the job description, and likely to remain there barring any UI overhauls.

As such, I would scrape the description as follows. First, get and click the button element using

I specifically use the text search function since LinkedIn might modify the element attributes, but the words will likely remain unchanged. Next, we navigate to and extract the text from the element containing the description

This gives you the raw text containing the description. At this point, some text processing might be necessary depending on the format of the text.

Summary

This guide should give a good starting point to designing your own web scraper. Building a robust scraper can be challenging because of the need to choose good xpaths. By outlining some of the thought processes when deciding how to scrape, I hope anyone building their own scrapers will start thinking along these lines.

Consultant passionate about data