Web scraping strategies for Selenium (python)
There are multiple technologies available to any aspiring web scraper. Depending on the nature of the specific task at hand, different tools have their advantages. In this guide, we will focus on Selenium, a commonly used python library that automates browser testing.
Below, I shall walk through a recent project where a Selenium web scraper was built to scrape job vacancies from a variety of sources. The GitHub repo can be found here.
The building blocks
Present in the scrapers module of the linked GitHub repository, these functions are the foundations for the scraping functionality.
is a function that forces the scraper to pause until the a specific element is present.
is a method from the selenium module extracts the first element that matches the given xpath.
method is similar to the singular version, but returns a list of all elements that matches the xpath.
Xpaths offer a way to refer to a specific element on a page, and are key to inform the scraper which element to extract. Often, multiple xpaths can be used to refer to the same element. Not all methods, however, will be equal. Some methods are more robust than others, particularly when it comes to being resilient to sudden changes in site structure.
A brief primer on xpaths
To view the HTML code from a browser, right-click the element of interest and click inspect. On Chrome, we might see something like this.
In the example, we want to locate the element that contains the number of contributors. From the HTML source (right-click and inspect the element of interest), we observe that the element comes nested many layers beneath the body element. The complete level-by-level xpath would look something like
If this element is the first “summary” element, we can refer to it by
We can also use the class to refer to it, though be wary if there are multiple elements of the same class:
Finally, we can use the text content to search for it as well
Choice of xpaths
With so many possible methods to refer to an element, it is not always obvious which to use. Below, I shall list several considerations when deciding a suitable xpath.
How likely is the site to change in the future?
The primary challenge when deciding on a suitable xpath to use lies in choosing one that is resistant to future changes. Hence, we really only need to be concerned when the site is likely to experience changes in structure, whether intentionally to prevent scrapers or unintentionally as part of updates. If changes are unlikely to occur, then the most convenient xpath can simply be chosen.
If a site is likely to change, will it be the text or structure to change?
Some sites undergo changes in their text or structure due to updates or revamps of the site. Unless it is a complete overhaul where the underlying structure of the sites changes, using specific element attributes such as an id or class will likely be able to withstand changes.
Other sites actively modify their element structure by changing element attributes such as id (possibly to hinder scrapers). These are a little tricker to deal with, but with careful selection of anchor points (specific elements in a page to use as a reference marker), a fairly robust xpath can still be generated. Using text as an identifier, as opposed to element attributes, also helps in circumventing such changes.
Are there any good anchor points?
A method of generate robust xpath is to base elements off good anchor points. Anchor points are elements that have unique characteristics that make them easily identifiable, and are unlikely to change significantly. For example, section headers, navigation buttons or field titles make good anchor points because they are part of the core layout of the page.
Loading page elements
‘Load more’ button
Present on sites such as LinkedIn, these buttons require a click to display more items. To display more items, simply find a suitable xpath for the button and execute the Selenium click method on it. For a more robust solution, use the function that forces the scraper to wait till an element loads (found in the building block section) to ensure that the button is present before attempting to click it.
Scroll to end
This feature is similar to the ‘Load more’ button, but activates when the user scrolls to the bottom of the page. To trigger it, execute a scroll script in Selenium that looks like this
Let’s walk through an example where we scrape the job description in a jobs page from LinkedIn.
A few observations:
Complete description is hidden behind a “Show more” button (the complete description is actually available in the HTML code, but for this example let’s pretend otherwise).
The “Show more” button acts as a good anchor point as it is right below the job description, and likely to remain there barring any UI overhauls.
As such, I would scrape the description as follows. First, get and click the button element using
from selenium import webdriver
driver = webdriver.Chrome()
button = driver.find_element_by_xpath(‘//button[contains(text(), “Show more”)]’)
I specifically use the text search function since LinkedIn might modify the element attributes, but the words will likely remain unchanged. Next, we navigate to and extract the text from the element containing the description
description_element = button.find_element_by_xpath(‘../div’)
description = description_element.text
This gives you the raw text containing the description. At this point, some text processing might be necessary depending on the format of the text.
This guide should give a good starting point to designing your own web scraper. Building a robust scraper can be challenging because of the need to choose good xpaths. By outlining some of the thought processes when deciding how to scrape, I hope anyone building their own scrapers will start thinking along these lines.