JavaScript and Dynamic Web Crawling
Web pages are made up of three components: HTML
, CSS
, and JavaScript
.
HTML
defines the structure of the web page, while CSS
defines the style of the web page.
JavaScript
is a language that makes web pages dynamic.
The term dynamic means the content of the web page can change in response to interactions with the user or certain events.
For instance, using JavaScript, you can display new content when a user clicks a button, or load additional content as the user scrolls.
Such dynamic content is not present when the web page is first loaded but is generated dynamically as JavaScript is executed in the web browser.
The Limits of BeautifulSoup
BeautifulSoup parses HTML
to extract data.
However, content generated dynamically with JavaScript
cannot be fetched with BeautifulSoup.
Example of Crawling Code that Doesn't Work with BeautifulSoup
Let's look at code that attempts to fetch the current temperature and perceived temperature from a weather website using BeautifulSoup.
import requests
from bs4 import BeautifulSoup
# Weather website URL
url = 'https://www.weather.example.com/current'
# Sending request to the page
response = requests.get(url)
# Parsing HTML with BeautifulSoup
soup = BeautifulSoup(response.text, 'html.parser')
# Finding temperature and perceived temperature
# 'tmp' class indicates the current temperature, 'feels' class indicates perceived temperature
temperature_element = soup.find('span', class_='tmp')
feels_like_element = soup.find('span', class_='feels')
# Extracting text
temperature = temperature_element.text.strip() if temperature_element else 'N/A'
feels_like = feels_like_element.text.strip() if feels_like_element else 'N/A'
# Outputting results
print(f"Today's temperature: {temperature}")
print(f"Perceived temperature: {feels_like}")
This code tries to fetch weather information from a weather website using BeautifulSoup but returns None
for temperature_element
and feels_like_element
.
This is because it only fetches the HTML before JavaScript is executed, so it cannot find those elements.
Dynamic Web Crawling with Selenium
The weather website uses JavaScript to dynamically display weather information, so BeautifulSoup alone can't successfully fetch that data.
But by using Selenium
, which allows you to interact with browsers, you can capture the web page after JavaScript has run and extract the required data.
Note: To run this code on your computer, you need to install the Selenium library with the command
pip install selenium
.
from selenium import webdriver
from selenium.webdriver.common.by import By
# Open Chrome browser
driver = webdriver.Chrome()
# Open the weather forecast page
url = "https://www.weather.example.com/current"
driver.get(url)
# Finding temperature and perceived temperature
# 'tmp' class indicates the current temperature, 'feels' class indicates perceived temperature
temperature_element = driver.find_element(By.CLASS_NAME, 'tmp')
feels_like_element = driver.find_element(By.CLASS_NAME, 'feels')
# Extracting text
temperature = temperature_element.text
feels_like = feels_like_element.text
# Outputting results
print(f"Today's temperature: {temperature}")
print(f"Perceived temperature: {feels_like}")
# Closing WebDriver
driver.quit()
Detailed Code Explanation
-
driver = webdriver.Chrome()
: Opens Chrome browser and creates a driver object. -
driver.get(url)
: Navigates to the specified URL (weather website). -
temperature_element = driver.find_element(By.CLASS_NAME, 'tmp')
: Finds the element with the tmp class and stores it in temperature_element. -
feels_like_element = driver.find_element(By.CLASS_NAME, 'feels')
: Finds the element with the feels class and stores it in feels_like_element. -
temperature = temperature_element.text
: Extracts the text from temperature_element and stores it in temperature. -
feels_like = feels_like_element.text
: Extracts the text from feels_like_element and stores it in feels_like. -
driver.quit()
: Closes the WebDriver.
Today's temperature: 30.4℃
Perceived temperature: Feels like(30.6℃)
Using Selenium in this way allows you to crawl content that is dynamically generated with JavaScript.
Want to learn more?
Join CodeFriends Plus membership or enroll in a course to start your journey.