Limitations of BeautifulSoup in Web Crawling
The requests
and BeautifulSoup
libraries are used for crawling unchanging websites, that is, sites with static data.
However, modern websites handle data dynamically based on user interactions. Users request additional data from the server, and the JavaScript in the web browser processes the response received from the server and displays it on the screen.
The requests
library fetches only static HTML, and BeautifulSoup
is used for parsing the fetched HTML.
Since data generated or modified by JavaScript is not included in the HTML fetched using requests
, it is not possible to crawl data processed by JavaScript using the traditional method.
How to Crawl Dynamic Data?
However, by using Selenium
, you can actually launch a web browser and crawl the DOM after JavaScript execution.
Websites like weather websites display weather information dynamically using JavaScript, which makes it inadequate to fetch data with BeautifulSoup.
However, with Selenium, you can capture the screen after JavaScript has executed in a real browser, thus solving this problem.
Note : To run the practice code on your computer, you need to install the Selenium library using the command
pip install selenium
.
from selenium import webdriver
from selenium.webdriver.common.by import By
# Open Chrome browser
driver = webdriver.Chrome()
# Open weather forecast page
url = "https://www.weather.gov/"
driver.get(url)
# Find temperature and feels-like temperature
# 'tmp' class represents current temperature, 'chill' class represents feels-like temperature
temperature_element = driver.find_element(By.CLASS_NAME, 'tmp')
feels_like_element = driver.find_element(By.CLASS_NAME, 'chill')
# Extract text
temperature = temperature_element.text
feels_like = feels_like_element.text
# Print results
print(f"Today's temperature: {temperature}")
print(f"Feels-like temperature: {feels_like}")
# Close WebDriver
driver.quit()
Detailed Explanation of the Code
-
driver = webdriver.Chrome()
: Opens Chrome browser and creates adriver
object -
driver.get(url)
: Navigates to the specified URL (weather website) -
temperature_element = driver.find_element(By.CLASS_NAME, 'tmp')
: Finds an element with thetmp
class and stores it intemperature_element
-
feels_like_element = driver.find_element(By.CLASS_NAME, 'chill')
: Finds an element with thechill
class and stores it infeels_like_element
-
temperature = temperature_element.text
: Extracts text fromtemperature_element
and stores it intemperature
-
feels_like = feels_like_element.text
: Extracts text fromfeels_like_element
and stores it infeels_like
-
driver.quit()
: Closes the WebDriver
Today's temperature: 86°F
Feels-like temperature: Feels like (87°F)
By using Selenium this way, you can crawl content dynamically generated by JavaScript.
For more detailed information about the Selenium library, you can learn from the Automation with Python and AI Curriculum!
Want to learn more?
Join CodeFriends Plus membership or enroll in a course to start your journey.