Skip to main content
Practice

Limitations of BeautifulSoup in Web Crawling

The requests and BeautifulSoup libraries are used for crawling unchanging websites, that is, sites with static data.

However, modern websites handle data dynamically based on user interactions. Users request additional data from the server, and the JavaScript in the web browser processes the response received from the server and displays it on the screen.

The requests library fetches only static HTML, and BeautifulSoup is used for parsing the fetched HTML.

Since data generated or modified by JavaScript is not included in the HTML fetched using requests, it is not possible to crawl data processed by JavaScript using the traditional method.


How to Crawl Dynamic Data?

However, by using Selenium, you can actually launch a web browser and crawl the DOM after JavaScript execution.

Websites like weather websites display weather information dynamically using JavaScript, which makes it inadequate to fetch data with BeautifulSoup.

However, with Selenium, you can capture the screen after JavaScript has executed in a real browser, thus solving this problem.

Note : To run the practice code on your computer, you need to install the Selenium library using the command pip install selenium.

Dynamic Web Crawling with Selenium
from selenium import webdriver
from selenium.webdriver.common.by import By

# Open Chrome browser
driver = webdriver.Chrome()

# Open weather forecast page
url = "https://www.weather.gov/"
driver.get(url)

# Find temperature and feels-like temperature
# 'tmp' class represents current temperature, 'chill' class represents feels-like temperature
temperature_element = driver.find_element(By.CLASS_NAME, 'tmp')
feels_like_element = driver.find_element(By.CLASS_NAME, 'chill')

# Extract text
temperature = temperature_element.text
feels_like = feels_like_element.text

# Print results
print(f"Today's temperature: {temperature}")
print(f"Feels-like temperature: {feels_like}")

# Close WebDriver
driver.quit()

Detailed Explanation of the Code

  • driver = webdriver.Chrome(): Opens Chrome browser and creates a driver object

  • driver.get(url): Navigates to the specified URL (weather website)

  • temperature_element = driver.find_element(By.CLASS_NAME, 'tmp'): Finds an element with the tmp class and stores it in temperature_element

  • feels_like_element = driver.find_element(By.CLASS_NAME, 'chill'): Finds an element with the chill class and stores it in feels_like_element

  • temperature = temperature_element.text: Extracts text from temperature_element and stores it in temperature

  • feels_like = feels_like_element.text: Extracts text from feels_like_element and stores it in feels_like

  • driver.quit(): Closes the WebDriver


Example Output
Today's temperature: 86°F
Feels-like temperature: Feels like (87°F)

By using Selenium this way, you can crawl content dynamically generated by JavaScript.

For more detailed information about the Selenium library, you can learn from the Automation with Python and AI Curriculum!

Want to learn more?

Join CodeFriends Plus membership or enroll in a course to start your journey.