Parsing HTML with BeautifulSoup
To obtain the information you desire through web crawling, you need to extract the desired information from the collected HTML data.
BeautifulSoup
is a Python package that helps solve this task easily by being used to parse
(analyze and extract data) HTML data fetched with requests.
Parsing HTML Data and Extracting Necessary Information
Using BeautifulSoup, you can convert an HTML document into a Python object, making it easy to navigate and manipulate each element of the document with Python code.
Let's go over how to parse HTML data
and extract the necessary information using BeautifulSoup.
Parsing HTML with BeautifulSoup
First, you need to convert the HTML data fetched from a web page into a BeautifulSoup object.
Using the requests
package to fetch HTML data and then creating a BeautifulSoup
object to parse the HTML can be done as follows:
import requests
from bs4 import BeautifulSoup
# URL to request
url = 'https://www.codefriends.net'
# Fetch HTML data with a GET request
response = requests.get(url)
# Create BeautifulSoup object and parse HTML
soup = BeautifulSoup(response.text, 'html.parser')
# Extract the title tag of the HTML
title = soup.title.text
# Print the page title
print(f"Page Title: {title}")
The above code stores a BeautifulSoup object with the parsed HTML data in the soup
variable, and extracts the title of the HTML document using soup.title.text
.
The soup.title
code fetches the contents of the <title>
tag of the HTML document, and .text
extracts the text of that tag.
Extracting Required Information
Various methods can be used to extract information, as shown below.
- Finding elements by tag name: Locate specific tags in the HTML document.
# Find all <a> tags
links = soup.find_all('a')
# Print all links
for link in links:
print(link.get('href'))
- Finding elements by class name: Locate elements by a specific class name.
# Find all <div> tags with class="example"
divs = soup.find_all('div', class_='example')
# Print the text of all <div> tags
for div in divs:
print(div.text)
- Finding elements by ID: Locate an element by a specific ID.
# Find element with id="main-content"
main_content = soup.find(id='main-content')
# Print the text of the selected element
print(main_content.text)
Extracting Article Titles and Links from a Web Page
Below is an example of extracting article titles and links from an actual web page:
import requests
from bs4 import BeautifulSoup
# URL of the web page to be scraped
url = 'https://news.ycombinator.com/'
# Fetch HTML data with a GET request
response = requests.get(url)
# Create BeautifulSoup object and parse HTML
soup = BeautifulSoup(response.text, 'html.parser')
# Extract all article titles and links
articles = soup.find_all('a')
# Print article titles and links
for article in articles:
# Extract article title and link
title = article.text
# Link URL
link = article.get('href')
# Print title and link
print(f"Title: {title}, Link: {link}")
The above code searches for a
tags in the YCombinator news page to extract article titles and links.
As seen, using BeautifulSoup allows you to easily analyze the structure of a web page to extract the desired data.
Want to learn more?
Join CodeFriends Plus membership or enroll in a course to start your journey.