Parsing HTML with BeautifulSoup

To obtain the information you desire through web crawling, you need to extract the desired information from the collected HTML data.

BeautifulSoup is a Python package that helps solve this task easily by being used to parse (analyze and extract data) HTML data fetched with requests.

Parsing HTML Data and Extracting Necessary Information

Using BeautifulSoup, you can convert an HTML document into a Python object, making it easy to navigate and manipulate each element of the document with Python code.

Let's go over how to parse HTML data and extract the necessary information using BeautifulSoup.

Parsing HTML with BeautifulSoup

First, you need to convert the HTML data fetched from a web page into a BeautifulSoup object.

Using the requests package to fetch HTML data and then creating a BeautifulSoup object to parse the HTML can be done as follows:

Parsing HTML with BeautifulSoup
import requests
from bs4 import BeautifulSoup

# URL to request
url = 'https://www.codefriends.net'

# Fetch HTML data with a GET request
response = requests.get(url)

# Create BeautifulSoup object and parse HTML
soup = BeautifulSoup(response.text, 'html.parser')

# Extract the title tag of the HTML
title = soup.title.text

# Print the page title
print(f"Page Title: {title}")

The above code stores a BeautifulSoup object with the parsed HTML data in the soup variable, and extracts the title of the HTML document using soup.title.text.

The soup.title code fetches the contents of the <title> tag of the HTML document, and .text extracts the text of that tag.

Extracting Required Information

Various methods can be used to extract information, as shown below.

Finding elements by tag name: Locate specific tags in the HTML document.

Finding elements by tag name
# Find all <a> tags
links = soup.find_all('a')

# Print all links
for link in links:
    print(link.get('href'))

Finding elements by class name: Locate elements by a specific class name.

Finding elements by class name
# Find all <div> tags with class="example"
divs = soup.find_all('div', class_='example')

# Print the text of all <div> tags
for div in divs:
    print(div.text)

Finding elements by ID: Locate an element by a specific ID.

Finding elements by ID
# Find element with id="main-content"
main_content = soup.find(id='main-content')

# Print the text of the selected element
print(main_content.text)

Extracting Article Titles and Links from a Web Page

Below is an example of extracting article titles and links from an actual web page:

Extracting Article Titles and Links
import requests
from bs4 import BeautifulSoup

# URL of the web page to be scraped
url = 'https://news.ycombinator.com/'

# Fetch HTML data with a GET request
response = requests.get(url)

# Create BeautifulSoup object and parse HTML
soup = BeautifulSoup(response.text, 'html.parser')

# Extract all article titles and links
articles = soup.find_all('a')

# Print article titles and links
for article in articles:

    # Extract article title and link
    title = article.text

    # Link URL
    link = article.get('href')

    # Print title and link
    print(f"Title: {title}, Link: {link}")

The above code searches for a tags in the YCombinator news page to extract article titles and links.

As seen, using BeautifulSoup allows you to easily analyze the structure of a web page to extract the desired data.

Want to learn more?

Join CodeFriends Plus membership or enroll in a course to start your journey.

Parsing HTML Data and Extracting Necessary Information​

Parsing HTML with BeautifulSoup​

Extracting Required Information​

Extracting Article Titles and Links from a Web Page​

Want to learn more?

Parsing HTML Data and Extracting Necessary Information

Parsing HTML with BeautifulSoup

Extracting Required Information

Extracting Article Titles and Links from a Web Page