Skip to main content
Practice

What is BeautifulSoup?

BeautifulSoup is a Python library that facilitates web scraping by extracting and parsing data from HTML files.

Parsing is the process of analyzing HTML documents of a webpage to extract the desired data. BeautifulSoup helps perform this parsing task easily.


Features and Characteristics of BeautifulSoup

  1. Support for Various Parsers

    • BeautifulSoup supports several types of parsers for parsing HTML/XML documents.

    • The most commonly used parsers are html.parser (standard Python library) and lxml.

  2. Easy Data Extraction

    • It allows easy searching for specific tags, IDs, classes, etc.

    • It can effectively extract various elements of a webpage, such as text and attribute values.

  3. Handling Complex HTML Structures

    • It can easily navigate and extract required data from nested tags or complex HTML structures.

    • It uses the hierarchical relationship of tags to find the exact data location.

  4. Flexible Search Methods

    • It allows data searching using various methods, such as CSS selectors and regular expressions.

    • It can also find data with specific patterns by combining multiple conditions.


Usage

Example of Using BeautifulSoup Library
from bs4 import BeautifulSoup

# Example of an HTML document
html_doc = """
<html>
<head>
<title>The Codefriends' story</title>
</head>
<body>
<p class="title"><b>The Codefriends' story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
</body>
</html>
"""

# Creating a BeautifulSoup object
soup = BeautifulSoup(html_doc, 'html.parser')

# Extracting contents of the HTML title tag
title = soup.title.text
print('Title:', title) # Output: The Codefriends' story

print('-' * 10)

# Extracting href attribute values of 'a' tags
for link in soup.find_all('a'):
print(link.get('href'))

Practice

Click the Run Code button on the right side of the screen to check the crawling results or modify the code!

Want to learn more?

Join CodeFriends Plus membership or enroll in a course to start your journey.