Scraping Wikipedia Homepage Information with Python
Wikipedia is an online encyclopedia created by people worldwide. π
In this lesson, we will learn how to collect specific information from a Wikipedia page using Python code.
Using the BeautifulSoup
and requests
libraries, you can extract the title and description from the Wikipedia homepage as shown below.
Step 1: Import Necessary Librariesβ
import requests
from bs4 import BeautifulSoup
This code performs the following:
-
Uses the
import
keyword to load the requests library for HTTP communication -
Uses the
from
keyword to load the bs4 package for web scraping and imports the BeautifulSoup class from the bs4 package
Step 2: Retrieve and Store HTML from the URLβ
Use BeautifulSoup to retrieve and store the HTML of a webpage in a variable as follows.
# Wikipedia homepage URL
url = "https://www.wikipedia.org"
# Fetch HTML from the URL using the requests library
response = requests.get(url)
# Set the encoding of the fetched HTML to UTF-8
response.encoding = 'utf-8'
# Store the fetched HTML in the soup variable
soup = BeautifulSoup(response.text, 'html.parser')
This code performs the following:
-
Stores the Wikipedia homepage URL in the
url
variable -
Fetches HTML from the URL using
requests.get(url)
-
Parses the fetched HTML with
BeautifulSoup(response.text, 'html.parser')
and stores the parsed result in the soup variable
Step 3: Extract Title and Description Informationβ
Extract desired information from the soup variable as shown below.
# Extract h1 (heading 1, title) from the webpage
h1_title = soup.find('h1').text
# Extract p (paragraph) tag from the webpage
p_description = soup.find('p').text
This code performs the following:
-
Finds the h1 tag in the soup variable using
soup.find('h1').text
to extract the title and stores it in the h1_title variable -
Finds the p tag in the soup variable using
soup.find('p').text
to extract the description and stores it in the p_description variable
Finally, use the print function to display the extracted title and description from the URL.
Practiceβ
Press the Run Code
button on the right to see the scraping results. The first execution may take some time.
You can also change the url
address (e.g., https://www.codefriends.net
) to fetch information from other web pages.
Want to learn more?
Join CodeFriends Plus membership or enroll in a course to start your journey.