Scraping Wikipedia Homepage Information with Python
Wikipedia is an online encyclopedia created by people worldwide. 📘
In this lesson, we will learn how to collect specific information from a Wikipedia page using Python code.
Using the BeautifulSoup
and requests
libraries, you can extract the title and description from the Wikipedia homepage as shown below.
Step 1: Import Necessary Libraries​
import requests
from bs4 import BeautifulSoup
This code performs the following:
-
Uses the
import
keyword to load the requests library for HTTP communication -
Uses the
from
keyword to load the bs4 package for web scraping and imports the BeautifulSoup class from the bs4 package
Step 2: Retrieve and Store HTML from the URL​
Use BeautifulSoup to retrieve and store the HTML of a webpage in a variable as follows.
# Wikipedia homepage URL
url = "https://www.wikipedia.org"
# Fetch HTML from the URL using the requests library
response = requests.get(url)
# Set the encoding of the fetched HTML to UTF-8
response.encoding = 'utf-8'
# Store the fetched HTML in the soup variable
soup = BeautifulSoup(response.text, 'html.parser')
This code performs the following:
-
Stores the Wikipedia homepage URL in the
url
variable -
Fetches HTML from the URL using
requests.get(url)
-
Parses the fetched HTML with
BeautifulSoup(response.text, 'html.parser')
and stores the parsed result in the soup variable
Step 3: Extract Title and Description Information​
Extract desired information from the soup variable as shown below.
# Extract h1 (heading 1, title) from the webpage
h1_title = soup.find('h1').text
# Extract p (paragraph) tag from the webpage
p_description = soup.find('p').text
This code performs the following:
-
Finds the h1 tag in the soup variable using
soup.find('h1').text
to extract the title and stores it in the h1_title variable -
Finds the p tag in the soup variable using
soup.find('p').text
to extract the description and stores it in the p_description variable
Finally, use the print function to display the extracted title and description from the URL.
Practice​
Press the Run Code
button on the right to see the scraping results. The first execution may take some time.
You can also change the url
address (e.g., https://www.codefriends.net
) to fetch information from other web pages.
Want to learn more?
Join CodeFriends Plus membership or enroll in a course to start your journey.