Scraping Wikipedia Homepage Information with Python
Wikipedia is an online encyclopedia created by people worldwide. 📘
In this lesson, we will learn how to collect specific information from a Wikipedia page using Python code.
Using the BeautifulSoup and requests libraries, you can extract the title and description from the Wikipedia homepage as shown below.
Step 1: Import Necessary Libraries​
import requests
from bs4 import BeautifulSoup
This code performs the following:
-
Uses the
importkeyword to load the requests library for HTTP communication -
Uses the
fromkeyword to load the bs4 package for web scraping and imports the BeautifulSoup class from the bs4 package
Step 2: Retrieve and Store HTML from the URL​
Use BeautifulSoup to retrieve and store the HTML of a webpage in a variable as follows.
# Wikipedia homepage URL
url = "https://www.wikipedia.org"
# Fetch HTML from the URL using the requests library
response = requests.get(url)
# Set the encoding of the fetched HTML to UTF-8
response.encoding = 'utf-8'
# Store the fetched HTML in the soup variable
soup = BeautifulSoup(response.text, 'html.parser')
This code performs the following:
-
Stores the Wikipedia homepage URL in the
urlvariable -
Fetches HTML from the URL using
requests.get(url) -
Parses the fetched HTML with
BeautifulSoup(response.text, 'html.parser')and stores the parsed result in the soup variable
Step 3: Extract Title and Description Information​
Extract desired information from the soup variable as shown below.
# Extract h1 (heading 1, title) from the webpage
h1_title = soup.find('h1').text
# Extract p (paragraph) tag from the webpage
p_description = soup.find('p').text
This code performs the following:
-
Finds the h1 tag in the soup variable using
soup.find('h1').textto extract the title and stores it in the h1_title variable -
Finds the p tag in the soup variable using
soup.find('p').textto extract the description and stores it in the p_description variable
Finally, use the print function to display the extracted title and description from the URL.
Practice​
Press the Run Code button on the right to see the scraping results. The first execution may take some time.
You can also change the url address (e.g., https://www.codefriends.net) to fetch information from other web pages.
Want to learn more?
Join CodeFriends Plus membership or enroll in a course to start your journey.