Skip to main content
Practice

Scraping Wikipedia Homepage Information with Python

Wikipedia is an online encyclopedia created by people worldwide. πŸ“˜

In this lesson, we will learn how to collect specific information from a Wikipedia page using Python code.

Using the BeautifulSoup and requests libraries, you can extract the title and description from the Wikipedia homepage as shown below.


Step 1: Import Necessary Libraries​

Importing requests and BeautifulSoup libraries
import requests
from bs4 import BeautifulSoup

This code performs the following:

  • Uses the import keyword to load the requests library for HTTP communication

  • Uses the from keyword to load the bs4 package for web scraping and imports the BeautifulSoup class from the bs4 package


Step 2: Retrieve and Store HTML from the URL​

Use BeautifulSoup to retrieve and store the HTML of a webpage in a variable as follows.

Fetching HTML from Wikipedia homepage
# Wikipedia homepage URL
url = "https://www.wikipedia.org"

# Fetch HTML from the URL using the requests library
response = requests.get(url)

# Set the encoding of the fetched HTML to UTF-8
response.encoding = 'utf-8'

# Store the fetched HTML in the soup variable
soup = BeautifulSoup(response.text, 'html.parser')

This code performs the following:

  • Stores the Wikipedia homepage URL in the url variable

  • Fetches HTML from the URL using requests.get(url)

  • Parses the fetched HTML with BeautifulSoup(response.text, 'html.parser') and stores the parsed result in the soup variable


Step 3: Extract Title and Description Information​

Extract desired information from the soup variable as shown below.

Extracting title and description from Wikipedia homepage
# Extract h1 (heading 1, title) from the webpage
h1_title = soup.find('h1').text

# Extract p (paragraph) tag from the webpage
p_description = soup.find('p').text

This code performs the following:

  • Finds the h1 tag in the soup variable using soup.find('h1').text to extract the title and stores it in the h1_title variable

  • Finds the p tag in the soup variable using soup.find('p').text to extract the description and stores it in the p_description variable

Finally, use the print function to display the extracted title and description from the URL.


Practice​

Press the Run Code button on the right to see the scraping results. The first execution may take some time.

You can also change the url address (e.g., https://www.codefriends.net) to fetch information from other web pages.

Want to learn more?

Join CodeFriends Plus membership or enroll in a course to start your journey.