Extracting Desired Information from Wikipedia

In this lesson, we will learn how to use Python to crawl data from the Internet page on Wikipedia.

Specifically, we will extract the title and certain sections from the content of the page, and learn how to properly handle UTF-8 data.

Fetching the Web Page

First, we will use the requests package to fetch the Internet page on Wikipedia.

The requests.get() method is used to retrieve the HTML source of the page.

Fetching the Web Page
import requests

# Set the URL
url = 'https://en.wikipedia.org/wiki/Internet'

# Fetch the web page
response = requests.get(url)

# Check the response status code
print("status_code:", response.status_code)

The url variable stores the address of the page to be crawled.
requests.get(url) fetches the HTML source of the given URL.
response.status_code is used to check if the request was successful. A status code of 200 means the request was successful.

Parsing HTML and Extracting the Title

We will parse the HTML and extract the title of the page.

We will use BeautifulSoup to analyze the HTML structure.

Parsing HTML and Extracting the Title
from bs4 import BeautifulSoup

# Parse the HTML
soup = BeautifulSoup(response.text, 'html.parser')

# Extract the title of the page
title = soup.find('h1', id='firstHeading').text
print("title:", title)

BeautifulSoup(response.text, 'html.parser') parses the HTML source.
soup.find('h1', id='firstHeading').text extracts the title of the page.

Extracting the Content

Next, we will retrieve all <p> tags from the content and then extract the first 5 paragraphs.

Extracting the Content
# Retrieve all <p> tags from the content
all_paragraphs = soup.find('div', class_='mw-parser-output').find_all('p')

# Select only the first 5 <p> tags
paragraphs = all_paragraphs[:5]

# Combine the extracted paragraphs into a single text
content = "\n".join([p.text for p in paragraphs])

soup.find('div', class_='mw-parser-output').find_all('p') retrieves all <p> tags from the content.
paragraphs = all_paragraphs[:5] selects the first 5 <p> tags.
"\n".join([p.text for p in paragraphs]) combines the selected paragraphs into a single text.

Handling UTF-8 Encoding Issues and Output

To properly output the crawled UTF-8 data, we will address encoding issues.

Handling UTF-8 Encoding Issues
# Handle UTF-8 encoding issues
print("content:", content.encode('utf-8').decode('utf-8'))

content.encode('utf-8').decode('utf-8') ensures that the outputted UTF-8 data is properly displayed.

Encoding converts a string into bytes (data composed of 0s and 1s) and decoding converts bytes into a string.

Once you run the code, the title and content of the Wikipedia Internet page will be displayed.

Want to learn more?

Join CodeFriends Plus membership or enroll in a course to start your journey.

Fetching the Web Page​

Parsing HTML and Extracting the Title​

Extracting the Content​

Handling UTF-8 Encoding Issues and Output​

Want to learn more?

Fetching the Web Page

Parsing HTML and Extracting the Title

Extracting the Content

Handling UTF-8 Encoding Issues and Output