Extracting Desired Information from Wikipedia
In this lesson, we will learn how to use Python to crawl data from the Internet
page on Wikipedia.
Specifically, we will extract the title
and certain sections from the content
of the page, and learn how to properly handle UTF-8 data
.
Fetching the Web Page
First, we will use the requests
package to fetch the Internet
page on Wikipedia.
The requests.get()
method is used to retrieve the HTML source of the page.
import requests
# Set the URL
url = 'https://en.wikipedia.org/wiki/Internet'
# Fetch the web page
response = requests.get(url)
# Check the response status code
print("status_code:", response.status_code)
-
The
url
variable stores the address of the page to be crawled. -
requests.get(url)
fetches the HTML source of the given URL. -
response.status_code
is used to check if the request was successful. A status code of 200 means the request was successful.
Parsing HTML and Extracting the Title
We will parse the HTML and extract the title of the page.
We will use BeautifulSoup
to analyze the HTML structure.
from bs4 import BeautifulSoup
# Parse the HTML
soup = BeautifulSoup(response.text, 'html.parser')
# Extract the title of the page
title = soup.find('h1', id='firstHeading').text
print("title:", title)
-
BeautifulSoup(response.text, 'html.parser')
parses the HTML source. -
soup.find('h1', id='firstHeading').text
extracts the title of the page.
Extracting the Content
Next, we will retrieve all <p>
tags from the content and then extract the first 5 paragraphs.
# Retrieve all <p> tags from the content
all_paragraphs = soup.find('div', class_='mw-parser-output').find_all('p')
# Select only the first 5 <p> tags
paragraphs = all_paragraphs[:5]
# Combine the extracted paragraphs into a single text
content = "\n".join([p.text for p in paragraphs])
-
soup.find('div', class_='mw-parser-output').find_all('p')
retrieves all<p>
tags from the content. -
paragraphs = all_paragraphs[:5]
selects the first 5<p>
tags. -
"\n".join([p.text for p in paragraphs])
combines the selected paragraphs into a single text.
Handling UTF-8 Encoding Issues and Output
To properly output the crawled UTF-8 data, we will address encoding issues.
# Handle UTF-8 encoding issues
print("content:", content.encode('utf-8').decode('utf-8'))
content.encode('utf-8').decode('utf-8')
ensures that the outputted UTF-8 data is properly displayed.
Encoding converts a string into bytes (data composed of 0s and 1s) and decoding converts bytes into a string.
Once you run the code, the title and content of the Wikipedia Internet
page will be displayed.
Want to learn more?
Join CodeFriends Plus membership or enroll in a course to start your journey.