Wikipedia Article Crawling
This document will guide you on how to crawl the title and the first paragraph of a Wikipedia article using Python's requests
and BeautifulSoup
libraries.
Step 1
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
In this step, the requests
library is used to retrieve the HTML content from a given URL. Then, the BeautifulSoup
library is utilized to parse the HTML, and the parsed content is stored in the soup
object. This object allows easy access to HTML elements.
Step 2
page_title = soup.find('title').text
Using soup.find('title')
, the <title>
tag of the HTML document is located, and the .text
attribute is used to extract the text content of the tag. This step is used to retrieve the page's title.
Step 3
first_valid_paragraph = None
for paragraph in soup.find_all('p'):
if 'mw-empty-elt' not in paragraph.get('class', []):
first_valid_paragraph = paragraph.text.strip()
break
By iterating over all <p>
tags, the first paragraph without the 'mw-empty-elt' class is found. The 'mw-empty-elt' class indicates an empty paragraph, so it is skipped to find the first paragraph with actual content.
Step 4
print(f"Page Title: {page_title}\n")
if first_valid_paragraph:
print(f"First Paragraph: {first_valid_paragraph}\n")
else:
print("No valid first paragraph found.\n")
Finally, the extracted page title and the first valid paragraph are printed. If a valid first paragraph is present, its content is displayed; if not, a "No valid first paragraph found." message is shown.
Practice
Click the Run Code
button on the right to see the crawling results or modify the code!
Want to learn more?
Join CodeFriends Plus membership or enroll in a course to start your journey.