Skip to main content

Wikipedia Article Crawling

This document will guide you on how to crawl the title and the first paragraph of a Wikipedia article using Python's requests and BeautifulSoup libraries.

Step 1

Retrieving and Parsing HTML
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')

In this step, the requests library is used to retrieve the HTML content from a given URL. Then, the BeautifulSoup library is utilized to parse the HTML, and the parsed content is stored in the soup object. This object allows easy access to HTML elements.

Step 2

Extracting Page Title
page_title = soup.find('title').text

Using soup.find('title'), the <title> tag of the HTML document is located, and the .text attribute is used to extract the text content of the tag. This step is used to retrieve the page's title.

Step 3

Extracting First Valid Paragraph
first_valid_paragraph = None
for paragraph in soup.find_all('p'):
if 'mw-empty-elt' not in paragraph.get('class', []):
first_valid_paragraph = paragraph.text.strip()

By iterating over all <p> tags, the first paragraph without the 'mw-empty-elt' class is found. The 'mw-empty-elt' class indicates an empty paragraph, so it is skipped to find the first paragraph with actual content.

Step 4

Outputting Results
print(f"Page Title: {page_title}\n")
if first_valid_paragraph:
print(f"First Paragraph: {first_valid_paragraph}\n")
print("No valid first paragraph found.\n")

Finally, the extracted page title and the first valid paragraph are printed. If a valid first paragraph is present, its content is displayed; if not, a "No valid first paragraph found." message is shown.


Click the Run Code button on the right to see the crawling results or modify the code!

Want to learn more?

Join CodeFriends Plus membership or enroll in a course to start your journey.