Skip to main content
Practice

Crawling Latest Trending Articles from Wikipedia

Utilize the find_all method of BeautifulSoup to crawl significant events from Wikipedia's Current Events section.


Example Code Explanation

Extracting the First 10 Trending Article Titles
import requests
from bs4 import BeautifulSoup

def crawl_wikipedia_current_events_first_10_titles():
url = "https://en.wikipedia.org/wiki/Portal:Current_events"

response = requests.get(url)
if response.status_code != 200:
print("Response failed", response.status_code)
return None

soup = BeautifulSoup(response.content, "html.parser")

# Locate the div tag containing the contents of the Current Events section
current_events_section = soup.find("div", {"id": "mw-content-text"})

# Find all li tags within the div tag
list_items = current_events_section.find_all("li") if current_events_section else []

# Extract text inside li tags and store them in a list
titles = [item.get_text(strip=True) for item in list_items[:10]]

return titles

  1. Requesting a Web Page: Use requests.get(url) to request the content of a specific URL.

  2. Checking Response Status: Verify whether the request was successful by inspecting response.status_code.

  3. Creating a BeautifulSoup Object and Parsing Data: Use BeautifulSoup(response.content, "html.parser") to parse the HTML content.

  4. Extracting Data from a Specific Section: Locate all li tags within a particular section of the webpage (e.g., 'Current Events'), and extract the first 10 entries.


Practice Exercises

  • Use the above code to extract the latest event titles from Wikipedia's 'Current Events' section.

  • Experiment with targeting different webpages and sections to practice data extraction techniques.

Want to learn more?

Join CodeFriends Plus membership or enroll in a course to start your journey.