Crawling Latest Trending Articles from Wikipedia
Utilize the find_all
method of BeautifulSoup
to crawl significant events from Wikipedia's Current Events section.
Example Code Explanation
import requests
from bs4 import BeautifulSoup
def crawl_wikipedia_current_events_first_10_titles():
url = "https://en.wikipedia.org/wiki/Portal:Current_events"
response = requests.get(url)
if response.status_code != 200:
print("Response failed", response.status_code)
return None
soup = BeautifulSoup(response.content, "html.parser")
# Locate the div tag containing the contents of the Current Events section
current_events_section = soup.find("div", {"id": "mw-content-text"})
# Find all li tags within the div tag
list_items = current_events_section.find_all("li") if current_events_section else []
# Extract text inside li tags and store them in a list
titles = [item.get_text(strip=True) for item in list_items[:10]]
return titles
-
Requesting a Web Page
: Userequests.get(url)
to request the content of a specific URL. -
Checking Response Status
: Verify whether the request was successful by inspectingresponse.status_code
. -
Creating a BeautifulSoup Object and Parsing Data
: UseBeautifulSoup(response.content, "html.parser")
to parse the HTML content. -
Extracting Data from a Specific Section
: Locate allli
tags within a particular section of the webpage (e.g., 'Current Events'), and extract the first 10 entries.
Practice Exercises
-
Use the above code to extract the latest event titles from Wikipedia's 'Current Events' section.
-
Experiment with targeting different webpages and sections to practice data extraction techniques.
Want to learn more?
Join CodeFriends Plus membership or enroll in a course to start your journey.