Extracting Article Titles from BBC News
Wouldn't it be convenient to automatically collect the latest world news every morning and receive it in your email?
This is made possible by web scraping.
In this lesson, we will work on a practical project to extract the latest article titles
from the BBC News homepage.
Review of Requests and BeautifulSoup
When retrieving static data (data not handled by JavaScript) from a web page, we commonly use the requests
and BeautifulSoup
libraries.
requests is a library that fetches the HTML code from a web page, and BeautifulSoup parses the HTML code to extract the necessary information.
Accessing BBC News with requests
To extract titles from a website, we first need to fetch the HTML data using the requests library.
Below is the code to send a request to the BBC News homepage and check if the request was successful.
import requests
# Sending a request to the BBC News homepage
url = "https://www.bbc.com/news"
response = requests.get(url)
# Checking if the request was successful
print("status_code:", response.status_code)
If response.status_code
is 200, the request was successful.
Analyzing HTML Data with BeautifulSoup
The HTML data received from the server is merely a string of text on its own.
To meaningfully analyze this data, you need to use BeautifulSoup.
BeautifulSoup helps parse the HTML structure and makes it easier to extract data contained within specific tags.
from bs4 import BeautifulSoup
import requests
# Sending a request to the BBC News homepage
url = "https://www.bbc.com/news"
response = requests.get(url)
# Checking if the request was successful
print("status_code:", response.status_code)
# Parsing the HTML data
soup = BeautifulSoup(response.text, "html.parser")
# Extracting 10 article titles enclosed in h2 tags from the page
titles = soup.find_all('h2', limit=10)
# Printing the index and article titles
# Using the enumerate() function to get index numbers
for idx, title in enumerate(titles):
print(f"{idx+1}. {title.text}")
The above code finds and prints the text contained in h2
tags.
On the BBC News homepage, article titles are mostly written in h2 tags, so we can extract the titles using find_all('h2')
.
In this lesson, we learned how to send a request to a web page with Requests, analyze the HTML data with BeautifulSoup, and extract article titles from BBC News.
In the next lesson, we will learn how to save the extracted data into a CSV file.
Want to learn more?
Join CodeFriends Plus membership or enroll in a course to start your journey.