Extracting Article Titles from BBC News

Wouldn't it be convenient to automatically collect the latest world news every morning and receive it in your email?

This is made possible by web scraping.

In this lesson, we will work on a practical project to extract the latest article titles from the BBC News homepage.

Review of Requests and BeautifulSoup

When retrieving static data (data not handled by JavaScript) from a web page, we commonly use the requests and BeautifulSoup libraries.

requests is a library that fetches the HTML code from a web page, and BeautifulSoup parses the HTML code to extract the necessary information.

Accessing BBC News with requests

To extract titles from a website, we first need to fetch the HTML data using the requests library.

Below is the code to send a request to the BBC News homepage and check if the request was successful.

Code to request the BBC News page
import requests

# Sending a request to the BBC News homepage
url = "https://www.bbc.com/news"
response = requests.get(url)

# Checking if the request was successful
print("status_code:", response.status_code)

If response.status_code is 200, the request was successful.

Analyzing HTML Data with BeautifulSoup

The HTML data received from the server is merely a string of text on its own.

To meaningfully analyze this data, you need to use BeautifulSoup.

BeautifulSoup helps parse the HTML structure and makes it easier to extract data contained within specific tags.

Analyzing HTML with BeautifulSoup
from bs4 import BeautifulSoup
import requests

# Sending a request to the BBC News homepage
url = "https://www.bbc.com/news"
response = requests.get(url)

# Checking if the request was successful
print("status_code:", response.status_code)

# Parsing the HTML data
soup = BeautifulSoup(response.text, "html.parser")

# Extracting 10 article titles enclosed in h2 tags from the page
titles = soup.find_all('h2', limit=10)

# Printing the index and article titles
# Using the enumerate() function to get index numbers
for idx, title in enumerate(titles):
    print(f"{idx+1}. {title.text}")

The above code finds and prints the text contained in h2 tags.

On the BBC News homepage, article titles are mostly written in h2 tags, so we can extract the titles using find_all('h2').

In this lesson, we learned how to send a request to a web page with Requests, analyze the HTML data with BeautifulSoup, and extract article titles from BBC News.

In the next lesson, we will learn how to save the extracted data into a CSV file.

Want to learn more?

Join CodeFriends Plus membership or enroll in a course to start your journey.

Review of Requests and BeautifulSoup​

Accessing BBC News with requests​

Analyzing HTML Data with BeautifulSoup​

Want to learn more?

Review of Requests and BeautifulSoup

Accessing BBC News with requests

Analyzing HTML Data with BeautifulSoup