Sending Data Scraped from Wikipedia via Email
In this assignment, you will scrape date information for significant historical events from Wikipedia, then send a CSV file containing the historical events and their dates as an email attachment.
Understanding how such a program works will broaden the range of applications for crawling large datasets and processing them into various forms for email delivery.
Converting Crawling Results to CSV
First, let's look at how to convert the crawled results into a CSV file.
1. Importing Necessary Libraries
import pandas as pd
import requests
from bs4 import BeautifulSoup
- 
pandas : A library used for reading and processing data, often used with ExcelandCSVfiles.
- 
requests : A library used for sending requeststo and receivingresponsesfrom web pages.
- 
BeautifulSoup : A library used for parsing the HTML codeof web pages to extract the desired information.
2. Reading the Excel File
df = pd.read_excel('input_file.xlsx')
- 
file_url : The path to the Excel file. This file contains numbersandnamesof historical events.
- 
pd.read_excel : The read_excelfunction from pandas is used to read the Excel file as a DataFrame.
3. Creating a Function to Extract Date Information from Wikipedia
def extract_date(event_name):
    # Wikipedia page URL
    url = base_url + event_name
    # Sending a web request
    response = requests.get(url)
    # If the request is successful
    if response.status_code == 200:
        # Parsing the HTML
        soup = BeautifulSoup(response.content, 'html.parser')
        # Assuming date information is typically in an infobox
        infobox = soup.find('table', {'class': 'infobox'})
        # Finding 'Date' entry in the infobox
        if infobox:
            # Checking if 'Date' entry exists
            date_tag = infobox.find('th', string='Date')
            # If 'Date' entry exists
            if date_tag:
                # Extracting date information from the next sibling tag
                date_value = date_tag.find_next_sibling('td')
                # If date information exists
                if date_value:
                    # Returning the date information
                    return date_value.text.strip()
        # If date information is not found
        return 'No date information'
    # If web request fails
    else:
        return 'Page error'
- 
requests.get : Sends a web request to the given URL and receives a response. 
- 
BeautifulSoup : Parses the HTML code of the response. 
- 
infobox : Finds the table (infobox) containing event information and returns the value of the 'Date' entry. 
- 
return : Returns 'No date information' if the date is not found, and 'Page error' if the page cannot be loaded. 
4. Applying the Function to the DataFrame
df['Date'] = df['HistoricalEvent'].apply(extract_date)
- 
df['HistoricalEvent']: The 'HistoricalEvent' column in the Excel file. This contains the names of each event.
- 
apply(extract_date): Applies theextract_datefunction to each event name to extract the date, and stores the result in a new 'Date' column.
5. Outputting the Results
print(df[['HistoricalEvent', 'Date']].to_csv(index=False))
- 
df[['HistoricalEvent', 'Date']]: Selects only the 'HistoricalEvent' and extracted 'Date' columns.
- 
to_csv(index=False): Converts the selected data toCSV formatand prints it.index=Falsemeans excluding the index (which indicates the position of each row in the DataFrame) from the output.
In the next lesson, we will learn how to send the crawled CSV data via email.
Want to learn more?
Join CodeFriends Plus membership or enroll in a course to start your journey.