Automating Data Research with Web Scraping
Imagine you have a task or research project where you need to find information on thousands of companies!
Manually searching the internet for each company is nearly impossible.
Using web scraping for repetitive research tasks can turn a day's work into a few minutes.
Let's implement a scenario where we need to find the founding year
and founder
of 5 companies
through code.
By understanding how the code works, you can easily extend this Python program to collect information on hundreds or thousands of companies at once.
1. Prepare CSV Data
Store the list of companies in a string called csv_data
. This data will act like a CSV file when used in the program.
Without saving the file physically, we'll use io.StringIO
to handle the data in-memory.
1,Apple Inc.
2,Microsoft
3,NVIDIA
4,Berkshire Hathaway
5,Google
2. Read CSV Data
Use csv.DictReader
to read the CSV data into a dictionary format.
Each row in the CSV will be converted into a dictionary with keys 'Number' and 'Company Name'.
reader = csv.DictReader(csv_file)
3. Mapping Company Names to Wikipedia URLs
Automatically generate Wikipedia page URLs based on company names.
For example, 'Apple Inc.'
will be converted to https://en.wikipedia.org/wiki/Apple_Inc.
.
companies = {row['Company Name']: base_url + row['Company Name'].replace(' ', '_') for row in reader}
This process saves each company's Wikipedia page URL in the companies
dictionary.
{
'Apple Inc.': 'https://en.wikipedia.org/wiki/Apple_Inc.',
'Microsoft': 'https://en.wikipedia.org/wiki/Microsoft',
'NVIDIA': 'https://en.wikipedia.org/wiki/NVIDIA',
'Berkshire Hathaway': 'https://en.wikipedia.org/wiki/Berkshire_Hathaway',
'Google': 'https://en.wikipedia.org/wiki/Google'
}
4. Fetch Data from the Webpage
Access each company's Wikipedia page to retrieve founder and founding year information.
Use requests.get()
to send a request to the URL, and BeautifulSoup
to parse the HTML data.
Parsing means extracting necessary information from the HTML structure.
response = requests.get(url)
soup = BeautifulSoup(response.content, 'html.parser')
5. Find Founder and Founding Year
The information box on the Wikipedia page (usually called an infobox table) contains founder and founding year details.
This code looks for 'Founder' and 'Founded' texts in the table to extract the required information.
# Extract the header information of the row
header = row.find('th')
# If header exists and contains 'Founder' text
if header and 'Founder' in header.text:
founder = row.find('td').text.strip()
# If header exists and contains 'Founded' text
if header and 'Founded' in header.text:
founded = row.find('td').text.strip()
6. Output the Results
Store the collected founder and founding year information in a company_info
list and print the list at the end.
company_info.append({'Company Name': company, 'Founder': founder, 'Founded': founded})
print(company_info)
Want to learn more?
Join CodeFriends Plus membership or enroll in a course to start your journey.