Automating Data Research with Web Scraping

Imagine you have a task or research project where you need to find information on thousands of companies!

Manually searching the internet for each company is nearly impossible.

Using web scraping for repetitive research tasks can turn a day's work into a few minutes.

Let's implement a scenario where we need to find the founding year and founder of 5 companies through code.

By understanding how the code works, you can easily extend this Python program to collect information on hundreds or thousands of companies at once.

1. Prepare CSV Data

Store the list of companies in a string called csv_data. This data will act like a CSV file when used in the program.

Without saving the file physically, we'll use io.StringIO to handle the data in-memory.

Company List for Research
1,Apple Inc.
2,Microsoft
3,NVIDIA
4,Berkshire Hathaway
5,Google

2. Read CSV Data

Use csv.DictReader to read the CSV data into a dictionary format.

Each row in the CSV will be converted into a dictionary with keys 'Number' and 'Company Name'.

Read CSV File
reader = csv.DictReader(csv_file)

3. Mapping Company Names to Wikipedia URLs

Automatically generate Wikipedia page URLs based on company names.

For example, 'Apple Inc.' will be converted to https://en.wikipedia.org/wiki/Apple_Inc..

Map Company Names to URLs
companies = {row['Company Name']: base_url + row['Company Name'].replace(' ', '_') for row in reader}

This process saves each company's Wikipedia page URL in the companies dictionary.

Resulting Company and URL Mapping
{
    'Apple Inc.': 'https://en.wikipedia.org/wiki/Apple_Inc.',
    'Microsoft': 'https://en.wikipedia.org/wiki/Microsoft',
    'NVIDIA': 'https://en.wikipedia.org/wiki/NVIDIA',
    'Berkshire Hathaway': 'https://en.wikipedia.org/wiki/Berkshire_Hathaway',
    'Google': 'https://en.wikipedia.org/wiki/Google'
}

4. Fetch Data from the Webpage

Access each company's Wikipedia page to retrieve founder and founding year information.

Use requests.get() to send a request to the URL, and BeautifulSoup to parse the HTML data.

Parsing means extracting necessary information from the HTML structure.

Fetch Data from Webpage
response = requests.get(url)

soup = BeautifulSoup(response.content, 'html.parser')

5. Find Founder and Founding Year

The information box on the Wikipedia page (usually called an infobox table) contains founder and founding year details.

This code looks for 'Founder' and 'Founded' texts in the table to extract the required information.

Find Founder and Founding Year
# Extract the header information of the row
header = row.find('th')

# If header exists and contains 'Founder' text
if header and 'Founder' in header.text:
    founder = row.find('td').text.strip()

# If header exists and contains 'Founded' text
if header and 'Founded' in header.text:
    founded = row.find('td').text.strip()

6. Output the Results

Store the collected founder and founding year information in a company_info list and print the list at the end.

Output the Results
company_info.append({'Company Name': company, 'Founder': founder, 'Founded': founded})

print(company_info)

Want to learn more?

Join CodeFriends Plus membership or enroll in a course to start your journey.

1. Prepare CSV Data​

2. Read CSV Data​

3. Mapping Company Names to Wikipedia URLs​

4. Fetch Data from the Webpage​

5. Find Founder and Founding Year​

6. Output the Results​