Crawling US Stock Index with Selenium

In this lesson, we will practice web crawling at a practical level using the Selenium knowledge we've learned so far.

The practice code on the screen uses Selenium to extract table data from the Americas section of the Yahoo Finance website and organizes it using the pandas library for output.

Note : Web crawling may fail if the HTML and CSS structure of the website changes. If the structure of the website changes, you will need to modify the code accordingly.

Let's break down the code step by step.

1. Import Required Libraries

Import Libraries
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.common.action_chains import ActionChains
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import pandas as pd
import time

selenium : A library for web automation and scraping. It allows you to find and interact with elements on a web page.
pandas : A library for handling data in tabular form, useful for data analysis in a manner similar to Excel.
time : A built-in Python module that provides various time-related functions.

2. Launch WebDriver and Navigate to the Website

Launch Selenium WebDriver and Navigate
driver = webdriver.Chrome()
driver.get('https://finance.yahoo.com/markets/')

webdriver.Chrome() : Launches the Chrome web driver to control the Chrome browser automatically. It opens a browser window.
driver.get(URL) : Navigates to the given URL. Here, we navigate to the 'Markets' page of Yahoo Finance.

3. Wait for Page to Load

Wait for Page to Load for 10 Seconds
wait = WebDriverWait(driver, 10)

WebDriverWait(driver, 10) : Waits up to 10 seconds for an element to appear. This prevents errors by ensuring the page is fully loaded before executing the code.

4. Find the 'Americas' Section

Find Specific Section
americas_section = wait.until(EC.presence_of_element_located((By.XPATH, "//h3[text()='Americas']")))

wait.until() : Waits until the h3 tag with the text 'Americas' appears on the page. EC.presence_of_element_located() checks if the specific element is present on the page.

5. Scroll to the Section and Find the Table Within

Find Table in Specific Section
actions = ActionChains(driver)
actions.move_to_element(americas_section).perform()

parent_section = americas_section.find_element(By.XPATH, "./ancestor::section[contains(@data-testid, 'world-indices')]")
table = parent_section.find_element(By.XPATH, ".//table")

ActionChains(driver) : Used for automating mouse movement or click actions on the page. Here, it scrolls to the 'Americas' section.
find_element(By.XPATH) : Finds the table within the parent element (the section tag) of the 'Americas' section.

6. Extract Table Data

Extract Data from Table
headers = [header.text for header in table.find_elements(By.XPATH, ".//th")]
rows = table.find_elements(By.XPATH, ".//tbody/tr")

table.find_elements(): Extracts the header and row data from the table.
- th : Represents the header information of the table.
- tr : Represents the rows of the table.

7. Save Data to List

Save Table Data to Dictionary
table_data = []
for row in rows:
    columns = row.find_elements(By.XPATH, ".//td")
    row_data = {}
    for i in range(len(headers)):
        header = headers[i]
        column_value = columns[i].text
        row_data[header] = column_value
    table_data.append(row_data)

Extracts the column data (td) for each row (tr), and stores them in a dictionary matching the headers. The dictionary is then added to a list.

8. Convert Data to Pandas DataFrame and Output

Convert to DataFrame and Output
df = pd.DataFrame(table_data)
df_filtered = df[['Symbol', 'Price']]
print(df_filtered)

pd.DataFrame(): Converts the extracted data into a pandas DataFrame.
df[['Symbol', 'Price']]: Filters and outputs the data, showing only the Symbol and Price columns.

9. Close Browser

Close Browser
driver.quit()

driver.quit(): Closes the browser after all tasks are completed to free up resources.

Want to learn more?

Join CodeFriends Plus membership or enroll in a course to start your journey.

1. Import Required Libraries​

2. Launch WebDriver and Navigate to the Website​

3. Wait for Page to Load​

4. Find the 'Americas' Section​

5. Scroll to the Section and Find the Table Within​

6. Extract Table Data​

7. Save Data to List​

8. Convert Data to Pandas DataFrame and Output​

9. Close Browser​