Crawling US Stock Index with Selenium
In this lesson, we will practice web crawling at a practical level using the Selenium knowledge we've learned so far.
The practice code on the screen uses Selenium to extract table data from the Americas
section of the Yahoo Finance
website and organizes it using the pandas
library for output.
Note : Web crawling may fail if the
HTML
andCSS
structure of the website changes. If the structure of the website changes, you will need to modify the code accordingly.
Let's break down the code step by step.
1. Import Required Libraries
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.common.action_chains import ActionChains
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import pandas as pd
import time
-
selenium : A library for web automation and scraping. It allows you to find and interact with elements on a web page.
-
pandas : A library for handling data in tabular form, useful for data analysis in a manner similar to Excel.
-
time : A built-in Python module that provides various time-related functions.
2. Launch WebDriver and Navigate to the Website
driver = webdriver.Chrome()
driver.get('https://finance.yahoo.com/markets/')
-
webdriver.Chrome() : Launches the Chrome web driver to control the Chrome browser automatically. It opens a browser window.
-
driver.get(URL) : Navigates to the given URL. Here, we navigate to the 'Markets' page of Yahoo Finance.
3. Wait for Page to Load
wait = WebDriverWait(driver, 10)
- WebDriverWait(driver, 10) : Waits up to 10 seconds for an element to appear. This prevents errors by ensuring the page is fully loaded before executing the code.
4. Find the 'Americas' Section
americas_section = wait.until(EC.presence_of_element_located((By.XPATH, "//h3[text()='Americas']")))
- wait.until() : Waits until the
h3
tag with the text 'Americas' appears on the page.EC.presence_of_element_located()
checks if the specific element is present on the page.
5. Scroll to the Section and Find the Table Within
actions = ActionChains(driver)
actions.move_to_element(americas_section).perform()
parent_section = americas_section.find_element(By.XPATH, "./ancestor::section[contains(@data-testid, 'world-indices')]")
table = parent_section.find_element(By.XPATH, ".//table")
-
ActionChains(driver) : Used for automating mouse movement or click actions on the page. Here, it scrolls to the 'Americas' section.
-
find_element(By.XPATH) : Finds the table within the parent element (the
section
tag) of the 'Americas' section.
6. Extract Table Data
headers = [header.text for header in table.find_elements(By.XPATH, ".//th")]
rows = table.find_elements(By.XPATH, ".//tbody/tr")
-
table.find_elements(): Extracts the header and row data from the table.
-
th
: Represents the header information of the table. -
tr
: Represents the rows of the table.
-
7. Save Data to List
table_data = []
for row in rows:
columns = row.find_elements(By.XPATH, ".//td")
row_data = {}
for i in range(len(headers)):
header = headers[i]
column_value = columns[i].text
row_data[header] = column_value
table_data.append(row_data)
- Extracts the column data (
td
) for each row (tr
), and stores them in a dictionary matching the headers. The dictionary is then added to a list.
8. Convert Data to Pandas DataFrame and Output
df = pd.DataFrame(table_data)
df_filtered = df[['Symbol', 'Price']]
print(df_filtered)
-
pd.DataFrame(): Converts the extracted data into a pandas DataFrame.
-
df[['Symbol', 'Price']]: Filters and outputs the data, showing only the
Symbol
andPrice
columns.
9. Close Browser
driver.quit()
- driver.quit(): Closes the browser after all tasks are completed to free up resources.
Want to learn more?
Join CodeFriends Plus membership or enroll in a course to start your journey.