Real-Time Stock Data Collection from Yahoo Finance
The ever-changing stock market! If you want to collect real-time stock data and store it periodically, how can you do it?
In this lesson, we will explore how to dynamically extract stock data from Yahoo Finance using Selenium
.
Extracting Dynamic Data
Dynamic data is generated by JavaScript
, which acts as the brain of a web page. This data can be generated after the user visits the site or changes based on specific user actions.
Such dynamic data cannot be fetched using BeautifulSoup
and Requests
.
However, Selenium
allows us to execute JavaScript on web pages and retrieve dynamic data.
Let's go through the code step by step.
If any part of the code is difficult to understand, feel free to ask our AI Tutor
for help.
1. Importing Necessary Packages
-
selenium
: Fetches dynamic data from web pages. -
pandas
: Organizes and processes data in tabular form. -
webdriver
: Controls web browsers using Selenium. -
By
: Specifies how to locate elements on the web page. -
ActionChains
: Performs mouse and keyboard actions on the web page. -
EC
: Waits until elements appear on the web page.
2. Open a Web Browser
# Launch Chrome WebDriver to open a browser window
driver = webdriver.Chrome()
# Navigate to the 'Markets' page on Yahoo Finance
driver.get('https://finance.yahoo.com/markets/')
Launch the Chrome browser and navigate to the 'Markets' page on Yahoo Finance
.
For reference, Selenium supports various browsers like Chrome and Firefox.
3. Wait Until the Page is Fully Loaded
# Wait until the page is fully loaded (maximum wait time of 10 seconds)
wait = WebDriverWait(driver, 10)
Wait for the web page to be fully loaded with a maximum wait time of 10 seconds.
It is important to allow time for the page to load as it may take some time for elements to be ready.
4. Find the 'Americas' Section and Scroll to It
# Find the h3 tag with the text 'Americas'
americas_section = wait.until(EC.presence_of_element_located((By.XPATH, "//h3[text()='Americas']")))
# Scroll to the 'Americas' section
actions = ActionChains(driver)
# Move the mouse to the 'Americas' section
actions.move_to_element(americas_section).perform()
XPATH
is an expression language used to locate particular elements in XML documents.
For example, the h3
element with the text 'Americas'
can be expressed as /h3[text()='Americas']
.
XPATH
is one of the methods used to locate elements on a web page.
move_to_element
scrolls the screen to the 'Americas' section.
5. Find the Table in the 'Americas' Section
# Find the parent section of the 'Americas' section containing the table
parent_section = americas_section.find_element(By.XPATH, "./ancestor::section[contains(@data-testid, 'world-indices')]")
# Find the table
table = parent_section.find_element(By.XPATH, ".//table")
Find the table
element within the parent section
tag containing the "Americas" section.
"./ancestor::section[contains(@data-testid, 'world-indices')]"
is the XPATH to locate the parent section
element of the "Americas" section.
This table contains the data we need (e.g., index names, prices, etc.).
6. Collect Table Headers and Data
# Extract headers from the table
headers = [header.text for header in table.find_elements(By.XPATH, ".//th")]
# Extract rows from the table
rows = table.find_elements(By.XPATH, ".//tbody/tr")
table.find_elements(By.XPATH, ".//th")
locates the th
tags within the table to extract headers.
th
(Table Header) tags represent the column names (e.g., "Name", "Price") in the table.
Store the headers (e.g., "Name", "Price") in a list.
Use tbody/tr
to extract data from each row (tr
, table row) in the table.
7. Extract 'Name' and 'Price' values from Each Row and Save
# Initialize a list to store data
table_data = []
# Extract column data for each row and add it to the list
for row in rows:
# Extract column data for each row
columns = row.find_elements(By.XPATH, ".//td")
row_data = {} # Initialize an empty dictionary
# Assume that the 'headers' and 'columns' lists are of the same length
for i in range(len(headers)):
# Get the i-th element from the 'headers' list
header = headers[i]
# Get the text of the i-th element from the 'columns' list
column_value = columns[i].text
# Add to the dictionary with the header as key and column_value as value
row_data[header] = column_value
# Add the data to the list
table_data.append(row_data)
Iterate over each row to extract data within the td
tags (cell values).
Save this data in the row_data
variable as a dictionary, where the key is header
and the value is column_value
.
Finally, append the dictionary to the table_data
list.
8. Convert to DataFrame Using pandas
# Convert the extracted data to a pandas DataFrame
df = pd.DataFrame(table_data)
# Select and print the DataFrame with only 'Symbol' and 'Price' columns
df_filtered = df[['Symbol', 'Price']]
# Print the sorted data
print(df_filtered)
Convert the extracted data to a pandas.DataFrame
.
The data will be stored in the data frame as shown below.
Symbol | Price |
---|---|
... | ... |
... | ... |
9. Select and Sort 'Symbol' and 'Price' Columns
df_filtered = df[['Symbol', 'Price']]
Select the 'Symbol' and 'Price' columns from the data frame and sort them based on the 'Symbol' column.
Want to learn more?
Join CodeFriends Plus membership or enroll in a course to start your journey.