Repeated Data Collection with Web Crawling
How would you monitor news articles periodically or track the price changes of specific products in an online store?
Web Crawling
is a technology that can be utilized to automate these tasks by repeatedly collecting data.
Web crawling refers to the process of visiting web pages through automated scripts and extracting specific data from those pages.
Note: Collecting data from a single page is called Web Scraping.
Strictly speaking, Web Crawling
refers to the process of systematically exploring multiple websites and collecting data, while Web Scraping
refers to extracting specific data from a single page.
In this lesson, we will use the term Web Crawling
as we will cover the comprehensive process of data collection.
How Does Web Crawling Work?
Web crawling refers to the process by which a program called a Web Crawler
or Spider
visits websites and automatically collects the contents of web pages.
A web crawler operates through the following steps:
-
Web Page Request: The crawler requests the URL of the web page to fetch the HTML source of that page.
-
HTML Parsing: Parsing involves analyzing the HTML source to understand the structure of the web page. The crawler analyzes the HTML tags and extracts the content of the web page.
-
Data Extraction: The crawler extracts the necessary data from the web page and either stores or processes it to provide to the user.
A web crawler typically starts from one page and sequentially visits other pages by following the links included on that page.
In this process, the crawler downloads and saves the HTML content or indexes it.
The result of crawling is usually stored in a database or file that reflects the structure and content of the website.
Simple Web Crawling Code Example
Now let’s collect data practically through a simple web crawling code.
In Python, we can use requests
and BeautifulSoup
libraries to fetch the data from web pages.
Note: To actually run the practice code on your computer, you need to install the 'requests' and 'BeautifulSoup' libraries with the command
pip install requests beautifulsoup4
.
import requests
from bs4 import BeautifulSoup
# 1. Fetching the HTML source of the example.com web page
url = 'https://example.com'
response = requests.get(url)
# 2. Parsing the HTML
soup = BeautifulSoup(response.text, 'html.parser')
# 3. Extracting paragraph (p) tags from the website
titles = soup.find_all('p')
# 4. Printing the text of the paragraph (p) tags
for title in titles:
print(title.text)
-
Step 1: Use
requests.get(url)
to request the web page of the specified URL and fetch the HTML source of that page. -
Step 2: Parse the HTML with a
BeautifulSoup
object, making it a structure that can be navigated. -
Step 3: Extract all paragraphs (
<p>
tags) from the website usingsoup.find_all('p')
and print them.
The above code example performs a simple web crawling task (strictly speaking, web scraping) that extracts the text of all paragraphs (<p>
tags) from the 'example.com' web page.
In reality, when extracting the titles of news articles from a news platform or prices of specific products from an online store, much more complex code is required.
In this lesson, we have understood the basic concept and necessity of web crawling and tried writing simple crawling code.
In the next lesson, we will learn about the legal and ethical considerations of web crawling.
Want to learn more?
Join CodeFriends Plus membership or enroll in a course to start your journey.