Skip to main content
Practice

crawling-basic

---
id: crawling-basic
title: How to Collect Data with Code, Crawling
description: Definition and Principles of Web Crawling
tags:
- Python
- Web Crawling
- Use Cases
sidebar_position: 1
isPublic: true
---

# How to Collect Data with Code, Crawling

Have you heard of the term `Crawling`?

Crawling refers to the process of **automatically collecting data** from websites. This task is performed by an automated software (bot) known as a Crawler (or Spider), which visits various web pages to extract desired data.

Crawling is used in various fields such as **data collection** and **indexing** (a reference system that helps find specific information quickly and easily) by search engines, and product information collection by online price comparison sites.

When learning or using crawling for personal or commercial purposes beyond non-profit aims, it's important to respect the website's `Terms of Service` and pay special attention to `legal issues` such as privacy and copyright.

<br />

## Crawling Process

1. `Requesting and Collecting Web Pages`: The crawler sends an HTTP request to the web page corresponding to the URL, and receives the content of the web page in HTML form from the server.

2. `Data Parsing`: The received HTML tags of the web page are analyzed to extract necessary data such as text, links, images, etc.

3. `Data Storage`: The extracted data is saved in a database or file.

4. `Repetition`: Steps 1-3 are repeated, requesting, collecting, and storing new web pages until set conditions are met.

<br />

## Technologies Used

- `HTML`: The language that defines the structure and content of web pages.

- `HTTP Request`: Requests web page data from a web server.

- `Parsing`: Parsing refers to **syntax analysis**, meaning the extraction of desired data from a particular target. In Python, libraries like Beautiful Soup and lxml are often used for HTML parsing.

<br />

## Use Cases of Web Crawling

1. `Search Engine Optimization (SEO) and Indexing`

- Search engines (Google, Bing, etc.) use web crawlers to collect web pages, then index and rank these pages on search engine result pages based on the collected data.

2. `Data Analysis and Market Research`

- Analyzing data from commercial websites to study market trends, price changes, product reviews, etc.

3. `Social Media Analysis`

- Collecting data from social media platforms to analyze user opinions, trends, and social reactions.

4. `Academic Research`

- Researchers use web crawling to collect academic materials, open data sets, news articles, and utilize them in research.

5. `Automated Monitoring`

- Continuously monitoring real-time data such as stock prices, exchange rates, and weather information to track changes.

<br />

## Practice

Press the _`Run Code`_ button on the right side of the screen to review the crawling results or edit the code!

Want to learn more?

Join CodeFriends Plus membership or enroll in a course to start your journey.