Essential HTML Knowledge for Web Crawling
To perform web crawling, you must first understand the structure of web pages.
One of the most crucial concepts in crawling is HTML
.
Let's briefly explore HTML, the 'skeleton' of web pages.
Reference : Check out the Introduction to Web + Build Your Own Website course for more detailed information about HTML.
What is HTML?
HTML stands for HyperText Markup Language
and is the basic language used to create web pages.
Simply put, it defines the structure and content of a web page.
Web browsers interpret this HTML to display the web page on the screen.
Essential HTML Knowledge for Web Crawling
When crawling the web, the most important thing is to accurately locate the data you want.
Let's look at the key elements of HTML that you need to know.
1. Tags: Building Blocks of Web Pages
HTML consists of tags. Tags are written in the format <tag-name>
, defining each element of a web page.
For example, a headline is expressed with an <h1>
tag and a paragraph with a <p>
tag.
There are many types of tags, each with its own meaning and function.
<h1>This is a headline</h1>
<p>This is a paragraph.</p>
As shown in the code above, HTML content displayed on the screen starts with an <opening-tag>
and ends with a </closing-tag>
.
The content of the tag is enclosed between the <opening-tag>
and </closing-tag>
.
This smallest unit of HTML, surrounded by the opening and closing tags, is called an Element
.
In crawling, it is essential to identify which element contains the information you want.
2. Attributes: Defining the Nature of Tags
Tags can have attributes. Attributes define the nature of the tag or provide additional information.
For example, the <a>
tag used to create links has an href
attribute to specify the destination of the link and a target
attribute to specify whether to open it in a new window.
<a href="https://www.codefriends.net/" target="_blank">
The HTML code above specifies that the CodeFriends
link opens in a new window
.
In web crawling, attributes are frequently used to locate specific elements based on their values.
Especially, class
(used to apply the same style) and id
(unique identifier) attributes are frequently used in web crawling.
3. DOM Structure: The Map of a Web Page
Web pages have a DOM (Document Object Model) structure where HTML tags are nested hierarchically.
Typically, a web page has elements nested within the <html>
tag, such as <body>
, and within the <body>
tag, there may be <div>
elements, and so on.
Understanding the DOM structure helps you easily identify where specific elements are located on the page.
<html>
<body>
<div>
<p>Paragraph content</p>
</div>
</body>
</html>
When performing crawling, understanding this DOM hierarchy helps accurately extract the needed data.
Want to learn more?
Join CodeFriends Plus membership or enroll in a course to start your journey.