Regular Expressions for Web Crawling

Regular expressions are tools used to extract information that matches a specific pattern from string data.

They are commonly used to find, replace, or validate specific patterns within strings.

In this lesson, we will explore the basic concepts of regular expressions and introduce you to methods for filtering necessary information from crawling data.

Basic Syntax of Regular Expressions

Regular expressions define specific patterns by combining various symbols and characters.

For example, the regular expression "^\d{3}-\d{3}-\d{4}$" is used to find phone numbers like "123-456-7890".

Common symbols and characters used in regular expressions include:

. : Matches any single character.
^ : Indicates the start of the string.
$ : Indicates the end of the string.
* : Matches zero or more characters.
+ : Matches one or more characters.
[] : Matches any one of the characters inside the brackets.
\d : Matches any digit.
\w : Matches any alphanumeric character.
\s : Matches any whitespace character.

Using Regular Expressions in Python

In Python, you can handle regular expressions using the re module.

The re module provides functionalities for string searching, matching, and substitution, and it comes pre-installed with Python.

Using Regular Expressions in Python
import re

# Regular expression pattern
pattern = r'\d{3}-\d{3}-\d{4}'

# Text to search within
text = "Customer service contact: Please reach out to 123-456-7890."

# Store the matched pattern string in match
match = re.search(pattern, text)

# Check if the pattern matches
if match:
    # Output the found number: 123-456-7890
    print(f"Found number: {match.group()}")
else:
    print("No number found.")

This code searches for a phone number pattern within the string and prints it out when found.

Extracting Email Addresses from HTML Data Using Regular Expressions

When you want to extract email addresses from a specific web page, you can use the following regular expression.

Extracting Email Addresses Using Regular Expressions
import re
import requests
from bs4 import BeautifulSoup

# URL to crawl
url = 'https://www.example.com/'

# Fetch HTML content
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')

# Extract text from HTML
text = soup.get_text()

# Regular expression pattern: find email addresses
email_pattern = r'[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}'

# Find email addresses
emails = re.findall(email_pattern, text)

# Output the extracted email addresses
for email in emails:
    print(f"Found email address: {email}")

The code above uses the re.findall() function to return a list of all emails that match the regular expression, finding and printing all email addresses from the web page.

Want to learn more?

Join CodeFriends Plus membership or enroll in a course to start your journey.

Basic Syntax of Regular Expressions​

Using Regular Expressions in Python​

Extracting Email Addresses from HTML Data Using Regular Expressions​

Want to learn more?

Basic Syntax of Regular Expressions

Using Regular Expressions in Python

Extracting Email Addresses from HTML Data Using Regular Expressions