Character Encoding and Data Processing
Encoding
is the process of converting data into a specific format.
Encoding is essential for storing, processing, and transmitting data in computer systems, transforming various types of information (e.g., text, images, audio) into a format that computers can understand.
Character Encoding
refers to the process of converting characters or symbols into a data format, specifically numbers, that a computer can use.
Since computers fundamentally understand only numbers, they must convert diverse character systems used by humans (e.g., alphabets, Chinese characters, Arabic numerals) into numbers for storage, processing, and transmission.
Character encoding defines these conversion rules, and there are different character encoding methods like ASCII, UTF-8, and ISO-8859-1.
Checking and Setting Encoding
The Python requests
library typically detects the correct encoding automatically.
When creating a BeautifulSoup
object, you can explicitly specify the encoding.
response = requests.get('http://example.com')
response.encoding = 'utf-8' # Set encoding
soup = BeautifulSoup(response.text, 'html.parser')
Methods for Data Cleansing and Storage
To efficiently store and utilize data collected through crawling, data cleansing and storage processes are necessary.
Data Cleansing
Use Python's built-in functions (e.g., strip()
, replace()
) to remove unnecessary whitespace, HTML tags, etc.
Data Storage
Crawled data can be stored as text files, CSV, JSON, or in a separate database.
import json
data = {'name': 'Alice', 'link': 'http://example.com'}
# Save as JSON format in data.json file
with open('data.json', 'w', encoding='utf-8') as file:
json.dump(data, file, ensure_ascii=False)
Practice
Click the Run Code
button on the right side of the screen to check the crawling results or edit the code!
Want to learn more?
Join CodeFriends Plus membership or enroll in a course to start your journey.