Skip to main content
Practice

Character Encoding and Data Processing

Encoding is the process of converting data into a specific format.

Encoding is essential for storing, processing, and transmitting data in computer systems, transforming various types of information (e.g., text, images, audio) into a format that computers can understand.

Character Encoding refers to the process of converting characters or symbols into a data format, specifically numbers, that a computer can use.

Since computers fundamentally understand only numbers, they must convert diverse character systems used by humans (e.g., alphabets, Chinese characters, Arabic numerals) into numbers for storage, processing, and transmission.

Character encoding defines these conversion rules, and there are different character encoding methods like ASCII, UTF-8, and ISO-8859-1.


Checking and Setting Encoding

The Python requests library typically detects the correct encoding automatically.

When creating a BeautifulSoup object, you can explicitly specify the encoding.

Checking Encoding
response = requests.get('http://example.com')
response.encoding = 'utf-8' # Set encoding

soup = BeautifulSoup(response.text, 'html.parser')

Methods for Data Cleansing and Storage

To efficiently store and utilize data collected through crawling, data cleansing and storage processes are necessary.


Data Cleansing

Use Python's built-in functions (e.g., strip(), replace()) to remove unnecessary whitespace, HTML tags, etc.


Data Storage

Crawled data can be stored as text files, CSV, JSON, or in a separate database.

Saving as Text File
import json

data = {'name': 'Alice', 'link': 'http://example.com'}

# Save as JSON format in data.json file
with open('data.json', 'w', encoding='utf-8') as file:
json.dump(data, file, ensure_ascii=False)

Practice

Click the Run Code button on the right side of the screen to check the crawling results or edit the code!

Want to learn more?

Join CodeFriends Plus membership or enroll in a course to start your journey.