Data Formats Used for Training AI
To train AI models, data must be transformed into formats that AI systems can understand.
In this lesson, we'll explore the main data file formats used for training AI: CSV
, JSON
, and XML
.
CSV
CSV
stands for Comma-Separated Values and is used to store and transfer table-like data.
Each row represents a single data entry, while each column represents a specific attribute of the data.
Values in each column are separated by commas (,).
For example, a CSV file storing math and English scores of students by name can be represented as follows:
Name,Math,English
John Doe,85,90
Jane Smith,88,80
CSV files are stored as text files with the .csv
file extension and can be easily opened and edited in various data management programs like Microsoft Excel, Google Sheets, and database programs.
JSON
JSON (JavaScript Object Notation) is commonly used for data storage and exchange in web and mobile applications.
JSON consists of objects and arrays, with objects wrapped in curly braces { }
and arrays in square brackets [ ]
.
// Array in square brackets
[
// Object in curly braces
{
"Name": "John Doe",
"Math": 85,
"English": 90
},
{
"Name": "Jane Smith",
"Math": 88,
"English": 80
}
]
A data file format where multiple JSON objects are listed, one per line, is called JSONL (JSON Lines).
{"Name": "John Doe", "Math": 85, "English": 90}
{"Name": "Jane Smith", "Math": 88, "English": 80}
When training OpenAI's AI models or general-purpose machine learning models, data files in JSONL format are often used.
XML
XML (eXtensible Markup Language) is primarily used to represent hierarchical data structures.
The key elements of XML include:
-
Tags: Data enclosed in angle brackets
< >
represent the data's structure.- Tags consist of opening and closing tags.
- An opening tag is
<tagname>
, and a closing tag is</tagname>
.
-
Attributes: Used to provide additional information within a tag.
- To add attributes to a tag, use
<tagname attributename="attributevalue">
. - Example:
<Student gender="Male">
adds a gender attribute to the Student tag.
- To add attributes to a tag, use
Below is how the JSON example is represented in XML.
<StudentList>
<Student>
<Name>John Doe</Name>
<Math>85</Math>
<English>90</English>
</Student>
<Student>
<Name>Jane Smith</Name>
<Math>88</Math>
<English>80</English>
</Student>
</StudentList>
When training image-related AI models, image file formats like .jpg
and .png
are used.
Image files are comprised of pixel values, and AI models interpret these pixel values to recognize and classify images.
The data file formats used for training AI models vary, and the correct format should be chosen according to the model's design strategy.
Want to learn more?
Join CodeFriends Plus membership or enroll in a course to start your journey.