Essentials for Taming AI, the Dataset
A Dataset
is a collection of data amassed and organized for specific purposes like AI model training and validation.
The JSONL file created for fine-tuning in the previous lesson is also one form of a dataset.
What kind of data is included in a dataset?
A dataset can contain a variety of data forms, including tables, images, text, and time-series data.
-
Tabular Data: Table-formatted data comprised of rows and columns, such as Excel files (.xlsx) and CSV files (.csv).
-
Image Data: Consists of image files like PNG and JPG, mainly utilized in computer vision.
-
Text Data: Data in the form of documents, sentences, and words, widely utilized in Natural Language Processing (NLP).
-
Time Series Data: Data collected over time, such as stock market data and temperature data over time.
What is the general structure of a dataset?
Most datasets are configured into the following three parts:
-
Feature: Data that is input into the AI model and serves as the focus of learning. In a chatbot model, the user's 'question' could be a feature, while in an image classification model, the 'photo' could be a feature.
-
Label: Represents the answer or result of the dataset. If a photo contains a cat, the label of that photo would be 'cat'.
-
Metadata: It's like a manual for the dataset, providing additional information such as the source of the data and when it was created.
Features | Label | Metadata |
---|---|---|
Image file path: /images/cat.jpg | Cat | File size: 3MB, Capture date: 2021-01-15, Source: User Upload |
Text: "How are you feeling today?" | Feeling inquiry | Length: 24 characters, Author: Admin, Creation date: 2021-02-01 |
Numeric data: [2, 14, 15, 23] | Sum of sequence: 54 | Data type: Integer array, Input date: 2021-03-22 |
Commonly Used Datasets
-
MNIST Dataset: A dataset composed of handwritten digit images, frequently used in the field of computer vision.
-
Iris Dataset: A tabular dataset used for predicting Iris flower species.
-
IMDB Review Dataset: A dataset of movie review texts used for sentiment analysis and other applications.
Want to learn more?
Join CodeFriends Plus membership or enroll in a course to start your journey.