Preparing to Train AI: Dataset Essentials
A Dataset
refers to a collection of data gathered and organized for specific purposes such as training and validating AI models.
The JSONL file we created for fine-tuning in previous lessons is one form of a dataset.
What kind of data should be included in a dataset?
Datasets can encompass a broad range of data types such as tables, images, text, and time series data.
-
Tabular Data: Data in table format consisting of rows and columns, like CSV, Excel, or SQL tables.
-
Image Data: Collections of image files such as PNG or JPG, typically used in computer vision.
-
Text Data: Textual data in the form of documents, sentences, or words, extensively used in Natural Language Processing (NLP).
-
Time Series Data: Chronologically collected data, examples include stock market data and temperature readings over time.
What is the typical structure of a dataset?
Most datasets are composed of three key components:
-
Features: The data input used for training the AI model. In a chatbot model, the user's 'question' is the feature, while in an image classification model, the 'photo' is the feature.
-
Label: Represents the answer or result within the dataset. If there's a cat in a photo, the label for that photo is 'cat'.
-
Metadata: Acts like a dataset's guidebook, providing additional information such as the data’s source or creation date.
Features | Label | Metadata |
---|---|---|
Image Path: /images/cat.jpg | Cat | File Size: 3MB, Date Taken: 2021-01-15, Source: User Upload |
Text: "How are you feeling today?" | Sentiment Inquiry | Length: 24 characters, Author: Admin, Created Date: 2021-02-01 |
Numeric Data: [2, 14, 15, 23] | Sequence Sum: 54 | Data Type: Integer Array, Entry Date: 2021-03-22 |
Commonly Used Datasets
-
MNIST Dataset: Consists of handwritten digit images, widely utilized in the field of computer vision.
-
Iris Dataset: A tabular dataset used to predict the species of iris flowers.
-
IMDB Review Dataset: Composed of movie review texts, often used in sentiment analysis.
Want to learn more?
Join CodeFriends Plus membership or enroll in a course to start your journey.