Skip to main content
Practice

Preparing to Train AI: Dataset Essentials

A Dataset refers to a collection of data gathered and organized for specific purposes such as training and validating AI models.

The JSONL file we created for fine-tuning in previous lessons is one form of a dataset.


What kind of data should be included in a dataset?

Datasets can encompass a broad range of data types such as tables, images, text, and time series data.

  • Tabular Data: Data in table format consisting of rows and columns, like CSV, Excel, or SQL tables.

  • Image Data: Collections of image files such as PNG or JPG, typically used in computer vision.

  • Text Data: Textual data in the form of documents, sentences, or words, extensively used in Natural Language Processing (NLP).

  • Time Series Data: Chronologically collected data, examples include stock market data and temperature readings over time.


What is the typical structure of a dataset?

Most datasets are composed of three key components:

  • Features: The data input used for training the AI model. In a chatbot model, the user's 'question' is the feature, while in an image classification model, the 'photo' is the feature.

  • Label: Represents the answer or result within the dataset. If there's a cat in a photo, the label for that photo is 'cat'.

  • Metadata: Acts like a dataset's guidebook, providing additional information such as the data’s source or creation date.


FeaturesLabelMetadata
Image Path: /images/cat.jpgCatFile Size: 3MB, Date Taken: 2021-01-15, Source: User Upload
Text: "How are you feeling today?"Sentiment InquiryLength: 24 characters, Author: Admin, Created Date: 2021-02-01
Numeric Data: [2, 14, 15, 23]Sequence Sum: 54Data Type: Integer Array, Entry Date: 2021-03-22

Commonly Used Datasets

  • MNIST Dataset: Consists of handwritten digit images, widely utilized in the field of computer vision.

  • Iris Dataset: A tabular dataset used to predict the species of iris flowers.

  • IMDB Review Dataset: Composed of movie review texts, often used in sentiment analysis.

Want to learn more?

Join CodeFriends Plus membership or enroll in a course to start your journey.