Skip to main content
Crowdfunding
Python + AI for Geeks
Practice

Essentials for Taming AI, the Dataset

A Dataset is a collection of data amassed and organized for specific purposes like AI model training and validation.

The JSONL file created for fine-tuning in the previous lesson is also one form of a dataset.


What kind of data is included in a dataset?

A dataset can contain a variety of data forms, including tables, images, text, and time-series data.

  • Tabular Data: Table-formatted data comprised of rows and columns, such as Excel files (.xlsx) and CSV files (.csv).

  • Image Data: Consists of image files like PNG and JPG, mainly utilized in computer vision.

  • Text Data: Data in the form of documents, sentences, and words, widely utilized in Natural Language Processing (NLP).

  • Time Series Data: Data collected over time, such as stock market data and temperature data over time.


What is the general structure of a dataset?

Most datasets are configured into the following three parts:

  • Feature: Data that is input into the AI model and serves as the focus of learning. In a chatbot model, the user's 'question' could be a feature, while in an image classification model, the 'photo' could be a feature.

  • Label: Represents the answer or result of the dataset. If a photo contains a cat, the label of that photo would be 'cat'.

  • Metadata: It's like a manual for the dataset, providing additional information such as the source of the data and when it was created.


FeaturesLabelMetadata
Image file path: /images/cat.jpgCatFile size: 3MB, Capture date: 2021-01-15, Source: User Upload
Text: "How are you feeling today?"Feeling inquiryLength: 24 characters, Author: Admin, Creation date: 2021-02-01
Numeric data: [2, 14, 15, 23]Sum of sequence: 54Data type: Integer array, Input date: 2021-03-22

Commonly Used Datasets

  • MNIST Dataset: A dataset composed of handwritten digit images, frequently used in the field of computer vision.

  • Iris Dataset: A tabular dataset used for predicting Iris flower species.

  • IMDB Review Dataset: A dataset of movie review texts used for sentiment analysis and other applications.

Want to learn more?

Join CodeFriends Plus membership or enroll in a course to start your journey.