AI-Powered Speech Recognition

How does AI comprehend human speech and convert it into text?

Speech recognition is a technology that analyzes spoken language and converts it into written text.

This technology is applied in a range of services, including voice assistants such as Siri and Google Assistant, transcription systems, and smart device control.

In this session, we will explore the concept of speech recognition, examine the technical process behind how AI recognizes speech, and participate in a practice exercise.

Speech recognition may involve a brief delay of 1–2 seconds. It is recommended to practice with short, clear phrases such as “Hello” or “Nice to meet you.”

Technical Process of Speech Recognition

Speech recognition involves converting spoken language into digital signals and then using AI models to turn those signals into text.

In essence, it converts analog sound into digital data that a computer can process and then analyzes the data to interpret its meaning.

The process of speech recognition involves the following steps.

1. Collecting Audio Signals

The user's speech is input via a microphone. It is crucial to eliminate background noise to ensure accurate audio data.

2. Digital Conversion

Since speech consists of analog signals (continuous sound), it must be converted into digital signals (discrete values represented by 0s and 1s) that a computer can process.

This process involves two key steps.

2-1. Sampling

This process involves measuring the analog signal at fixed time intervals in order to extract meaningful data points.

The higher the sampling frequency (i.e., the more frequently it is measured), the more accurately the original voice can be represented.

For example, CD-quality audio is sampled at 44.1kHz (measured 44,100 times per second).

2-2. Quantization

This is the process of numerically representing the sampled data within a set range.

Analog signals can take on continuous values, but because computers process data using a finite number of bits, these values must be approximated and stored numerically.

For instance, 16-bit quantization represents sound using 2¹⁶ (about 65,536) values.

3. Feature Extraction

This is the process of extracting meaningful patterns such as the beginning and end of pronunciations, pitch, and intensity from the digitalized speech data.

Techniques used for feature extraction include MFCC (Mel-Frequency Cepstral Coefficients), Spectrogram, and Wavelet Transform.

4. Applying Acoustic AI Models

The extracted features are input into an AI model to analyze phonemes and words.

A phoneme is the smallest phonetic unit in a language and represents the basic unit of sound.

Recently, deep learning-based models such as CNN, RNN, and Transformer are being used to improve speech recognition accuracy.

5. Applying Language AI Models

AI considers context to assemble recognized words and generate the optimal sentences.

Models based on N-gram, RNN, and Transformer such as GPT are used in sentence generation.

6. Generating Output

The final recognized text is output to the user. The accuracy of the recognized text may vary depending on factors such as user's pronunciation, intonation, and background noise.

In the next session, we will learn about Speech Synthesis technology, which creates speech using AI, the opposite of speech recognition.

Technical Process of Speech Recognition​

1. Collecting Audio Signals​

2. Digital Conversion​

2-1. Sampling​

2-2. Quantization​

3. Feature Extraction​

4. Applying Acoustic AI Models​

5. Applying Language AI Models​

6. Generating Output​