Skip to main content
Practice

Advanced Natural Language Processing Techniques with NLTK

In this lesson, we will explore advanced features such as POS tagging, named entity recognition, and syntax parsingusing NLTK.


1. Part-of-Speech Tagging

A part of speech (POS) refers to the grammatical role of a word in a sentence.

For example, in "I am a student.", I is a pronoun, am is a verb, a is an article, and student is a noun.

POS tagging involves analyzing each word in a sentence to determine its part of speech.

POS Tagging Example
import nltk
from nltk.tokenize import word_tokenize
from nltk import pos_tag

nltk.download('averaged_perceptron_tagger')

text = "NLTK provides powerful NLP tools."
tokens = word_tokenize(text)
tagged = pos_tag(tokens)

print(tagged)

In the above code, NNP (proper noun), VBZ (verb, 3rd person singular present), JJ (adjective), etc., are the tags for each word indicating its part of speech.


2. Named Entity Recognition (NER)

Named Entity Recognition (NER) is the process of identifying specific entities such as people, organizations, and locations in a text.

Named Entity Recognition Example
import numpy
from nltk.chunk import ne_chunk

nltk.download('maxent_ne_chunker')
nltk.download('words')

sentence = "I live in California."
tokens = word_tokenize(sentence)
tagged = pos_tag(tokens)
ner_tree = ne_chunk(tagged)

print(ner_tree)

The output appears as follows:

(S I/PRP live/VBP in/IN (GPE California/NNP) ./.)

Here, GPE indicates a geopolitical entity, and NNP signifies a proper noun.


How About Other Languages?

NLTK is primarily an English-based natural language processing library, so its support for languages like Korean is limited.

For processing languages like Korean, it's common to use libraries such as spaCy or KoNLPy alongside NLTK.

KoNLPy Example
from konlpy.tag import Okt

okt = Okt()
text = "Python makes natural language processing easy."

print(okt.morphs(text)) # Morphological analysis
print(okt.nouns(text)) # Extracting nouns
print(okt.pos(text)) # POS tagging

This code allows you to extract morphemes, identify nouns, and tag parts of speech in a Korean sentence.

While NLTK is excellent for English natural language processing, using other libraries is advisable for handling languages like Korean.


References

Want to learn more?

Join CodeFriends Plus membership or enroll in a course to start your journey.