NLP Series - Essentials

4 minute read

this is a placeholder image

Natural Language Processing Basics - Core Terminologies and Techniques

If there’s one form of media which we are exposed to every single day, it’s text. There is huge amounts of data created by people and devices, every single day. This data includes and is not limited to comments on social media, product reviews, tweets and reddit messages. Generally speaking, the data from these sources are useful for the purposes of research and commerce.

This article serves as a detailed of all the NLP core terminologies discussed in the NLP series and more. This page is updated frequently and the topics discussed has been categorized for easier understanding. I request you to follow the references and learn more on the topics listed below.

Data Collection

Synonym Replacement & Antonym replacement
Synonym replacement is a technique in which we replace a word by one of its synonyms. This technique works by collapsing related terms and reducing the vocabulary of the corpus without losing meaning, thus saving a lot of memory. These individual identifiers are a set of one or more synonyms that are interchangeable in some context without changing the truth value of the proposition in which they are embedded and are called synsets.

As we know that an antonym is a word having opposite meaning of another word, and the opposite of synonym replacement is called antonym replacement.

Back Translation

TF-IDF based word replacement

Bigram Flipping

Replacing entities

Adding noise to data

Snorkel

Easy Data Augmentation

Active learning

Data Cleaning

Unicode Normalization

Text Encoding

Bag-of-words

Optical character recognition

Text Pre-processing

Sentence Segmentation
Sentence segmentation, also known as sentence boundary disambiguation, sentence breaking and sentence boundary detection is one of the foremost problems of NLP referring to dividing text into its component sentences.

Tokenization
The process of breaking a piece of text into its constituent parts is called Tokenization. Usually, the fo job in an NLP project is to divide the text into a list of tokens. The granularity of the resulting tokens will differ depending on the objective of our NLP task. In other words, tokens can be individual words, phrases or even whole sentences. Almost every NLP task uses some sort of tokenization technique. Some of the techniques used for tokenization include white space tokenization, dictionary based tokenization and subword tokenization.White space tokenization is the simplest form of tokenization where the sentence or paragraph is broken into terms whenever a whitespace is encountered. On the contrary, *Dictionary based tokenization is a more advanced method compared to the whitespace tokenization and the tokens are found based on the tokens already existing in the dictionary. *Subword tokenization is a recent strategy that uses unsupervised machine learning techniques to help us combat problems of misspellings, rare words, and multilingual sources by breaking unknown words into “subword units”.

Stemming
Stemming usually refers to a crude heuristic process that chops off the ends of words in the hope of achieving this goal correctly most of the time, and often includes the removal of derivational affixes. The first stemming algorithm was created by Julie Beth Lovins in 1968, although there had been earlier work done on the subject. In 1980, Martin Porter created the Porter Stemmer. This is certainly the most well-known stemming algorithm that has repeatedly been shown to be empirically very effective, and is called the Porter’s algorithm. Stemming requires no memory and it can be easily tuned since its based on an algorithm. But, since it may not return an actual word, it is not always interpretable, and hence not useful if this result facing end-users.

Lemmatization
Lemmatization is the process of grouping together the inflected forms of a word so they can be analysed as a single word. This word is called its lemma, or the head word. For example, “eating”, “eat” and “ate” can be counted as one word and as variations of the word “eat”. Lemmatization returns the base or dictionary form of a word, and hence the result of this process is interpretable. Lemmatization is usually slower than stemming.

Text Normalization

Language Detection

Code Mixing and Transliteration

Chunking

POS Tagging
Part-of-speech tagging refers to a technique of identifying whether words in a sentence are nouns, verbs, adjectives, and so on.

Parse Tree

Coreference Resolution

Lowercasing

Collocation Extraction for Phrase Detection

HYPOTHESIS TESTING FOR COLLOCATION EXTRACTION

Feature Engineering of Text Data

Bag-of-Words

Bag-of-n-Grams

Removing Stopwords

Frequency-Based Filtering

Rare words

NLP Tasks

Information Extraction

Topic Modeling

Word Embeddings

Text Classification

Sentiment Analysis

Sequence Modeling

Chatbots

Text Summarization

Document Classification

Grammatical Error Correction/Autocorrect

Text-to-Speech

Speech-to-Text

Dialogue Understanding

Fake News Detection/Hate Speech Detection

Image captioning

Question and answering

Semantic textual similarity

Word Sense Disambiguation

Keyword Extraction

Annotation

Document Ranking

** Relation extraction**

Named entity recognition

References

1. Alice Zheng and Amanda Casari. 2018. Feature engineering for machine learning: principles and techniques for data scientists. O’Reilly Media, Inc.

2. Alex Thomas. 2020. Natural Language Processing with Spark NLP. O’Reilly Media, Inc.

3. Sowmya Vajjala, Bodhisattwa Majumder, Anuj Gupta, Harshit Surana. 2020. Practical Natural Language Processing. O’Reilly Media, Inc.

4. Bird, Steven, Ewan Klein, and Edward Loper. Natural Language Processing with Python. Sebastopol, CA: O’Reilly Media, 2009.

Thanks for reading! I hope you found this article helpful. Read more data science articles here including tutorials from beginner to advanced levels!

Updated: