NLP Series - Essentials
Natural Language Processing Basics - Core Terminologies and Techniques
If there’s one form of media which we are exposed to every single day, it’s text. There is huge amounts of data created by people and devices, every single day. This data includes and is not limited to comments on social media, product reviews, tweets and reddit messages. Generally speaking, the data from these sources are useful for the purposes of research and commerce.
This article serves as a detailed of all the NLP core terminologies discussed in the NLP series and more. This page is updated frequently and the topics discussed has been categorized for easier understanding. I request you to follow the references and learn more on the topics listed below.
Data Collection
Synonym Replacement & Antonym replacement
Synonym replacement is a technique in which we replace a word by one of its synonyms. This technique works by collapsing related terms and reducing the vocabulary of the corpus without losing meaning, thus saving a lot of memory. These individual identifiers are a set of one or more synonyms that are interchangeable in some context without changing the truth value of the proposition in which they are embedded and are called synsets.
As we know that an antonym is a word having opposite meaning of another word, and the opposite of synonym replacement is called antonym replacement.
Back Translation
TF-IDF based word replacement
Bigram Flipping
Replacing entities
Adding noise to data
Snorkel
Easy Data Augmentation
Active learning
Data Cleaning
Unicode Normalization
Text Encoding
Bag-of-words
Optical character recognition
Text Pre-processing
Sentence Segmentation
Sentence segmentation, also known as sentence boundary disambiguation, sentence breaking and sentence boundary detection is one of the foremost problems of NLP referring to dividing text into its component sentences.
Tokenization
The process of breaking a piece of text into its constituent parts is called Tokenization. Usually, the fo job in an NLP project is to divide the text into a list of tokens. The granularity of the resulting tokens will differ depending on the objective of our NLP task. In other words, tokens can be individual words, phrases or even whole sentences. Almost every NLP task uses some sort of tokenization technique. Some of the techniques used for tokenization include white space tokenization, dictionary based tokenization and subword tokenization.White space tokenization is the simplest form of tokenization where the sentence or paragraph is broken into terms whenever a whitespace is encountered. On the contrary, *Dictionary based tokenization is a more advanced method compared to the whitespace tokenization and the tokens are found based on the tokens already existing in the dictionary. *Subword tokenization is a recent strategy that uses unsupervised machine learning techniques to help us combat problems of misspellings, rare words, and multilingual sources by breaking unknown words into “subword units”.
Stemming
Stemming usually refers to a crude heuristic process that chops off the ends of words in the hope of achieving this goal correctly most of the time, and often includes the removal of derivational affixes. The first stemming algorithm was created by Julie Beth Lovins in 1968, although there had been earlier work done on the subject. In 1980, Martin Porter created the Porter Stemmer. This is certainly the most well-known stemming algorithm that has repeatedly been shown to be empirically very effective, and is called the Porter’s algorithm. Stemming requires no memory and it can be easily tuned since its based on an algorithm. But, since it may not return an actual word, it is not always interpretable, and hence not useful if this result facing end-users.
Lemmatization
Lemmatization is the process of grouping together the inflected forms of a word so they can be analysed as a single word. This word is called its lemma, or the head word. For example, “eating”, “eat” and “ate” can be counted as one word and as variations of the word “eat”. Lemmatization returns the base or dictionary form of a word, and hence the result of this process is interpretable. Lemmatization is usually slower than stemming.
Text Normalization
Language Detection
Code Mixing and Transliteration
Chunking
POS Tagging
Part-of-speech tagging refers to a technique of identifying whether words in a sentence are nouns, verbs, adjectives, and so on.
Parse Tree
Coreference Resolution
Lowercasing
Collocation Extraction for Phrase Detection
HYPOTHESIS TESTING FOR COLLOCATION EXTRACTION
Feature Engineering of Text Data
Bag-of-Words
Bag-of-n-Grams
Removing Stopwords
Frequency-Based Filtering
Rare words
NLP Tasks
Information Extraction
Topic Modeling
Word Embeddings
Text Classification
Sentiment Analysis
Sequence Modeling
Chatbots
Text Summarization
Document Classification
Grammatical Error Correction/Autocorrect
Text-to-Speech
Speech-to-Text
Dialogue Understanding
Fake News Detection/Hate Speech Detection
Image captioning
Question and answering
Semantic textual similarity
Word Sense Disambiguation
Keyword Extraction
Annotation
Document Ranking
** Relation extraction**
Named entity recognition
References
1. Alice Zheng and Amanda Casari. 2018. Feature engineering for machine learning: principles and techniques for data scientists. O’Reilly Media, Inc.
2. Alex Thomas. 2020. Natural Language Processing with Spark NLP. O’Reilly Media, Inc.
3. Sowmya Vajjala, Bodhisattwa Majumder, Anuj Gupta, Harshit Surana. 2020. Practical Natural Language Processing. O’Reilly Media, Inc.
4. Bird, Steven, Ewan Klein, and Edward Loper. Natural Language Processing with Python. Sebastopol, CA: O’Reilly Media, 2009.
Thanks for reading! I hope you found this article helpful. Read more data science articles here including tutorials from beginner to advanced levels!