Natural Language Processing

02/ Why is everybody talking about NLP?

AI for natural language

NLP stands for Natural Language Processing. It’s the field of data science that deals with handling text written in a natural way.
NLP is about text and how machines understand text written by humans. It can also be about spoken language, for AI it doesn't matter if it is spoken or written text. 

In short, NLP techniques aim to automatically process, analyze and manipulate language data like speech and text.

03/ Why should you use NLP?

NLP models perform better than humans in a variety of natural language tasks. How is it possible? The NLP models have become better and the models are trained on ever larger datasets.

2018 was the year of the NLP revolution where an NLP model surpassed human level understanding of a text. The models have become even better since.

04/ How to use NLP?

Text classification

NLP models can be trained to automatically assign a set of predefined categories to any kind of text, whether that’s entire documents, parts of documents, messages, summaries, etc.

Use cases: Classify tweets

Sentiment analysis

Sentiment analysis focuses on identifying the subjective information in text and classifying each as positive, negative, or neutral.

Use cases: Reviews sentiment

Text similarity

Text similarity is the task of automatically determining how ‘close’ two pieces of text are both in structure as in meaning (eg. understand that 'what year were you born?' and 'how old are you?' are similar questions)

Use cases: Find similar questions

05/ How does NLP work?

If you see the word bank, you might think about a financial institution or the edge of a lake or river. If you’re given more context, as in A walk by the river bank, you realize that bank must mean the land next to some water. 

NLP self-attention seeks to do the very same thing.

Tokens and embedding vectors

A word like bank is called a token when it represents a fundamental piece of text. Text embedding converts tokens into vectors that have the remarkable property to let mathematical operations reflect their natural meaning.

Determining the values inside the embedding vector of a token is a large part of the heavy lifting in text processing. Thankfully, with hundreds of dimensions available to organize the vocabulary of known tokens, embeddings can be pretrained to relate numerically in ways that reflect how their tokens relate in natural language.

An NLP model can use the embeddings to solve a task, like identifying the mood of the text, its genre, or its similarity with another piece of text.