Text classification / cheat sheet

Target audience: Data scientists and developers

Problem formulation

Use this cheat sheet:
If you want to kick start your use of text classification.

What is text classification?

Text classification aims to assign text, e.g., tweets, messages or reviews, to one or multiple categories. Such categories can be review scores or positive v.s. negative. In order to build such classifiers, we need labeled data, which consists of text and their corresponding labels.

Example use cases

Note
Disclaimer
Please note that data sets, models and other content, including open source software, (collectively referred to as "Content") provided and/or suggested by Peltarion for use in the Platform, may be subject to separate third party terms of use or license terms. You are solely responsible for complying with the applicable terms. Peltarion makes no representations or warranties about Content and specifically disclaim all responsibility for any liability, loss, or risk, which is incurred as a consequence, directly or indirectly, of the use or application of any of the Content.

Positive v.s. negative

(👍v.s.👎)

On Quora, people can ask questions and connect with others who contribute unique insights and quality answers. A key challenge is to weed out insincere questions. This quora dataset consists of answers labeled as sincere or sincere.
Note that the accurary and loss metrics become immediately VERY good even if your model gives a wrong prediction.

Sentiment analysis

(😃 or 😱 or 😡 or 😍 or …​)

HappyDB is a corpus of more than 100,000 happy moments crowd-sourced via Amazon’s Mechanical Turk.
The goal of the corpus is to advance the understanding of the causes of happiness through text-based reflection.
HappyDB - happy moments

Data preparation

Structure of imported csv

Datasets with text for text classification need to include:

  • 1 column with text to be used as input

  • 1 column as label to be used as predicted category in the Target block.

Dataset view changes

In the Datasets view, set the encoding of the text feature to Text and the label feature to Categorical.
Select a Language model that matches your input data language.

Example: If you use the Quora questions dataset Quora train.csv, unzip it and upload it to the Platform. Make sure to set the encoding for question_text feature as Text. Set the target feature as Categorical.

Modeling

Build this straight forward model to get started. It’s not cutting edge and it will not solve all your problems, but it will show you what basic parts need to go into a text classification model.

Text classification example model PA2

Block changes

Input

Select your text encoded feature as input feature.

Text embedding (beta)

Make sure that the Language model you select match the Language model selected when you set encoding. Select Prebuilt embedding and then fastText. (Alternatively, you can choose Randomly initialized, and see if that fits your model better).

Dense

Select Softmax as Activation and set number of Nodes to 2.

Target

Select your Categorical encoded feature as target feature and set the Loss function to Categorical classification.

Evaluation

In the Evaluation view, you can see in real-time how the AI model is performing as it’s learning from the data. You want to have as low loss scores as possible, and you want the training error to be slightly lower than test error.
Read more on how to evaluate your model in Classification loss metrics.

Deployment

To make it easier to test your model on new sentences quickly, we have provided a Text Classifier web app.
Open the Text classifier: bit.ly/Text_classifier, define your deployment URL, token and input parameter, enter a new text and press ▶️ to predict!

Example:

Text webb app PA2

cURL test

Testing over cURL works as for any other deployment. Read more about how to use cURL here.

Get started for free