BigQuery is a powerful tool for querying large amounts of data, available on Google Cloud Platform. Not only do many companies have their own data accessible via BigQuery, often for analytics purposes, there are also a large number of publicly available datasets through the tool.
Using BigQuery’s powerful SQL dialect it's easy to build a specialized dataset for deep learning with just a few lines. In this blog post, we'll show how you can shape one of these public datasets, the Stack Overflow posts dataset, to train a question-tagging BERT model.
Building a Stack Overflow question tagging model with public BigQuery data
02/ Public datasets freely available for BigQuery
At the time of writing, there are 190 public datasets freely available for BigQuery. These cover a wide range of domains, including
- A collection of COVID-19 datasets
- Bitcoin transactions
- Patents databases
- HackerNews stories
- Stack Overflow posts
- The Met artworks and metadata
- A massive general computer vision dataset, OpenImages
- Earth observation data from the Sentinel-2 satellites
We'll look at how to work with a balanced random sample of the dataset, and how to make sure that the sample is reproducible and can be extended with new data. We'll also demonstrate how to create a Chrome plugin that can add the suggested tag as you type a question. Hopefully, this can serve as a good starting point for anyone modeling data accessible through BigQuery!
03/ Building a question tagger for Stack Overflow
Have you noticed that Stack Overflow sometimes suggests tags? It doesn’t happen very often, and we suspect it’s based on finding similar entries. We thought it would be fun to show how one could build a content based tagger instead, which could suggest tags much more often. The model we train will be used in this chrome extension to tag Stack Overflow questions while you're writing them. You'll even be able to replace our model with one of your own. Check out how, below!
Tagging Stack Overflow posts
When posing questions on Stack Overflow it is important to set appropriate tags to allow people familiar with the area to find it. Good tags on Stack Overflow typically follow a certain set of best practices and it is not uncommon for more experienced users to edit the tags of a post. To improve things, automated tagging has been suggested on multipleoccasions, and seems to have been implemented a while back. You'd be forgiven for not having seen the feature before. In our experience it's very uncommon to get suggestions, and it might be related to the no longer current methods used to produce them.
Unfortunately the Stack Overflow dataset isn't very well documented on the dataset page. Thankfully, this post on Meta goes through the available fields in detail. The fields relevant for us are all in the Posts table
- Title, self-explanatory
- Body, the post itself including HTML
- Tags, a string of all tags for the post separated by |
where the former two will be the input and the tags are to be predicted. There are many ways of handling two inputs, for this problem we will simply concatenate the fields to one in BigQuery. But let's first take a quick look at how to access the dataset.
Accessing the dataset in BigQuery
(To follow along these instructions you will need a Google Cloud Platform account. For our purposes you should hopefully not need more than what the free tier includes, but this is for you to decide)
Follow the link to the Stack Overflow posts dataset page to BigQuery or go to BigQuery in the GCP console. With the posts_questions table selected, press QUERY TABLE to get the query field populated with a first query. Select "*" to get a preview of all columns in the table.
We'll run a few queries and visualize their results to understand the data we're working with.
How many tags are there per post?
First, let's look at how many tags there are per post.
SELECT IF(Posts.tags = '', 0, ARRAY_LENGTH(SPLIT(Posts.tags, '|'))) as number_of_tags, COUNT(*) as count FROM`bigquery-public-data.stackoverflow.posts_questions`as Posts GROUP BY number_of_tags ORDER BY number_of_tags
Since the Tags field is separated by | we'll simply split it and look at the length of that array. Notice that we have to treat the empty field separately, as these would otherwise be treated like a single tag. Plotting the data returned by the query we can see that while the majority of questions have 3 tags, there is a large number of posts with 1 tag.
When are the posts from?
Grouping by creation year instead we get a histogram across years.
SELECT EXTRACT(YEAR FROM Posts.creation_date) as year, COUNT(*) as count FROM `bigquery-public-data.stackoverflow.posts_questions` as Posts GROUP BY year ORDER BY year
While the number of posts grew rapidly until around 2013, the number of questions posed per year doesn't seem to change much. We won't draw any conclusions based on this, but it's good to know that the vast majority of the data is from the past decade.
What are the most common tags?
Finally, let's take a look at which tags are the most popular ones. Here I'll only do this for posts with a single tag, but I'll link to a notebook which does this for all.
SELECT tags as tag, COUNT(*) as count FROM `bigquery-public-data.stackoverflow.posts_questions` as Posts WHERE ARRAY_LENGTH(SPLIT(Posts.tags, '|')) = 1 GROUP BY tags ORDER BY count DESC LIMIT 1000
Even though there is a large concentration of posts in the top 20 tags, the tail of this distribution is fat. In fact the majority of all tags come from the tail, so it's important that our classifier be able to handle many tags. Furthermore, since the dataset is unbalanced with respect to tags it will be important to somehow handle this to avoid getting a very biased model.
We're after building a model that tries to predict which of a fixed number k of Tags is relevant given the content. To support the choice of k we can study how many of all examples we cover depending on the size of k. Normalizing the previous query by the number of Tags we get
So to cover over 70% of all used tags we need to include the top 150 tags in the model.
Building a training dataset
Based on the earlier exploration I've introduced the following design choices
- Since there is a sizable number of posts with only one tag I will start with formulating this as a single-label classification problem and filtering to only samples with a single tag.
- As most posts are from within the past 6 years and the most common tags are older we will ignore any temporal effects.
- The model should classify content into one of the 150 top tags to get a decent (70%) coverage of the used tags.
To write a query that creates the desired dataset, we'll first use the WITH statement to set up the two tables that we will use in the query. Posts is the relevant features (content and tag, with id for traceability) of all entries with a single tag. TopTagCounts is the top 700 tags, which we need to sample a constant number of samples with each tag.
WITH Posts as ( SELECT Posts.id, CONCAT(title, "\n", body) as content, Posts.tags as tag FROM `bigquery-public-data.stackoverflow.posts_questions` as Posts WHERE ARRAY_LENGTH(SPLIT(Posts.tags, '|')) = 1 ), TopTagCounts as ( SELECT tag, COUNT(*) as count FROM Posts GROUP BY tag ORDER BY count DESC LIMIT 700 )
We will also need a function to sample from each class. This function returns True for approximately a selected number of samples for all classes. To make sure the data is balanced, it will also discard all classes with fewer than the selected number of samples. The strategy is similar to the ones explained in this Stack Overflow post, but with a hash function of the text (FARM_FINGERPRINT) instead of a random number, to achieve a deterministic selection based on the content.
CREATE TEMP FUNCTION sample_class(text STRING, samples_per_class INT64, class_size INT64) AS ( (class_size > samples_per_class) and (0 = MOD( FARM_FINGERPRINT(text), CAST(class_size / samples_per_class as INT64) )) );
From these we may now construct the final query, selecting a constant number of samples (500) from all classes and cropping the content at 51200 characters, the maximum allowed for any cell on the Peltarion platform.
SELECT Posts.id, Posts.tag, SUBSTR(Posts.content, 0, 51200) as content FROM Posts INNER JOIN TopTagCounts ON Posts.tag = TopTagCounts.tag WHERE sample_class(Posts.content, 500, TopTagCounts.count)
The full query to run is
CREATE TEMP FUNCTION sample_class(text STRING, samples_per_class INT64, class_size INT64) AS ( (class_size > samples_per_class) and (0 = MOD( FARM_FINGERPRINT(text), CAST(class_size / samples_per_class as INT64) )) ); WITH Posts as ( SELECT Posts.id, CONCAT(title, "\n", body) as content, Posts.tags as tag FROM `bigquery-public-data.stackoverflow.posts_questions` as Posts WHERE ARRAY_LENGTH(SPLIT(Posts.tags, '|')) = 1 ), TopTagCounts as ( SELECT tag, COUNT(*) as count FROM Posts GROUP BY tag ORDER BY count DESC LIMIT 150 ) SELECT Posts.id, Posts.tag, SUBSTR(Posts.content, 0, 51200) as content FROM Posts INNER JOIN TopTagCounts ON Posts.tag = TopTagCounts.tag WHERE sample_class(Posts.content, 500, TopTagCounts.count)
Run the query in the window to get the results in a temporary table. This will yield approximately 150 * 500 = 75 000 samples. Finally, export the results to a BigQuery table which you'll import to the Peltarion platform using the SAVE RESULTS button.
That's about it! The hardest part is now done. From here on, training the model and deploying it to build an app is a breeze.
Training a BERT classifier to predict the tag of a post
Thanks to the BigQuery and Google Cloud Storage / S3 integrations it's easy to import and use the previously listed datasets to train deep learning models on the Peltarion Platform. From a new project use the BigQuery import option.
Follow the wizard to select the previously created table, and import it as a dataset.
In the feature view select the content field and set it to 512 tokens English uncased BERT tokens.
From here all you need to do is to save the dataset, select "Use in new experiment", and follow the dialogs to create a model that takes content and predicts the tag. I'm going with all parameters set to the suggested values, but feel free to play around with them. In particular, you might want to increase the number of epochs and play around with the learning rate. The model should start training and you'll be able to follow the progress in the Evaluation view.
How well does this perform?
With all the suggested settings, after 2 epochs I got a validation cross-entropy of 0.5, which is pretty hard to get an intuitive understanding of. Fortunately, the available metrics give us a more complete picture. Accuracy, recall and precision are all around 87%, meaning 87% of the time the model guesses on the correct label. That's pretty fantastic for just going with the suggested settings, especially when working with 150 classes!
We can get an even better understanding by digging through examples in the confusion matrix. I get the following confusion matrix.
The vast majority of examples fall on the diagonal, corresponding to correct classification. To understand the mistakes the model does we can zoom into some of the particularly strong cells outside of the diagonal. These indicate misclassifications, i.e. where the model predicted something other than the true label.
Though this requires more thorough analysis, it seems like some of the most common mistakes the model does is trying to separate different versions of the same tag. So python-3.x gets mixed up with python, ruby-on-rails-3 with ruby-on-rails and sql-server with sql-server-2008. That seems pretty reasonable, and a good indication that the model is performing well! If we want the model to suggest more generic tags, we could remove the numerical parts of the labels (see stemming in the list of suggestions).
Serving the model through a REST API
Deploying the model simply means creating a deployment and enabling it. This sets up a REST API that we can use from the Chrome extension. All we need to do is to select the experiment and checkpoint we want to be served. The input data will be processed just like the data uploaded.
04/ Building a Chrome extension that suggests tags when writing a Stack Overflow question
To test this out we built a Chrome extension to use the deployed model to suggest tags when writing a question on Stack Overflow. The Chrome extension is open-sourced and available with source code at: http://github.com/Peltarion/stackoverflow-tags-extension/. You can add it to chrome via the Chrome Web Store.
Once the button is pressed to suggest tags, the extension sends the title and body to the REST API of the model. The response is a dictionary with the estimated likelihood of the question being about each tag which occurred in the dataset. We then select the three tags with the highest score and show them as suggestions.
Try it out and let us know what you think! This is just the simplest model we could build so we expect it to have plenty of margin for improvement. We'll list a few ideas below, but if you manage to train a better model you can point the extension to your own deployed model!
05/ Ideas for improving the model
We made a lot of simplifications to make this post easy to follow. Lifting either of these should allow you to build a better model.
- We've limited ourselves to single-label classification. While that makes a lot of things easier, it might also not be the best model of the data. Since the tags are not mutually exclusive, it may be better to treat them as individually occurring tags by doing multi-label classification.
- You could also use more of the available data by playing around with the number of samples per tag and how many tags to use.
- Stemming on the tags. E.g. considering the tags python, python-2.x and python-3.x the same tag.
- While fields like CreationDate would probably be helpful to account for temporal affect and the User table might help in understanding the posting user, we will keep things simple for now. E.g. a model only trained on posts from the past year might perform better for predictions of new posts.
- Set up a test set and perform some directed experiments to pick the best hyperparameters.
Let us know what you think! And please feel free to get in touch with us at firstname.lastname@example.org or through the in-platform chat if you have any questions while you’re trying it out.
02/ More on Data science
Search text by semantic similarity
Connecting the dots in Neural Networks
Image similarity with deep learning explained
How can a neural network deal with a vague concept like similarity? Why do we measure angles with the cosine similarity? Is it that far from classification?
Peltarion on Microsoft's AI Show
Create a text dataset from voice recordings