Data science /

Building a Stack Overflow question tagging model with public BigQuery data

February 23/10 min read

    BigQuery is a powerful tool for querying large amounts of data, available on Google Cloud Platform. Not only do many companies have their own data accessible via BigQuery, often for analytics purposes, there are also a large number of publicly available datasets through the tool.

    Using BigQuery’s powerful SQL dialect it's easy to build a specialized dataset for deep learning with just a few lines. In this blog post, we'll show how you can shape one of these public datasets, the Stack Overflow posts dataset, to train a question-tagging BERT model.

    02/ Public datasets freely available for BigQuery

    At the time of writing, there are 190 public datasets freely available for BigQuery. These cover a wide range of domains, including

    We'll look at how to work with a balanced random sample of the dataset, and how to make sure that the sample is reproducible and can be extended with new data. We'll also demonstrate how to create a Chrome plugin that can add the suggested tag as you type a question. Hopefully, this can serve as a good starting point for anyone modeling data accessible through BigQuery!

    03/ Building a question tagger for Stack Overflow

    Have you noticed that Stack Overflow sometimes suggests tags? It doesn’t happen very often, and we suspect it’s based on finding similar entries. We thought it would be fun to show how one could build a content based tagger instead, which could suggest tags much more often. The model we train will be used in this chrome extension to tag Stack Overflow questions while you're writing them. You'll even be able to replace our model with one of your own. Check out how, below!

    Tagging Stack Overflow posts

    When posing questions on Stack Overflow it is important to set appropriate tags to allow people familiar with the area to find it. Good tags on Stack Overflow typically follow a certain set of best practices and it is not uncommon for more experienced users to edit the tags of a post. To improve things, automated tagging has been suggested on multipleoccasions, and seems to have been implemented a while back. You'd be forgiven for not having seen the feature before. In our experience it's very uncommon to get suggestions, and it might be related to the no longer current methods used to produce them.

    This is how Stack Overflow currently surfaces tag suggestions.

    In this post, we'll illustrate how one could replace the tagging system with a modern, content-based tagging model built from a pre-trained BERT model. We'll even integrate it into stack overflow using a custom Chrome plugin. We're hoping to share how quickly a combination of data in BigQuery, the Peltarion platform and some simple javascript code can allow you to build powerful AI enabled apps.

    Unfortunately the Stack Overflow dataset isn't very well documented on the dataset page. Thankfully, this post on Meta goes through the available fields in detail. The fields relevant for us are all in the Posts table

    • Title, self-explanatory
    • Body, the post itself including HTML
    • Tags, a string of all tags for the post separated by |

    where the former two will be the input and the tags are to be predicted. There are many ways of handling two inputs, for this problem we will simply concatenate the fields to one in BigQuery. But let's first take a quick look at how to access the dataset.

    Accessing the dataset in BigQuery

    (To follow along these instructions you will need a Google Cloud Platform account. For our purposes you should hopefully not need more than what the free tier includes, but this is for you to decide)

    Follow the link to the Stack Overflow posts dataset page to BigQuery or go to BigQuery in the GCP console. With the posts_questions table selected, press QUERY TABLE to get the query field populated with a first query. Select "*" to get a preview of all columns in the table.

    The BigQuery interface with the Stack Overflow dataset open.

    We'll run a few queries and visualize their results to understand the data we're working with.

    How many tags are there per post?

    First, let's look at how many tags there are per post.

    SELECT
        IF(Posts.tags = '', 0, ARRAY_LENGTH(SPLIT(Posts.tags, '|'))) as number_of_tags,
        COUNT(*) as count
    FROM`bigquery-public-data.stackoverflow.posts_questions`as Posts
    GROUP BY number_of_tags
    ORDER BY number_of_tags

    Since the Tags field is separated by | we'll simply split it and look at the length of that array. Notice that we have to treat the empty field separately, as these would otherwise be treated like a single tag. Plotting the data returned by the query we can see that while the majority of questions have 3 tags, there is a large number of posts with 1 tag.

    The number of questions by the number of tags assigned.

    When are the posts from?

    Grouping by creation year instead we get a histogram across years.

    SELECT
        EXTRACT(YEAR FROM Posts.creation_date) as year,
        COUNT(*) as count
    FROM `bigquery-public-data.stackoverflow.posts_questions` as Posts
    GROUP BY year
    ORDER BY year

    While the number of posts grew rapidly until around 2013, the number of questions posed per year doesn't seem to change much. We won't draw any conclusions based on this, but it's good to know that the vast majority of the data is from the past decade.

    The number of questions by the year posted.

    What are the most common tags?

    Finally, let's take a look at which tags are the most popular ones. Here I'll only do this for posts with a single tag, but I'll link to a notebook which does this for all.

    SELECT tags as tag, COUNT(*) as count
    FROM `bigquery-public-data.stackoverflow.posts_questions` as Posts
    WHERE ARRAY_LENGTH(SPLIT(Posts.tags, '|')) = 1
    GROUP BY tags
    ORDER BY count DESC
    LIMIT 1000

    Even though there is a large concentration of posts in the top 20 tags, the tail of this distribution is fat. In fact the majority of all tags come from the tail, so it's important that our classifier be able to handle many tags. Furthermore, since the dataset is unbalanced with respect to tags it will be important to somehow handle this to avoid getting a very biased model.

    The most frequent tags and how many times they occur in the dataset.

    We're after building a model that tries to predict which of a fixed number k of Tags is relevant given the content. To support the choice of k we can study how many of all examples we cover depending on the size of k. Normalizing the previous query by the number of Tags we get

    Shows what fraction of all tags used you get if you pick the number of  many of the most frequent tags on the x axis.

    So to cover over 70% of all used tags we need to include the top 150 tags in the model.

    Building a training dataset

    Based on the earlier exploration I've introduced the following design choices

    • Since there is a sizable number of posts with only one tag I will start with formulating this as a single-label classification problem and filtering to only samples with a single tag.
    • As most posts are from within the past 6 years and the most common tags are older we will ignore any temporal effects.
    • The model should classify content into one of the 150 top tags to get a decent (70%) coverage of the used tags.

    To write a query that creates the desired dataset, we'll first use the WITH statement to set up the two tables that we will use in the query. Posts is the relevant features (content and tag, with id for traceability) of all entries with a single tag. TopTagCounts is the top 700 tags, which we need to sample a constant number of samples with each tag.

    WITH
      Posts as (
        SELECT
          Posts.id,
          CONCAT(title, "\n", body) as content,
          Posts.tags as tag
        FROM `bigquery-public-data.stackoverflow.posts_questions` as Posts
        WHERE ARRAY_LENGTH(SPLIT(Posts.tags, '|')) = 1
      ),
      TopTagCounts as (
        SELECT tag, COUNT(*) as count
        FROM Posts
        GROUP BY tag
        ORDER BY count DESC
        LIMIT 700
      )

    We will also need a function to sample from each class. This function returns True for approximately a selected number of samples for all classes. To make sure the data is balanced, it will also discard all classes with fewer than the selected number of samples. The strategy is similar to the ones explained in this Stack Overflow post, but with a hash function of the text  (FARM_FINGERPRINT) instead of a random number, to achieve a deterministic selection based on the content.

    CREATE
      TEMP FUNCTION sample_class(text STRING,
                                 samples_per_class INT64,
                                 class_size INT64) AS (
        (class_size > samples_per_class) and
        (0 = MOD(
          FARM_FINGERPRINT(text),
          CAST(class_size / samples_per_class as INT64)
         ))
       );

    From these we may now construct the final query, selecting a constant number of samples (500) from all classes and cropping the content at 51200 characters, the maximum allowed for any cell on the Peltarion platform.

    SELECT Posts.id,
           Posts.tag,
           SUBSTR(Posts.content, 0, 51200) as content
    FROM Posts
    INNER JOIN TopTagCounts ON Posts.tag = TopTagCounts.tag
    WHERE sample_class(Posts.content, 500, TopTagCounts.count)

    The full query to run is

    CREATE
      TEMP FUNCTION sample_class(text STRING,
                                 samples_per_class INT64,
                                 class_size INT64) AS (
        (class_size > samples_per_class) and
        (0 = MOD(
          FARM_FINGERPRINT(text),
          CAST(class_size / samples_per_class as INT64)
         ))
       );
    
    WITH
      Posts as (
        SELECT
          Posts.id,
          CONCAT(title, "\n", body) as content,
          Posts.tags as tag
        FROM `bigquery-public-data.stackoverflow.posts_questions` as Posts
        WHERE ARRAY_LENGTH(SPLIT(Posts.tags, '|')) = 1
      ),
      TopTagCounts as (
        SELECT tag, COUNT(*) as count
        FROM Posts
        GROUP BY tag
        ORDER BY count DESC
        LIMIT 150
      )
    
    SELECT Posts.id,
           Posts.tag,
           SUBSTR(Posts.content, 0, 51200) as content
    FROM Posts
    INNER JOIN TopTagCounts ON Posts.tag = TopTagCounts.tag
    WHERE sample_class(Posts.content, 500, TopTagCounts.count)

    Run the query in the window to get the results in a temporary table. This will yield approximately 150 * 500 = 75 000 samples. Finally, export the results to a BigQuery table which you'll import to the Peltarion platform using the SAVE RESULTS button.

    Saving the query result into a new BigQuery table, to be imported into the Peltarion Platform.

    That's about it! The hardest part is now done. From here on, training the model and deploying it to build an app is a breeze.

    Training a BERT classifier to predict the tag of a post

    Thanks to the BigQuery and Google Cloud Storage / S3 integrations it's easy to import and use the previously listed datasets to train deep learning models on the Peltarion Platform. From a new project use the BigQuery import option.

    The dataset view on the Peltarion Platform in a new project.

    Follow the wizard to select the previously created table, and import it as a dataset.

    Previewing BigQuery data during import to the Peltarion Platform.

    In the feature view select the content field and set it to 512 tokens English uncased BERT tokens.

    Setting up the imported dataset and adjusting the sequence length for the content.

    From here all you need to do is to save the dataset, select "Use in new experiment", and follow the dialogs to create a model that takes content and predicts the tag. I'm going with all parameters set to the suggested values, but feel free to play around with them. In particular, you might want to increase the number of epochs and play around with the learning rate. The model should start training and you'll be able to follow the progress in the Evaluation view.

    How well does this perform?

    With all the suggested settings, after 2 epochs I got a validation cross-entropy of 0.5, which is pretty hard to get an intuitive understanding of. Fortunately, the available metrics give us a more complete picture. Accuracy, recall and precision are all around 87%, meaning 87% of the time the model guesses on the correct label. That's pretty fantastic for just going with the suggested settings, especially when working with 150 classes!

    We can get an even better understanding by digging through examples in the confusion matrix. I get the following confusion matrix.

    The confusion matrix of our trained classification model. Notice how clearly defined the diagonal is, a great sign!

    The vast majority of examples fall on the diagonal, corresponding to correct classification. To understand the mistakes the model does we can zoom into some of the particularly strong cells outside of the diagonal. These indicate misclassifications, i.e. where the model predicted something other than the true label.

    When the actual tag is python, the biggest "errors" occur when the model predicts a specific version of python. A very reasonable error!

    Similarly, when the actual tag is ruby-on-rails, the biggest "errors" occur when the model predicts ruby-on-rails-3.

    Finally, sql-server-2008 is sometimes mixed up with the non-specific tag for sql-server.

    Though this requires more thorough analysis, it seems like some of the most common mistakes the model does is trying to separate different versions of the same tag. So python-3.x gets mixed up with python, ruby-on-rails-3 with ruby-on-rails and sql-server with sql-server-2008. That seems pretty reasonable, and a good indication that the model is performing well! If we want the model to suggest more generic tags, we could remove the numerical parts of the labels (see stemming in the list of suggestions).

    Serving the model through a REST API

    Deploying the model simply means creating a deployment and enabling it. This sets up a REST API that we can use from the Chrome extension. All we need to do is to select the experiment and checkpoint we want to be served. The input data will be processed just like the data uploaded.

    Setting up a model deployment on the Peltarion Platform, to allow querying it over a REST API.

    04/ Building a Chrome extension that suggests tags when writing a Stack Overflow question

    To test this out we built a Chrome extension to use the deployed model to suggest tags when writing a question on Stack Overflow. The Chrome extension is open-sourced and available with source code at: http://github.com/Peltarion/stackoverflow-tags-extension/. You can add it to chrome via the Chrome Web Store.

    Once the button is pressed to suggest tags, the extension sends the title and body to the REST API of the model. The response is a dictionary with the estimated likelihood of the question being about each tag which occurred in the dataset. We then select the three tags with the highest score and show them as suggestions.

    An animation of the auto tag plugin, powered by the model we just trained!

    Try it out and let us know what you think! This is just the simplest model we could build so we expect it to have plenty of margin for improvement. We'll list a few ideas below, but if you manage to train a better model you can point the extension to your own deployed model!

    The extension we've created allows you choose a different deployment from the one we provide! See if you can improve on our model, and use your own instead.

    05/ Ideas for improving the model

    We made a lot of simplifications to make this post easy to follow. Lifting either of these should allow you to build a better model.

    1. We've limited ourselves to single-label classification. While that makes a lot of things easier, it might also not be the best model of the data. Since the tags are not mutually exclusive, it may be better to treat them as individually occurring tags by doing multi-label classification.
    2. You could also use more of the available data by playing around with the number of samples per tag and how many tags to use.
    3. Stemming on the tags. E.g. considering the tags python, python-2.x and python-3.x the same tag.
    4. While fields like CreationDate would probably be helpful to account for temporal affect and the User table might help in understanding the posting user, we will keep things simple for now. E.g. a model only trained on posts from the past year might perform better for predictions of new posts.
    5. Set up a test set and perform some directed experiments to pick the best hyperparameters.

    Let us know what you think! And please feel free to get in touch with us at contact@peltarion.com or through the in-platform chat if you have any questions while you’re trying it out. 

        02/ More on Data science