If you can find your mouse pad and click on it a total of five times, you can do this.
Ok, let’s get started!
Go to this page and click into the ‘Data’ tab. Scroll down to the data files and click, ‘Download all’. This is your dataset. No. Don’t do that thing where you read the article first and then decide if you’re going to do it or not. Just do it. It’s five clicks - less time than you’ll spend on social media this evening.
Now the next thing you need to do is to create a free account on Peltarion's platform, here. Note that due to high volumes of learners like yourselves, it may take up to five minutes for you to receive the email to confirm your account (feel free to throw in some social media time here if you’re worried about missing out).
Ping! And there’s that email.
Five clicks, starting now.
When you’ve created your account, click on 'Projects' in the top left corner, and you'll get to this view.
Click ‘New project’ in the top left corner.
Type a name (or use the default one if you think typing counts as clicking). Click ‘Create’. Open the zip file you downloaded with the data, and drag the file called ‘train’ to the platform (dragging definitely doesn’t count as clicking, we clearly need some rules here).
Click ‘Done’ when the data has been uploaded. Damn, that was a really unnecessary click. I’m going to have to talk to the frontend team about that.
I’m going to take a quick moment to brief you on what’s going on here. Your screen should now look like the screenshot attached here.
On the left-hand side, we can see that the dataset has automatically been split in two. There’s a ‘Training’ set, with 80% of the data, and a ‘Validation’ set, with 20% of the data. This is so that, when we start training the model, it can find out how good its predictions are from the validation data and use that information to improve how it makes its predictions.
We can also see a column called ‘target’, which has either a 1 or a 0 in it. The number 1 means that the tweet is about a real disaster, and the number 0, well, it means it’s not. If we hover over the columns over the ‘target’ column, we can see that we have a little over 4,000 tweets that aren’t about a real disaster, and just over 3,000 tweets that are about a real disaster.
We also have a ‘text’ column with the tweet itself. We’re now going to spend a very valuable click on changing the sequence length in this column. The reason we’re doing this is so that the model takes account of the whole tweet when it trains. Click the wrench-symbol over the ‘text’ column and set it to 150.
Click ‘Save version’ in the top right corner.
Ok, by now we know that I lied, but please trust that I did it for your own good. Would you really have started this awesome project if you thought you had to go through the arduous task of clicking ten times? Also, it could be argued (and hereby will be argued) that the model-building is really what counts for the clicking-part. So here we go.
Click ‘Use in new experiment’ in the top right corner.
This brings us to the experiment wizard! The first part just asks us to confirm that we’re aware of which dataset version we’re using and that we’re happy with the split between the training and validation data.
On the next page, change the input feature to ‘text’, as this was what the column that had the tweet was called. Then change the target feature to ‘target’, as this was what the column that had the information about whether the tweet was about a real disaster was called.
The model gives us a suggestion for a model called ‘BERT uncased’, which is a good idea (just trust me on this one, or, if you don’t, you can read about it here). It also suggests that it is a single-label classification problem, which is correct, as we want just one prediction (either a 1 or a 0) for each tweet. Click ‘Next’ again.
Click ‘create’ on the next page.
You’ll now see the basic architecture of the model displayed. If we’re happy with the way it looks (no error messages or other nasty surprises) just click ‘Run’ in the top right corner.
And that’s it!
Your model is now training. It will take up to two hours for the model to train (training a model requires a lot of computing resources, after all).
If you want you can check back in a few hours and click into the modeling view to see how it’s doing. It’s likely to get around a 79% accuracy (that’s what mine got), which is in line with what other people on Kaggle got with the same dataset.
Pretty awesome, don’t you think?
You can also check out the deployment view, where you can click on ‘test deployment’ and write in a few different tweets to see what it predicts (the first prediction will take a bit of time every time you first load the page, since it will first need to call the model, but after that it should be good for a friends-and-family-demo at any given time!).
So, what was the point of this?
First of all, don’t forget the click-to-streetcred-ratio. And even if it was actually ten clicks, it’s pretty good I think.
But in all seriousness, what I’m trying to do in this article is to get you over that initial hurdle that data science is an inaccessible thing that only people with years of training can get into. If you have a problem that you want to solve, you can start exploring. In fact, more than anything else, having a problem that you want to solve is how most people end up getting into the field. And now that you’ve built your first AI model, you hopefully also have the confidence needed to get learning!
Hungry for more? Start a new project and browse through our dataset library to try something out on your own.
Want to keep reading?
Here’s another article I wrote a while back that lets you use your own Twitter data.