Look deep into DNA
How to solve a regression problem
This tutorial will show you how to build a model that will solve a regression problem. That is a problem where you want to predict a quantity.
You will create a solution on the platform to a complex problem using state of the art machine learning models.
- Target audience: Intermediate users
Author: Mikael Huss, Data Scientist at Peltarion
Motivation - Replicate results of a research paper
The past few years have seen a number of interesting applications of deep learning to the field of biology, including applications to DNA research.
We were curious what would happen if we attempted to replicate the results of a research paper on yeast DNA that was published in Genome Research in 2017, using the Peltarion Platform. The paper, Deep learning of the regulatory grammar of yeast 5′ untranslated regions from 500,000 random sequences, is concerned with the problem of predicting, from DNA sequence alone, how much protein a gene will produce.
DNA constitutes the hereditary template which specifies how cells should produce their proteins. DNA is, to a first approximation, identical in all cells in the body. However, the static DNA template gets used in a dynamically varying way in different cell types and organs. It is read by molecular machines that transcribe it into a temporary intermediate, RNA, which is then translated to proteins, molecules that do most of the actual work in the body.
The genetic code (which determines how so-called coding regions in DNA get translated into the protein building blocks or amino acids) is known, but the finer points of how this process is controlled and fine-tuned are much less well understood. For example, there are parts of DNA that are transcribed to the RNA intermediate but not translated further into proteins. These are called untranslated regions (UTR) and are thought to be involved in controlling the rate at which a gene is made into protein (rather than being part of the code for making the actual amino acids).
Formulating the problem in terms of deep learning
The authors of the paper wanted to construct a predictive model relating UTR sequence to protein production by generating a large number (almost 500,000) of random, 50-base-pair DNA sequences attached to a gene coding for a protein.
The production rate of this protein was measured indirectly through a growth rate assay (details are in the paper).
In other words, they created 500,000 training examples consisting of strings with length 50, where each character was either A, C, G or T. For each training example, there was a corresponding target value to predict, obtained from the growth rate measurement.
The authors describe their next steps:
"We trained a convolutional neural network (CNN) on the random library and showed that it performs well at predicting the protein expression of both a held-out set of the random 5′ UTRs as well as native S. cerevisiae 5′ UTRs. The model additionally was used to computationally evolve highly active 5′ UTRs. 'We confirmed experimentally that the great majority of the evolved sequences led to higher protein expression rates than the starting sequences, demonstrating the predictive power of this model."
Luckily, the paper’s authors provide both the data they used and Jupyter Notebooks with Keras code. This makes it easier for us to re-implement the study on the Peltarion Platform.
Platform ready data
If you just want the input data files with minimal hassle:
Optional: Deeper dive into the data
If you want to see how the data was prepared, you can take a look at this script that will both download the authors’ original data as a CSV file and preprocess it. When you download the script, make sure that the dependencies (pandas, NumPy, Fire, tqdm) are installed, and you should be able to run it from the command line. The only thing you need to specify is where the files should be saved. Finally, if you want to see the authors’ original analyses, they are available as Jupyter Notebooks here together with the raw data in CSV format.
Work with the data on the platform
Now it’s time to start using the platform! Let’s start by creating a new project, which we will call Yeast Blog, and opening that project.
The next thing to do is to create a dataset from the input data files we have prepared according to what was written above.
Click Upload file and upload the CSV and NumPy files you downloaded in the previous step.
Name the dataset Yeast and click Done.
The column that lists integer values will not be used.
Target feature - growth rate
The column that shows floating-point numbers, for example, -0.6084566 or 0.4901557 is the column of target values, that is, fluorescence levels that describe the growth rate.
Change the Label to growth_rate.
The string column, containing a T or a V, indicates if the row should be used for training or validation. The split between training and validation data is approximately 95% and 5%.
NOTE: An additional 5% of the original data is set aside for testing during preparation and is not included in the dataset that you have uploaded to the platform.
Input feature - One-hot-encoded NumPy arrays
The column that shows black squares contains the DNA sequences in one-hot-encoded NumPy arrays with the dimensions 70x4x1:
The size of the first dimension (70) corresponds to the original number of base pairs, 50 flanked by 10 dummy positions on each side.
The size of the second dimension (4) corresponds to the number of different nucleotides (A, C, G, T).
The third dimension (1) is a “dummy dimension” that we need in order to be able to perform two-dimensional convolutions on the sequences.
Click the column and change the Label to seq.
The squares are black because the default assumption in the Platform is that multidimensional arrays will be images, and the preview will show a thumbnail representation of the image. The values in our matrices are always zero or one, because we are representing DNA sequences as vectors like [1, 0, 0, 0] (for “A”) or [0, 1, 0, 0] (for “C”). Since the pixel intensity of images, generally, falls in the range between 0 and 255, the thumbnail representations will be black.
Now, we need to split the dataset into subsets for training and validation. To do this you need to make sure you see the advanced settings.
You can choose to usa a random split or according to a logical condition. In this case, you can create a condition based on the subset column in the dataset.
Click New subset and name the training subset Training 95%
Then click Add conditional filter and set Feature to subset, Operator to is equal to, and Value to T.
Click Create subset.
Repeat the procedure. Name the new subset Validation 5%.
Set Feature to subset, Operator to is equal to, and Value to V.
Finally, click Use in new experiment to complete the construction of a new, version-controlled dataset.
Replicate the authors' model
Now it’s time to create the predictive model. We will replicate the paper authors’ winning model, which they found by hyperparameter optimization using the Hyperopt package in Python.
Here is a summary of their model as printed out by Keras (obtained by following the Jupyter Notebook linked from their paper). This shows the various layers and operations of their final model.
Create an experiment
Let’s see how we can build an equivalent model on the platform. Name the experiment in the Experiment wizard, make sure the correct dataset, dataset version, training and validation subsets are selected. Since we only have one version of the dataset, this seems a bit redundant now, but in many projects that have been going on for a while, it has been useful to be able to switch between different dataset versions and subsets here.
Click Use in new experiment. Now we get an empty canvas on which to design the model.
Add blocks to the model
Now let’s start defining the structure of the model. Click the Build tab in the Inspector and then Input to add an input block on the canvas. Click the new Input block, and then in the Inspector set Feature to seq. This means that the DNA sequences will be the input data to the network.
In the authors’ model, the first layer is a convolutional layer with width 13, height 4 (corresponding to the number of DNA bases) and 128 filters. To mimic this, select the 2D Convolution block (this should be automatically connected to the Input block), and fill in 128, 13 and 4 respectively in the three first editable fields (Filters, Width, Height). Leave the rest of the parameters at their default values. Underneath the block, the dimensions of the data after passing through these convolutions are shown (1 x 58 x 128).
Next, according to the paper, we need a dropout operation with dropout probability 0.15. Select the Dropout block and change the value in the Rate field to 0.15. The dropout operation will randomly set some activations to zero, but it will not affect the dimensions of the data, so the shape is still the same.
Now add another 2D Convolution block with 128 filters, width 13 and height 1, followed by another Dropout block with dropout probability 0.15.
Copy and paste the last pair of 2D Convolution and Dropout blocks. Now the shape of the data after passing through all the layers we added should be 1 x 34 x 128.
Next, these data will be passed through a series of fully connected (dense) layers to produce the final prediction. In order to do that, though, we need to “squeeze” the data into a single vector of numbers, which is done with the Flatten block.
Click Flatten. There should be a single number (4352) underneath the newly appeared node, indicating that the three-dimensional 1 x 34 x 128 tensor has been unfolded into a vector of 4352 values.
Next, click on Dense and set the value of Nodes to 64.
Click Dense again and set the value of Nodes to 1. This is the layer that will produce our prediction, a single number, which is why Nodes is equal to 1. For this last Dense layer, though, we also change the Activation to Linear.
Finally, add a target block and set Feature to growth_rate. At this point, the model should look something like this:
Train the model
Click Settings. In the paper, the authors did not specify a Batch size value, so let’s stick with the default, 32, and the Adam optimizer function, which they did specify that they used. Then click Run and go get a cup of coffee while the training gets underway!
Monitoring training progress
We can watch the progress of the training by going to the Evaluation view.
Three different plots are shown on the page, and they are dynamically updated as the training goes on. The upper graph, under Training overview, shows how the loss (the sum of prediction errors made by the model) varies for the training and validation datasets. The lower the loss, the better – but in particular we should pay attention to how the validation loss behaves. While the training error typically keeps on dropping, the validation error can start increasing again after a minimum due to the model overfitting to the training data.
The plots on the lower part of Evaluation view reflect the model’s performance at the best observed epoch (or last), in terms of validation error. So for this particular instantiation of the training process, it will show results for epoch 7.
The prediction scatterplot appears because we are dealing with a regression problem (as opposed to a classification problem) – we are trying to predict a single numeric value. This plot shows the true value on the x-axis and the predicted value on the y-axis. We see a clear, though imperfect, correlation between the true values and the values predicted by our model.
The error distribution graph can be useful for diagnosing what kinds of errors a model makes. Sometimes, a model will tend to over- or underpredict systematically, and in those cases, that will show in the error distribution plot.
Using the Peltarion Platform, we were able to replicate the paper successfully, achieving similar results. Compared to a Jupyter Notebook, the form in which the original model was published, the Peltarion Platform offers the following benefits:
No installation required (no TensorFlow/Keras/other dependencies)
Transparent access to a GPU
Multi-user projects with easy sharing within an organization
At-a-glance comparisons of results of different experiments
Quick iteration when experimenting; for instance, you can “clone” a trained model and continue building on top of it
Model deployment to a REST API.