Target audience: Data scientists and Developers
Author: Mikael Huss, Data Scientist at Peltarion
The past few years have seen a number of interesting applications of deep learning to the field of biology, including applications to DNA research.
We were curious what would happen if we attempted to replicate the results of a research paper on yeast DNA that was published in Genome Research in 2017, using the Peltarion Platform. The paper, Deep learning of the regulatory grammar of yeast 5′ untranslated regions from 500,000 random sequences, is concerned with the problem of predicting, from DNA sequence alone, how much protein a gene will produce.
DNA constitutes the hereditary template which specifies how cells should produce their proteins. DNA is, to a first approximation, identical in all cells in the body. However, the static DNA template gets used in a dynamically varying way in different cell types and organs. It is read by molecular machines that transcribe it into a temporary intermediate, RNA, which is then translated to proteins, molecules that do most of the actual work in the body.
The genetic code (which determines how so-called coding regions in DNA get translated into the protein building blocks or amino acids) is known, but the finer points of how this process is controlled and fine-tuned are much less well understood. For example, there are parts of DNA that are transcribed to the RNA intermediate but not translated further into proteins. These are called untranslated regions (UTR) and are thought to be involved in controlling the rate at which a gene is made into protein (rather than being part of the code for making the actual amino acids).
The authors of the paper wanted to construct a predictive model relating UTR sequence to protein production by generating a large number (almost 500,000) of random, 50-base-pair DNA sequences attached to a gene coding for a protein. The production rate of this protein was measured indirectly through a growth rate assay (details are in the paper). In other words, they created 500,000 training examples consisting of strings with length 50, where each character was either A, C, G or T. For each training example, there was a corresponding target value to predict, obtained from the growth rate measurement.
The authors describe their next steps:
"We trained a convolutional neural network (CNN) on the random library and showed that it performs well at predicting the protein expression of both a held-out set of the random 5′ UTRs as well as native S. cerevisiae 5′ UTRs. The model additionally was used to computationally evolve highly active 5′ UTRs. We confirmed experimentally that the great majority of the evolved sequences led to higher protein expression rates than the starting sequences, demonstrating the predictive power of this model."
Luckily, the paper’s authors provide both the data they used and Jupyter Notebooks with Keras code. This makes it easier for us to reimplement the study on the Peltarion Platform.
If you want to see how the data was prepared, you can take a look at this script that will both download the authors’ original data as a CSV file and preprocess it. When you download the script, make sure that the dependencies (pandas, NumPy, Fire, tqdm) are installed, and you should be able to run it from the command line. The only thing you need to specify is where the files should be saved. Finally, if you want to see the authors’ original analyses, they are available as Jupyter Notebooks here together with the raw data in CSV format.
Now it's time to start using the Platform! Let’s start by creating a new project on the Peltarion Platform, which we will call Yeast Blog, and opening that project.
The next thing to do is to create a dataset from the input data files we have prepared according to what was written above. Click New dataset and upload the CSV and NumPy files (index, target values and one-hot-encoded sequences). You will get a preview of the data when the uploads have been completed. Click Next.
Name the dataset Yeast and click Done.
The column that shows floating-point numbers is the column of target values, i.e., fluorescence levels that describe the growth rate. Click the column and change the Label to growth_rate in the Inspector to the right.
The column that lists integer values will not be used.
The string column, containing a T or an V, indicates if the row should be used for training or validation. The split between training and validation data is approximately 95% and 5%.
The column that shows black squares contains the DNA sequences in one-hot-encoded NumPy arrays with the dimensions 70x4x1:
Click the column and change the Label to seq.
The squares are black because the default assumption in the Platform is that multidimensional arrays will be images, and the preview will show a thumbnail representation of the image. The values in our matrices are always zero or one, because we are representing DNA sequences as vectors like [1, 0, 0, 0] (for “A”) or [0, 1, 0, 0] (for “C”). Since the pixel intensity of images, generally, falls in the range between 0 and 255, the thumbnail representations will be black.
Now, we need to split the dataset into subsets for training and validation. You can choose to do this with a random split or according to a logical condition. In this case, you can create a condition based on the subset column in the dataset.
Delete the default the subsets by clicking the elipsis(...) then Delete.
Click New subset and fill it out appropriately for each selection:
Select the training subset that you have created in Normalize on subset.
Finally, click Save to complete the construction of a new, version-controlled dataset.
Now it’s time to create the predictive model. We will replicate the paper authors’ winning model, which they found by hyperparameter optimization using the Hyperopt package in Python.
Here is a summary of their model as printed out by Keras (obtained by following the Jupyter Notebook linked from their paper). This shows the various layers and operations of their final model.
Let’s see how we can build an equivalent model in the Platform. Go to the Modeling view, click New experiment and give the experiment a name.
Now we get an empty canvas on which to design the model. Start by selecting the dataset, dataset version and training/validation in the Settings tab in the Inspector. Since we only have one version of the dataset, this seems a bit redundant now, but in many projects that have been going on for a while, it has been useful to be able to switch between different dataset versions and subsets here.
Now let’s start defining the structure of the model. Click the Blocks tab in the Inspector and then Input to put an Input block on the canvas. Click the new Input block, and then in the Inspector set Feature to seq. This means that the DNA sequences will be the input data to the network.
In the authors’ model, the first layer is a convolutional layer with width 13, height 4 (corresponding to the number of DNA bases) and 128 filters. To mimic this, select the 2D Convolution block (this should be automatically connected to the Input block), and fill in 128, 13 and 4 respectively in the three first editable fields (Filters, Width, Height). Leave the rest of the parameters at their default values. Underneath the block, the dimensions of the data after passing through these convolutions are shown (1 x 58 x 128).
Next, according to the paper, we need a dropout operation with dropout probability 0.15. Select the Dropout block and change the value in the Rate field to 0.15. The dropout operation will randomly set some activations to zero, but it will not affect the dimensions of the data, so the shape is still the same.
Now add another 2D Convolution block with 128 filters, width 13 and height 1, followed by another Dropout block with dropout probability 0.15.
Copy and paste the last pair of 2D Convolution and Dropout blocks. Now the shape of the data after passing through all the layers we added should be 1 x 34 x 128.
Next, these data will be passed through a series of fully connected (dense) layers to produce the final prediction. In order to do that, though, we need to “squeeze” the data into a single vector of numbers, which is done with the Flatten block.
Click Flatten. There should be a single number (4352) underneath the newly appeared node, indicating that the three-dimensional 1 x 34 x 128 tensor has been unfolded into a vector of 4352 values.
Next, click on Dense and set the value of Nodes to 64.
Click Dense again and set the value of Nodes to 1. This is the layer that will produce our prediction, a single number, which is why Nodes is equal to 1. For this last Dense layer, though, we also change the Activation to Linear.
Finally, add a target block and set Feature to growth_rate. At this point, the model should look something like this:
Click Settings. In the paper, the authors did not specify a Batch size value, so let’s stick with the default, 32, and select the Adam optimizer function, which they did specify that they used. Then click Run and go get a cup of coffee while the training gets underway!
We can watch the progress of the training by going to the Evaluation view.
Three different plots are shown on the page, and they are dynamically updated as the training goes on. The upper graph, the Loss & metrics graph, shows how the loss (the sum of prediction errors made by the model) varies for the training and validation datasets. The lower the loss, the better – but in particular we should pay attention to how the validation loss behaves. While the training error typically keeps on dropping, the validation error can start increasing again after a minimum due to the model overfitting to the training data.
The plots on the lower part of Evaluation view reflect the model’s performance at the best observed epoch (or last), in terms of validation error. So for this particular instantiation of the training process, it will show results for epoch 7.
The prediction scatterplot appears because we are dealing with a regression problem (as opposed to a classification problem) – we are trying to predict a single numeric value. This plot shows the true value on the x-axis and the predicted value on the y-axis. We see a clear, though imperfect, correlation between the true values and the values predicted by our model.
The error distribution graph can be useful for diagnosing what kinds of errors a model makes. Sometimes, a model will tend to over- or underpredict systematically, and in those cases, that will show in the error distribution plot.
Using the Peltarion Platform, we were able to replicate the paper successfully, achieving similar results. Compared to a Jupyter Notebook, the form in which the original model was published, the Peltarion Platform offers the following benefits: