Look deep into DNA

How to solve a regression problem

This tutorial will show you how to build a model that will solve a regression problem. That is a problem where you want to predict a quantity.

You will create a solution on the platform to a complex problem using state-of-the-art machine learning models.

Person- Target audience: Intermediate users

Author: Mikael Huss, Data Scientist at Peltarion

Motivation - Replicate results of a research paper

The past few years have seen a number of interesting applications of deep learning to the field of biology, including applications to DNA research.

We were curious what would happen if we attempted to replicate the results of a research paper on yeast DNA that was published in Genome Research in 2017, using the Peltarion Platform. The paper, Deep learning of the regulatory grammar of yeast 5′ untranslated regions from 500,000 random sequences, is concerned with the problem of predicting, from DNA sequence alone, how much protein a gene will produce.

Hexagon building

Biological background

DNA constitutes the hereditary template that specifies how cells should produce their proteins. DNA is, to a first approximation, identical in all cells in the body. However, the static DNA template gets used in a dynamically varying way in different cell types and organs. It is read by molecular machines that transcribe it into a temporary intermediate, RNA, which is then translated to proteins, molecules that do most of the actual work in the body.

The genetic code (which determines how so-called coding regions in DNA get translated into the protein building blocks or amino acids) is known, but the finer points of how this process is controlled and fine-tuned are much less well understood. For example, there are parts of DNA that are transcribed to the RNA intermediate but not translated further into proteins. These are called untranslated regions (UTR) and are thought to be involved in controlling the rate at which a gene is made into protein (rather than being part of the code for making the actual amino acids).

Formulating the problem in terms of deep learning

The authors of the paper wanted to construct a predictive model relating UTR sequence to protein production by generating a large number (almost 500,000) of random, 50-base-pair DNA sequences attached to a gene coding for a protein. The production rate of this protein was measured indirectly through a growth rate assay (details are in the paper).
In other words, they created 500,000 training examples consisting of strings with length 50, where each character was either A, C, G or T. For each training example, there was a corresponding target value to predict, obtained from the growth rate measurement.

The authors describe their next steps:

"We trained a convolutional neural network (CNN) on the random library and showed that it performs well at predicting the protein expression of both a held-out set of the random 5′ UTRs as well as native S. cerevisiae 5′ UTRs. The model additionally was used to computationally evolve highly active 5′ UTRs. 'We confirmed experimentally that the great majority of the evolved sequences led to higher protein expression rates than the starting sequences, demonstrating the predictive power of this model."

Luckily, the paper’s authors provide both the data they used and Jupyter Notebooks with Keras code. This makes it easier for us to re-implement the study on the Peltarion Platform.

Platform ready data

If you just want the input data files with minimal hassle:

Optional: Deeper dive into the data

If you want to see how the data was prepared, you can take a look at this script that will both download the authors’ original data as a CSV file and preprocess it. When you download the script, make sure that the dependencies (pandas, NumPy, Fire, tqdm) are installed, and you should be able to run it from the command line. The only thing you need to specify is where the files should be saved. Finally, if you want to see the authors’ original analyses, they are available as Jupyter Notebooks here, together with the raw data in CSV format.

Work with the data on the platform

Now it’s time to start using the platform! Let’s start by creating a new project, which we will call Yeast Blog, and opening that project.

The next thing to do is to create a dataset from the input data files we have prepared according to what was written above.
Click Upload file and upload the CSV and NumPy files you downloaded in the previous step.

Name the dataset Yeast and click Done.

The column that lists integer values will not be used.

Target feature - growth rate

The column that shows floating-point numbers, for example, -0.6084566 or 0.4901557 is the column of target values, that is, fluorescence levels that describe the growth rate.

Change the Label to growth_rate.

Training/Validation split

The string column, containing a T or a V, indicates if the row should be used for training or validation. The split between training and validation data is approximately 95% and 5%.

NOTE: An additional 5% of the original data is set aside for testing during preparation and is not included in the dataset that you have uploaded to the platform.

Input feature - One-hot-encoded NumPy arrays

The column that shows black squares contains the DNA sequences in one-hot-encoded NumPy arrays with the dimensions 70x4x1:

  • The size of the first dimension (70) corresponds to the original number of base pairs, 50 flanked by 10 dummy positions on each side.

  • The size of the second dimension (4) corresponds to the number of different nucleotides (A, C, G, T).

  • The third dimension (1) is a “dummy dimension” that we need in order to be able to perform two-dimensional convolutions on the sequences.

Click the column and change the Label to seq.

The squares are black because the default assumption in the platform is that multidimensional arrays will be images, and the preview will show a thumbnail representation of the image. The values in our matrices are always zero or one, because we are representing DNA sequences as vectors like [1, 0, 0, 0] (for “A”) or [0, 1, 0, 0] (for “C”). Since the pixel intensity of images, generally, falls in the range between 0 and 255, the thumbnail representations will be black.

Create subsets

Now, we need to split the dataset into subsets for training and validation. To do this, you need to make sure you see the advanced settings.

You can choose to use a random split or according to a logical condition. In this case, you can create a condition based on the subset column in the dataset.

  • Click New subset and name the training subset Training 95%
    Then click Add conditional filter and set Feature to subset, Operator to is equal to, and Value to T.
    Click Create subset.

  • Repeat the procedure. Name the new subset Validation 5%.
    Set Feature to subset, Operator to is equal to, and Value to V.

Finally, click Use in new experiment to complete the construction of a new, version-controlled dataset.

Use in new experiment button

Create custom experiment in the Wizard

In the Experiment wizard, click Create custom experiment. Now we get an empty canvas on which to design the model.

Create custom experiment button

Replicate the authors' model

Now it’s time to create the predictive model. We will replicate the paper authors’ winning model, which they found by hyperparameter optimization using the Hyperopt package in Python.

Here is a summary of their model as printed out by Keras (obtained by following the Jupyter Notebook linked from their paper). This shows the various layers and operations of their final model.

Keras model summary
Figure 1. Keras model summary

Add blocks to the model

Now let’s start defining the structure of the model. Click the Build tab in the Inspector and then Input to add an input block on the canvas. Click the new Input block, and then in the Inspector set Feature to seq. This means that the DNA sequences will be the input data to the network.

In the authors’ model, the first layer is a convolutional layer with width 13, height 4 (corresponding to the number of DNA bases) and 128 filters. To mimic this, select the 2D Convolution block (this should be automatically connected to the Input block), and fill in 128, 13 and 4 respectively in the three first editable fields (Filters, Width, Height). Change Padding to Valid.
Underneath the block, the dimensions of the data after passing through these convolutions are shown (1 x 58 x 128).

Model start

Next, according to the paper, we need a dropout operation with dropout probability 0.15. Select the Dropout block and change the value in the Rate field to 0.15. The dropout operation will randomly set some activations to zero, but it will not affect the dimensions of the data, so the shape is still the same.


Now add another 2D Convolution block with 128 filters, width 13 and height 1, followed by another Dropout block with dropout probability 0.15.

Copy and paste the last pair of 2D Convolution and Dropout blocks. Now the shape of the data after passing through all the layers we added should be 1 x 34 x 128.

Next, these data will be passed through a series of fully connected (dense) layers to produce the final prediction. In order to do that, though, we need to “squeeze” the data into a single vector of numbers, which is done with the Flatten block.

Click Flatten. There should be a single number (4352) underneath the newly appeared node, indicating that the three-dimensional 1 x 34 x 128 tensor has been unfolded into a vector of 4352 values.

Next, click on Dense and set the value of Nodes to 64.

Click Dense again and set the value of Nodes to 1. This is the layer that will produce our prediction, a single number, which is why Nodes is equal to 1. For this last Dense layer, though, we also change the Activation to Linear.

Finally, add a target block and set Feature to growth_rate. At this point, the model should look something like this:

Final model
Figure 2. Final model

Train the model

Click the Settings tab. In the paper, the authors did not specify a Batch size value, so let’s stick with the default, 32, and the Adam optimizer function, which they did specify that they used.

Then click Run and go get a cup of coffee while the training gets underway!

Run button

Monitoring training progress

You can watch the progress of the training in the Evaluation view.

Loss and metrics tab

The loss (the sum of prediction errors made by the model) varies for the training and validation datasets. The lower the loss, the better – but in particular, we should pay attention to how the validation loss behaves (the bold line).

While the training error typically keeps on dropping, the validation error can start increasing again after a minimum due to the model overfitting to the training data.

Predictions inspection tab

The plots in the Predictions inspection tab reflect the model’s performance at the best-observed epoch (or last), in terms of validation error. So for this particular instantiation of the training process, it will show results for epoch 7.

Evaluation view - Scatterplot
Figure 3. Evaluation view - Scatterplot

The prediction scatterplot appears because we are dealing with a regression problem (as opposed to a classification problem) – we are trying to predict a single numeric value. This plot shows the true value on the x-axis and the predicted value on the y-axis.

We see a clear, though imperfect, correlation between the true values and the values predicted by our model.


Using the Peltarion Platform, we were able to replicate the paper successfully, achieving similar results. Compared to a Jupyter Notebook, the form in which the original model was published, the Peltarion Platform offers the following benefits:

  • No installation required (no TensorFlow/Keras/other dependencies)

  • Transparent access to a GPU

  • Multi-user projects with easy sharing within an organization

  • At-a-glance comparisons of results of different experiments

  • Quick iteration when experimenting; for instance, you can “clone” a trained model and continue building on top of it

  • Model deployment to a REST API.

Was this page helpful?
Yes No