Estimated time: 30 min

Predict California house prices

How to solve a regression problem using table data and images

In this tutorial you will build an experiment and train an AI model with real data both numbers and images and try to make it reliable for house price prediction.

If you deploy the final trained AI model in real life, someone could load the location, size of their house, etc., via an online portal and get a valuation. Nice!

Target audience: Data scientists and developers

Preread: Before following this tutorial, it is strongly recommended that you complete the previous tutorial if you have not done so already.

The problem

According to Real Capital Analytics (RCA), global volumes for completed sales of commercial properties totaled $873 billion in 2017. Most of that is backed by bank mortgages. As was shown by the 2008 financial crisis, it is essential for the economy that the amount of a bank's real estate lending matches the actual value of the real estate. Getting a good estimate of the price of a house is hard even for the most seasoned real estate agents.

There is an obvious benefit to building a data-driven decision support tool for both banks and real estate agents, and such tools have been around for decades. They have typically used historical sales data to track prices in individual neighborhoods and from that get average prices. With the advent of deep learning it is now possible to get a much more sophisticated valuation as we can now use other data types such as images.

You will learn to

  • Solve a regression problem
    This tutorial will show you how to build a model that will solve a regression problem. That is a problem where you want to predict a quantity, in this case, the price of a house.
  • Use multiple datasets
    To solve the regression problem, we will build a model that will help us predict the price of a house given multiple sets of input data, both tabular data and images. Predicting a house price from both tabular and image inputs is a unique problem and not something you can do with anything other than deep learning.
  • Use snippets
  • Run multiple experiments and compare them

The data

For this tutorial, we have created the Calihouse dataset. You can read more about it here. This dataset consists of map images of the house location from Open street map and tabular demographic data collected from the California 1990 Census.

Each sample in the dataset gives the following information about one block of houses:

  • Housing median age
  • Total number of rooms
  • Total number of bedrooms
  • Population
  • Number of households
  • Median income
  • Median house value

We wish to make an AI model that learns to predict the price of a house, here called median house value, given the other available data (i.e., median house age, population, etc.). Hence, median house value is our output feature, while the others are our input features. It wouldn’t be a useful system if we had to input median house value to get median house value in the output, right?

Create a project for the Predict California house prices tutorial

Start by creating a project on the Projects view by clicking on New project. A project combines all of the steps in solving a problem, from pre-processing of datasets to model building, evaluation and deployment. Name the project and add a description in the Create project pop-up and click Submit.

Add the Calihouse dataset to the platform

Now you can add the Calihouse dataset to the platform.

  1. Download the Calihouse dataset here.
  2. Navigate to the Datasets view and click New dataset.
  3. Locate the on your computer and drag and drop it to the Upload files tab.
  4. Click Next.
  5. Name the dataset House pricing and click Done.

Datasets view

Feature distributions

Above each feature, you'll see a feature distribution showing the distribution of the feature. In our case here, all our variables are natural measurements, which almost certainly guarantees that they have something similar to a normal distribution.


AI learns from data we supply. But how can we be sure that it will work with other data – data that it hasn’t seen?

The answer to that is validation. Instead of using all of the data available for training the system, we leave some aside to test the system later. This validation subset makes sure that we know how well the system is capable of generalization, i.e., how well it works on data it hasn’t been trained on. As you can see in the Inspector on the right, the dataset is by default split into two subsets, 80% is included in the training subset, and 20% is put aside for a validation subset. Leave it at the default value, but you can change this later if you want to.

Normalize image input data

Change the preprocessing of the feature image_path from None to Standardization. You normalize a dataset to make it easier and faster to train a model.

Standardization converts a set of raw input data to have a zero mean and unit standard deviation. Values above the feature’s mean value will get positive scores, and those below the mean will get a negative score. The reason we normalize or scale input data is simply because neural networks train better when the data comes roughly in an interval of -1 to 1.

Create a feature set

A feature set is two or more features that you want to treat in the same way during modeling. Create a feature set for the tabular data on the houses, for example, number of bedrooms and median income. Click on New feature set, name the feature set tabular and select the information on the houses:

  • housingMedianAge (1)
  • totalRooms (1)
  • totalBedrooms (1)
  • population (1)
  • households (1)
  • medianIncome (1)

Click Create.

The new feature set will be displayed above the columns. Click on it to view the features that are included.

Datasets view - Feature columns and tabular feature set

Save the dataset

Click on Save version on the upper right corner of the Datasets view.

Design a model for image data

Now that we have the data, let’s create the AI model. We’ll start with just trying to predict the prices from the map images. It most likely won’t give us good predictions, but let’s try it anyway just to get a baseline.

On the Peltarion Platform, an experiment is the basic unit you’ll be working with. It’s the basic hypothesis that you want to try, i.e., “I think I might get good accuracy if I train this model, on this data, in this way.”

An experiment contains all the information needed to reproduce the experiment:

  • The dataset 
  • The AI model
  • The settings or parameters used to run the experiment.

The result is a trained AI model that can be evaluated and deployed.

In the Modeling view, click New experiment. Name the model and click Create.

In the Inspector, click on the Settings tab and make sure that the House pricing dataset is selected in the Dataset field found in the Dataset settings section.

Create the model

Ok, what do you need? For the image input data, you'll use a convolutional neural network, a CNN, which is a neural net that is often used when the input is images. This network looks for low-level features such as edges and curves and then builds up to more abstract concepts through a series of convolutional layers. For the tabular demographic data, you'll use a single dense layer, which will probably be enough.

Create the input part for the image data:

  1. In the Inspector, click on the Build tab.
  2. Click on Snippets to expand the section and click on CNN. This will add a complete CNN snippet net to the Modeling canvas.

    After this step, the Information pop-up should have appeared showing you what needs to be adjusted before you can run the model. Let's do just that.
  3. Select the Input block in the Modeling canvas. In the Inspector set Feature to image_path. These are the map images.

    Make sure that Image augmentation is set to None.
  4. Next, select the last Dense block and set Nodes to 1 (we only want one prediction) and Activation to Linear.
  5. Finally, select the Target block and set Feature to medianHouseValue and Loss to Mean Squared Error.

You will end up with this experiment:

The model

Run model

Finally, it’s time to train the model and see if we've come up with a good model.

Navigate to the Settings tab in the Inspector. In the Run settings section keep the default values, but for your info:

  • Batch size is how many rows (examples) that are computed at the same time.
  • One Epoch is when the complete dataset has run through the model one time. That means that if you set Epochs to 10 the complete dataset has run through the model 10 times.
  • Data access seed is just a random number.
  • The Optimizer is how the system optimizes the loss with respect to the weights of the network.

Done! Click Run in the top right corner.

Analyze experiment

The Evaluation view shows in several ways how the training of the model has progressed.

There are a number of ways to visualize how your experiment is performing. 

The Inspector now shows Experiment info where you can, among many other things, see the loss of your experiment.

Loss graph 

The lower the loss, the better a model (unless the model has over-fitted to the training data). The loss is calculated on training and validation and its interpretation is how well the model is doing for these two sets. It is a summation of the errors made for each example in training or validation sets.

Loss graph

Prediction scatterplot

In a perfect scatterplot, you'll have 100% on the diagonal going from bottom left to top right.

Prediction scatterplot

Error distribution graph

The Error distribution graph shows the number of instances as a function of the error value. This is to see how the errors are distributed. Naturally, we want the curve as narrow as possible, no errors!

Error distribution graph

Improve experiment

Ok, so after a minute of training we can confirm our suspicion that just using images is probably not the way to go. Now it's time to find out if you can improve the experiment.

As long as you keep the same loss function, you can compare the results of the experiments and see which one is the best in the Evaluation view.

Extend experiment with a second input for tabular data

Another way to improve an experiment is to add another dataset and see if that will improve the experiment's predictions. So we want to connect two input nets and see if they can work together. To make it really easy for you, we've created a snippet with two input nets, one CNN and one fully connected (FC).

  1. Create a new experiment and add a CNN + FC to the Modeling canvas. Make sure that the House pricing dataset is selected in the field Dataset found in the Dataset settings section.
  2. Select the Input block for the FC net and set Feature to tabular, this is the feature set you created in the Datasets view.
  3. Select the Input block for the CNN net and set Feature to image_path.
  4. Select the Target block and set Feature to medianHouseValue and Loss to Mean squared error.

You'll end up with this experiment:

The extended model

Run experiment and evaluate

Run the experiment again and see if we've come up with a better experiment. As stated before, if you keep the same loss function you can compare subsequent experiments.

Did the second input help? As you can see only after a minute or two, the results are drastically improved it’s actually learning something. If you let it run for a while you’ll get a decent predictive tool.

Deploy trained experiment

While our model may be great, it is little more than an academic exercise as long as it is locked up inside the platform. If we want people to be able to use the model, we have to get it out in some usable form. Check out the tutorial, Deploy an operational AI model, to learn how to put things into production and make AI models operational.

Tutorial recap

Congratulations, you've completed the California house pricing tutorial. In this tutorial you've learned how to solve a regression problem, first by using a CNN snippet and then by extending the experiment using multiple datasets. Then you analyzed the experiments to find out which one was the best. Good job!

Get started