Tutorial - Predict California house prices

How to solve a regression problem using table data and images

In this tutorial, you will build an experiment and train an AI model with real data - both numbers and images - and try to make it reliable for house price prediction.

If you deploy the final trained AI model in real life, someone could load the location, size of their house, etc., via an online portal and get a valuation. Nice!

Target audience: Data scientists and developers.

Preread: Before following this tutorial, it is strongly recommended that you complete the previous tutorial if you have not done that already.

The problem

According to Real Capital Analytics (RCA), global volumes for completed sales of commercial properties totaled $873 billion in 2017. Most of that is backed by bank mortgages. As was  shown by the 2008 financial crisis, it is essential for the economy that the amount of a banks real estate lending matches the actual value of the real estate. Getting a good estimate of the price of a house is hard even for the most seasoned real estate agents.

There is an obvious benefit to building a data-driven decision support tool for both banks and real estate agents and such tools have been around for decades. They have typically used historical sales data to track prices in individual neighborhoods and from that get average prices. With the advent of deep learning it is now possible to get a much more sophisticated valuation as we can now use other data types - such as images.

You will learn

  • Solve a regression problem

    This tutorial will show you how to build a model that will solve a regression problem. That is a problem where you want to predict a quantity, in this case, the price of a house.
  • Use multiple datasets

    To solve the regression problem, we will build a model that will help us predict the price of a house given multiple sets of input data, both tabular data and images. Predicting a house price from both tabular and image inputs is a unique problem and not something you can do with anything other than deep learning.
  • Using snippets
  • Run multiple experiments and compare them

The data

For this tutorial, we have created the Calihouse dataset. You can read more about it here. This dataset consists of map images of the house location from Open street map and tabular demographic data collected from the California 1990 Census.

Each sample in the dataset gives the following information about one block of houses:

  • Housing median age
  • Total number of rooms
  • Total number of bedrooms
  • Population
  • Number of households
  • Median income
  • Median house value

We wish to make an AI model that learns to predict the price of a house, here called median house value, given the other available data (i.e., median house age, population, etc.). Hence Median house value is our output feature, while the others are our input features. It wouldn’t be a useful system if we had to input Median house value to get Median house value in the output, right?

Create a project for the Predict California house prices tutorial

Start by creating a project on the Projects view by clicking on New project. A project combines all of the steps in solving a problem, from pre-processing of datasets to model building, evaluation and deployment. Name the project and add a description in the Create project pop-up and click Submit.

Add the Calihouse dataset to the platform

Now you can add the Calihouse dataset to the platform,

  1. Download the Calihouse dataset here.
  2. Navigate to the Datasets view and click New dataset.
  3. Locate the cali_house_v2.zip on your computer and drag and drop it to the Upload files tab. Click Next.
  4. Name the dataset House pricing and click Done.
The Dataset view

Feature distributions

Above each feature, you'll see a feature distribution showing the distribution of the feature. In our case here, all our variables are natural measurements, which almost certainly guarantees that they have something similar to a normal distribution.

Subsets

AI learns from data we supply it. But how can we be sure that they will work with other data – data that they haven’t seen?

The answer to that is validation. Instead of using all of the data available for training the system, we leave some aside with which we later test the system. This validation subset makes sure that we know how well the system is capable of generalization, i.e., how well it works on data it hasn’t been trained on. As you can see in the Inspector on the right, the dataset is by default split into two subsets, 80% is included in the training subset, and 20% are put aside for a validation subset. Leave it at the default value, but you can change this later if you want to.

Normalize image input data

Change the preprocessing of the feature image_path from No preprocessing to Standardization. You normalize a dataset to make it easier and faster to train a model.

Standardization converts a set of raw input data to have a zero mean and unit standard deviation. Values above the feature’s mean value will get positive scores, and those below the mean will get a negative score. The reason we normalize or scale input data is simply because neural networks train better when the data comes roughly in an interval of -1 to 1.

Create feature sets

Time to define the input and output. We do this by creating feature sets in the Inspector. They are features bundled together. Since we wish to make an AI system that learns to predict the price of a house given the other available data (i.e., house median age, population,  map image, etc.) our output feature set includes only the Median house value. Our first input feature set will include the map images and our second input feature set will include all other tabular input features.

  1. Create an input feature set for the map images that shows the location of the house.
  2. Create a feature set for the tabular data on the houses, for example, number of bedrooms and median income.

    Click on New feature set again, name the feature set Input - Tabular and select the information on the houses:

    HousingMedianAge (1)

    totalRooms (1)

    totalBedrooms (1)

    population (1)

    households (1)

    medianIncome (1)
  3. Click Create.
  4. Create a target feature set for the predicted house price. Click on New feature set a third time, name the feature set Target - House price and select medianHouseValue (1), then click Create.
  5. Click on Save version on the upper right corner of the Datasets view.
Note!
Do not forget to save this version of the dataset. You won’t be able to use it in the Modeling view otherwise.

Design a model for image data

Now that we have the data, let’s create the AI model. We’ll start with just trying to predict the prices from the map images. It most likely won’t give us good predictions, but let’s try it anyway just to get a baseline.

On the Peltarion Platform, an experiment is the basic unit you’ll be working with. It’s the basic hypothesis that you want to try, i.e, “I think I might get good accuracy if I train this model, on this data, in this way”.

An experiment contains all the information needed to reproduce the experiment:

  • The dataset 
  • The AI model
  • The settings or parameters used to run the experiment.

The result is a trained AI model that can be evaluated and deployed.

In the Modeling view, click New experiment. Name the model and click Create.

In the Inspector, click on the Settings tab and make sure that the House pricing dataset is selected in the Dataset wrapper field found in the Dataset settings section.

Create model

Ok, what do you need? For the image input data, you'll use a convolutional neural network, a CNN, this is a neural net that is often used when the input is images. This network is looking for low-level features such as edges and curves and then building up to more abstract concepts through a series of convolutional layers. For the tabular demographic data you'll, use a single dense layer, that layer will probably be enough.

Create the input part for the image data

  1. In the Inspector , click on the Blocks tab.
  2. Click on Snippets to expand the section and click on CNN. This will add a complete CNN snippet net to the Modeling canvas.

    After this step, the Information-center popup should have appeared showing you what needs to be adjusted before you can run the model. Let's do just that.
  3. Select the Input block in the Modeling canvas. In the Inspector set the input Feature set to Input - Images. These are the map images.

    Make sure that Image augmentation is set to None.
  4. Next select the last Dense block and set Nodes to 1 (we only want one prediction) and Activation to Linear.
  5. Finally select the Target block and set the Feature set to Target - House price keep Mean squared error as the Loss.

You will end up with this experiment:

The model

Run model

Finally, time to train the model and see if we've come up with a good model.

Navigate to the Settings tab in the Inspector. In the Run settings section keep the default values, but for your info:

  • Batch size is how many rows (examples) that are computed at the same time.
  • One Epoch is when the complete dataset has run through the model one time. That means that if you set Epochs to 100 the complete dataset has run through the model a 100 times.
  • Data access seed is just a random number.
  • The Optimizer is how the system optimizes the loss with respect to the weights of the network.

Done!! Click Run in the top right corner.

Analyze experiment

The Evaluation view shows in several ways how the training of the model has progressed.

There are a number of ways to visualize how your experiment is performing. 

The Inspector now shows Experiment info  where you can, among many other things, see the loss of your experiment.

Loss graph. The lower the loss, the better a model (unless the model has over-fitted to the training data). The loss is calculated on training and validation and its interpretation is how well the model is doing for these two sets. It is a summation of the errors made for each example in training or validation sets.

Training overview

Prediction scatterplot. In a perfect scatterplot, you'll have 100% on the diagonal going from bottom left to top right.

Prediction scatterplot

The Error distribution shows how the number of instances as a function of the error value. This is to see how the errors are distributed. Naturally, we want the curve as narrow as possible, no errors!

Error distribution graph

Improve experiment

Ok, so after a minute of training we can confirm our suspicion that just using images is probably not the way to go. Now it's time to find out if you can improve the experiment.

As long as you keep the same loss, function you can compare the results of the experiments and see which one is the best in the Evaluation view.

Extend experiment with a second input for tabular data

Another way to improve an experiment is to add another dataset and see if that will improve the experiments predictions. So we want to connect two input nets and see if they can work together. To make it really easy for you, we've created a snippet with two input nets, one CNN and one fully connected (FC).

  1. Create a new experiment and add a CNN + FC to the Modeling canvas. Make sure that the House pricing dataset is selected in the field found Dataset wrapper in the Dataset settings section.
  2. Select the Input block for the FC net and set the Feature set  to Input - Tabular.
  3. Select the Input block for the CNN net and set the Feature set to Input - Images.
  4. Select the Target block and set the Feature set to Target - House price. Set Mean squared error as the Loss.

You'll end up with this the experiment:

The extended model

Run experiment and evaluate

Run the experiment again and see if we've come up with a better experiment. As stated before, if you keep the same loss function you can compare subsequent experiments.

Did the second input help? As you can see only after a minute or two, the results are drastically improved - it’s actually learning something. If you let it run for a while you’ll get a decent predictive tool.

Deploy trained experiment

While our model may be great it is little more than an academic exercise as long as it is locked up inside the Platform. If we want people to be able to use the model, we have to get it out in some usable form. Check out the tutorial, Deploy an operational AI model, to learn how to put things into production and make AI models operational.

Tutorial recap

Congratulations, you've completed the California house pricing tutorial. In this tutorial you've learned how to solve a regression problem, first by using a CNN snippet and then by extending the experiment using multiple datasets. Then you analyzed the experiments to find out which one was the best. Good job!

Try the platform