Estimated time: 30 min

Predict California house prices

How to solve a regression problem using table data and images.

In this tutorial you will build an experiment and train an AI model with real data both numbers and images and try to make it reliable for house price prediction.

If you deploy the final trained AI model in real life, someone could load the location, size of their house, etc., via an online portal and get a valuation. Nice!

Target audience: Data scientists and developers

Preread: Before following this tutorial, it is strongly recommended that you complete the Deploy an operational AI model if you have not done so already.

The problem

According to Real Capital Analytics (RCA), global volumes for completed sales of commercial properties totaled $873 billion in 2017. Most of that is backed by bank mortgages. As was shown by the 2008 financial crisis, it is essential for the economy that the amount of a bank's real estate lending matches the actual value of the real estate. Getting a good estimate of the price of a house is hard even for the most seasoned real estate agents.

There is an obvious benefit to building a data-driven decision support tool for both banks and real estate agents, and such tools have been around for decades. They have typically used historical sales data to track prices in individual neighborhoods and from that get average prices. With the advent of deep learning it is now possible to get a much more sophisticated valuation as we can now use other data types such as images.

You will learn to

  • Solve a regression problem
    This tutorial will show you how to build a model that will solve a regression problem. That is a problem where you want to predict a quantity, in this case, the price of a house.
  • Use multiple datasets
    To solve the regression problem, we will build a model that will help us predict the price of a house given multiple sets of input data, both tabular data and images. Predicting a house price from both tabular and image inputs is a unique problem and not something you can do with anything other than deep learning.
  • Use snippets
  • Run multiple experiments and compare them

The data

For this tutorial, we have created the Calihouse dataset. You can read more about it here. This dataset consists of map images of the house location from Open street map and tabular demographic data collected from the California 1990 Census.

Each sample in the dataset gives the following information about one block of houses:

  • Housing median age
  • Total number of rooms
  • Total number of bedrooms
  • Population
  • Number of households
  • Median income
  • Median house value

We wish to make an AI model that learns to predict the price of a house, here called median house value, given the other available data (i.e., median house age, population, etc.). Hence, median house value is our output feature, while the others are our input features. It wouldn’t be a useful system if we had to input median house value to get median house value in the output, right?

Create a project for the Predict California house prices tutorial

Start by creating a project on the Projects view by clicking on New project. A project combines all of the steps in solving a problem, from pre-processing of datasets to model building, evaluation and deployment. Name the project and add a description in the Create project pop-up and click Submit.

Add the Calihouse dataset to the platform

After creating the project, you will be taken to the Datasets view, where you can import data.

Click the Data library button and look for the Cali House - tutorial data dataset in the list. Click on it to get more information.

If you agree with the license, click Accept and import. This will import the dataset in your project, and you will be taken to the dataset's details where you can edit features and subsets.

House pricing dataset

Feature distributions

Above each feature, you'll see a feature distribution showing the distribution of the feature. In our case here, all our variables are natural measurements, which almost certainly guarantees that they have something similar to a normal distribution.

Subsets

AI learns from data we supply. But how can we be sure that it will work with other data – data that it hasn’t seen?

The answer to that is validation. Instead of using all of the data available for training the system, we leave some aside to test the system later. This validation subset makes sure that we know how well the system is capable of generalization, i.e., how well it works on data it hasn’t been trained on. As you can see in the Inspector, the dataset is by default split into two subsets, 80% is included in the training subset, and 20% is put aside for a validation subset. Leave it at the default value, but you can change this later if you want to.

Normalize image input data

Locate the feature image_path and click the spanner. Change the Normalization from None to Standardization. You normalize a dataset to make it easier and faster to train a model.

Standardization converts a set of raw input data to have a zero mean and unit standard deviation. Values above the feature’s mean value will get positive scores, and those below the mean will get a negative score. The reason we normalize or scale input data is simply because neural networks train better when the data comes roughly in an interval of -1 to 1.

Create a feature set

Create a feature set for the second experiment. A feature set is two or more features that you want to treat in the same way during modeling.
This feature set consists of the tabular data on the houses, for example, number of bedrooms and median income. Click on New feature set, name the feature set tabular and select the information on the houses:

  • housingMedianAge (1)
  • totalRooms (1)
  • totalBedrooms (1)
  • population (1)
  • households (1)
  • medianIncome (1)

Click Create.

The new feature set will be displayed above the columns. Click on it to view the features that are included.

Datasets view - Feature columns and tabular feature set

Save the dataset

Click Save version on the upper right corner of the Datasets view.

Then click Use in new model and the Experiment wizard will pop up.

Design a model for image data

Now that we have the data, let’s create the AI model. We’ll start by just trying to predict the prices from the map images. It most likely won’t give us good predictions, but let’s try it anyway just to get a baseline.

On the Peltarion Platform, an experiment is the basic unit you’ll be working with. It’s the basic hypothesis that you want to try, i.e., “I think I might get good accuracy if I train this model, on this data, in this way.”

An experiment contains all the information needed to reproduce the experiment:

  • The dataset 
  • The AI model
  • The settings or parameters used to run the experiment.

The result is a trained AI model that can be evaluated and deployed.

Create experiment

You should be in the Experiment wizard. If not (you never know when things don't go as planned) navigate to the Modeling view and click New experiment.

Define dataset tab

Make sure that the House pricing dataset is selected in the Experiment wizard.

Choose snippet tab

Click on the Choose snippet tab.

Ok, what do you need? For the image input data, select EfficientNet B0. These are a family of neural network architectures released by Google in 2019 that have been designed by an optimization procedure that maximizes the accuracy for a given computational cost.

Make sure the Input feature is image_path and the Target feature is housingMedianValue.

Initialize weights

Keep all default settings in the Initialize weights tab. The nice part with pretrained snippets, such as  EfficientNet B0, is that they have learned the useful representations from the dataset it has been trained on. This stored knowledge can be used in new experiments.

Click Create. This will add a complete Efficientnet BO snippet to the Modeling canvas.

Modeling canvas

The Experiment wizard have pre-populated all settings needed:

  • The Loss in the Target block is set to Mean Squared Error (MSE). MSE is often used when doing regression, when the target, conditioned on the input, is normally distributed.
  • The last Dense block has Nodes set to 1  because we want only one prediction.
Settings tab in the Inspector

Navigate to the Settings tab in the Inspector and change the Learning rate to 0.0005.

Run model

Now, it’s time to train the model and see if we've come up with a good model.

Done! Click Run in the top right corner.

Analyze experiment

The Evaluation view shows in several ways how the training of the model has progressed.

There are a number of ways to visualize how your experiment is performing. 

The Inspector now shows Experiment info where you can, among many other things, see the loss of your experiment.

Loss graph 

The lower the loss, the better a model (unless the model has over-fitted to the training data). The loss is calculated on training and validation and its interpretation is how well the model is doing for these two sets. It is a summation of the errors made for each example in training or validation sets.

Loss graph

Prediction scatterplot

In a perfect scatterplot, you'll have 100% on the diagonal going from bottom left to top right.

Prediction scatterplot

Error distribution graph

The Error distribution graph shows the number of instances as a function of the error value. This is to see how the errors are distributed. Naturally, we want the curve as narrow as possible, no errors!

Error distribution graph

Improve experiment

Now it's time to find out if you can improve the experiment.

As long as you keep the same loss function, you can compare the results of the experiments and see which one is the best in the Evaluation view.

Extend experiment with a second input for tabular data

One way to improve this experiment is to add a second input for tabular data and see if that will improve the experiment's predictions. So we want to combine two nets with different inputs and see if they can work together.

  1. Click on the 3 dots next to your first experiment and select Duplicate. Do not copy the weights.
  2. Remove:
    the Dense block and the Target block.
  3. Expand the EfficientNet BO block and then expand the Top block.
    You'll see a red Dropout block at the bottom of the snippet.
  4. Open the Blocks section in the Build tab in the Inspector and add a Concatenate block with 2 inputs.
  5. Connect the Dropout block with the Concatenate block.
  6. After the Concatenate block add:
    A Dense block with 512 nodes and ReLU loss function.
    A Dense block with 1 node and Linear Loss function.
    A Target block with target feature to medianHouseValue and loss to Mean squared error.
  7. Let's build the network for the tabular data. Add
    An Input block with input feature tabular. This is the feature set you created in the Datasets view.
    A Dense block with 1000 nodes and ReLU loss function.
    A Batch normalization block.
    A Dense block with 1000 nodes and ReLU loss function.
  8. Connect the last Dense block you added with the Concatenate block.
Run experiment and evaluate

Run the experiment again and see if we've come up with a better experiment. As stated before, if you keep the same loss function you can compare subsequent experiments.

Did the second input help? As you can see after a few epochs, the results are drastically improved  it’s actually learning something. If you let it run for a while you’ll get a decent predictive tool

Deploy trained experiment

While our model may be great, it is little more than an academic exercise as long as it is locked up inside the platform. If we want people to be able to use the model, we have to get it out in some usable form. Check out the tutorial, Deploy an operational AI model, to learn how to put things into production and make AI models operational.

Tutorial recap

Congratulations, you've completed the California house pricing tutorial. In this tutorial you've learned how to solve a regression problem, first by using a CNN snippet and then by extending the experiment using multiple datasets. Then you analyzed the experiments to find out which one was the best. Good job!