Target audience: Data scientists and developers.
Preread: Before following this tutorial, it is strongly recommended that you complete the previous tutorial if you have not done that already.
According to Real Capital Analytics (RCA), global volumes for completed sales of commercial properties totaled $873 billion in 2017. Most of that is backed by bank mortgages. As was shown by the 2008 financial crisis, it is essential for the economy that the amount of a banks real estate lending matches the actual value of the real estate. Getting a good estimate of the price of a house is hard even for the most seasoned real estate agents.
There is an obvious benefit to building a data-driven decision support tool for both banks and real estate agents and such tools have been around for decades. They have typically used historical sales data to track prices in individual neighborhoods and from that get average prices. With the advent of deep learning it is now possible to get a much more sophisticated valuation as we can now use other data types - such as images.
For this tutorial, we have created the Calihouse dataset. You can read more about it here. This dataset consists of map images of the house location from Open street map and tabular demographic data collected from the California 1990 Census.
Each sample in the dataset gives the following information about one block of houses:
We wish to make an AI model that learns to predict the price of a house, here called median house value, given the other available data (i.e., median house age, population, etc.). Hence Median house value is our output feature, while the others are our input features. It wouldn’t be a useful system if we had to input Median house value to get Median house value in the output, right?
Start by creating a project on the Projects view by clicking on New project. A project combines all of the steps in solving a problem, from pre-processing of datasets to model building, evaluation and deployment. Name the project and add a description in the Create project pop-up and click Submit.
Now you can add the Calihouse dataset to the platform,
Above each feature, you'll see a feature distribution showing the distribution of the feature. In our case here, all our variables are natural measurements, which almost certainly guarantees that they have something similar to a normal distribution.
AI learns from data we supply it. But how can we be sure that they will work with other data – data that they haven’t seen?
The answer to that is validation. Instead of using all of the data available for training the system, we leave some aside with which we later test the system. This validation subset makes sure that we know how well the system is capable of generalization, i.e., how well it works on data it hasn’t been trained on. As you can see in the Inspector on the right, the dataset is by default split into two subsets, 80% is included in the training subset, and 20% are put aside for a validation subset. Leave it at the default value, but you can change this later if you want to.
Change the preprocessing of the feature image_path from No preprocessing to Standardization. You normalize a dataset to make it easier and faster to train a model.
Standardization converts a set of raw input data to have a zero mean and unit standard deviation. Values above the feature’s mean value will get positive scores, and those below the mean will get a negative score. The reason we normalize or scale input data is simply because neural networks train better when the data comes roughly in an interval of -1 to 1.
Time to define the input and output. We do this by creating feature sets in the Inspector. They are features bundled together. Since we wish to make an AI system that learns to predict the price of a house given the other available data (i.e., house median age, population, map image, etc.) our output feature set includes only the Median house value. Our first input feature set will include the map images and our second input feature set will include all other tabular input features.
Now that we have the data, let’s create the AI model. We’ll start with just trying to predict the prices from the map images. It most likely won’t give us good predictions, but let’s try it anyway just to get a baseline.
On the Peltarion Platform, an experiment is the basic unit you’ll be working with. It’s the basic hypothesis that you want to try, i.e, “I think I might get good accuracy if I train this model, on this data, in this way”.
An experiment contains all the information needed to reproduce the experiment:
The result is a trained AI model that can be evaluated and deployed.
In the Modeling view, click New experiment. Name the model and click Create.
In the Inspector, click on the Settings tab and make sure that the House pricing dataset is selected in the Dataset wrapper field found in the Dataset settings section.
Ok, what do you need? For the image input data, you'll use a convolutional neural network, a CNN, this is a neural net that is often used when the input is images. This network is looking for low-level features such as edges and curves and then building up to more abstract concepts through a series of convolutional layers. For the tabular demographic data you'll, use a single dense layer, that layer will probably be enough.
Create the input part for the image data
You will end up with this experiment:
Finally, time to train the model and see if we've come up with a good model.
Navigate to the Settings tab in the Inspector. In the Run settings section keep the default values, but for your info:
Done!! Click Run in the top right corner.
The Evaluation view shows in several ways how the training of the model has progressed.
There are a number of ways to visualize how your experiment is performing.
The Inspector now shows Experiment info where you can, among many other things, see the loss of your experiment.
Loss graph. The lower the loss, the better a model (unless the model has over-fitted to the training data). The loss is calculated on training and validation and its interpretation is how well the model is doing for these two sets. It is a summation of the errors made for each example in training or validation sets.
Prediction scatterplot. In a perfect scatterplot, you'll have 100% on the diagonal going from bottom left to top right.
The Error distribution shows how the number of instances as a function of the error value. This is to see how the errors are distributed. Naturally, we want the curve as narrow as possible, no errors!
Ok, so after a minute of training we can confirm our suspicion that just using images is probably not the way to go. Now it's time to find out if you can improve the experiment.
As long as you keep the same loss, function you can compare the results of the experiments and see which one is the best in the Evaluation view.
Another way to improve an experiment is to add another dataset and see if that will improve the experiments predictions. So we want to connect two input nets and see if they can work together. To make it really easy for you, we've created a snippet with two input nets, one CNN and one fully connected (FC).
You'll end up with this the experiment:
Run the experiment again and see if we've come up with a better experiment. As stated before, if you keep the same loss function you can compare subsequent experiments.
Did the second input help? As you can see only after a minute or two, the results are drastically improved - it’s actually learning something. If you let it run for a while you’ll get a decent predictive tool.
While our model may be great it is little more than an academic exercise as long as it is locked up inside the Platform. If we want people to be able to use the model, we have to get it out in some usable form. Check out the tutorial, Deploy an operational AI model, to learn how to put things into production and make AI models operational.
Congratulations, you've completed the California house pricing tutorial. In this tutorial you've learned how to solve a regression problem, first by using a CNN snippet and then by extending the experiment using multiple datasets. Then you analyzed the experiments to find out which one was the best. Good job!