Bayesian Linear Regression

Introduction

Similar as Regression, Bayesian Linear Regression analysis predicts a continuous value based on other variables, and also an interpretation for the choosen variables.

Actable AI would use the entire table as the source data and automatically splits the table into three parts:

  • Train data: rows in the table where predictors and target are filled.
  • Prediction data: rows where target column is missing.
  • Validation data: Actable AI would sample a part of the data to verify the reliability of the trained model. This part of the data would also be used in the performance tunning stage if performance optimisation is selected.

Parameters

  • Predicted target: Choose one column whose missing values would be predicted.
  • Predictors: Columns that are used to predict the predicted target.
  • Validation percentage: The model is evaluated in the end via the validation dataset. By sliding this value, one can control the percentage of rows with a non-empty predicted target that is used for validation.
  • Polynomial degree: Calculate exponential and cross-intersection values for numeric variables, the values would be used as additional input.
  • Quantile low: Quantiles divids the prediction result range into continuous intervals with equal probabilities. If set, a lower bound with the set confidence is returned.
  • Quantile high: If set, a higher bound with the set confidence is returned.
  • Filters (optional): Set conditions on columns to filter on the original dataset. If selected, only a subset of the original data would be used in the analytics.
  • Number of trials: Number of trials for hyper-parameter optimisation. Increasing the number of trails usually results in better prediction but longer training time.

Case Study

Imagine we are a real estate company and would like to forecast rental prices for properties that remain on the market. An example of the dataset could be:

days_on_market initial_price location neighborhood number_of_bathrooms number_of_rooms sqft rental_price
10 2271 great south_side 1 0 4848 2271
1 2167 good downtown 1 1 674 2167
19 1883 poor westbrae 1 1 554 1883
3 2431 great south_side 1 0 529 2431
58 4463 poor westbrae 2 3 1190 4123.812

Now we added some new properties and would like to find out how much is the rental price for their condition, they are

days_on_market initial_price location neighborhood number_of_bathrooms number_of_rooms sqft rental_price
18 1725 poor westbrae 1 0 509  
49 1388 poor westbrae 1 0 481  
1 4677 good downtown 2 3 808  
30 1713 poor westbrae 1 1 522  
10 1903 good downtown 1 1 533  

We set our parameters as

_images/setup1.png

Review Result

The result view contains a Prediction tab, a Performance tab, a Multivariate tab, a Univariate tab and a Table tab.

Prediction

The Prediction tab shows the prediction result for the rows which missed the target value. The table would have two new columns <target>_low and <target>_high.

_images/perdiction.png

Performance

The Performance tab shows the performance of our model with the Root Mean Square Error (RMSE) metric (14.292) and the R-squared (R2) metric (1.0).

  • RMSE: Square Root of the Average of Squared Error is calculated as the square root of the second sample moment of the differences between predicted values and observed values.
  • R2: R-squared is the coefficient of determination. R squared indicates how much target is predictable from predictors. 0 means predictors have zero predictability of the target, while 1 means the target is fully predictable by predictors.
_images/performance.png

Multivariate

The Multivariate tab contains a table showing the variable used/generated by the model.
  • Variable name: The name of the variable used. Multiple variable names corresponds to a cross-intersection of variables. In simple terms these variables multiplied by eachother.
  • Coefficient Value: Coefficients of the regresison model (mean of distribution)
  • Standard Deviation: Standard deviation of the coefficients.
_images/multivariate.png

Univariate

The Univariate tab is more focused on the relationship between each predictor and the target.
  • The first graph below shows a regression of our target rental price by only using the number of rooms as feature. We can see that the number of rooms can easily predict the rental price wit an R-squared of 0.912. We can also see that the more rooms the higher the price.
  • The second graph shows the probability density function of the rental price. We can see that the number of rooms has a positive influence on the price.

We generate a univariate analysis for every original value (every non cross-intersection or exponent) in the dataset.

_images/univariate.png

Table

The Table tab display the original dataset.