.. _bayesian:

Bayesian Linear Regression
==========================

Introduction
------------

Similarly to :ref:`regression`, *Bayesian Linear Regression* analysis predicts a continuous value based on other variables, and also provides data which helps to interpret the results.

Actable AI uses the entire table as the source data and automatically splits the table into three parts:

* **Training data**: Rows in the table where both the predictors and the ground-truth target are known and available. More information available :ref:`here<gls_training_set>`.
* **Prediction data**: Rows where target column is missing, and thus need to be predicted.
* **Validation data**: Actable AI samples a part of the data to verify the reliability of the trained model. This part of the data is also used in the performance tuning stage if performance optimization is selected. More information available :ref:`here<gls_validation_set>`.

Parameters
----------

There are two tabs containing options that can be set: the **Data** tab, and the **Priors** tab, as follows:

* **Data** tab:

  * **Predicted target**: Choose one column for which any missing values should be predicted.
  * **Predictors**: Columns that are used to predict the **predicted target**.
  * **Validation percentage**: The model is trained with the :ref:`training set<gls_training_set>` and is validated (tested whilst performing training) using the :ref:`validation set<gls_validation_set>`. By sliding this value, one can control the percentage of rows with a non-empty predicted target that is used for validation.
  * **Polynomial degree**: Calculate exponential and cross-intersection values for numeric variables, the values would be used as additional input.
  * **Quantile low**: Used to calculate the lower bound of the prediction interval. More information available :ref:`here<gls_quantile>`.
  * **Quantile high**: Used to calculate the upper bound of the prediction interval. More information available :ref:`here<gls_quantile>`.
  * **Filters** (optional): Set conditions on columns (features), in order to remove any samples in the original dataset which are not required. If selected, only a subset of the original data is used in the analytics.
  * **Number of trials**: Number of trials for hyper-parameter optimization. Increasing the number of trails usually results in better prediction but longer training time.

  Furthermore, if any features are related to time, they can also be used for visualization:

  * **Time Column**: The time-based feature to be used for visualization. An arbitrary expression which returns a DATETIME column in the table can also be defined, using format codes as defined `here <https://docs.python.org/3/library/datetime.html#strftime-and-strptime-format-codes>`_. 
  * **Time Range**: The time range used for visualization, where the time is based on the **Time Column**. A number of pre-set values are provided, such as *Last day*, *Last week*, *Last month*, etc. Custom time ranges can also be provided. In any case, all relative times are evaluated on the server using the server's local time, and all tooltips and placeholder times are expressed in UTC. If either of the start time and/or end time is specified, the time-zone can also be set using the ISO 8601 format.


* **Priors**:

  Optionally, the *priors* can be set. The *prior* of a certain quantity is the probability distribution that reflects one's beliefs about the likely values of this quantity before any data is seen or taken into account. 
  
  By clicking on the green *Add Prior* button, a new prior can be added:

  .. image:: images/analytics/bayesian/priors_1.png
      :align: center
      :scale: 50%

  The following fields then need to be filled:

  * **Column Name**: the name of the column to be used as a prior. If the *type* of the column is *double*, then the **Prior value** can be set. Otherwise, the **Column value** should be set.

    .. note::
       Fields which cannot be set are grayed out to avoid mistakenly filling in any incorrect values.

  * **Column Value**: The value of the *column* variable to be used (can only be set if the **Column Name** type is not of type *double*).
  * **Polynomial value**: Maximum polynomial degree to be used. Highest value that can be used is 3.
  * **Prior value**: The value of the prior to be used.
  
  .. image:: images/analytics/bayesian/priors_detail.png
      :align: center
      :scale: 50%


Case Study
----------

.. note::
   This example is available in the Actable AI web app and may be viewed `here <https://app.actable.ai/r/qeNlUHhYZvqoqemmXeV6CQ>`_.


Suppose we are working at a real estate company and would like to forecast rental prices for properties that remain on the market. The below table represents a sample of the dataset:

+-----------------+----------------+-----------+---------------+----------------------+------------------+-------+---------------+
| days_on_market  | initial_price  | location  | neighborhood  | number_of_bathrooms  | number_of_rooms  | sqft  | rental_price  |
+=================+================+===========+===============+======================+==================+=======+===============+
| 10              | 2271           | great     | south_side    | 1                    | 0                | 4848  | 2271          |
+-----------------+----------------+-----------+---------------+----------------------+------------------+-------+---------------+
| 1               | 2167           | good      | downtown      | 1                    | 1                | 674   | 2167          |
+-----------------+----------------+-----------+---------------+----------------------+------------------+-------+---------------+
| 19              | 1883           | poor      | westbrae      | 1                    | 1                | 554   | 1883          |
+-----------------+----------------+-----------+---------------+----------------------+------------------+-------+---------------+
| 3               | 2431           | great     | south_side    | 1                    | 0                | 529   | 2431          |
+-----------------+----------------+-----------+---------------+----------------------+------------------+-------+---------------+
| 58              | 4463           | poor      | westbrae      | 2                    | 3                | 1190  | 4123.812      |
+-----------------+----------------+-----------+---------------+----------------------+------------------+-------+---------------+

Now, we have added some new properties and would like to find out the rental price given their condition:

+-----------------+----------------+-----------+---------------+----------------------+------------------+-------+---------------+
| days_on_market  | initial_price  | location  | neighborhood  | number_of_bathrooms  | number_of_rooms  | sqft  | rental_price  |
+=================+================+===========+===============+======================+==================+=======+===============+
| 18              | 1725           | poor      | westbrae      | 1                    | 0                | 509   |               |
+-----------------+----------------+-----------+---------------+----------------------+------------------+-------+---------------+
| 49              | 1388           | poor      | westbrae      | 1                    | 0                | 481   |               |
+-----------------+----------------+-----------+---------------+----------------------+------------------+-------+---------------+
| 1               | 4677           | good      | downtown      | 2                    | 3                | 808   |               |
+-----------------+----------------+-----------+---------------+----------------------+------------------+-------+---------------+
| 30              | 1713           | poor      | westbrae      | 1                    | 1                | 522   |               |
+-----------------+----------------+-----------+---------------+----------------------+------------------+-------+---------------+
| 10              | 1903           | good      | downtown      | 1                    | 1                | 533   |               |
+-----------------+----------------+-----------+---------------+----------------------+------------------+-------+---------------+

We set our parameters as follows:

.. image:: images/analytics/bayesian/setup.png
    :align: center
    :scale: 50%

Review Result
-------------

The result view contains a **Prediction** tab, a **Performance** tab, a **Multivariate** tab, a **Univariate** tab and a **Table** tab.

Prediction
""""""""""

The **Prediction** tab shows the prediction result for the rows where the target value was missing. The table has several new columns, where ``<target>`` is the name of the target variable:

* ``<target>``: the predicted values of the target variable.
* ``<target_STD>``: the standard deviation of predicted values.
* ``<target>_low`` and ``<target>_high``: these columns represent the lower and upper :ref:`quartile<gls_quantile>` bounds, respectively, of the predicted price given the values set for **Quantile low** and **Quantile high**, respectively.

.. image:: images/analytics/bayesian/prediction.png
    :align: center
    :scale: 50%

Performance
"""""""""""

The **Performance** tab shows the performance of the model with the *Root Mean Square Error (RMSE)* metric (19.299) and the *R-squared (R2)* metric (1.0).

* **Root Mean Square Error (RMSE)**: Calculated as the square root of the second sample moment of the differences between the predicted and observed values.
* **R-squared (R2)**: Also known as the *coefficient of determination*, and indicates the extent to which a target variable is predictable from the predictor variables. A value of 0 means that predictors have zero predictability of the target, while a value of 1 means the target is fully predictable by the predictors.

.. image:: images/analytics/bayesian/performance.png
    :align: center
    :scale: 50%

Multivariate
""""""""""""

The **Multivariate** tab contains a table showing information related to the behavior of the predictions when multiple variables are considered.

* Variable name: The name(s) of the variable(s) used. Multiple variable names correspond to a cross-intersection of variables. In simple terms, these variables are essentially multiplied by each other.
* Coefficient Value: Coefficients of the regression model (mean of the distribution), which can give an indication as to the variables' importance. Values close to 0 are unimportant and largely unused when predicting the target. Larger values (even if they are negative) indicate that the variable(s) highly influence(s) the predictions.
* Standard Deviation: Standard deviation values of the coefficients.

.. image:: images/analytics/bayesian/multivariate.png
    :align: center
	:scale: 75%

Univariate
""""""""""

The information in this tab focuses more on the relationship between the target and each individual predictor. A univariate analysis is generated for every original value (every non cross-intersection or exponent) in the dataset. For example, in the case of categorical values such as ``location`` which can take values of *great*, *good*, and *poor*, analyses using each of these values is provided (i.e. ``location_great``, ``location_good``, and ``location_poor``). 

Two graphs are generated for each variable considered:

* The first graph shows a regression of our target (``rental price`` in our example) by only using the ``number of rooms`` as a feature that can be used for prediction. An example is shown in the image below. It can be observed that the ``number of rooms`` can easily predict the ``rental price`` with an R-squared value of 0.912. It can also be observed that the price increases when the number of rooms increases.

  .. image:: images/analytics/bayesian/univariate.png
      :align: center
      :scale: 80%

* The second graph shows the Probability Density Function (PDF) of the target (``rental price``). It is also evident in this graph that the number of rooms has a positive influence on the price.

  .. image:: images/analytics/bayesian/univariate_pdf2.png
      :align: center
      :scale: 90%

Table
"""""

The **Table** tab displays the first 1,000 rows in the original dataset and the corresponding values of the columns used in the analysis:

.. image:: images/analytics/bayesian/table.png
    :align: center
    :scale: 50%