.. _variance:

Analysis of Variance
====================

Introduction
------------

*Analysis of variance (ANOVA)* is a statistical tool that is used to compare variances among different groups.

Parameters
----------

* **N-Way ANOVA**: Whether to perform One-Way ANOVA or Two-Way ANOVA (using either one or two independent variables, respectively).
* **Outcome**: The variable corresponding to the outcome, typically the *dependent* variable (the variable that is affected by the behavior of *independent* variables).
* **Treatment(s)**: Categorical variable(s) affecting the outcome (the *independent* variables).
* **Filters** (optional): Set conditions on columns (features), in order to remove any samples in the original dataset which are not required. If selected, only a subset of the original data is used in the analytics.

Case Study 1: Tumors (One-way ANOVA)
------------------------------------

.. note::
   This example is available in the Actable AI web app and may be viewed `here <https://app.actable.ai/r/Iesd5Zpm_XcIwmQp55Y_2A>`_.


Suppose we are working on tumors and would like to know if a tumor is *benign* or *malignant*. A sample of the dataset is as follows:

+-----------+-----------+--------+---------+-----------+
| ID        | Diagnosis | radius | texture | perimeter |
+-----------+-----------+--------+---------+-----------+
| 842302    | M         | 17.99  | 10.38   | 122.8     |
+-----------+-----------+--------+---------+-----------+
| 842517    | M         | 20.57  | 17.77   | 132.9     |
+-----------+-----------+--------+---------+-----------+
| 84300903  | M         | 19.69  | 21.25   | 130.0     |
+-----------+-----------+--------+---------+-----------+
| 8510426   | B         | 13.54  | 14.36   | 87.46     |
+-----------+-----------+--------+---------+-----------+
| 8510653   | B         | 13.08  | 15.71   | 85.63     |
+-----------+-----------+--------+---------+-----------+
| 8510824   | B         | 9.504  | 12.44   | 60.34     |
+-----------+-----------+--------+---------+-----------+

In this dataset, *M* represents *malignant* and *B* represents *benign*.

If we would like to have an analysis of variance of the tumor radius (``radius``) compared with the different diagnosis groups (``Diagnosis``), the parameters can be set as follows:

.. image:: images/analytics/anova/setup.png
    :align: center
    :scale: 50%


The result view contains the **Box Plot**, **ANOVA Table**, **Tukey's Test**, **Diagnostics**, and **Table** tabs:

Boxplot
"""""""

The distribution of the outcome (``radius``) grouped by the treatment variable (``diagnosis``), namely *M* (malignant) or *B* (benign), is shown in a box plot. Hovering the mouse cursor over the box plots also reveals further details.

.. image:: images/analytics/anova/boxplot.png
    :align: center
    :scale: 50%

In this image, it is quite clear that malignant tumors have larger radii and greater variations than benign tumors.

.. _anova_one_way:

ANOVA Table
"""""""""""

The **ANOVA Table** tab shows the results of the ANOVA analysis, an example of which is shown in the following image:

.. image:: images/analytics/anova/anova-table.png
    :align: center
    :scale: 90%

In the table, *DF* stands for :ref:`gls_degrees_of_freedom`, *SUM_SQ* refers to *Sum of Squares*, *MEAN_SQ* refers to *Mean Squares*, *F* refers to the *F test* statistic (the mean square of the variable divided by the mean square of each parameter), and *PR(>F)* refers to the probability (:ref:`gls_pvalue`) of the *F value* being greater than the observed value. The :ref:`gls_null_hypothesis` (which states that any differences are not statistically significant and have occurred only due to chance) is rejected if this probability is less than or equal to a significance level known as the *alpha value*. More information can be viewed `here <https://www.scribbr.com/statistics/one-way-anova/>`_.

In this example, the ``PR(>F)`` value of 0 indicates that there is a statistically significant difference in the effect between different classes of ``Diagnosis`` on ``radius``.

Tukey's Test
""""""""""""

The **Tukey's Test** tab shows the results of the Tukey's range test analysis. More information on this test can be viewed `here <https://en.wikipedia.org/wiki/Tukey%27s_range_test>`_.

.. image:: images/analytics/anova/tukey-test.png
    :align: center
    :scale: 50%

.. _anova_diagnostics:

Diagnostics
"""""""""""

The **Diagnostics** tab is composed of several sub-tabs, as follows:
	
* **Q-Q Plot** 

  This tab shows the *quantile-quantile plot* (*Q-Q plot*) of the ANOVA analysis. The *Q-Q plot* is a graphical representation of two probability distributions by comparing their quantiles. The :math:`x`-axis represents the quantiles for the normal distribution and the :math:`y`-axis contains the quantiles of the data. Hence, the *Q-Q Plot* indicates if the data is similar to the normal distribution.
  
  Distributions which are similar to each other will lie approximately on the identity line ( :math:`y` = :math:`x` ) labeled *'trace 1'* (orange line) in the graph. On the other hand, distributions which are linearly related will also lie on a line but not necessarily on the identity line. 

  More information on the Q-Q plot is available `here <https://en.wikipedia.org/wiki/Q%E2%80%93Q_plot>`_.

.. image:: images/analytics/anova/qq-plot.png
   :align: center
   :scale: 60%


* **Residual Plot**

  The **Residual Plot** tab shows the *residuals* of the ANOVA analysis. *Residuals* correspond to differences between the data and the fitted values. If the residuals are not randomly distributed, then a linear regression model is suitable for the data.

  .. image:: images/analytics/anova/residual-plot.png
     :align: center
     :scale: 60%


* **Shapiro Wilk**

  This tab shows the results of the *Shapiro-Wilk test*, which is a non-parametric test for normality. Specifically, the Shapiro-Wilk test is used to determine whether to accept or reject the :ref:`gls_null_hypothesis` that the data is normally distributed. If the :ref:`gls_pvalue` is less than the chosen *alpha value*, the :ref:`gls_null_hypothesis` is rejected and it can thus be deduced that the data is not normally distributed. 
  
  .. image:: images/analytics/anova/shapiro-wilk.png
     :align: center

  More information on the Shapiro-Wilk test can be viewed `here <https://en.wikipedia.org/wiki/Shapiro%E2%80%93Wilk_test>`_.


* **Bartlett**

  This tab shows the results of *Bartlett's test*, used to test *homoscedasticity* (i.e. testing if variances are equal across groups or samples, an assumption made by some statistics tests such as ANOVA). If the *p*-value is less than the *alpha value*, the :ref:`gls_null_hypothesis` (postulating that there is no difference in variances) is rejected, concluding that there are differences between variances that are unlikely to have occurred by chance. 
  
  More information can be viewed `here <https://en.wikipedia.org/wiki/Bartlett%27s_test>`_.

  .. image:: images/analytics/anova/bartlett.png
     :align: center

  It is known that Bartlett's test is sensitive to departures from normality (i.e. cases where samples are derived from non-normal distributions). *Levene's test* is an alternative test that is less sensitive to such departures from normality.


* **Levene**

  *Levene's test* is similar to Bartlett's test described above in testing for equality of variances across multiple samples or groups, but is less sensitive to departures from normality. 
  
  .. image:: images/analytics/anova/levene.png
     :align: center

  More information on Levene's test can be viewed `here <https://en.wikipedia.org/wiki/Levene%27s_test>`_.


Table
"""""

The **Table** tab displays the first 1,000 rows of the original dataset:

.. image:: images/analytics/anova/table.png
    :align: center
    :scale: 50%

Case Study 2: Apartment Rental Prices (Two-way ANOVA)
-----------------------------------------------------

.. note::
   This example is available in the Actable AI web app and may be viewed `here <https://app.actable.ai/r/14L7PSdGN6sXHRdQS-K7pA>`_.


Suppose we are a real estate agent and we would like to predict the rental price for different apartments based on features such as their size, location, etc. A sample of a dataset containing this information is as follows:

+------------------+----------------------+-------+-----------+-----------------+----------------+---------------+---------------+
| number_of_rooms  | number_of_bathrooms  | sqft  | location  | days_on_market  | initial_price  | neighborhood  | rental_price  |
+==================+======================+=======+===========+=================+================+===============+===============+
| 0                | 1                    | 484,8 | great     | 10              | 2271           | south_side    | 2271          |
+------------------+----------------------+-------+-----------+-----------------+----------------+---------------+---------------+
| 1                | 1                    | 674   | good      | 1               | 2167           | downtown      | 2167          |
+------------------+----------------------+-------+-----------+-----------------+----------------+---------------+---------------+
| 1                | 1                    | 554   | poor      | 19              | 1883           | westbrae      | 1883          |
+------------------+----------------------+-------+-----------+-----------------+----------------+---------------+---------------+
| 0                | 1                    | 529   | great     | 3               | 2431           | south_side    | 2431          |
+------------------+----------------------+-------+-----------+-----------------+----------------+---------------+---------------+
| 3                | 2                    | 1219  | great     | 3               | 5510           | south_side    | 5510          |
+------------------+----------------------+-------+-----------+-----------------+----------------+---------------+---------------+
| 1                | 1                    | 398   | great     | 11              | 2272           | south_side    | 2272          |
+------------------+----------------------+-------+-----------+-----------------+----------------+---------------+---------------+
| 3                | 2                    | 1190  | poor      | 58              | 4463           | westbrae      | 4123.812      |
+------------------+----------------------+-------+-----------+-----------------+----------------+---------------+---------------+

If we would like to analyze how the variance of the rental price behaves across different types of locations and across a varying number of rooms, 2-way ANOVA could be employed with the following parameters:

.. image:: images/analytics/anova/setup_2way.png
    :align: center
    :scale: 50%


The result view is similar to the first case study (so that it again contains the **Box Plot**, **ANOVA Table**, **Tukey's Test**, **Diagnostics**, and **Table** tabs), but now considers the variance of two treatment variables with the outcome: 

Boxplot
"""""""

The distribution of the outcome (``rental_price``) grouped by both treatment variables (``location`` and ``number_of_rooms``) is shown in a box plot. Hovering the mouse cursor over the box plots also reveals further details.

.. image:: images/analytics/anova/boxplot_2way_1.png
    :align: center
    :scale: 50%

In this image, the :math:`x`-axis represents the ``location`` while there is then one box plot for each value of ``number_of_rooms`` present in the dataset (i.e. between 0 and 3 rooms). 

The box plot can be flipped by clicking on the *Flip Box Plot* button, to represent the ``number_of_rooms`` on the :math:`x`-axis with each value containing one box plot for each ``location``:

.. image:: images/analytics/anova/boxplot_2way_2.png
    :align: center
    :scale: 50%

It is clear from both box plots that the rental price increases as the number of rooms increases, and *poor* locations are generally worth less rent than *good* locations which are in turn worth less than *great* locations. However, it can also be observed that there are large variances for *poor* properties, so that some of these may have a similar rental price to *good* properties.

Two-way ANOVA can be helpful to separate the effects of the two treatment variables. For example, if One-way ANOVA is performed using ``location`` as the treatment, the following is obtained:

.. image:: images/analytics/anova/boxplot_apt.png
    :align: center
    :scale: 50%

As can be observed, while there is again a general trend that *poor* apartments are worth less than *good* apartments which are in turn worth less than *great* apartments, it is much less clear than before. This is due to the substantial variances and overlaps between groups. The inclusion of the number of rooms thus enabled better separation and understanding of how the rental price varies across both treatment variables.

ANOVA Table
"""""""""""

The **ANOVA Table** tab shows the results of the ANOVA analysis, using both treatment variables. An example of which is shown in the following image:

.. image:: images/analytics/anova/anova-table_2way.png
    :align: center
    :scale: 90%

The interpretation of the table is similar to the One-way ANOVA case: :ref:`anova_one_way`. In this example, both the ``location`` and ``number_of_rooms`` have statistically significant effects on the ``rental_price``.


Tukey's Test
""""""""""""

The **Tukey's Test** tab shows the results of the Tukey's range test analysis. More information on this test can be viewed `here <https://en.wikipedia.org/wiki/Tukey%27s_range_test>`_.

In the case of Two-way ANOVA, there is one tab for each treatment, containing a table with one row for each pair of possible groups. For example, in the case of ``location`` spanning values *poor*, *good*, and *great*, the following (order-invariant) pairs are possible:

#. Poor, Good
#. Poor, Great
#. Good, Great

This results in three rows (one for each pair), while there are six rows in the case of ``number_of_rooms``: 

.. figure:: images/analytics/anova/tukey-test_2way_location.png
    :align: center
    :scale: 55%
	
    Tukey's Test for ``location`` (and ``rental_price``)

.. figure:: images/analytics/anova/tukey-test_2way_number_of_rooms.png
    :align: center
    :scale: 55%
	
    Tukey's Test for ``number_of_rooms`` (and ``rental_price``)

From the two tables above, it is evident that there are statistically significant differences between all groups in this example.


Diagnostics
"""""""""""

The **Diagnostics** tab is composed of several sub-tabs that are largely identical to the One-way ANOVA case. See :ref:`anova_diagnostics` for more information.

Table
"""""

The **Table** tab displays the first 1,000 rows of the original dataset:

.. image:: images/analytics/anova/table.png
    :align: center
    :scale: 50%