.. _correlation: Correlation Analysis ==================== Introduction ------------ *Correlation analysis* is a method of statistical evaluation used to study the relationship between two variables. This analysis is useful to indicate whether there are any connections between variables in the same dataset, and what is the magnitude and effect of that relationship. As an example, in the advertising industry, there is normally a correlation between *advertising spend* and the advertisements' *impression rate*, such that a higher advertising spend tends to result in higher impression rates. It should be noted that `correlation is not always causal `_, with most relationships tending to be misleading and showing fake relationships between variables. In order to discover and understand the true causal, you may use Actable AI's :ref:`causal_inference` analytic. Parameters ---------- * **Compared factors**: Features for which their correlation with the target will be computed. * **Correlation target**: The feature to be studied with selected *Compared factors*. * **Number of displayed factors**: Number of best-correlated factors to be displayed. * **Correlation target value** (optional): If set, the target is treated as a binary variable. Samples with the chosen value will be set as *positive*, while the rest of the classes are set to *negative*. * **Filters** (optional): Set conditions on columns (features) so as to remove any samples in the original dataset which are not required. If selected, only a subset of the original data is used. * **Bar Values**: If selected, values are shown on the bar. Changing this control takes effect instantly in the visualizations. Furthermore, if any features are related to time, they can also be used for visualization: * **Time Column**: The time-based feature to be used for visualization. An arbitrary expression which returns a DATETIME column in the table can also be defined, using format codes as defined `here `_. * **Time Range**: The time range used for visualization, where the time is based on the **Time Column**. A number of pre-set values are provided, such as *Last day*, *Last week*, *Last month*, etc. Custom time ranges can also be provided. In any case, all relative times are evaluated on the server using the server's local time, and all tooltips and placeholder times are expressed in UTC. If either of the start time and/or end time is specified, the time-zone can also be set using the ISO 8601 format. Case Study ---------- .. note:: This example is available in the Actable AI web app and may be viewed `here `_. Suppose that we are a bike rental shop owner. We have our bike demand data for the past two years and we would like to understand the effect of the weather change on the bike rental demand. All weather values are normalized in the dataset, an example of which is given below: +-----------+-----------+------------+---------+-------------+------+-------------+ | temp | hum | windspeed | casual | registered | cnt | dteday | +===========+===========+============+=========+=============+======+=============+ | 0.344167 | 0.805833 | 0.160446 | 331 | 654 | 985 | 2011-01-01 | +-----------+-----------+------------+---------+-------------+------+-------------+ | 0.363478 | 0.696087 | 0.248539 | 131 | 670 | 801 | 2011-01-02 | +-----------+-----------+------------+---------+-------------+------+-------------+ | 0.196364 | 0.437273 | 0.248309 | 120 | 1229 | 1349 | 2011-01-03 | +-----------+-----------+------------+---------+-------------+------+-------------+ | 0.2 | 0.590435 | 0.160296 | 108 | 1454 | 1562 | 2011-01-04 | +-----------+-----------+------------+---------+-------------+------+-------------+ | 0.226957 | 0.436957 | 0.1869 | 82 | 1518 | 1600 | 2011-01-05 | +-----------+-----------+------------+---------+-------------+------+-------------+ | 0.204348 | 0.518261 | 0.0895652 | 88 | 1518 | 1606 | 2011-01-06 | +-----------+-----------+------------+---------+-------------+------+-------------+ | 0.196522 | 0.498696 | 0.168726 | 148 | 1362 | 1510 | 2011-01-07 | +-----------+-----------+------------+---------+-------------+------+-------------+ | 0.165 | 0.535833 | 0.266804 | 68 | 891 | 959 | 2011-01-08 | +-----------+-----------+------------+---------+-------------+------+-------------+ | 0.138333 | 0.434167 | 0.36195 | 54 | 768 | 822 | 2011-01-09 | +-----------+-----------+------------+---------+-------------+------+-------------+ We set our parameters as follows, where ``cnt`` stands for the `rental demand count`. .. image:: images/analytics/correlation/setup.png :align: center :scale: 75% Review Result ------------- The result view contains a **Chart** tab, a **Correlation Result** tab and a **Data** tab: Chart """"" The **Chart** tab provides an overview plot chart for showing the correlation between *Compared factors* and *Correlation target*. As can be observed, ``temp`` has a positive correlation with ``cnt`` while ``windspeed`` and ``hum`` (humidity) has negative correlation with ``cnt``. .. image:: images/analytics/correlation/overview_chart.png :align: center :scale: 60% Actable AI also gives breakdown views for all **Compared factors**. For example, the following graph describes the correlation between ``temp`` and ``cnt``. Most of data points are well-explained by the regression calculation, with a correlation coefficient of 0.622. .. image:: images/analytics/correlation/cnt_vs_temp.png :align: center :scale: 60% Actable AI uses the Spearman's Rank-Order Correlation Coefficient (SROCC) to indicate the strength and direction of the correlation. More information on SROCC can be found :ref:`here`. In the image above, the SROCC has a positive value and it can be clearly observed that ``cnt`` tends to increase with increasing values of ``temp``. On the other hand, the image below shows that ``cnt`` tends to decrease with increasing values of ``windspeed``, although this relationship is quite weak with a magnitude of 0.2172. These results make sense, since people might feel uncomfortable riding a bike when it is cold or when it is windy. .. image:: images/analytics/correlation/cnt_vs_windspeed.png :align: center :scale: 60% Correlation Result """""""""""""""""" This tab provides the Spearman's Rank-Order Correlation Coefficient values and the :ref:`p-values` for the correlation of each factor with the target. The :ref:`p-value` is a measure of how likely any observed correlation is due to chance, with the :ref:`null hypothesis` stating that there is no correlation between a factor and the target. The range of :ref:`p-values` lies between 0 (0%) - 1 (100%): * A value close to 1 indicates that the :ref:`null hypothesis` is true. Hence, there is no correlation between the compared variables. * A value close to 0 suggests that the :ref:`null hypothesis` should be rejected. Hence, there is a high probability that the feature truly does exhibit the observed correlation (and thus the result did not occur by chance). * The threshold at which the :ref:`null hypothesis` is accepted or rejected is known as the *alpha value* and is typically set to 0.05, meaning that the probability of achieving the same or more extreme results is 5% under the assumption that the :ref:`null hypothesis` is true. The :ref:`null hypothesis` is rejected if the :ref:`p-value` is less than the alpha value. .. image:: images/analytics/correlation/corr_result_tab.png :align: center :scale: 75% Data """" The **Data** tab displays the first 1,000 rows in the original dataset and the corresponding values of the columns used in the analysis: .. image:: images/analytics/correlation/data_tab.png :align: center :scale: 75%