Correlation Analysis

Introduction

Correlation analysis is a method of statistical evaluation used to study the relationship between two variables. This analysis is useful to indicate whether there are any connections between variables in the same dataset, and what is the magnitude and effect of that relationship.

As an example, in the advertising industry, there is normally a correlation between advertising spend and the advertisements’ impression rate, such that a higher advertising spend tends to result in higher impression rates.

It should be noted that correlation is not always causal, with most relationships tending to be misleading and showing fake relationships between variables. In order to discover and understand the true causal, you may use Actable AI’s Causal Inference analytic.

Parameters

  • Compared factors: Features for which their correlation with the target will be computed.

  • Correlation target: The feature to be studied with selected Compared factors.

  • Number of displayed factors: Number of best-correlated factors to be displayed.

  • Correlation target value (optional): If set, the target is treated as a binary variable. Samples with the chosen value will be set as positive, while the rest of the classes are set to negative.

  • Filters (optional): Set conditions on columns (features) so as to remove any samples in the original dataset which are not required. If selected, only a subset of the original data is used.

  • Bar Values: If selected, values are shown on the bar. Changing this control takes effect instantly in the visualizations.

Furthermore, if any features are related to time, they can also be used for visualization:

  • Time Column: The time-based feature to be used for visualization. An arbitrary expression which returns a DATETIME column in the table can also be defined, using format codes as defined here.

  • Time Range: The time range used for visualization, where the time is based on the Time Column. A number of pre-set values are provided, such as Last day, Last week, Last month, etc. Custom time ranges can also be provided. In any case, all relative times are evaluated on the server using the server’s local time, and all tooltips and placeholder times are expressed in UTC. If either of the start time and/or end time is specified, the time-zone can also be set using the ISO 8601 format.

Case Study

Note

This example is available in the Actable AI web app and may be viewed here.

Suppose that we are a bike rental shop owner. We have our bike demand data for the past two years and we would like to understand the effect of the weather change on the bike rental demand. All weather values are normalized in the dataset, an example of which is given below:

temp

hum

windspeed

casual

registered

cnt

dteday

0.344167

0.805833

0.160446

331

654

985

2011-01-01

0.363478

0.696087

0.248539

131

670

801

2011-01-02

0.196364

0.437273

0.248309

120

1229

1349

2011-01-03

0.2

0.590435

0.160296

108

1454

1562

2011-01-04

0.226957

0.436957

0.1869

82

1518

1600

2011-01-05

0.204348

0.518261

0.0895652

88

1518

1606

2011-01-06

0.196522

0.498696

0.168726

148

1362

1510

2011-01-07

0.165

0.535833

0.266804

68

891

959

2011-01-08

0.138333

0.434167

0.36195

54

768

822

2011-01-09

We set our parameters as follows, where cnt stands for the rental demand count.

_images/setup2.png

Review Result

The result view contains a Chart tab, a Correlation Result tab and a Data tab:

Chart

The Chart tab provides an overview plot chart for showing the correlation between Compared factors and Correlation target. As can be observed, temp has a positive correlation with cnt while windspeed and hum (humidity) has negative correlation with cnt.

_images/overview_chart.png

Actable AI also gives breakdown views for all Compared factors. For example, the following graph describes the correlation between temp and cnt. Most of data points are well-explained by the regression calculation, with a correlation coefficient of 0.622.

_images/cnt_vs_temp.png

Actable AI uses the Spearman’s Rank-Order Correlation Coefficient (SROCC) to indicate the strength and direction of the correlation. More information on SROCC can be found here.

In the image above, the SROCC has a positive value and it can be clearly observed that cnt tends to increase with increasing values of temp. On the other hand, the image below shows that cnt tends to decrease with increasing values of windspeed, although this relationship is quite weak with a magnitude of 0.2172. These results make sense, since people might feel uncomfortable riding a bike when it is cold or when it is windy.

_images/cnt_vs_windspeed.png

Correlation Result

This tab provides the Spearman’s Rank-Order Correlation Coefficient values and the p-values for the correlation of each factor with the target. The p-value is a measure of how likely any observed correlation is due to chance, with the null hypothesis stating that there is no correlation between a factor and the target. The range of p-values lies between 0 (0%) - 1 (100%):

  • A value close to 1 indicates that the null hypothesis is true. Hence, there is no correlation between the compared variables.

  • A value close to 0 suggests that the null hypothesis should be rejected. Hence, there is a high probability that the feature truly does exhibit the observed correlation (and thus the result did not occur by chance).

  • The threshold at which the null hypothesis is accepted or rejected is known as the alpha value and is typically set to 0.05, meaning that the probability of achieving the same or more extreme results is 5% under the assumption that the null hypothesis is true. The null hypothesis is rejected if the p-value is less than the alpha value.

_images/corr_result_tab.png

Data

The Data tab displays the first 1,000 rows in the original dataset and the corresponding values of the columns used in the analysis:

_images/data_tab.png