Glossary of Terms

A number of terms related to machine learning and data science are mentioned in Actable AI’s documentation. This glossary serves as a quick reference to the definitions of these terms.

A/B Test

A statistical way of comparing multiple techniques, where A is typically an existing technique and B is a new technique. A/B testing not only determines which technique performs better but also whether the difference is statistically significant.

More information available here.

Accuracy

A commonly used metric, where the number of correct classification predictions are divided by the total number of predictions.

Related with precision, recall, and the F-score.

More information available here.

Analytics

Trends and conclusions derived from the information contained in data sets. Actable AI defines a number of different Analytics (such as Regression, Classification, and so on). Whenever an analytic is computed, it can be found under the Analytics tab in the web app for future reference, as described in the section Load & Explore Data.

Area Under the ROC Curve (AUC)

The Area Under the ROC Curve can be used to summarize the trade-off between the True Positive Rate (TPR, i.e. how many samples were correctly predicted to be ‘positive’) and False Positive Rate (FPR, i.e. how many samples were incorrectly predicted to be ‘positive’). The highest score that can be achieved is 1.

More information available here.

Confounding variable

Also known as common causes or controls, confounders are variables that have a causal effect on both the Treatment variable and the Outcome variable.

More information available here.

Confusion Matrix

A \(N \times N\) table summarizing the number of correct and incorrect predictions made by a classification model, where \(N\) is the number of classes. Hence, the table can be used to determine if a model has a tendency to predict the wrong class for a particular class, thereby confusing these classes with each another.

More information available here.

Cross-Validation

A method which re-samples portions of the data that can be used for training and testing, which can be used to estimate a model’s generalization capability.

Specifically, the data is split into a number of folds, where one fold is used for validation and the rest of the folds are used for training. This process is repeated such that each fold is used for validation. The validation results are aggregated over the results of all folds.

More information available here.

Degrees Of Freedom

The maximum number of logically independent variables that are free to vary.

More information available here.

Double Machine Learning

Double machine learning is a process which first generates two predictive models (to predict the outcome from the confounder variables (also known as common causes), and to predict the treatment from the confounders), that are then combined to create a model of the treatment effects.

This process is done to avoid biases due to spurious correlations, by first disassociating the common causes from both the given treatment variables and the outcome variable. It then finds the association between the disassociated treatment and outcome variables.

More information available at:

F-score

A measure of accuracy based on the Precision and Recall metrics.

The F1-score score is the harmonic mean of the precision and recall. The more generic F-score applies additional weights, valuing one of precision or recall more than the other.

The maximal F-score is 1.0, corresponding to perfect precision and recall values, while the lowest value is 0 when either precision or recall are zero.

More information available here.

There also exist various types of computations in the case of multi-class classification (when there are more than two classes (binary classification)), namely macro, micro, and weighted averaging scores. More information available here.

Histogram

A graphical display of data using bars of differing heights, where taller bars represent larger values and each bar groups a range of numbers called bins.

More information available here.

Hyperparameters

Parameters used to control the learning process when training a model. Examples include the learning rate and the batch size. Actable AI automatically tunes model hyperparameters and returns the model where the performance is maximized using some sort of evaluation metric.

More information available here.

Individual Conditional Expectation (ICE)

Individual Conditional Expectation (ICE) plots complement the insights gained from PDP. While PDP gives an average view of the relationship between a feature and the predicted outcome, ICE plots enable the selection of a specific data point to observe how changing one feature impacts the prediction for that unique instance.

For example, consider a patient’s medical diagnosis model, where factors like age, weight, and blood pressure contribute to the prediction. With ICE, you can select a particular patient and see how their diagnosis changes as their weight varies, while keeping the other factors unchanged.

By exploring multiple ICE plots, you get a nuanced understanding of how the model’s predictions vary for individual cases. This level of detail is particularly helpful when dealing with diverse datasets, as ICE plots highlight the heterogeneity in relationships between features and predictions.

More information can be found in the following resources:

Matthew’s Correlation Coefficient (MCC)

A measure of statistical accuracy and can be said to summarize the Confusion Matrix by measuring the difference between the predicted and actual values.

The MCC score is high only if the predictions are good across all classes.

More information available here and here.

Mean Absolute Error (MAE)

The average absolute differences between two sets of values. The absolute value of a number is its positive value (e.g. \(|-2| = 2\)).

For example, the MAE between a list of values \([8, 10, 5]\) and another list of values \([12, 13, 0]\) is \([|8-12| + |10-13| + |5-0|]/3 = [4 + 3 + 5]/3 = 4\).

The best value is 0.0.

More information available here.

Median Absolute Deviation (MAD)

Also known as Median Absolute Error, this metric measures the median absolute differences between two sets of values, thereby measuring the ‘middle’ degree of variation and is thus less sensitive to outliers than other metrics such as Mean Absolute Error (MAE). The absolute value of a number is its positive value (e.g. \(|-2| = 2\)).

For example, the MAD between a list of values \([8, 10, 5]\) and another list of values \([12, 13, 0]\) is \(median(|8-12|, |10-13|, |5-0|) = median(4, 3, 5) = 4\).

The best value is 0.0.

More information available here.

Mean Squared Error (MSE)

The average squared differences between two sets of values.

For example, the MSE between a list of values \([8, 10, 5]\) and another list of values \([12, 13, 0]\) is \(\frac{(8-12)^2 + (10-13)^2 + (5-0)^2}{3} = \frac{[16 + 9 + 25]}{3} = 16.67\).

The best value is 0.0.

More information available here.

Null Hypothesis

The null hypothesis represents the case that there exists no statistically significant differences between two possibilities, so that any differences are not statistically significant and have occurred only due to chance. It is rejected if the P-Value is less than or equal to a significance level known as the alpha value (\(\alpha\)).

For example, if \(\alpha\) is set to 0.05 (representing a 95% confidence value), then a \(p\)-value smaller than 0.05 indicates that the null hypothesis should be rejected, and any differences in observations did not occur due to chance alone.

More information available here.

Outcome variable

A variable representing the outcome of a model, as influenced by Treatment variables.

Also known as a dependent variable (a variable that depends on the values of other variables).

More information available here.

Partial Dependence Plot (PDP)

Partial Dependence Plots (PDP) are powerful visualization tools that help you understand how specific features in a model influence its predictions. With PDP, you can easily explore how changing one feature impacts the outcome while keeping all other factors constant.

Let’s say you have a predictive model that considers various factors to predict an outcome, such as housing prices based on features like location, size, and age. By using PDP, you can choose one feature, like ‘size’, and see how it affects the predicted price while keeping other factors fixed. PDP provides an average view, showing the trend in housing prices as the ‘size’ changes, helping you grasp the general relationship between size and price.

PDPs allow you to gain deeper insights into your model, understand feature importance, and make more informed decisions based on its behavior. It’s a handy tool for understanding complex models and extracting valuable knowledge from them.

PDPs can be thought of being the average of ICE values, which are computed for each data sample rather than across all samples (as done in a PDP).

More information can be found in the following resources:

P-Value

The \(p\)-value is defined as the probability of obtaining results at least as extreme as the observed results, under the assumption that the Null Hypothesis is true. Hence, the \(p\)-value is used to determine if the Null Hypothesis should be rejected, with smaller values indicating that such extreme results would be very unlikely to occur under the null hypothesis and thereby increasing the likelihood that the null hypothesis is rejected.

The range of \(p\)-values lies between 0 (0%) and 1 (100%):

  • A value close to 1 indicates that the null hypothesis is true. Hence, the relationship being tested between the compared variables is not valid.

  • A value close to 0 suggests that the null hypothesis should be rejected. Hence, there is a high probability that the feature truly does exhibit the observed relationship (and thus the result did not occur by chance).

  • The threshold at which the null hypothesis is accepted or rejected is known as the alpha value (\(\alpha\)) and is typically set to 0.05, meaning that the probability of achieving the same or more extreme results is 5% if the null hypothesis is assumed to be true. The null hypothesis is rejected if the \(p\)-value is less than the \(\alpha\) value.

More information available here.

Precision

A commonly used metric, answering the following question:

“Out of the class predictions (e.g. the ‘positive’ class), how many were actually of that class (positive)?”

Hence, precision determines how reliability of a model in detecting the class. A high precision means that, of all classes detected with the label considered (e.g. positive), most of them were truly belonging to that class.

Values lie between 0 and 1.0, where a value of 1.0 corresponds to a perfect result.

Related with accuracy, recall, and the F-score.

More information available here.

There also exist various types of computations in the case of multi-class classification (when there are more than two classes (binary classification)), namely macro, micro, and weighted scores. More information available here.

Quantile

Quantiles are values that split sorted data or a probability distribution into equal parts. In general, a \(q\)-quantile divides sorted data into \(q\) parts.

One of the most common type of quantiles are the quartile values (4-quantiles), which are three values that split the data into four parts. The three quartile values can be described as follows:

  • Q1/first quartile/lower quartile: the number halfway between the lowest number and the middle number.

  • Q2/second quartile/median: the middle number halfway between the lowest number and the highest number.

  • Q3/third quartile/upper quartile: the number halfway between the middle number and the highest number.

More information available here.

R-Squared (R2)

Also known as the coefficient of determination, and indicates the extent to which a target variable is predictable from the predictor variables.

A value of 0 means that predictors have zero predictability of the target, while a value of 1 means the target is fully predictable by the predictors.

More information available here.

Recall

Also known as sensitivity, recall is a commonly used metric that answers the following question:

“How often is the class (e.g. the ‘positive’ class) being detected?”

Hence, recall measures the percentage of predictions identified as the class considered (e.g. the positive class), and is used to select the best model when there is a high cost associated with false negatives (when a model mistakenly predicts the class to be ‘negative’).

Values lie between 0 and 1.0, where a value of 1.0 corresponds to a perfect result.

Related with accuracy, precision, and the F-score.

More information available here.

There also exist various types of computations in the case of multi-class classification (when there are more than two classes (binary classification)), namely macro, micro, and weighted scores. More information available here.

Receiver Operating Characteristics (ROC) Curve

A graph of the True Positive Rate (TPR, i.e. how many samples were correctly predicted to be ‘positive’) versus the False Positive Rate (FPR, i.e. how many samples were incorrectly predicted to be ‘positive’), for different classification thresholds.

More information available here.

Root Mean Squared Error (RMSE)

The square root of the Mean Squared Error (MSE).

For example, the RMSE between a list of values \([8, 10, 5]\) and another list of values \([12, 13, 0]\) is

\[\sqrt{\frac{(8-12)^2 + (10-13)^2 + (5-0)^2}{3}} = \sqrt{\frac{16 + 9 + 25}{3}} = \sqrt{16.67} = 4.08\]

More information available here.

Shapley Values

Shapley values have their roots in game theory, where they are used to estimate the contribution of each player in a final result. A Shapley value of a player is the average of differences in the results with and without that player in different co-operations with other players.

In the case of machine learning, in order to estimate the contribution of each feature in a model’s decision, two models can be built for every combination of other factors: one with the feature under consideration, and one without it. The Shapley value of a feature is the average of differences in predictions of the results between the two models, to now estimate the contribution of the feature on the result.

More information available here.

Spearman’s Rank-Order Correlation Coefficient (SROCC)

A nonparametric measure of the strength and direction of association between the rankings of two variables.

Another type of correlation measure that is often used, Pearson’s correlation, assesses linear relationships. SROCC assess monotonic relationships (whether linear or not), with the following properties:

  • Values lie between -1 and 1.

  • A value of 0 indicates no correlation, while higher values of the magnitude correspond to stronger correlations (the maximal value of 1 indicates that the variables are perfectly correlated). The following guide can be used to determine the strength of the relationship:

    • Between 0.00 and 0.19: “Very Weak”

    • Between 0.20 and 0.39: “Weak”

    • Between 0.40 and 0.59: “Moderate”

    • Between 0.60 and 0.79: “Strong”

    • Between 0.80 and 1.00: “Very Strong”

  • The sign (positive or negative) indicates the direction of the relationship.

  • Example: if the SROCC value of a feature (with the target) is equal to -0.8 and the SROCC of another feature (with the target) is 0.8, then the two features are equally correlated with the target (with a “very strong” magnitude of 0.8). However, in the case of the first feature, the negative sign indicates that its values tend to increase when the values of target decrease (and vice versa). On the other hand, the values of the target and the second feature tend to vary in the same direction (i.e. the values either both increase or both decrease).

More information available here.

Testing set

The part of the dataset that is used to test a trained model’s performance.

Normally, a dataset is divided into three distinct subsets:

Each example in the dataset should belong to only one of these subsets.

More information available here.

Training set

The part of the dataset that is used to train a model to do the required task. Usually, a Validation set is used while training to evaluate the performance on a set of data that is unseen by the model. This ensures that the model is generalizing well to unseen data, and is not over-fitting to the training dataset.

Normally, a dataset is divided into three distinct subsets:

Each example in the dataset should belong to only one of these subsets.

More information available here.

Treatment variable

A variable that affects the outcome variable. There may be multiple treatment variables that influence the outcome.

Also known as an independent variable.

More information available here.

Validation set

The part of the dataset that is used to verify the reliability of a trained model. It is also used in the performance tuning stage if performance optimization is selected. Once performance is satisfactory on the validation set, a model is then evaluated on the Testing set.

Normally, a dataset is divided into three distinct subsets:

Each example in the dataset should belong to only one of these subsets.

More information available here.