Data Cleanse & Imputation

Introduction

Data incompleteness is one of the most common issues in the big data world. This analytics detects empty and invalid fields, then suggest fixes for those fields.

Parameters

There are three types of fixes:

  • Misplaced: This rule applies to each row of the data set to detect rows with invalid values. For example, for a data table with the information of apartments, a realistic constraint is that apartments have at least 2 bedrooms. One can define a rule to detect invalid rows as in the below figure. Rows with invalid number_of_rooms will be detected and fixed with predictions based on other values in the row. Besides concrete values, one can use regex match to identify the wrong values. The code will automatically surround your input regex with ^ and $ to avoid incorrect matching cells.
_images/misplaced_rule.png
  • Validation: This rule can apply to multiple rows by comparing fields from different rows. For example, for a data table with the information of hospitalisations, hospitals with the same names must have the same phone numbers. One can define a rule to detect invalid rows as below, which will be interrupted as the following statement by the algorithm: when HospitalName equals, if PhoneName not equal, either HospitalName or PhoneName (or both) is wrong. The algorithm would try to fix PhoneName first, but it failed to use a value to match the criteria, it will try to modify HospitalName too. You can put as many Statement Part as possible, but the algorithm might not work well because the data matched in your dataset is too little.
_images/validation_rule.png
  • Null: By checking this rule, the algorithm will try to generate suggestions for missing values.
_images/null_rule.png

Case Study

Imagine we have a collection of patient treatment records. However, several typo errors caused the final data to be incorrect. For example, the city of the hospital is supposed to be birmingham but typed as birmingxam, the phone number for the hospital should be 2565938310 but recorded as 2x6x938310.

To fix this problem, we can use the validation rule and set the parameter like the following based on the knowledge that when the hospital name is the same, their city and phone number should be the same.

_images/case_study.png

Review Result

After running the fix, the fixed data table will be shown with corrected values highlighted in yellow (as in bellow figure). To show the values before correction, one can hover over the cell and the incorrect values will be shown inside brackets in a tooltip. The following example shows that after configuring the validation rule, the algorithm is able to detect the wrong PhoneNumber and fix the value from 334793870x to 3347938701.

_images/imputation_result.png