Unveiling Data Preprocessing: Tackling NaN Values and Outliers



In the realm of data analysis, "data preprocessing" emerges as a critical phase that lays the foundation for meaningful insights. This article delves into the intricacies of data preprocessing, addressing two key challenges – NaN values and outliers. By the end of this read, you'll be well-versed in identifying, managing, and leveraging these elements to enhance the accuracy and value of your data analysis.

Introduction

Data preprocessing is the gateway to robust data analysis. This article unveils the process's essence, addressing the twin challenges of NaN values and outliers. Navigate through these obstacles and elevate the quality of your data-driven insights. Data preprocessing involves refining raw data for analysis. It encompasses cleaning, transforming, and organizing data to ensure it's primed for accurate interpretation.

NaN Values

NaN (Not a Number) values, also known as missing values, are data points that lack meaningful entries. They can arise due to errors, incomplete data collection, or data merging.

Strategies to Handle NaN Values

Removal

Removing rows or columns with NaN values is an option, but this approach can lead to loss of valuable data.

Imputation

Imputation involves filling NaN values with estimated or calculated values. Techniques include mean, median, mode imputation, and predictive modeling.

Outliers play a pivotal role in data analysis, often holding valuable insights or posing as anomalies that can skew results. This article aims to demystify the concept of outliers, explain their significance, and delve into methods to detect and remove them effectively. By the end of this read, you'll be equipped with knowledge about outlier detection techniques and their role in refining data analysis.

Outliers

An outlier is an observation that stands out from the rest of the data due to its exceptionally high or low value. Imagine a classroom where most students score around 70, but one scores 95 – that's an outlier. Outliers are data points that deviate significantly from the rest of the dataset. These anomalies can skew results and disrupt analysis if not handled appropriately. Outliers, those seemingly peculiar data points that deviate significantly from the norm, often raise questions in data analysis. These data points can reveal unique insights or distort results, necessitating a clear understanding of their implications and strategies to handle them.

Significance of Outliers

Outliers can indicate data variability, errors, or rare occurrences. They can impact statistical measures like mean and standard deviation, potentially misleading interpretations.

Methods to Detect Outliers

Visual Inspection

One of the simplest methods involves plotting data visually and identifying points that appear distant from the majority.

Z-Score Method

Z-score measures how many standard deviations a data point is from the mean. A high Z-score indicates an outlier.

Interquartile Range (IQR)

IQR represents the range between the first quartile (25th percentile) and the third quartile (75th percentile). Points outside 1.5 times the IQR are considered outliers.

Box Plots

Box plots graphically display data's spread and outliers by showing the distribution quartiles and individual data points.

Standard Deviation

Some people use certain criteria for outliers to be removed from their dataset. A value higher than three times the standard deviation can be used to identify outliers.

Methods to Remove Outliers

Z-Score Method

By removing data points with Z-scores beyond a certain threshold, outliers can be effectively eliminated.

IQR Method

Outliers are identified using the IQR and then removed from the dataset.

Tukey's Fences

Tukey's fences set boundaries beyond which data points are considered outliers and can be removed.

Impact of Outliers on Data Analysis

Outliers can skew statistical measures and affect the interpretation of results. Identifying and handling outliers is crucial for accurate analysis.

Balancing Outliers and Context

While removing outliers can enhance data accuracy, it's essential to consider the context. Outliers might be indicative of rare but significant occurrences.


Programming Example

How to Threat NaN values:

consider a matrix 3-by-3 with its center element set to NaN.
  • >> a = magic(3); a(2,2) = NaN

suppose you get this from the script above:
a =
8 1 6
3 NaN 7
4 9 2


Now Compute a sum for each column in the matrix.

        
    >> sum(a) 
  •  
    ans =

    15 NaN 15
Any mathematical operation that involves NaN values will carry those NaN values through to the eventual outcome as applicable.

Prior to conducting statistical calculations, it's recommended to eliminate NaN values from the dataset. Here are several approaches to utilize the 'isnan' function for the purpose of excluding NaN values from the data.

Rows containing NaN values can be ignored in the calculations via this script:

>> a(any(isnan(a)'),:) = [];

re-check for sum command just like before:

>> sum(a)
 
  • ans =

    12 10 8
You can see from the result here that only rows without a NaN value are calculated.


How to Treat Outliers:

Similar to NaNs, outliers or incorrectly positioned data points can also be eliminated from a data set. Let's load a count.dat in Matlab. This file contains vehicle traffic in 3 locations. Firstly, we compute mean and standard deviation of it:
  • mu = mean(count)
    sigma = std(count)

    mu =
           32.0000 46.5417 65.5833

    sigma =
           25.3703 41.4057 68.0281

The number of rows with outliers greater than three standard deviations is obtained with

  • [n,p] = size(count)
    outliers = abs(count - mu(ones(n, 1),:)) > 3*sigma(ones(n, 1),:);
    nout = sum(outliers)
    nout =
    1 0 0

There is one outlier in the first column. Remove this entire observation with

  • count(any(outliers'),:) = [];

So, if we type this :

[n,p] = size(count)
outliers = abs(count - mu(ones(n, 1),:)) > 3*sigma(ones(n, 1),:);
count(any(outliers'),:) = [];
nout = sum(outliers)
nout =
0 0 0

As you can see, we have succeeded in removing the outliers from the data.


If you feel bored or a bit lazy to read, you can watch a video related to this lesson below:





#OutlierDetection #DataAnalysis #DataQuality #StatisticalMethods #DataVisualization ##DataPreprocessing #NaNValues #Outliers #DataAnalysis


-myresearchxpress

myresearchxpress

Hi, i"m asep sandra, a researcher at BRIN Indonesia. I want to share all about data analysis and tools with you. Hopefully this blog will fulfill your needs.

Posting Komentar

Lebih baru Lebih lama