iKono Telecommunications

Loading

I recently obtained certification as a Big Data Scientist from Arcitura as part of the third call for transversal skills offered by the Colombian Ministry of ICT. Among the many topics that can be addressed with data science is the detection of atypical data or outliers, applicable, for example, in investigating fraudulent banking transactions or finding defective parts in a production chain.

There are different types of outliers:

Global outliers:

Data that is inconsistent without any condition, for example, a fraudulent bank transaction.

Contextual outliers:

Data that is only inconsistent within a specific context or condition, for example, a person of average weight is an atypical element in a sumo wrestling competition.

Collective outliers:

Data that is inconsistent only when combined with other similar data, without conditions or context, for example, multiple small-value bank deposits from multiple bank accounts may indicate money laundering when analyzed collectively.

The model for detecting outliers can be based on statistics or machine learning algorithms.

Statistical Techniques

Statistical techniques work by fitting a distribution. Once the distribution is known, values that fall within the distribution's low probability range can be identified as outliers.

Parametric approach

It is assumed that the data-generating process produces data that fit a particular probability distribution, such as the normal distribution. Using the corresponding probability density function, the probability of a value can be determined.

This process requires the estimation of the distribution parameters such as the mean and standard deviation.

A common test for univariate analysis of continuous data is the z-score, which states that for a normal distribution, any z-score that is more than three standard deviations from the mean in either direction is considered an outlier.

Nonparametric approach

In this approach, probability distributions are not assumed. The data-generating process is modeled solely based on the data produced.

A common nonparametric technique for identifying outliers in discrete data is the histogram. If a data point does not fall into any of the intervals in the plot, it is considered an outlier.

The interquartile range (IQR) can also be considered as a non-parametric statistical technique for detecting outliers.

Machine Learning Algorithms

Distance-based techniques

These techniques are based on the assumption that in a multidimensional space, data points can be considered normal when they are close to each other. Outliers are those that are far from these normal points.

With clustering

An outlier is a point that is not part of a cluster or is part of a cluster that is small and far from the other clusters.

  • k-means: It is only effective for finding global outliers
  • CBLOF (Cluster Based Local Outlier Factor): This technique can be used to detect groups of outliers.
Without clustering

With this technique, it is not necessary to create clusters, since each point is evaluated individually based on its distance from its nearest neighbors.

  • k-NN (k-Nearest Neighbors): A score is generated for each point using a number (k) of nearest neighbors.

Supervised technique

It is based on a learning approach in which there are some known examples of atypical data that are provided to the algorithm to develop an outlier detection model.

A single-class classification model is built that only models normal examples, so any instance that does not belong to the normal class constitutes an outlier.

Semi-supervised technique

Clustering is first used to create natural clusters before applying the one-class classification algorithm. The one-class algorithm labels unmarked instances that already belong to a cluster based on the instances already classified within those clusters.

Any instance or cluster that does not belong to any normal class is considered an outlier.

As we have seen, there are many techniques in data science for detecting outliers. The technique used in each case will depend on various factors, such as the type of variable (discrete or continuous), the number of variables involved (univariate or multivariate analysis), and the nature of the data-generating process.

#We invite you to read our blog post «Tips for setting up an IVR and recording a welcome message»

Facebook
LinkedIn
WhatsApp
E-mail

Learn about our Corporate Solutions

Learn about the IP telephony, multi-agent chat, and mass text and voice messaging solutions for your business.

en_USEnglish