ACML2016 Tutorial on Mass Estimation: Enabling density-based or distance-based algorithms to do what they cannot do

by Kai Ming Ting

 

Download tutorial slides

 

Abstract:

This tutorial provides an overview of mass estimation, an alternative data modelling mechanism to density estimation; and details how it can overcome fundamental weaknesses of density-based or distance-based algorithms to enable them to do what they cannot do previously.

Mass estimation is attractive because the basic measure, mass, is not only more fundamental than density, but also more versatile―mass can be used to do density estimation, as a means for subspace selection and to find multi-dimensional median, and can be extended to measure dissimilarity of any two points. Example advantages of mass over density or distance are given as follows:

·        DEMass―Density estimator based on mass―runs orders of magnitude faster than kernel and kNN density estimators

·        Mass has been used, in place of density, as an effective means for subspace selection.

·        Half-space mass is the maximally robust and efficient method to find multi-dimensional median. Existing methods such as data depth are either less robust or computationally more expensive.

·        Simply replacing mass-based dissimilarity (a data dependent measure) with distance measure (a data independent measure) overcomes key weaknesses of density-based and distance-based methods in clustering, anomaly detection, information retrieval and classification.

This tutorial draws upon recent work on mass estimation and previous work which was also mass-based but was incorrectly categorised as density-based.

 

Overview:

Machine learning traditionally investigates techniques to extract knowledge automatically from data of moderate size. Most of the techniques cannot handle big data because the dominant existing paradigm for data mining is based on density estimation. Most distance-based, similarity-based and probabilistic-based algorithms ultimately rest upon density estimation as their basic building block.

This paradigm has two key constraining weaknesses: (1) the density estimation as their core modelling mechanism is computationally expensive in terms of runtime and memory space requirement. As such, density-based algorithms cannot be applied to (a) big data, including data streams which potentially have infinite data, and (b) high dimensional problems. Existing research has focused on reducing the runtime and memory space requirement from quadratic in the input data size to near linear; and that is the limit it can achieve within the existing paradigm. (2) The sole use of geometric positions in the feature space to compute dissimilarity of two points; and data distribution has no influence in the process. This is seldom recognised as the root cause of the density-based methods’ inability to accomplish certain important tasks, despite the revelation by psychologists in the 70’s. As a result, complex mitigation methods have been suggested to overcome the manifested algorithmic weaknesses on the surface.

Mass offers to overcome these fundamental weaknesses from the root cause. This tutorial introduces the basic idea of mass and its characteristics in contrast to density and distance; and provides specific examples of its advantages, reported in the recent works, which enable existing density-based and distance-based to perform tasks that otherwise impossible. 

Assumed prior knowledge of potential audience: Basic knowledge of machine learning.

 

Brief Bio of the instructor:

After receiving his PhD from the University of Sydney, Kai Ming Ting had worked at the University of Waikato, Deakin University and Monash University. He joins Federation University Australia since 2014. He had previously held visiting positions at Osaka University, Nanjing University, and Chinese University of Hong Kong. His current research interests are in the areas of mass estimation, anomaly detection, ensemble approaches, data streams, data mining and machine learning in general. He has served as a program committee co-chair for the Twelfth Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD-2008). He was a member of the program committee for a number of international conferences including ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, and International Conference on Machine Learning. He has received research funding from Australian Research Council, US Air Force of Scientific Research (AFOSR/AOARD), Toyota InfoTechnology Center, and Australian Institute of Sports. Awards received include the Runner-up Best Paper Award in 2008 IEEE ICDM (for Isolation Forest), and the Best Paper Award in 2006 PAKDD. He is the creator of isolation techniques, mass estimation and mass-based dissimilarity.

 

Back to ACML homepage