ACML2016 Tutorial on Mass Estimation: Enabling density-based or distance-based algorithms to
do what they cannot do
by
Kai Ming Ting
Abstract:
This tutorial provides an overview of mass
estimation, an alternative data modelling mechanism to density estimation; and
details how it can overcome fundamental weaknesses of density-based or
distance-based algorithms to enable them to do what they cannot do previously.
Mass estimation is attractive because the
basic measure, mass, is not only more fundamental than density, but also more
versatile―mass can be used to do density estimation, as a means for
subspace selection and to find multi-dimensional median, and can be extended to
measure dissimilarity of any two points. Example advantages of mass over
density or distance are given as follows:
·
DEMass―Density
estimator based on mass―runs orders of magnitude faster than kernel and kNN density estimators
·
Mass
has been used, in place of density, as an effective means for subspace
selection.
·
Half-space
mass is the maximally robust and efficient method to find multi-dimensional
median. Existing methods such as data depth are either less robust or
computationally more expensive.
·
Simply
replacing mass-based dissimilarity (a data dependent measure) with distance
measure (a data independent measure) overcomes key weaknesses of density-based
and distance-based methods in clustering, anomaly detection, information
retrieval and classification.
This tutorial draws upon recent work on
mass estimation and previous work which was also
mass-based but was incorrectly categorised as density-based.
Overview:
Machine learning traditionally
investigates techniques to extract knowledge automatically from data of
moderate size. Most of the techniques cannot handle big data because the
dominant existing paradigm for data mining is based on density estimation. Most
distance-based, similarity-based and probabilistic-based algorithms ultimately
rest upon density estimation as their basic building block.
This paradigm has two key constraining
weaknesses: (1) the density estimation as their core modelling mechanism is
computationally expensive in terms of runtime and memory space requirement. As
such, density-based algorithms cannot be applied to (a) big data, including
data streams which potentially have infinite data, and (b) high dimensional
problems. Existing research has focused on reducing the runtime and memory
space requirement from quadratic in the input data size to near linear; and
that is the limit it can achieve within the existing paradigm. (2) The sole use
of geometric positions in the
feature space to compute dissimilarity of two points; and data distribution has
no influence in the process. This is seldom recognised as the root cause of the
density-based methods’ inability to accomplish certain important tasks, despite
the revelation by psychologists in the 70’s. As a result, complex mitigation
methods have been suggested to overcome the manifested algorithmic weaknesses on
the surface.
Mass offers to overcome these fundamental
weaknesses from the root cause. This tutorial introduces the basic idea of mass
and its characteristics in contrast to density and distance; and provides
specific examples of its advantages, reported in the recent works, which enable
existing density-based and distance-based to perform tasks that otherwise
impossible.
Assumed
prior knowledge of potential audience: Basic knowledge of
machine learning.
Brief
Bio of the instructor:
After receiving his PhD from the
University of Sydney, Kai Ming Ting had worked at the University of Waikato,
Deakin University and Monash University. He joins Federation University
Australia since 2014. He had previously held visiting positions at Osaka
University, Nanjing University, and Chinese University of Hong Kong. His
current research interests are in the areas of mass estimation, anomaly
detection, ensemble approaches, data streams, data mining and machine learning
in general. He has served as a program committee co-chair for the Twelfth
Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD-2008). He
was a member of the program committee for a number of international conferences
including ACM SIGKDD International Conference on Knowledge Discovery and Data
Mining, and International Conference on Machine Learning. He has received
research funding from Australian Research Council, US Air Force of Scientific
Research (AFOSR/AOARD), Toyota InfoTechnology Center, and Australian Institute of Sports. Awards received
include the Runner-up Best Paper Award in 2008 IEEE ICDM (for Isolation
Forest), and the Best Paper Award in 2006 PAKDD. He is the creator of isolation
techniques, mass estimation and mass-based dissimilarity.
Back to ACML homepage