Mass Estimation

This site contains the source codes for mass estimation and its applications in different data mining tasks.

Mass Estimation is a core data modelling method which models data distributions in terms of mass distribution, rather than density distribution, to solve various data mining problems. Mass estimation, as with density estimation, has been applied to different data mining tasks, e.g., anomaly detection, classification, clustering, information retrieval and regression. One of the advanatges of mass-based algorithms is that they can run orders of magnitude faster than the existing density-based counterparts. Papers on mass estimation are given below in the references.

The table below lists the software currently available for the various data mining tasks. The software are written in JAVA and integrated into WEKA.

Task DEMass Mass
Anomaly Detection DEMass-LOF (JAR), LiNearN (JAR) iForest, SCiForest, ReMass-iForest, HS-Tree
Classification DEMass-Bayes (JAR) MassBayes (JAR)
Clustering DEMass-DBSCAN (JAR), LiNearN-Cluster (JAR) MassTER (JAR)
Information Retrieval ReFeat, ReMass-ReFeat

Example Usage

All software can be run within the WEKA GUI framework unless otherwise stated differently.

Alternatively, they can be run on the command line. An example using MassTER is given as follows using a data set consisting of 4 attributes (3 numeric plus class attribute), to construct 1,000 trees with a maximum tree height of 6 (i.e., 3 attributes X 2 levels), where each tree is built using a random subset of 256 instances. Assuming that all of the relevant JAR files are in the current directory. Note: '-c last' specify the class attribute.

java -classpath "*" weka.clusterers.MassTER -c last -t data_set.arff -A 3 -D -E 5 -H 2 -N 1000 -W 256

Estimating mass distribution

The latest method for (level 1) multi-dimensional mass estimation is now available (see reference: [Half-Space Mass] below.)

The first single dimensional mass estimation is implemented in MATLAB.

References

[DEMass-Bayes, DEMass-DBSCAN, DEMass-LOF] Ting, Kai Ming; Washio Takashi; Jonathan Wells; F. T. Liu; Sunil Aryal (2013). "DEMass: a new density estimator for big data". Knowledge and Information Systems 35 (3): 1–32.

[Half-Space Mass] Bo Chen, Kai Ming Ting, Takashi Washio, Gholamreza Haffari (2015). "Half-space mass: a maximally robust and efficient data depth method. September 2015, Volume 100, Issue 2, pp 677-699, Machine Learning

[iForest] Liu, Tony F.; Kai Ming Ting; Z. H. Zhou (2008). "Isolation forest". Proceedings of the Eighth IEEE International Conference on Data Mining (ICDM'08): 413–422.

[LiNearN, LiNearN-Cluster] Wells, Jonathan R.; Kai Ming Ting; Washio Takashi (2014). "LiNearN: A New Approach to Nearest Neighbour Density Estimator". Pattern Recognition 47: 2702–2720.

[Mass] Ting, Kai Ming; G. T. Zhou; F. T. Liu; J. S. C. Tan (2013). "Mass estimation". Machine Learning 90 (1): 127–160

[MassTER] Ting, K. M. and Wells, J. R. Multi-Dimensional Mass Estimation and Mass-based Clustering Proceedings of the 10th IEEE International Conference on Data Mining (ICDM 10), IEEE Computer Society Press, 2010, 511-520

[ReFeat] Zhou, G. T.; Kai Ming Ting; T. F. Liu; Y. Yin (2012). "Relevance feature mapping for content based multimedia information retrieval". Pattern Recognition 45 (4): 1707–1720.

[ReMass-iForest, ReMass-ReFeat] Aryal, Sunil; Kai Ming Ting; Jonathan R. Wells; Takashi Washio (2014). Improving iForest with relative mass, Advances in Knowledge Discovery and Data Mining : 510–521.

[SCiForest] Liu, Tony F.; Kai Ming Ting; Z. H. Zhou (2010). "On detecting clustered anomalies using SCiForest". Proceedings of the 2010 European Conference on Machine Learning and Knowledge Discovery in Databases (ECML PKDD'10): 274–290.