Entropy Based k Nearest Neighbor Pattern Classification (EbkNN): En-route to Achieving a High Accuracy in Breast Cancer Diagnosis

Authors

  • Pushpam Sinha Department of Mechanical Engineering, Netaji Subhas Institute of Technology, Amhara, Bihta, Patna, India
  • Ankita Sinha Software Engineer 2, Intuit IDC, Bangalore, India

DOI:

https://doi.org/10.24203/ajas.v8i6.6386

Keywords:

Binary classification, Euclidean distance, Training dataset, Information

Abstract

Entropy based k-Nearest Neighbor pattern classification (EbkNN) is a variation of the conventional k-Nearest Neighbor rule of pattern classification, which exclusively optimizes the value of k-neighbors for each test data based on the calculations of entropy. The formula for entropy used in EbkNN is the one that has been defined popularly in information theory for a set of n different types of information (class) attached to a total of m objects (data points) with each object defined by f features. In EbkNN that value of k is chosen for discrimination of given test data for which the entropy is the least non-zero value. Other rules of conventional kNN are retained in EbkNN. It is concluded that EbkNN works best for binary classification. It is computationally prohibitive to use EbkNN for discriminating the data points of the test dataset into number of classes greater than two. The biggest advantage of EbkNN vis-à-vis the conventional kNN is that in one single run of EbkNN algorithm we get optimum classification of test data. But conventional kNN algorithm has to be run separately for each of the selected range of values of k, and then the optimum k to be chosen from amongst them. We also tested our EbkNN method on WDBC (Wisconsin Diagnostic Breast Cancer) dataset. There are 569 instances in this dataset and we made a random choice of first 290 instances as training dataset and the rest 279 instances as test dataset. We got an exceptionally remarkable result with EbkNN method- accuracy close to 100% and better than the ones got by most of the other researchers who worked on WDBC dataset.

 

References

S. Dutt, S. Chandramouli and A.K. Das. Machine Learning. Pearson India Education Services Pvt Ltd. Noida. 2020.

Y. Hamamoto, S. Uchimura, and S. Tomita, "A Bootstrap Technique for Nearest Neighbor Classifier Design," IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, vol. 19, no. 1, pp. 73-79, 1997.

E. Alpaydin, "Voting Over Multiple Condensed Nearest Neoghbors," Artificial Intelligence Review, vol. 11, pp. 115-132, 1997.

E. Fix and J. Hodges. Discriminatory Analysis: Nonparametric Discrimination: Consistency Properties. USAF School of Aviation Medicine, Report No 4, 1951.

M. V. Johns, ‘ (An empirical Bayes approach to non-parametric two-way classification,” in Studies in Item Analysis and Predic- tion, H. Solomon, Ed. Stanford, Calif.: Stanford University Press, 1961.

L. N. Kanal, “Statistical methods for pattern classification,” Philco Rept., 1963; originally appeared in T. Harley et al,, “Semi-automatic imagery screening research study and expen- mental investigation,,” Philco Reports, VO43-2 and VO43-3, Vol. I, sec. 6, and Appendix H, prepared for U. S. Army Electronics Research and Development, Lab. under Contract DA-36-039-SC- 90742. March 29. 1963.

G. Sebestyen, Decision Making Processes in Pattern Recognition. New York: Macmillan, 1962, pp. 90-92.

Nils Nilsson, Learning Machines. New York: McGraw-Hill, 1965, pp. 120-121. D. 0.

Loftsgaarden and C. P. Quesenberry, “A nonparametric estimate of a multivariate density function,” Annals Math Stat., vol. 36, pp. 1049-1051, June 1965.

T. M. Cover and P. E. Hart, "Nearest Neighbor Pattern Classification," IEEE Trans. Inform. Theory, vol. IT-13, pp. 2127, 1967.

K. Q. Weinberger and L. K. Saul, "Distance Metric Learning for Large Margin Nearest Neighbor Classification," Journal of Machine Learning Research, vol. 10, pp. 207-244, 2009.

S. Chopra, R. Hadsell, and Y. LeCun,” Learning a similiarty metric discriminatively, with application to face verification,”.In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR-05), pages 349–356, San Diego, CA, 2005

J. Goldberger, S. Roweis, G. Hinton, and R. Salakhutdinov,”Neighbourhood components analysis,” In L. K. Saul, Y. Weiss, and L. Bottou, editors, Advances in Neural Information Processing Systems 17, pages 513–520, Cambridge, MA, 2005. MIT Press.

S. Shalev-Shwartz, Y. Singer, and A. Y. Ng,” Online and batch learning of pseudo-metrics,” In Proceedings of the Twenty First International Conference on Machine Learning (ICML-04), pages 94–101, Banff, Canada, 2004.

J.K. Uhlmann ,” Satisfying general proximity/similarity queries with metric trees,” Information Processing Letters; 40: 175-179, 1991.

T. Liu, A.W. Moore and A. Gray,” New Algorithms for Efficient High-Dimensional Nonparametric Classification,” Journal of Machine Learning Research; 7: 1135-1158, 2006.

R.F. Sproull,” Refinements to Nearest Neighbor Searching in k-Dimensional Trees,” Algorithmica; 6:579-589, 1991.

S.Z. Li, K.L. Chan and C. Wang,” Performance Evaluation of the NFL Method in Image Classification and Retrieval,” IEEE Trans on Pattern Analysis and Machine Intelligence; Vol 22-Issue 11,2000.

Y. Zhou and C. Zhang ,”Tunable Nearest Neighbor Class,” Pattern Recognition ;346-349 pp, 2007.

Y.C. Liaw, C.M. Wu and M.L. Leou,” Fast Exact k Nearest Neighbors Search using an Orthogonal Search Tree,” Pattern Recognition; Vol 43-Issue 6: 2351-2358, 2010.

J. McNames,” Fast Nearest Neighbor Algorithm based on Principal Axis Search Tree,” IEEE Trans on Pattern Analysis and Machine Intelligence;Vol 23-Issue 9:964-976, 2001.

P. Hart, "The Condensed Nearest Neighbour Rule," IEEE Transactions on Information Theory, vol. 14, pp. 515-516, 1968.

G. Gates, "The Reduced Nearest Neighbour Rule," IEEE Transactions on Information Theory, vol. 18, pp. 431-433, 1972.

M. Kubat and M. Jr, "Voting Nearest-Neighbour Subclassifiers," in Proceedings of the 17th International Conference on Machine Learning, ICML-2000, Stanford, CA, pp. 503-510.,2000.

D. R. Wilson and T. R. Martinez, "Reduction Techniques for Instance-Based Learning Algorithms," Machine learning, vol. 38, no. 3, pp. 257-286, 2000.

P. K. Sinha, “Modifying one of the machine learning algorithms kNN to make it independent of the parameter k by re-defining neighbor,” I.J. Mathematical Sciences and Computing, 4, 12-25, 2020.

Y. Song, J. Huang, D. Zhou, H. Zha, and C. L. Giles, "Iknn: Informative k-nearest neighbor pattern classification," in Proceedings of the 11th European conference on Principles and Practice of Knowledge Discovery in Databases, Berlin, pp. 248-264., 2007

Y. Yang, "An evaluation of statistical approaches to text categorization," Information Retrieval, vol. 1, pp. 69-90, 1999.

Y. Yang and X. Liu, "A re-examination of text categorization methods," in Proceedings of SIGIR-99, 22nd ACM International Conference on Research and Development in Information Retrieval, Berkeley, pp. 42-49, 1999.

M. Jirina and M. J. Jirina, "Classifier Based on Inverted Indexes of Neighbors," Institute of Computer Science, Technical Report No. V-1034, 2008.

M. Jirina and M. J. Jirina, "Using Singularity Exponent in Distance Based Classifier," in Proceedings of the 10th International Conference on Intelligent Systems Design and Applications (ISDA2010), Cairo, pp. 220-224, 2010.

M. Jirina and M. J. Jirina, "Classifiers Based on Inverted Distances," in New Fundamental Technologies in Data Mining, K. Funatsu, Ed. InTech, vol. 1, ch. 19, pp. 369-387, 2011.

A. B. Hassanat, M. A. Abbadi, and G. A. Altarawneh, “Solving the problem of the K parameter in KNN classifier using an Ensemble Learning Approach,” International Journal of Computer Science and Information Security, Vol. 12, No. 8, pp. 33-39, 2014.

C.E. Shannon, “A mathematical theory of communication,” Bell System Technical Journal, 27(3): 379-423, 1948.

S. Srivastava, N. Sharma, S.K. Singh, and R. Srivastava,” Quantitative analysis of a general framework of a CAD tool for breast cancer detection from mammograms,” J Med Imaging Health Inform, 4, 654-74, 2014.

S. Saini, and R. Vijay,” Mammogram analysis using feedforward back propagation and cascade-forward back propagation artificial neural network,” In 2015 Fifth International Conference on Communication Systems and Network Technologies IEEE, pp 1177-80, 2015.

M.M. Pawar, and S.N. Talbar,” Genetic fuzzy system (GFS) based wavelet co-occurrence feature selection in mammogram classification for breast cancer diagnosis,” Perspect Sci, 8, 247-50, 2016.

S.J.S. Gardezi, I. Faye, F. Adjed, N. Kamel, and M.M. Eltoukhy,” Mammogram classification using curvelet GLCM texture features and GIST features,” In International Conference on Advanced Intelligent Systems and Informatics Springer Cham, pp 705-13, 2016.

K. Vaidehi, and T.S. Subashini,”Automatic characterization of benign and malignant masses in mammography,” Procedia Comput Sci, 46, 1762-9, 2015

J. Harefa , A. Alexander, and M. Pratiwi,” Comparison classifier: support vector machine (SVM) and K-nearest neighbor (K-NN) in digital mammogram images,” Jurnal Informatika dan Sistem Informasi, 2, 35-40, 2017.

M. Pratiwi, J. Harefa , and S. Nanda,”Mammograms classification using gray-level co-occurrence matrix and radial basis function neural network,” Procedia Comput Sci, 59, 83-91, 2015.

H. Rajaguru, S. Chakravarty S.R., “Analysis of Decision tree and k-nearest Neighbor algorithm in the classification of breast cancer,” Asian Pac J Cancer Prev, 20 (12), 3777-3781, 2019.

Z. Mushtaq, A. Yaqub, S. Sani & Adnan Khalid,” Effective K-nearest neighbor classifications for Wisconsin breast cancer data sets,” Journal of the Chinese Institute of Engineers, pp 1-13, DOI: 10.1080/02533839.2019.1676658, 2019.

Downloads

Published

2020-12-31

How to Cite

Sinha, P., & Sinha, A. (2020). Entropy Based k Nearest Neighbor Pattern Classification (EbkNN): En-route to Achieving a High Accuracy in Breast Cancer Diagnosis . Asian Journal of Applied Sciences, 8(6). https://doi.org/10.24203/ajas.v8i6.6386