An Effective XML Documents Clustering Method Using Word Embeddings for Heterogeneous Collections

B.A. Bodinga, A. Roko, 1A.B. Muhammad, I. Saidu

Abstract

As the size of XML repositories is growing, XML data management becomes challenging as how these documents can be stored and retrieved. One way of resolving such issues is to group the documents into clusters so that documents within the same cluster are more related than documents in different clusters. This became necessary in order to aid indexing and retrieval of XML documents. Traditional documents clustering methods represents documents with models that fails to consider the semantic relation between words. In this paper, WEClusterX is proposed to semantically cluster XML documents. The idea behind WEClusterX is to pinpoint which concept is represented by a particular context. Firstly, a pre-trained Bidirectional Encoder Representations from Transformers (BERT) is used to extract and cluster embeddings. Then, a Context-Document matrix is generated from the cluster of embeddings. Finally, clusters were formed using the famous k-means algorithm. The method combines the statistical importance of words with their contextualized representation in documents in order to forms meaningful clusters. The proposed WEClusterX is evaluated using extensive experiments. Experimental results have demonstrated that our proposed clustering solution achieved better performance in terms of purity and entropy.

keywords:

XML document, Documents clustering, BERT, Embeddings, Heterogeneous

References:

[1]. Aggarwal, C.C., Ta, N., Wang, J., Feng, J. and Zaki, M.J., (2007). XProj: A framework for projected structural clustering of XML documents. Proceedings of the 13th ACM SIGKDD International Conference on knowledge discovery and data mining, San Jose, Califomia, USA, pp 46-55
[2]. Aggarwal, C.C. and Zhai, C. (2012). A survey of text clustering Algorithms. In: Aggarwal,
C.C. and Zhai, C. (eds) Mining Text data. Springer, Boston, MA. http://doi.org/10.1007/978-1-
4614-3223-4_4.
[3]. Alishahi,M., Naghibzadeh, M., and Aski, B.S., (2010). Tag name structure-based clustering
of XML documents. International Journal of Computer and Electrical Engeneerining, Vol. 2(1).
[4]. Antonellis, P., Makris, C., and Tsirakis, N., (2008). XEdge: clustering homogeneous and
heterogeneous XML documents using edge summaries. Proceedings of the 2008 ACM Symposium on Applied Computing, New York, USA, pp. 1082-1088.
[5]. Altingövde, I.S., Atilgan, D., and Ulusoy, O., (2010). Exploiting Index Pruning Methods for
Clustering XML Collections. Proceedings of the 8th International Workshop of the Initiative for
the Evaluation of XML Retrieval, INEX 2009, Brisbane, Australia, December
7-9, 2009, pp.
379-386.
[6]. Bessine, K., Nehew, A., Cherroun, H., and Moussaoui, A., (2015). XCLSC: Structure and
content-based clustering of XML documents. Proceedings of the 2015 12th International
Symposium on Programming and Systems (ISPS), 28-30 April 2015.
[7]. Costa, G., and Ortale, R., (2013). A latent semantic approach to xml clustering by content and
structure based on non-negative matrix factorization. Proceedings of the 12th International
Conference on Machine Learning and Applications, Miami, pp. 179-184.
[8]. Costa, G., and Ortale, R., (2014). Xml document co-clustering via non-negative matrix tri-
factorization. Proceedings of the IEEE 26th International conference on Tools with Artificial
Intelligence (ICTAI), Cyprus, pp. 607-614.
[9]. Costa, G., and Ortale, R., (2015). Fully-automatic xml clustering by structure-constrained
phrases. Proceedings of the IEEE 27th International Conference on Tools with Artificial
Intelligence (ICTAI), pp. 146-153.
[10]. Costa, G., and Ortale, R., (2017). XML Clustering by Structure-Constrained Phrases: A
Fully-Automatic Approach Using Contextualized N-Grams. International Journal on Artificial
Intelligence Tools, Vol. 26(1).
[11]. Costa, G., and Ortale, R., (2018). Machine learning techniques for XML (co-) clustering by
structure-constrained phrases. Information Retrieval Journal, Vol. 21(1), pp. 24-55.
[12]. Dalamagas, T., Cheng, T., Winkel, K.J., Sellis, T., (2006). A methodology for clustering
XML documents by structure. Information Systems, vol. 31(3) pp. 187-228.
[13]. Devlin, J., Chang, M., Lee, K., and Toutanova, M., (2019). BERT: Pre-training of Deep
Bidirectional Transformers for Language Understanding. Proceedings of the 2019 Conference of the
North American Chapter of the Association for Computational Linguistics:
Human
Language Technologies, Vol. 1, pp. 4171–4186,
Minneapolis, Minnesota.
[14]. Dongo, I., Ticona-Herrera, R., Cadinalle, Y., and Guzman, R., (2020). Semantic Similarity
of XML Documents Based on Structural and Content Analysis. Proceedings of the 2020 4th International Symposium on Computer Science and Intelligent Control, November 2020.
[15]. Greco, S., Gullo, F., Ponti, G., and Tagarelli, A., (2011). Collaborative clustering of XML
documents. Journal of Computer and System Sciences, Vol. 77 (6) pp. 988-1008.
[16]. Jianwu, Y., Cheung, W.K., and Chen, X., (2005). Integrating element and term semantics for
similarity-based XML document clustering. Proceedings of the 2005 IEEE/WIC/ACM International Conference on Web Intelligence (WI'05), Compiegne, France, 2005, pp. 222-228
[17]. Joulin, A., Edouard G., Piotr B.,& Mikolov, T. (2016). Bag of Tricks for Efficient Text
Classification. In Proceedings of the 15th Conference of the European Chapter of the Association
for Computational Linguistics: Volume 2, Short Papers, pages 427–431, Valencia, Spain.
[18]. Lalmas, M. (2009). XML RETRIEVAL. (G. Marchionini, Ed.). Morgan & Claypool
http://doi.org/10.2200/S00203ED1V01Y200907ICR007
[19]. Lee, L.M., Yang, L.H., Hsu, W., and Yang, X., (2002). XClust: Clustering XML schemas
for effective integration. Proceedings of the 2002 ACM CIKM International Conference on
Information and Knowledge Management, McLean, VA, USA, November 4-9, 2002.
[20]. Leung, H.P., Chung, F.L., Chan, S.C.F., and Luk, R. (2005). XML document clustering using
common XPath. Proceedings of the 2005 International Workshop on Challenges in Web
Information Retrieval and Integration (WIRI 2005), 8-9 April 2005, Tokyo, Japan
[21]. Mikolov, T., Sutskever, I., and Chen, K., et al. (2013) Distributed Representations of Words
and Phrases and Their Compositionality. Proc of the 26th International Conference on
Neural
Information Processing Systems, Curran Associates Inc., USA, 3111-3119.
[22]. Nayak, R.,and Xia, F.B.,(2004) Automatic Integration of Heterogenous XML-schemas.
In Kotsis, G, Taniar, D, Bressan, S, & Ibrahim, I K (Eds.) The Sixth International Conference on
Information Integration and Web-based Applications and Services. Oesterreichische Computer
Gesellschaft, Bandung, Indonesia, pp. 427-436.
[23]. Nayak, R., and Tran, T., (2007). A progressive clustering algorithm to group XML data by
structural and semantic similarity. International Journal of Pattern Recognition and Artificial Intelligence. Vol. 21(4). Pp. 723-743.
[24]. Nearman, A., and Jagadish, H.V., (2002). Evaluating Structural Similarity in XML
Documents. Proceedings of the 5th international conference on computational science (ICCS),
Wisconsin, USA.
[25]. Pennington,J., Socher,R., & Manning C.( 2014). GloVe: Global Vectors for Word
Representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural
Language Processing (EMNLP), pages 1532–1543, Doha, Qatar. Association for
Computational Linguistics.
[26]. Rezk, N.G., Sarhan,A., and Algergawy, A., (016). Clustering of XML documents based on
structure and aggregated content, Proceedings of the 11th International Conference on Computer
Engineering & Systems (ICCES), Egypt, pp. 93-102.
[27]. Samadi, N., and Ravana, S.D., (2023). XML Clustering Framework based on document
content and structure in a heterogenous digital library. Malaysian Journal of Computer Science,
Vol. 36 (3), pp. 124-147.
[28]. Tagarelli, A., amnd Greco, S., (2006). Towards Semantic XML Clustering. In the
proceedings of the SIAM International conference on data mining, pp. 188-198.
[29]. Tran, T., Nayak, R. (2007). Evaluating the Performance of XML Document Clustering by
Structure Only. In: Fuhr, N., Lalmas, M., Trotman, A. (eds) Comparative Evaluation of XML
Information Retrieval Systems. INEX 2006. Lecture Notes in Computer Science, vol
Springer, Berlin, Heidelberg.
[30]. Tran, T., Nayak, R., Bruza, P. (2008). Document Clustering Using Incremental and Pairwise
Approaches. In: Fuhr, N., Kamps, J., Lalmas, M., Trotman, A. (eds) Focused Access to XML
Documents. INEX 2007. Lecture Notes in Computer Science, vol 4862. Springer,
Berlin,
Heidelberg.
[31]. Tran, T., Kutty, S., Nayak, R. (2009). Utilizing the Structure and Content Information for
XML Document Clustering. In: Geva, S., Kamps, J., Trotman, A. (eds) Advances in Focused
Retrieval. INEX 2008. Lecture Notes in Computer Science, vol 5631., pp. 460-468, Springer,
Berlin, Heidelberg
[32]. Vercoustre, AM., Fegas, M., Gul, S., Lechevallier, Y. (2006). A Flexible Structured-Based
Representation for XML Document Mining. In: Fuhr, N., Lalmas, M., Malik, S., Kazai, G. (eds)
Advances in XML Information Retrieval and Evaluation. INEX 2005. Lecture Notes
in
Computer Science, vol 3977. Springer, Berlin, Heidelberg.
[33]. W

DOWNLOAD PDF

CALL FOR PAPERS

VOL. 11 ISSUE 4

APRIL 2025 EDITION

Research Articles written in English are invited from interested scholars and researchers in the academic community and other establishment for publication in the following areas:

Management Sciences
Social Sciences
Education
Engineering
Humanities
Sciences

An Author who wishes to submit a manuscript should note that the manuscript has not been submitted elsewhere nor is it for consideration in another journal. The article should be the original work of the author. International Institute of Academic Research and Development (IIARD) welcomes and acknowledges high-quality theoretical and empirical original research papers from researchers, academicians, professional, practitioners, and students from all over the world.

LATEST UPDATES

DOI (DIGITAL OBJECT IDENTIFIER) ISSUANCE

We are pleased to inform you that IIARD is now a registered member of Crossref. Henceforth, we will be issuing DOI to every published article.

JOURNAL HARD COPIES ARE READY FOR DISPATCH

All Journal hard copies are ready for dispatch. Corresponding authors are advice to submit their mailing addresses to editor@iiardjournals.org