Identification of paraphrased text in research articles through improved embeddings and fine-tuned BERT model

Razaq, Abdur; Halim, Zahid; Ur Rahman, Atta; Sikandar, Kholla

doi:10.1007/s11042-024-18359-w

Identification of paraphrased text in research articles through improved embeddings and fine-tuned BERT model

Published: 15 February 2024

Volume 83, pages 74205–74232, (2024)
Cite this article

Multimedia Tools and Applications Aims and scope Submit manuscript

Abdur Razaq¹,
Zahid Halim ORCID: orcid.org/0000-0003-3094-3483¹,
Atta Ur Rahman¹ &
…
Kholla Sikandar²

434 Accesses
10 Citations
Explore all metrics

Abstract

With the emerging new technologies based on Artificial Intelligence (AI) for the generation of new and paraphrasing of existing text, the identification of genuinely written text has become an important research undertaking. Past approaches to address this issue, need a significant volume of human-labeled data. Most of the approaches used in literature are either for noisy text or for clean text. Conversations in chats, text in blogs, text messages on cell phones, text exchange on Messengers, etc., are examples of noisy text that may contain misspelled words or incomplete words. The second approach focuses on clean text, which is free from the mentioned characteristics in the noisy text. As research articles do not contain noisy data, we propose a model that focuses on clean text for the identification of paraphrases in research articles. To address the problem of paraphrase detection, this work presents a novel Bidirectional Encoder Representation from Transformers (BERT) based model with fine-tuning. For word representation, Global Vectors (Glove) embeddings and contextualized Embeddings From Language Models (ELMo) are employed in this work. Initially, the model is evaluated without performing preprocessing. Later, the preprocessing step is performed before evaluating the model. Extensive experimentations are performed to evaluate the proposed model utilizing two benchmark datasets, namely, Microsoft Research Paraphrase (MSRP) and Quora Question Pairs (Quora). A comparison of the proposed model is done with four closely related state-of-the-art works. The obtained results show that Fine-tuned BERT using ELMo embeddings with preprocessing produces promising outcomes. Paraphrase identification rates achieved on MSRP and Quora datasets are 86.51% and 94.32%, respectively, which are better than the other competing methods. The proposed solution enables the identification of paraphrased text with a higher accuracy having its application in multiple domains requiring genuinely written documents.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+

from €37.37 /Month

Starting from 10 chapters or articles per month
Access and download chapters and articles from more than 300k books and 2,500 journals
Cancel anytime

Buy Now

Price includes VAT (Netherlands)

Instant access to the full article PDF.

Institutional subscriptions

Siamese BERT with Enhanced Feature Models for Paraphrase Detection

Comparative Insights into Modern Architectures for Paraphrase Detection

Comparative Analysis of Semantic Similarity Word Embedding Techniques for Paraphrase Detection

Data availability

Enquiries about data availability should be directed to the authors.

References

Agarwal B, Ramampiaro H, Langseth H, Ruocco M (2018) A deep network model for paraphrase detection in short text messages. Inf Process Manage 54(6):922–937
Article Google Scholar
Mahmoud A, Zrigui M (2021) Semantic similarity analysis for corpus development and paraphrase detection in Arabic. Int Arab J Inf Technol 18(1):1–7
Google Scholar
Aravinda Reddy D, Anand Kumar M, Soman KP (2019) LSTM based paraphrase identification using combined word embedding features. In Soft computing and signal processing. Springer, Singapore, pp 385-394
Bunk S, Krestel R (2018) WELDA: enhancing topic models by incorporating local word context. In: Proceedings of the 18th ACM/IEEE on joint conference on digital libraries, pp 293–302
Chen Z, Zhang H, Zhang X, Zhao L (2018) Quora question pairs. University of Waterloo, pp 1–7
Google Scholar
Dabiri S, Heaslip K (2019) Developing a Twitter-based traffic event detection model using deep learning architectures. Expert Syst Appl 118:425–439
Article Google Scholar
Das D, Smith NA (2009) Paraphrase identification as probabilistic quasi-synchronous recognition. In: Proceedings of the joint conference of the 47th annual meeting of the ACL and the 4th international joint conference on natural language processing of the AFNLP. Association for Computational Linguistics, Suntec, pp 468–476
Dey K, Shrivastava R, Kaushik S (2016) A paraphrase and semantic similarity detection system for user generated short-text content on microblogs. In: Proceedings of COLING 2016, the 26th international conference on computational linguistics: technical papers, pp 2880–2890
Dogra V (2021) Banking news-events representation and classification with a novel hybrid model using DistilBERT and rule-based features. Turk J Comput Math Education (TURCOMAT) 12(10):3039–3054
Google Scholar
Dolan B, Brockett C (2005) Automatically constructing a corpus of sentential paraphrases. In: Third international workshop on paraphrasing (IWP2005)
Dolan W, Quirk C, Brockett C, Dolan B (2004) Unsupervised construction of large paraphrase corpora: Exploiting massively parallel news sources. In: COLING 2004: Proceedings of the 20th international conference on computational linguistics, Geneva, pp 350–356
Eyecioglu A, Keller B (2015) Twitter paraphrase identification with simple overlap features and SVMs. In: Proceedings of the 9th international workshop on semantic evaluation (SemEval 2015), pp 64–69
Felbo B, Mislove A, Søgaard A, Rahwan I, Lehmann S (2017) Using millions of emoji occurrences to learn any-domain representations for detecting sentiment, emotion, and sarcasm. arXiv preprint arXiv:1708.00524
Ferreira R, Cavalcanti GD, Freitas F, Lins RD, Simske SJ, Riss M (2018) Combining sentence similarities measures to identify paraphrases. Comput Speech Lang 47:59–73
Article Google Scholar
Heilman M, Smith NA (2010) Tree edit models for recognizing textual entailments, paraphrases, and answers to questions. In: Human language technologies: The 2010 annual conference of the north American chapter of the association for computational linguistics, pp 1011–1019
Hu B, Lu Z, Li H, Chen Q (2014) Convolutional neural network architectures for matching natural language sentences. Adv Neural Inform Process Syst p 27
Ji Y, Eisenstein J (2013) Discriminative improvements to distributional sentence similarity. In: Proceedings of the 2013 conference on empirical methods in natural language processing, pp 891–896
Jinesh YI, Gawade S, Palivela H (2022) "Feature Extraction from Radiology Images for Visual Question Answering System Using CNN and BiLSTM Model." Recent Innovations in Computing. Springer, Singapore, pp 317–331
Karan M, Glavaš G, Šnajder J, Dalbelo Bašić B, Vulic I, Moens MF (2015) Tklbliir: Detecting Twitter paraphrases with tweeting Jay. In Proceedings of the 9th International Workshop on Semantic Evaluation (SemEval 2015) (pp. 70–74). ACL; East Stroudsburg, PA
Yalcin K, Cicekli I, Ercan G (2022) An external plagiarism detection system based on part-of-speech (POS) tag n-grams and word embedding. Expert Syst Appl 197:116677
Article Google Scholar
Lan W, Xu W (2018) Neural network models for paraphrase identification, semantic textual similarity, natural language inference, and question answering. In: Proceedings of the 27th international conference on computational linguistics, pp 3890–3902
Lian W, Nie G, Jia B, Shi D, Fan Q, Liang Y (2020) An intrusion detection method based on decision tree-recursive feature elimination in ensemble learning. Math Probl Eng 2020:2835023
Article Google Scholar
Liang H, Sun X, Sun Y, Gao Y (2017) Text feature extraction based on deep learning: a review. EURASIP J Wirel Commun Netw 2017(1):1–12
Article Google Scholar
Madnani N, Tetreault J, Chodorow M (2012) Re-examining machine translation metrics for paraphrase identification. In: Proceedings of the 2012 conference of the North American chapter of the Association for Computational Linguistics: human language technologies, pp 182–190
Mohammad AS, Jaradat Z, Mahmoud AA, Jararweh Y (2017) Paraphrase identification and semantic text similarity analysis in Arabic news tweets using lexical, syntactic, and semantic features. Inf Process Manage 53(3):640–652
Article Google Scholar
Ngoc Phuoc An V, Magnolini S, Popescu O (2015) Paraphrase identification and semantic similarity in twitter with simple features. In: The 3rd international workshop on natural language processing for social media, pp 10–19
Nighojkar A, Licato J (2021) Improving paraphrase detection with the adversarial paraphrasing task. In Proceedings of the 59th annual meeting of the association for computational linguistics and the 11th international joint conference on natural language processing, (Volume 1: Long papers), pp 7106–7116
Oliva J, Serrano JI, Del Castillo MD, Iglesias Á (2011) SyMSS: A syntax-based measure for short-text semantic similarity. Data Knowl Eng 70(4):390–405
Article Google Scholar
Pang B, Knight K, Marcu D (2003) Syntax-based alignment of multiple translations: extracting paraphrases and generating new sentences. Cornell University Ithaca NY, Department Computer Science
Book Google Scholar
Pennington J, Socher R, Manning CD (2014) Glove: Global vectors for word representation. In: Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pp 1532–1543
Peng Q, Weir D, Weeds J, Chai Y (2022) Predicate-argument based bi-encoder for paraphrase identification. In: Proceedings of the 60th annual meeting of the association for computational linguistics (volume 1: long papers), pp 5579–5589
Rashid J, Shah SMA, Irtaza A (2019) Fuzzy topic modeling approach for text mining over short text. Inf Process Manage 56(6):102060
Article Google Scholar
Jain R, Kathuria A, Singh A, Saxena A, Khandelwal A (2022) ParaCap: paraphrase detection model using capsule network. Multimed Syst pp 1–19
Chawla S, Aggarwal P, Kaur R (2022) Comparative analysis of semantic similarity word embedding techniques for paraphrase detection. In: Emerging technologies for computing, communication, and smart cities: Proceedings of ETCCS 2021, Springer, pp 15–29
Reimers N, Gurevych I (2017) Reporting score distributions makes a difference: Performance study of lstm-networks for sequence tagging. arXiv preprint arXiv:1707.09861
Rus V, McCarthy PM, Lintean MC, McNamara DS, Graesser AC (2008). Paraphrase Identification with Lexico-Syntactic Graph Subsumption. In FLAIRS Conference, pp 201–206
Shahmohammadi H, Dezfoulian M, Mansoorizadeh M (2021) Paraphrase detection using LSTM networks and handcrafted features. Multimed Tools Appl 80(4):6479–6492
Article Google Scholar
Shakeel MH, Karim A, Khan I (2020) A multi-cascaded model with data augmentation for enhanced paraphrase detection in short texts. Inf Process Manage 57(3):102204
Article Google Scholar
Socher R, Huang E, Pennin J, Manning CD, Ng A (2011) Dynamic pooling and unfolding recursive autoencoders for paraphrase detection. Adv Neural Inform Process Syst p 24
Wang Z, Hamza W, Florian R (2017) Bilateral multi-perspective matching for natural language sentences. arXiv preprint arXiv:1702.03814
Wang Z, Mi H, Ittycheriah A (2016) Sentence similarity learning by lexical decomposition and composition. In: Proceedings of COLING 2016, the 26th international conference on computational linguistics: Technical papers, pp 1340–1349
Xie X, Li Z, Tang Z, Yao D, Ma H (2023) Unifying knowledge iterative dissemination and relational reconstruction network for image–text matching. Inf Process Manage 60(1):103154
Article Google Scholar
Xu W, Callison-Burch C, Dolan WB (2015) Semeval-2015 task 1: Paraphrase and semantic similarity in Twitter (pit). In: Proceedings of the 9th International Workshop on semantic evaluation (SemEval 2015), pp 1–11
Xu W, Ritter A, Callison-Burch C, Dolan WB, Ji Y (2014) Extracting lexically divergent paraphrases from Twitter. Trans Assoc Comput Linguis 2:435–448
Article Google Scholar
Yang M, Chen X, Tan L, Lan X, Luo Y (2023) Listen carefully to experts when you classify data: A generic data classification ontology encoded from regulations. Inf Process Manage 60(2):103186
Article Google Scholar
Yin W, Schütze H (2015) Convolutional neural network for paraphrase identification. In: Proceedings of the 2015 conference of the north American chapter of the association for computational linguistics: human language technologies, pp 901–911
Zarrella G, Henderson J, Merkhofer E, Strickhart L (2015) Mitre: Seven systems for semantic similarity in tweets. In: Proceedings of the 9th international workshop on semantic evaluation (SemEval 2015), pp 12–17
Zhao J, Lan M (2015) Ecnu: Leveraging word embeddings to boost performance for paraphrase on Twitter. In: Proceedings of the 9th international workshop on semantic evaluation (SemEval 2015), pp 34–39
Palivela H (2021) Optimization of paraphrase generation and identification using language models in natural language processing. Int J Inf Manag Data Insights 1(2):100025

Download references

Acknowledgements

The authors wish to thank the GIK Institute for providing research facilities. This work was sponsored by the GIK Institute graduate research fund under the GAF scheme.

Funding

This work was sponsored by the GIK Institute graduate research fund under the GA1 scheme. Grand number GCS1635.

Author information

Authors and Affiliations

The Machine Intelligence Research Group (MInG), Faculty of Computer Science and Engineering, Ghulam Ishaq Khan Institute of Engineering Sciences and Technology, Topi, 23460, Pakistan
Abdur Razaq, Zahid Halim & Atta Ur Rahman
Department of Computing Sciences, University of Aberdeen, Aberdeen, UK
Kholla Sikandar

Authors

Abdur Razaq
View author publications
Search author on:PubMed Google Scholar
Zahid Halim
View author publications
Search author on:PubMed Google Scholar
Atta Ur Rahman
View author publications
Search author on:PubMed Google Scholar
Kholla Sikandar
View author publications
Search author on:PubMed Google Scholar

Contributions

Abdur Razaq: Conceptualization, Methodology, Software, Supervision. Zahid Halim: Visualization, Investigation, Software. Atta Ur Rahman: Software, Validation. Kholla Sikandar: Resources, Reviewing and Editing.

Corresponding author

Correspondence to Zahid Halim.

Ethics declarations

Ethical approval

All procedures performed in studies involving human participants were per the ethical standards of the institutional and/or national research committee and with the 1964 Helsinki Declaration and its later amendments or comparable ethical standards.

Informed consent

N/A.

Conflict of interest

All authors declare that they have no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Razaq, A., Halim, Z., Ur Rahman, A. et al. Identification of paraphrased text in research articles through improved embeddings and fine-tuned BERT model. Multimed Tools Appl 83, 74205–74232 (2024). https://doi.org/10.1007/s11042-024-18359-w

Download citation

Received: 10 February 2023
Revised: 21 October 2023
Accepted: 19 January 2024
Published: 15 February 2024
Version of record: 15 February 2024
Issue date: September 2024
DOI: https://doi.org/10.1007/s11042-024-18359-w

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+

from €37.37 /Month

Starting from 10 chapters or articles per month
Access and download chapters and articles from more than 300k books and 2,500 journals
Cancel anytime

Buy Now

Price includes VAT (Netherlands)

Instant access to the full article PDF.

Institutional subscriptions

Identification of paraphrased text in research articles through improved embeddings and fine-tuned BERT model

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Siamese BERT with Enhanced Feature Models for Paraphrase Detection

Comparative Insights into Modern Architectures for Paraphrase Detection

Comparative Analysis of Semantic Similarity Word Embedding Techniques for Paraphrase Detection

Explore related subjects

Data availability

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Ethical approval

Informed consent

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now