Abstract
With the emerging new technologies based on Artificial Intelligence (AI) for the generation of new and paraphrasing of existing text, the identification of genuinely written text has become an important research undertaking. Past approaches to address this issue, need a significant volume of human-labeled data. Most of the approaches used in literature are either for noisy text or for clean text. Conversations in chats, text in blogs, text messages on cell phones, text exchange on Messengers, etc., are examples of noisy text that may contain misspelled words or incomplete words. The second approach focuses on clean text, which is free from the mentioned characteristics in the noisy text. As research articles do not contain noisy data, we propose a model that focuses on clean text for the identification of paraphrases in research articles. To address the problem of paraphrase detection, this work presents a novel Bidirectional Encoder Representation from Transformers (BERT) based model with fine-tuning. For word representation, Global Vectors (Glove) embeddings and contextualized Embeddings From Language Models (ELMo) are employed in this work. Initially, the model is evaluated without performing preprocessing. Later, the preprocessing step is performed before evaluating the model. Extensive experimentations are performed to evaluate the proposed model utilizing two benchmark datasets, namely, Microsoft Research Paraphrase (MSRP) and Quora Question Pairs (Quora). A comparison of the proposed model is done with four closely related state-of-the-art works. The obtained results show that Fine-tuned BERT using ELMo embeddings with preprocessing produces promising outcomes. Paraphrase identification rates achieved on MSRP and Quora datasets are 86.51% and 94.32%, respectively, which are better than the other competing methods. The proposed solution enables the identification of paraphrased text with a higher accuracy having its application in multiple domains requiring genuinely written documents.



Similar content being viewed by others
Data availability
Enquiries about data availability should be directed to the authors.
References
Agarwal B, Ramampiaro H, Langseth H, Ruocco M (2018) A deep network model for paraphrase detection in short text messages. Inf Process Manage 54(6):922–937
Mahmoud A, Zrigui M (2021) Semantic similarity analysis for corpus development and paraphrase detection in Arabic. Int Arab J Inf Technol 18(1):1–7
Aravinda Reddy D, Anand Kumar M, Soman KP (2019) LSTM based paraphrase identification using combined word embedding features. In Soft computing and signal processing. Springer, Singapore, pp 385-394
Bunk S, Krestel R (2018) WELDA: enhancing topic models by incorporating local word context. In: Proceedings of the 18th ACM/IEEE on joint conference on digital libraries, pp 293–302
Chen Z, Zhang H, Zhang X, Zhao L (2018) Quora question pairs. University of Waterloo, pp 1–7
Dabiri S, Heaslip K (2019) Developing a Twitter-based traffic event detection model using deep learning architectures. Expert Syst Appl 118:425–439
Das D, Smith NA (2009) Paraphrase identification as probabilistic quasi-synchronous recognition. In: Proceedings of the joint conference of the 47th annual meeting of the ACL and the 4th international joint conference on natural language processing of the AFNLP. Association for Computational Linguistics, Suntec, pp 468–476
Dey K, Shrivastava R, Kaushik S (2016) A paraphrase and semantic similarity detection system for user generated short-text content on microblogs. In: Proceedings of COLING 2016, the 26th international conference on computational linguistics: technical papers, pp 2880–2890
Dogra V (2021) Banking news-events representation and classification with a novel hybrid model using DistilBERT and rule-based features. Turk J Comput Math Education (TURCOMAT) 12(10):3039–3054
Dolan B, Brockett C (2005) Automatically constructing a corpus of sentential paraphrases. In: Third international workshop on paraphrasing (IWP2005)
Dolan W, Quirk C, Brockett C, Dolan B (2004) Unsupervised construction of large paraphrase corpora: Exploiting massively parallel news sources. In: COLING 2004: Proceedings of the 20th international conference on computational linguistics, Geneva, pp 350–356
Eyecioglu A, Keller B (2015) Twitter paraphrase identification with simple overlap features and SVMs. In: Proceedings of the 9th international workshop on semantic evaluation (SemEval 2015), pp 64–69
Felbo B, Mislove A, Søgaard A, Rahwan I, Lehmann S (2017) Using millions of emoji occurrences to learn any-domain representations for detecting sentiment, emotion, and sarcasm. arXiv preprint arXiv:1708.00524
Ferreira R, Cavalcanti GD, Freitas F, Lins RD, Simske SJ, Riss M (2018) Combining sentence similarities measures to identify paraphrases. Comput Speech Lang 47:59–73
Heilman M, Smith NA (2010) Tree edit models for recognizing textual entailments, paraphrases, and answers to questions. In: Human language technologies: The 2010 annual conference of the north American chapter of the association for computational linguistics, pp 1011–1019
Hu B, Lu Z, Li H, Chen Q (2014) Convolutional neural network architectures for matching natural language sentences. Adv Neural Inform Process Syst p 27
Ji Y, Eisenstein J (2013) Discriminative improvements to distributional sentence similarity. In: Proceedings of the 2013 conference on empirical methods in natural language processing, pp 891–896
Jinesh YI, Gawade S, Palivela H (2022) "Feature Extraction from Radiology Images for Visual Question Answering System Using CNN and BiLSTM Model." Recent Innovations in Computing. Springer, Singapore, pp 317–331
Karan M, Glavaš G, Šnajder J, Dalbelo Bašić B, Vulic I, Moens MF (2015) Tklbliir: Detecting Twitter paraphrases with tweeting Jay. In Proceedings of the 9th International Workshop on Semantic Evaluation (SemEval 2015) (pp. 70–74). ACL; East Stroudsburg, PA
Yalcin K, Cicekli I, Ercan G (2022) An external plagiarism detection system based on part-of-speech (POS) tag n-grams and word embedding. Expert Syst Appl 197:116677
Lan W, Xu W (2018) Neural network models for paraphrase identification, semantic textual similarity, natural language inference, and question answering. In: Proceedings of the 27th international conference on computational linguistics, pp 3890–3902
Lian W, Nie G, Jia B, Shi D, Fan Q, Liang Y (2020) An intrusion detection method based on decision tree-recursive feature elimination in ensemble learning. Math Probl Eng 2020:2835023
Liang H, Sun X, Sun Y, Gao Y (2017) Text feature extraction based on deep learning: a review. EURASIP J Wirel Commun Netw 2017(1):1–12
Madnani N, Tetreault J, Chodorow M (2012) Re-examining machine translation metrics for paraphrase identification. In: Proceedings of the 2012 conference of the North American chapter of the Association for Computational Linguistics: human language technologies, pp 182–190
Mohammad AS, Jaradat Z, Mahmoud AA, Jararweh Y (2017) Paraphrase identification and semantic text similarity analysis in Arabic news tweets using lexical, syntactic, and semantic features. Inf Process Manage 53(3):640–652
Ngoc Phuoc An V, Magnolini S, Popescu O (2015) Paraphrase identification and semantic similarity in twitter with simple features. In: The 3rd international workshop on natural language processing for social media, pp 10–19
Nighojkar A, Licato J (2021) Improving paraphrase detection with the adversarial paraphrasing task. In Proceedings of the 59th annual meeting of the association for computational linguistics and the 11th international joint conference on natural language processing, (Volume 1: Long papers), pp 7106–7116
Oliva J, Serrano JI, Del Castillo MD, Iglesias Á (2011) SyMSS: A syntax-based measure for short-text semantic similarity. Data Knowl Eng 70(4):390–405
Pang B, Knight K, Marcu D (2003) Syntax-based alignment of multiple translations: extracting paraphrases and generating new sentences. Cornell University Ithaca NY, Department Computer Science
Pennington J, Socher R, Manning CD (2014) Glove: Global vectors for word representation. In: Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pp 1532–1543
Peng Q, Weir D, Weeds J, Chai Y (2022) Predicate-argument based bi-encoder for paraphrase identification. In: Proceedings of the 60th annual meeting of the association for computational linguistics (volume 1: long papers), pp 5579–5589
Rashid J, Shah SMA, Irtaza A (2019) Fuzzy topic modeling approach for text mining over short text. Inf Process Manage 56(6):102060
Jain R, Kathuria A, Singh A, Saxena A, Khandelwal A (2022) ParaCap: paraphrase detection model using capsule network. Multimed Syst pp 1–19
Chawla S, Aggarwal P, Kaur R (2022) Comparative analysis of semantic similarity word embedding techniques for paraphrase detection. In: Emerging technologies for computing, communication, and smart cities: Proceedings of ETCCS 2021, Springer, pp 15–29
Reimers N, Gurevych I (2017) Reporting score distributions makes a difference: Performance study of lstm-networks for sequence tagging. arXiv preprint arXiv:1707.09861
Rus V, McCarthy PM, Lintean MC, McNamara DS, Graesser AC (2008). Paraphrase Identification with Lexico-Syntactic Graph Subsumption. In FLAIRS Conference, pp 201–206
Shahmohammadi H, Dezfoulian M, Mansoorizadeh M (2021) Paraphrase detection using LSTM networks and handcrafted features. Multimed Tools Appl 80(4):6479–6492
Shakeel MH, Karim A, Khan I (2020) A multi-cascaded model with data augmentation for enhanced paraphrase detection in short texts. Inf Process Manage 57(3):102204
Socher R, Huang E, Pennin J, Manning CD, Ng A (2011) Dynamic pooling and unfolding recursive autoencoders for paraphrase detection. Adv Neural Inform Process Syst p 24
Wang Z, Hamza W, Florian R (2017) Bilateral multi-perspective matching for natural language sentences. arXiv preprint arXiv:1702.03814
Wang Z, Mi H, Ittycheriah A (2016) Sentence similarity learning by lexical decomposition and composition. In: Proceedings of COLING 2016, the 26th international conference on computational linguistics: Technical papers, pp 1340–1349
Xie X, Li Z, Tang Z, Yao D, Ma H (2023) Unifying knowledge iterative dissemination and relational reconstruction network for image–text matching. Inf Process Manage 60(1):103154
Xu W, Callison-Burch C, Dolan WB (2015) Semeval-2015 task 1: Paraphrase and semantic similarity in Twitter (pit). In: Proceedings of the 9th International Workshop on semantic evaluation (SemEval 2015), pp 1–11
Xu W, Ritter A, Callison-Burch C, Dolan WB, Ji Y (2014) Extracting lexically divergent paraphrases from Twitter. Trans Assoc Comput Linguis 2:435–448
Yang M, Chen X, Tan L, Lan X, Luo Y (2023) Listen carefully to experts when you classify data: A generic data classification ontology encoded from regulations. Inf Process Manage 60(2):103186
Yin W, Schütze H (2015) Convolutional neural network for paraphrase identification. In: Proceedings of the 2015 conference of the north American chapter of the association for computational linguistics: human language technologies, pp 901–911
Zarrella G, Henderson J, Merkhofer E, Strickhart L (2015) Mitre: Seven systems for semantic similarity in tweets. In: Proceedings of the 9th international workshop on semantic evaluation (SemEval 2015), pp 12–17
Zhao J, Lan M (2015) Ecnu: Leveraging word embeddings to boost performance for paraphrase on Twitter. In: Proceedings of the 9th international workshop on semantic evaluation (SemEval 2015), pp 34–39
Palivela H (2021) Optimization of paraphrase generation and identification using language models in natural language processing. Int J Inf Manag Data Insights 1(2):100025
Acknowledgements
The authors wish to thank the GIK Institute for providing research facilities. This work was sponsored by the GIK Institute graduate research fund under the GAF scheme.
Funding
This work was sponsored by the GIK Institute graduate research fund under the GA1 scheme. Grand number GCS1635.
Author information
Authors and Affiliations
Contributions
Abdur Razaq: Conceptualization, Methodology, Software, Supervision. Zahid Halim: Visualization, Investigation, Software. Atta Ur Rahman: Software, Validation. Kholla Sikandar: Resources, Reviewing and Editing.
Corresponding author
Ethics declarations
Ethical approval
All procedures performed in studies involving human participants were per the ethical standards of the institutional and/or national research committee and with the 1964 Helsinki Declaration and its later amendments or comparable ethical standards.
Informed consent
N/A.
Conflict of interest
All authors declare that they have no conflict of interest.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Razaq, A., Halim, Z., Ur Rahman, A. et al. Identification of paraphrased text in research articles through improved embeddings and fine-tuned BERT model. Multimed Tools Appl 83, 74205–74232 (2024). https://doi.org/10.1007/s11042-024-18359-w
Received:
Revised:
Accepted:
Published:
Version of record:
Issue date:
DOI: https://doi.org/10.1007/s11042-024-18359-w