Skip to main content
Log in

Identification of paraphrased text in research articles through improved embeddings and fine-tuned BERT model

  • Published:
Multimedia Tools and Applications Aims and scope Submit manuscript

Abstract

With the emerging new technologies based on Artificial Intelligence (AI) for the generation of new and paraphrasing of existing text, the identification of genuinely written text has become an important research undertaking. Past approaches to address this issue, need a significant volume of human-labeled data. Most of the approaches used in literature are either for noisy text or for clean text. Conversations in chats, text in blogs, text messages on cell phones, text exchange on Messengers, etc., are examples of noisy text that may contain misspelled words or incomplete words. The second approach focuses on clean text, which is free from the mentioned characteristics in the noisy text. As research articles do not contain noisy data, we propose a model that focuses on clean text for the identification of paraphrases in research articles. To address the problem of paraphrase detection, this work presents a novel Bidirectional Encoder Representation from Transformers (BERT) based model with fine-tuning. For word representation, Global Vectors (Glove) embeddings and contextualized Embeddings From Language Models (ELMo) are employed in this work. Initially, the model is evaluated without performing preprocessing. Later, the preprocessing step is performed before evaluating the model. Extensive experimentations are performed to evaluate the proposed model utilizing two benchmark datasets, namely, Microsoft Research Paraphrase (MSRP) and Quora Question Pairs (Quora). A comparison of the proposed model is done with four closely related state-of-the-art works. The obtained results show that Fine-tuned BERT using ELMo embeddings with preprocessing produces promising outcomes. Paraphrase identification rates achieved on MSRP and Quora datasets are 86.51% and 94.32%, respectively, which are better than the other competing methods. The proposed solution enables the identification of paraphrased text with a higher accuracy having its application in multiple domains requiring genuinely written documents.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+
from €37.37 /Month
  • Starting from 10 chapters or articles per month
  • Access and download chapters and articles from more than 300k books and 2,500 journals
  • Cancel anytime
View plans

Buy Now

Price includes VAT (Netherlands)

Instant access to the full article PDF.

Fig. 1
The alternative text for this image may have been generated using AI.
Fig. 2
The alternative text for this image may have been generated using AI.
Fig. 3
The alternative text for this image may have been generated using AI.

Similar content being viewed by others

Data availability

Enquiries about data availability should be directed to the authors.

References

  1. Agarwal B, Ramampiaro H, Langseth H, Ruocco M (2018) A deep network model for paraphrase detection in short text messages. Inf Process Manage 54(6):922–937

    Article  Google Scholar 

  2. Mahmoud A, Zrigui M (2021) Semantic similarity analysis for corpus development and paraphrase detection in Arabic. Int Arab J Inf Technol 18(1):1–7

    Google Scholar 

  3. Aravinda Reddy D, Anand Kumar M, Soman KP (2019) LSTM based paraphrase identification using combined word embedding features. In Soft computing and signal processing. Springer, Singapore, pp 385-394

  4. Bunk S, Krestel R (2018) WELDA: enhancing topic models by incorporating local word context. In: Proceedings of the 18th ACM/IEEE on joint conference on digital libraries, pp 293–302

  5. Chen Z, Zhang H, Zhang X, Zhao L (2018) Quora question pairs. University of Waterloo, pp 1–7

    Google Scholar 

  6. Dabiri S, Heaslip K (2019) Developing a Twitter-based traffic event detection model using deep learning architectures. Expert Syst Appl 118:425–439

    Article  Google Scholar 

  7. Das D, Smith NA (2009) Paraphrase identification as probabilistic quasi-synchronous recognition. In: Proceedings of the joint conference of the 47th annual meeting of the ACL and the 4th international joint conference on natural language processing of the AFNLP. Association for Computational Linguistics, Suntec, pp 468–476

  8. Dey K, Shrivastava R, Kaushik S (2016) A paraphrase and semantic similarity detection system for user generated short-text content on microblogs. In: Proceedings of COLING 2016, the 26th international conference on computational linguistics: technical papers, pp 2880–2890

  9. Dogra V (2021) Banking news-events representation and classification with a novel hybrid model using DistilBERT and rule-based features. Turk J Comput Math Education (TURCOMAT) 12(10):3039–3054

    Google Scholar 

  10. Dolan B, Brockett C (2005) Automatically constructing a corpus of sentential paraphrases. In: Third international workshop on paraphrasing (IWP2005)

  11. Dolan W, Quirk C, Brockett C, Dolan B (2004) Unsupervised construction of large paraphrase corpora: Exploiting massively parallel news sources. In: COLING 2004: Proceedings of the 20th international conference on computational linguistics, Geneva, pp 350–356

  12. Eyecioglu A, Keller B (2015) Twitter paraphrase identification with simple overlap features and SVMs. In: Proceedings of the 9th international workshop on semantic evaluation (SemEval 2015), pp 64–69

  13. Felbo B, Mislove A, Søgaard A, Rahwan I, Lehmann S (2017) Using millions of emoji occurrences to learn any-domain representations for detecting sentiment, emotion, and sarcasm. arXiv preprint arXiv:1708.00524

  14. Ferreira R, Cavalcanti GD, Freitas F, Lins RD, Simske SJ, Riss M (2018) Combining sentence similarities measures to identify paraphrases. Comput Speech Lang 47:59–73

    Article  Google Scholar 

  15. Heilman M, Smith NA (2010) Tree edit models for recognizing textual entailments, paraphrases, and answers to questions. In: Human language technologies: The 2010 annual conference of the north American chapter of the association for computational linguistics, pp 1011–1019

  16. Hu B, Lu Z, Li H, Chen Q (2014) Convolutional neural network architectures for matching natural language sentences. Adv Neural Inform Process Syst p 27

  17. Ji Y, Eisenstein J (2013) Discriminative improvements to distributional sentence similarity. In: Proceedings of the 2013 conference on empirical methods in natural language processing, pp 891–896

  18. Jinesh YI, Gawade S, Palivela H (2022) "Feature Extraction from Radiology Images for Visual Question Answering System Using CNN and BiLSTM Model." Recent Innovations in Computing. Springer, Singapore, pp 317–331

  19. Karan M, Glavaš G, Šnajder J, Dalbelo Bašić B, Vulic I, Moens MF (2015) Tklbliir: Detecting Twitter paraphrases with tweeting Jay. In Proceedings of the 9th International Workshop on Semantic Evaluation (SemEval 2015) (pp. 70–74). ACL; East Stroudsburg, PA

  20. Yalcin K, Cicekli I, Ercan G (2022) An external plagiarism detection system based on part-of-speech (POS) tag n-grams and word embedding. Expert Syst Appl 197:116677

    Article  Google Scholar 

  21. Lan W, Xu W (2018) Neural network models for paraphrase identification, semantic textual similarity, natural language inference, and question answering. In: Proceedings of the 27th international conference on computational linguistics, pp 3890–3902

  22. Lian W, Nie G, Jia B, Shi D, Fan Q, Liang Y (2020) An intrusion detection method based on decision tree-recursive feature elimination in ensemble learning. Math Probl Eng 2020:2835023

    Article  Google Scholar 

  23. Liang H, Sun X, Sun Y, Gao Y (2017) Text feature extraction based on deep learning: a review. EURASIP J Wirel Commun Netw 2017(1):1–12

    Article  Google Scholar 

  24. Madnani N, Tetreault J, Chodorow M (2012) Re-examining machine translation metrics for paraphrase identification. In: Proceedings of the 2012 conference of the North American chapter of the Association for Computational Linguistics: human language technologies, pp 182–190

  25. Mohammad AS, Jaradat Z, Mahmoud AA, Jararweh Y (2017) Paraphrase identification and semantic text similarity analysis in Arabic news tweets using lexical, syntactic, and semantic features. Inf Process Manage 53(3):640–652

    Article  Google Scholar 

  26. Ngoc Phuoc An V, Magnolini S, Popescu O (2015) Paraphrase identification and semantic similarity in twitter with simple features. In: The 3rd international workshop on natural language processing for social media, pp 10–19

  27. Nighojkar A, Licato J (2021) Improving paraphrase detection with the adversarial paraphrasing task. In Proceedings of the 59th annual meeting of the association for computational linguistics and the 11th international joint conference on natural language processing, (Volume 1: Long papers), pp 7106–7116

  28. Oliva J, Serrano JI, Del Castillo MD, Iglesias Á (2011) SyMSS: A syntax-based measure for short-text semantic similarity. Data Knowl Eng 70(4):390–405

    Article  Google Scholar 

  29. Pang B, Knight K, Marcu D (2003) Syntax-based alignment of multiple translations: extracting paraphrases and generating new sentences. Cornell University Ithaca NY, Department Computer Science

    Book  Google Scholar 

  30. Pennington J, Socher R, Manning CD (2014) Glove: Global vectors for word representation. In: Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pp 1532–1543

  31. Peng Q, Weir D, Weeds J, Chai Y (2022) Predicate-argument based bi-encoder for paraphrase identification. In: Proceedings of the 60th annual meeting of the association for computational linguistics (volume 1: long papers), pp 5579–5589

  32. Rashid J, Shah SMA, Irtaza A (2019) Fuzzy topic modeling approach for text mining over short text. Inf Process Manage 56(6):102060

    Article  Google Scholar 

  33. Jain R, Kathuria A, Singh A, Saxena A, Khandelwal A (2022) ParaCap: paraphrase detection model using capsule network. Multimed Syst pp 1–19

  34. Chawla S, Aggarwal P, Kaur R (2022) Comparative analysis of semantic similarity word embedding techniques for paraphrase detection. In: Emerging technologies for computing, communication, and smart cities: Proceedings of ETCCS 2021, Springer, pp 15–29

  35. Reimers N, Gurevych I (2017) Reporting score distributions makes a difference: Performance study of lstm-networks for sequence tagging. arXiv preprint arXiv:1707.09861

  36. Rus V, McCarthy PM, Lintean MC, McNamara DS, Graesser AC (2008). Paraphrase Identification with Lexico-Syntactic Graph Subsumption. In FLAIRS Conference, pp 201–206

  37. Shahmohammadi H, Dezfoulian M, Mansoorizadeh M (2021) Paraphrase detection using LSTM networks and handcrafted features. Multimed Tools Appl 80(4):6479–6492

    Article  Google Scholar 

  38. Shakeel MH, Karim A, Khan I (2020) A multi-cascaded model with data augmentation for enhanced paraphrase detection in short texts. Inf Process Manage 57(3):102204

    Article  Google Scholar 

  39. Socher R, Huang E, Pennin J, Manning CD, Ng A (2011) Dynamic pooling and unfolding recursive autoencoders for paraphrase detection. Adv Neural Inform Process Syst p 24

  40. Wang Z, Hamza W, Florian R (2017) Bilateral multi-perspective matching for natural language sentences. arXiv preprint arXiv:1702.03814

  41. Wang Z, Mi H, Ittycheriah A (2016) Sentence similarity learning by lexical decomposition and composition. In: Proceedings of COLING 2016, the 26th international conference on computational linguistics: Technical papers, pp 1340–1349

  42. Xie X, Li Z, Tang Z, Yao D, Ma H (2023) Unifying knowledge iterative dissemination and relational reconstruction network for image–text matching. Inf Process Manage 60(1):103154

    Article  Google Scholar 

  43. Xu W, Callison-Burch C, Dolan WB (2015) Semeval-2015 task 1: Paraphrase and semantic similarity in Twitter (pit). In: Proceedings of the 9th International Workshop on semantic evaluation (SemEval 2015), pp 1–11

  44. Xu W, Ritter A, Callison-Burch C, Dolan WB, Ji Y (2014) Extracting lexically divergent paraphrases from Twitter. Trans Assoc Comput Linguis 2:435–448

    Article  Google Scholar 

  45. Yang M, Chen X, Tan L, Lan X, Luo Y (2023) Listen carefully to experts when you classify data: A generic data classification ontology encoded from regulations. Inf Process Manage 60(2):103186

    Article  Google Scholar 

  46. Yin W, Schütze H (2015) Convolutional neural network for paraphrase identification. In: Proceedings of the 2015 conference of the north American chapter of the association for computational linguistics: human language technologies, pp 901–911

  47. Zarrella G, Henderson J, Merkhofer E, Strickhart L (2015) Mitre: Seven systems for semantic similarity in tweets. In: Proceedings of the 9th international workshop on semantic evaluation (SemEval 2015), pp 12–17

  48. Zhao J, Lan M (2015) Ecnu: Leveraging word embeddings to boost performance for paraphrase on Twitter. In: Proceedings of the 9th international workshop on semantic evaluation (SemEval 2015), pp 34–39

  49. Palivela H (2021) Optimization of paraphrase generation and identification using language models in natural language processing. Int J Inf Manag Data Insights 1(2):100025

Download references

Acknowledgements

The authors wish to thank the GIK Institute for providing research facilities. This work was sponsored by the GIK Institute graduate research fund under the GAF scheme.

Funding

This work was sponsored by the GIK Institute graduate research fund under the GA1 scheme. Grand number GCS1635.

Author information

Authors and Affiliations

Authors

Contributions

Abdur Razaq: Conceptualization, Methodology, Software, Supervision. Zahid Halim: Visualization, Investigation, Software. Atta Ur Rahman: Software, Validation. Kholla Sikandar: Resources, Reviewing and Editing.

Corresponding author

Correspondence to Zahid Halim.

Ethics declarations

Ethical approval

All procedures performed in studies involving human participants were per the ethical standards of the institutional and/or national research committee and with the 1964 Helsinki Declaration and its later amendments or comparable ethical standards.

Informed consent

N/A.

Conflict of interest

All authors declare that they have no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Razaq, A., Halim, Z., Ur Rahman, A. et al. Identification of paraphrased text in research articles through improved embeddings and fine-tuned BERT model. Multimed Tools Appl 83, 74205–74232 (2024). https://doi.org/10.1007/s11042-024-18359-w

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Version of record:

  • Issue date:

  • DOI: https://doi.org/10.1007/s11042-024-18359-w

Keywords