Skip to main content
Log in

Deduplication of the media-based event databases

  • Research Article
  • Published:
Journal of Computational Social Science Aims and scope Submit manuscript

Abstract

Event databases play a crucial role in documenting and examining spatiotemporally distributed events ranging from protests over disasters, to public health emergencies. The Global Database of Events, Language, and Tone (GDELT) is notable for its comprehensive automated cataloging of events from many news sources. By extension, its extensive scope augments the risk of counting unique events repeatedly because they are reported by multiple sources. This leads to an inaccurate assessment of event level dynamics. This article presents a new, automated deduplication technique specifically designed to improve the event identification and counting accuracy of GDELT data. By consolidating redundant event entries into a unified, comprehensive record, our approach effectively mitigates the issue of overcounting, while enhancing the integrity and usefulness of the data. We assess the effectiveness of our method by conducting thorough algorithmic testing, and by comparing it to other established datasets such as Armed Conflict Location and Event Dataset (ACLED) and Integrated Crisis Early Warning System (ICEWS). The comparative analysis employs the deduplicated GDELT data to predict local protest levels, demonstrating that our deduplication procedure not only decreases the number of overcounted events but also better aligns GDELT with other event databases, thereby validating the effectiveness of our methodology. Our findings inform empirical research dependent on media-reported event databases like GDELT, with broader implications for other fields reliant on data collection and sources affected by directional measurement error.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+
from €37.37 /Month
  • Starting from 10 chapters or articles per month
  • Access and download chapters and articles from more than 300k books and 2,500 journals
  • Cancel anytime
View plans

Buy Now

Price includes VAT (Netherlands)

Instant access to the full article PDF.

Fig. 1
The alternative text for this image may have been generated using AI.
Fig. 2
The alternative text for this image may have been generated using AI.
Fig. 3
The alternative text for this image may have been generated using AI.
Fig. 4
The alternative text for this image may have been generated using AI.
Fig. 5
The alternative text for this image may have been generated using AI.
Fig. 6
The alternative text for this image may have been generated using AI.
Fig. 7
The alternative text for this image may have been generated using AI.
Fig. 8
The alternative text for this image may have been generated using AI.
Fig. 9
The alternative text for this image may have been generated using AI.
Fig. 10
The alternative text for this image may have been generated using AI.
Fig. 11
The alternative text for this image may have been generated using AI.
Fig. 12
The alternative text for this image may have been generated using AI.
Fig. 13
The alternative text for this image may have been generated using AI.

Similar content being viewed by others

References

  1. Yu, M., Bambacus, M., Cervone, G., Clarke, K., Duffy, D., Huang, Q., Li, J., Li, W., Li, Z., Liu, Q., & Resch, B. (2020). Spatiotemporal event detection: A review. International Journal of Digital Earth, 13(12), 1339–1365.

    Article  Google Scholar 

  2. DesInventar. (n.d.). DesInventar Sendai. Retrieved January 16, 2024, from https://www.desinventar.net/

  3. Cloud to Street. (n.d.). Global Flood Database. Retrieved January 16, 2024, from https://global-flood-database.cloudtostreet.ai/

  4. Leetaru, K., & Schrodt, P. A. (2013). March). GDELT: Global data on events, location, and tone, 1979–2012. Paper presented at the ISA Annual Convention, 2, 1–51.

    Google Scholar 

  5. Yonamine, E. (2013). A nuanced study of political conflict using the Global Datasets of Events Location and Tone (GDELT) Dataset [Doctoral dissertation, The Pennsylvania State University]. ProQuest Dissertations & Theses Global. https://www.proquest.com/dissertations-theses/nuanced-study-political-conflict-using-global/docview/1467737683/se-2

  6. Schrodt, P. A. (2001). Automated coding of international event data using sparse parsing techniques. In annual meeting of the International Studies Association, Chicago.

  7. Hopp, F. R., Schaffer, J., Fisher, J. T., & Weber, R. (2019). iCoRe: The GDELT interface for the advancement of communication research. Computational Communication Research, 1(1), 13–44.

    Article  Google Scholar 

  8. Alamro, R., McCarren, A., & Al-Rasheed, A. (2019). Predicting saudi stock market index by incorporating gdelt using multivariate time series modelling. In Advances in Data Science, Cyber Security and IT Applications: First International Conference on Computing, ICC 2019, Riyadh, Saudi Arabia, December 10–12, 2019, Proceedings, Part I 1 (pp. 317–328). Springer International Publishing.

  9. Shi, S., Changqing, S., ChengChangxiu, J. G., & Sijing, Y. E. (2020). GDELT: Big event data for sensing global social dynamics. World Regional Studies, 29(1), 71.

    Google Scholar 

  10. Voukelatou, V., Pappalardo, L., Miliou, I., Gabrielli, L., & Giannotti, F. (2020). Estimating countries’ peace index through the lens of the world news as monitored by GDELT. In 2020 IEEE 7th international conference on data science and advanced analytics (DSAA) (pp. 216–225). IEEE.

  11. Boecking, B., Hall, M., & Schneider, J. (2015). Event prediction with learning algorithms—A study of events surrounding the Egyptian revolution of 2011 on the basis of micro blog data. Policy & Internet, 7(2), 159–184.

    Article  Google Scholar 

  12. Wu, C., & Gerber, M. S. (2017). Forecasting civil unrest using social media and protest participation theory. IEEE Transactions on Computational Social Systems, 5(1), 82–94.

    Article  Google Scholar 

  13. Zhou, C., Liu, F., Gao, J., & Song, C. (2017). Can Bayesian poisson tensor factorization automatically extract interesting events from massive media reports?. In 2017 International Conference on Behavioral, Economic, Socio-cultural Computing (BESC) (pp. 1–4). IEEE.

  14. Yonamine, J. E. (2013). Predicting future levels of violence in afghanistan districts using gdelt. Unpublished manuscript. Available at https://blog.gdeltproject.org/jay-yonamines-forecasting-future-violence-in-afghanistan/

  15. Radford, B. J. (2020). Seeing the forest and the trees: Detection and cross-document coreference resolution of militarized interstate disputes. arXiv preprint arXiv:2005.02966.

  16. Wang, W., Kennedy, R., Lazer, D., & Ramakrishnan, N. (2016). Growing pains for global monitoring of societal events. Science, 353(6307), 1502–1503.

    Article  Google Scholar 

  17. Raleigh, C., Linke, R., Hegre, H., & Karlsen, J. (2010). Introducing ACLED: An armed conflict location and event dataset. Journal of Peace Research, 47(5), 651–660.

    Article  Google Scholar 

  18. O’Brien, S. P. (2012). A multi-method approach for near real time conflict and crisis early warning. In Handbook of Computational Approaches to Counterterrorism (pp. 401–418). New York, NY: Springer New York.

  19. McClelland, C. A. (1978). World event/interaction survey. Inter-university Consortium for Political and Social Research.

  20. Azar, E. E. (1980). The conflict and peace data bank (COPDAB) project. Journal of Conflict Resolution, 24(1), 143–152.

    Article  Google Scholar 

  21. Gerner, D. J., Schrodt, P. A., Francisco, R. A., & Weddle, J. L. (1994). Machine coding of event data using regional and international sources. International Studies Quarterly, 38(1), 91–119.

    Article  Google Scholar 

  22. Scarborough, Grace I., Benjamin E. Bagozzi, Andreas Beger, John Berrie, Andrew Halterman, Philip A. Schrodt, Jevon Spivey, 2023, "POLECAT Weekly Data", https://doi.org/10.7910/DVN/AJGVIT, Harvard Dataverse, V62

  23. ACLED. (2019). Armed conflict location & event data project (ACLED) codebook. ACLED.

  24. Chenoweth, E., Hendrix, C. S., & Hunter, K. (2019). Introducing the nonviolent action in violent contexts (NVAVC) dataset. Journal of Peace Research, 56(2), 295–305.

    Article  Google Scholar 

  25. Ide, T., Kristensen, A., & Bartusevičius, H. (2021). First comes the river, then comes the conflict? A qualitative comparative analysis of flood-related political unrest. Journal of Peace Research, 58(1), 83–97.

    Article  Google Scholar 

  26. Althaus, S., Peyton, B., & Shalmon, D. (2022). A total error approach for validating event data. American Behavioral Scientist, 66(5), 603–624.

    Article  Google Scholar 

  27. Ferreira, L. N., Hong, I., Rutherford, A., & Cebrian, M. (2021). The small-world network of global protests. Scientific Reports, 11(1), 19215.

    Article  Google Scholar 

  28. Jünger, J., & Gärtner, C. (2021). Distilling issue cycles from large databases: A time-series analysis of terrorism and media in Africa. Social Science Computer Review, 39(6), 1272–1291.

    Article  Google Scholar 

  29. Qiao, F., Li, P., Zhang, X., Ding, Z., Cheng, J., & Wang, H. (2017). Predicting social unrest events with hidden Markov models using GDELT. Discrete Dynamics in Nature and Society2017.

  30. Yuan, L., Song, C., Cheng, C., Shen, S., Chen, X., & Wang, Y. (2020). The cooperative and conflictual interactions between the United States, Russia, and China: A quantitative analysis of event data. Journal of Geographical Sciences, 30, 1702–1720.

    Article  Google Scholar 

  31. Chung, A. (2023). The impact of South Korea-US and North Korea-US relations on inter-korean relations: an empirical analysis using big data Kukche, Chiyok Yon’Gu. Review of International and Area Studies, 32(1), 41.

    Article  Google Scholar 

  32. Qiao, F., & Wang, H. (2015). Computational approach to detecting and predicting occupy protest events. In 2015 International Conference on Identification, Information, and Knowledge in the Internet of Things (IIKI) (pp. 94–97). IEEE.

  33. Deng, N. (2021). Predicting Social Events using Entity Interaction Graph Sequences. In 2021 IEEE 2nd International Conference on Big Data, Artificial Intelligence and Internet of Things Engineering (ICBAIE) (pp. 1025–1029). IEEE.

  34. Schein, A., Paisley, J., Blei, D. M., & Wallach, H. (2015, August). Bayesian poisson tensor factorization for inferring multilateral relations from sparse dyadic event counts. In Proceedings of the 21th ACM SIGKDD International conference on knowledge discovery and data mining (pp. 1045–1054).

  35. Jiang, L. (2020). The underlying causal network from global dyadic events: allies and rivals in international relations. In 2020 IEEE Intl Conf on Parallel & Distributed Processing with Applications, Big Data & Cloud Computing, Sustainable Computing & Communications, Social Computing & Networking (ISPA/BDCloud/SocialCom/SustainCom) (pp. 1029–1036). IEEE.

  36. Bi, S., Gao, J., Wang, Y., & Cao, Y. (2015). A contrast of the degree of activity among the three major powers, USA, China, and Russia: Insights from media reports. In 2015 International Conference on Behavioral, Economic and Socio-cultural Computing (BESC) (pp. 38–42). IEEE.

  37. Yoshioka, M., & Kando, N. (2016). Comparative Analysis of GDELT Data Using the News Site Contrast System. In NewsIR@ ECIR (pp. 63–65).

  38. Zheng, C., Fan, H., Singh, R., & Shi, Y. (2020). A Domain expertise and word-embedding geometric projection based semantic mining framework for measuring the soft power of social entities. IEEE Access, 8, 204597–204611.

    Article  Google Scholar 

  39. Pogorelov, K., Schroeder, D. T., Filkukova, P., & Langguth, J. (2020). A system for high performance mining on gdelt data. In 2020 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW) (pp. 1101–1111). IEEE.

  40. Kwak, H., & An, J. (2014). A first look at global news coverage of disasters by using the gdelt dataset. In Social Informatics: 6th International Conference, SocInfo 2014, Barcelona, Spain, November 11–13, 2014. Proceedings 6 (pp. 300–308). Springer International Publishing.

  41. Su, Y., Lan, Z., Lin, Y. R., Comfort, L. K., & Joshi, J. (2016). Tracking disaster response and relief following the 2015 Nepal earthquake. In 2016 IEEE 2nd International Conference on Collaboration and Internet Computing (CIC) (pp. 495–499). IEEE..

  42. Owuor, I., Hochmair, H. H., & Cvetojevic, S. (2020). Tracking hurricane dorian in GDELT and twitter. AGILE: GIScience Series, 1: 19.

  43. Olteanu, A., Castillo, C., Diakopoulos, N., & Aberer, K. (2015). Comparing events coverage in online news and social media: The case of climate change. In Proceedings of the international AAAI conference on web and social media (Vol. 9, No. 1, pp. 288–297).

  44. Wen, M., Ota, K., Li, H., Lei, J., Gu, C., & Su, Z. (2015). Secure data deduplication with reliable key management for dynamic updates in CPSS. IEEE Transactions on Computational Social Systems, 2(4), 137–147.

    Article  Google Scholar 

  45. Li, J., Li, Y. K., Chen, X., Lee, P. P., & Lou, W. (2014). A hybrid cloud approach for secure authorized deduplication. IEEE Transactions on Parallel and Distributed Systems, 26(5), 1206–1216.

    Article  Google Scholar 

  46. Xia, W., Jiang, H., Feng, D., Douglis, F., Shilane, P., Hua, Y., Fu, M., Zhang, Y., & Zhou, Y. (2016). A comprehensive study of the past, present, and future of data deduplication. Proceedings of the IEEE, 104(9), 1681–1710.

    Article  Google Scholar 

  47. Sengar, S. S., & Mishra, M. (2012). A Parallel Architecture for In-Line Data Deduplication. In 2012 Second International Conference on Advanced Computing & Communication Technologies (pp. 399–403). IEEE.

  48. Costa, G., Cuzzocrea, A., Manco, G., & Ortale, R. (2011). Data deduplication: A review. Learning structure and schemas from documents, 385–412.

  49. Kushagra, S., Ben-David, S., & Ilyas, I. (2019). Semi-supervised clustering for deduplication. In The 22nd International Conference on Artificial Intelligence and Statistics (pp. 1659–1667). PMLR.

  50. Cohen, W. W., & Richman, J. (2002). Learning to match and cluster large high-dimensional data sets for data integration. In Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining (pp. 475–480).

  51. Schrodt, P. (2012). Conflict and Mediation Event Observations event and actor codebook V. 1.1 b3.

  52. Colaresi, M., & Mahmood, Z. (2017). Do the robot: Lessons from machine learning to improve conflict forecasting. Journal of Peace Research, 54(2), 193–214.

    Article  Google Scholar 

  53. Galla, D., & Burke, J. (2018). Predicting social unrest using GDELT. In International conference on machine learning and data mining in pattern recognition (pp. 103–116). Cham: Springer International Publishing.

  54. Earl, J., Soule, S. A., & McCarthy, J. D. (2003). Protest under fire? Explaining the policing of protest. American Sociological Review, 68(4), 581–606.

    Article  Google Scholar 

  55. Demarest, L., & Langer, A. (2019). Reporting on electoral violence in Nigerian news media: “Saying it as it is”? African Studies Review, 62(4), 83–109.

    Article  Google Scholar 

  56. Ward, M. D., Beger, A., Cutler, J., Dickenson, M., Dorff, C., & Radford, B. (2013). Comparing GDELT and ICEWS event data. Analysis, 21(1), 267–297.

    Google Scholar 

  57. Smidt, H. M. (2020). United Nations peacekeeping locally: Enabling conflict resolution, reducing communal violence. Journal of Conflict Resolution, 64(2–3), 344–372.

    Article  Google Scholar 

  58. Weidmann, N. B., & Ward, M. D. (2010). Predicting conflict in space and time. Journal of Conflict Resolution, 54(6), 883–901.

    Article  Google Scholar 

  59. Ives, B., & Lewis, J. S. (2020). From rallies to riots: Why some protests become violent. Journal of Conflict Resolution, 64(5), 958–986.

    Article  Google Scholar 

  60. Calvo Figueras, B., Caselli, T., & Broersma, M. (2021). Finding narratives in news flows: the temporal dimension of news stories. DHQ: Digital Humanities Quarterly, 15(4).

  61. Demarest, L., & Langer, A. (2022). How events enter (or not) data sets: The pitfalls and guidelines of using newspapers in the study of conflict. Sociological Methods & Research, 51(2), 632–666.

    Article  Google Scholar 

  62. Earl, J., Martin, A., McCarthy, J. D., & Soule, S. A. (2004). The use of newspaper data in the study of collective action. Annual Review of Sociology, 30, 65–80.

    Article  Google Scholar 

  63. Yuan, W., Caren, N., & Amenta, E. (2023). What drives the news coverage of US social movements? Social Forces, 102(1), 242–262.

    Article  Google Scholar 

  64. Cook, S. J., & Weidmann, N. B. (2019). Lost in aggregation: Improving event analysis with report-level data. American Journal of Political Science, 63(1), 250–264.

    Article  Google Scholar 

  65. Hammond, J., & Weidmann, N. B. (2014). Using machine-coded event data for the micro-level study of political violence. Research & Politics, 1(2), 2053168014539924.

    Article  Google Scholar 

  66. Andrews, K. T., & Caren, N. (2010). Making the news: Movement organizations, media attention, and the public agenda. American Sociological Review, 75(6), 841–866.

    Article  Google Scholar 

  67. Amenta, E., Caren, N., Olasky, S. J., & Stobaugh, J. E. (2009). All the movements fit to print: Who, what, when, where, and why SMO families appeared in the New York Times in the twentieth century. American Sociological Review, 74(4), 636–656.

    Article  Google Scholar 

  68. Weidmann, N. B. (2016). A closer look at reporting bias in conflict event data. American Journal of Political Science, 60(1), 206–218.

    Article  Google Scholar 

  69. Dharmapala, D., & Huq, A. (2024). Imputing unreported hate crimes using google search data. Journal of Law and Empirical Analysis, 1(2), 2755323X241274653.

    Article  Google Scholar 

  70. Sullivan, T. A., & Sullivan, T. A. (2020). Who’s Missing? Undercounting and Underreporting. Census 2020: Understanding the Issues, 33–47.

  71. Spieler, E. A., & Wagner, G. R. (2014). Counting matters: Implications of undercounting in the BLS survey of occupational injuries and illnesses. American Journal of Industrial Medicine, 57(10), 1077–1084.

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Deepti Joshi.

Ethics declarations

Conflict of interest

All authors declare that they have no conflicts of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Joshi, D., Werum, R., Hazelwood, D. et al. Deduplication of the media-based event databases. J Comput Soc Sc 8, 76 (2025). https://doi.org/10.1007/s42001-025-00409-4

Download citation

  • Received:

  • Accepted:

  • Published:

  • Version of record:

  • DOI: https://doi.org/10.1007/s42001-025-00409-4

Keywords