Abstract
Event databases play a crucial role in documenting and examining spatiotemporally distributed events ranging from protests over disasters, to public health emergencies. The Global Database of Events, Language, and Tone (GDELT) is notable for its comprehensive automated cataloging of events from many news sources. By extension, its extensive scope augments the risk of counting unique events repeatedly because they are reported by multiple sources. This leads to an inaccurate assessment of event level dynamics. This article presents a new, automated deduplication technique specifically designed to improve the event identification and counting accuracy of GDELT data. By consolidating redundant event entries into a unified, comprehensive record, our approach effectively mitigates the issue of overcounting, while enhancing the integrity and usefulness of the data. We assess the effectiveness of our method by conducting thorough algorithmic testing, and by comparing it to other established datasets such as Armed Conflict Location and Event Dataset (ACLED) and Integrated Crisis Early Warning System (ICEWS). The comparative analysis employs the deduplicated GDELT data to predict local protest levels, demonstrating that our deduplication procedure not only decreases the number of overcounted events but also better aligns GDELT with other event databases, thereby validating the effectiveness of our methodology. Our findings inform empirical research dependent on media-reported event databases like GDELT, with broader implications for other fields reliant on data collection and sources affected by directional measurement error.













Similar content being viewed by others
References
Yu, M., Bambacus, M., Cervone, G., Clarke, K., Duffy, D., Huang, Q., Li, J., Li, W., Li, Z., Liu, Q., & Resch, B. (2020). Spatiotemporal event detection: A review. International Journal of Digital Earth, 13(12), 1339–1365.
DesInventar. (n.d.). DesInventar Sendai. Retrieved January 16, 2024, from https://www.desinventar.net/
Cloud to Street. (n.d.). Global Flood Database. Retrieved January 16, 2024, from https://global-flood-database.cloudtostreet.ai/
Leetaru, K., & Schrodt, P. A. (2013). March). GDELT: Global data on events, location, and tone, 1979–2012. Paper presented at the ISA Annual Convention, 2, 1–51.
Yonamine, E. (2013). A nuanced study of political conflict using the Global Datasets of Events Location and Tone (GDELT) Dataset [Doctoral dissertation, The Pennsylvania State University]. ProQuest Dissertations & Theses Global. https://www.proquest.com/dissertations-theses/nuanced-study-political-conflict-using-global/docview/1467737683/se-2
Schrodt, P. A. (2001). Automated coding of international event data using sparse parsing techniques. In annual meeting of the International Studies Association, Chicago.
Hopp, F. R., Schaffer, J., Fisher, J. T., & Weber, R. (2019). iCoRe: The GDELT interface for the advancement of communication research. Computational Communication Research, 1(1), 13–44.
Alamro, R., McCarren, A., & Al-Rasheed, A. (2019). Predicting saudi stock market index by incorporating gdelt using multivariate time series modelling. In Advances in Data Science, Cyber Security and IT Applications: First International Conference on Computing, ICC 2019, Riyadh, Saudi Arabia, December 10–12, 2019, Proceedings, Part I 1 (pp. 317–328). Springer International Publishing.
Shi, S., Changqing, S., ChengChangxiu, J. G., & Sijing, Y. E. (2020). GDELT: Big event data for sensing global social dynamics. World Regional Studies, 29(1), 71.
Voukelatou, V., Pappalardo, L., Miliou, I., Gabrielli, L., & Giannotti, F. (2020). Estimating countries’ peace index through the lens of the world news as monitored by GDELT. In 2020 IEEE 7th international conference on data science and advanced analytics (DSAA) (pp. 216–225). IEEE.
Boecking, B., Hall, M., & Schneider, J. (2015). Event prediction with learning algorithms—A study of events surrounding the Egyptian revolution of 2011 on the basis of micro blog data. Policy & Internet, 7(2), 159–184.
Wu, C., & Gerber, M. S. (2017). Forecasting civil unrest using social media and protest participation theory. IEEE Transactions on Computational Social Systems, 5(1), 82–94.
Zhou, C., Liu, F., Gao, J., & Song, C. (2017). Can Bayesian poisson tensor factorization automatically extract interesting events from massive media reports?. In 2017 International Conference on Behavioral, Economic, Socio-cultural Computing (BESC) (pp. 1–4). IEEE.
Yonamine, J. E. (2013). Predicting future levels of violence in afghanistan districts using gdelt. Unpublished manuscript. Available at https://blog.gdeltproject.org/jay-yonamines-forecasting-future-violence-in-afghanistan/
Radford, B. J. (2020). Seeing the forest and the trees: Detection and cross-document coreference resolution of militarized interstate disputes. arXiv preprint arXiv:2005.02966.
Wang, W., Kennedy, R., Lazer, D., & Ramakrishnan, N. (2016). Growing pains for global monitoring of societal events. Science, 353(6307), 1502–1503.
Raleigh, C., Linke, R., Hegre, H., & Karlsen, J. (2010). Introducing ACLED: An armed conflict location and event dataset. Journal of Peace Research, 47(5), 651–660.
O’Brien, S. P. (2012). A multi-method approach for near real time conflict and crisis early warning. In Handbook of Computational Approaches to Counterterrorism (pp. 401–418). New York, NY: Springer New York.
McClelland, C. A. (1978). World event/interaction survey. Inter-university Consortium for Political and Social Research.
Azar, E. E. (1980). The conflict and peace data bank (COPDAB) project. Journal of Conflict Resolution, 24(1), 143–152.
Gerner, D. J., Schrodt, P. A., Francisco, R. A., & Weddle, J. L. (1994). Machine coding of event data using regional and international sources. International Studies Quarterly, 38(1), 91–119.
Scarborough, Grace I., Benjamin E. Bagozzi, Andreas Beger, John Berrie, Andrew Halterman, Philip A. Schrodt, Jevon Spivey, 2023, "POLECAT Weekly Data", https://doi.org/10.7910/DVN/AJGVIT, Harvard Dataverse, V62
ACLED. (2019). Armed conflict location & event data project (ACLED) codebook. ACLED.
Chenoweth, E., Hendrix, C. S., & Hunter, K. (2019). Introducing the nonviolent action in violent contexts (NVAVC) dataset. Journal of Peace Research, 56(2), 295–305.
Ide, T., Kristensen, A., & Bartusevičius, H. (2021). First comes the river, then comes the conflict? A qualitative comparative analysis of flood-related political unrest. Journal of Peace Research, 58(1), 83–97.
Althaus, S., Peyton, B., & Shalmon, D. (2022). A total error approach for validating event data. American Behavioral Scientist, 66(5), 603–624.
Ferreira, L. N., Hong, I., Rutherford, A., & Cebrian, M. (2021). The small-world network of global protests. Scientific Reports, 11(1), 19215.
Jünger, J., & Gärtner, C. (2021). Distilling issue cycles from large databases: A time-series analysis of terrorism and media in Africa. Social Science Computer Review, 39(6), 1272–1291.
Qiao, F., Li, P., Zhang, X., Ding, Z., Cheng, J., & Wang, H. (2017). Predicting social unrest events with hidden Markov models using GDELT. Discrete Dynamics in Nature and Society, 2017.
Yuan, L., Song, C., Cheng, C., Shen, S., Chen, X., & Wang, Y. (2020). The cooperative and conflictual interactions between the United States, Russia, and China: A quantitative analysis of event data. Journal of Geographical Sciences, 30, 1702–1720.
Chung, A. (2023). The impact of South Korea-US and North Korea-US relations on inter-korean relations: an empirical analysis using big data Kukche, Chiyok Yon’Gu. Review of International and Area Studies, 32(1), 41.
Qiao, F., & Wang, H. (2015). Computational approach to detecting and predicting occupy protest events. In 2015 International Conference on Identification, Information, and Knowledge in the Internet of Things (IIKI) (pp. 94–97). IEEE.
Deng, N. (2021). Predicting Social Events using Entity Interaction Graph Sequences. In 2021 IEEE 2nd International Conference on Big Data, Artificial Intelligence and Internet of Things Engineering (ICBAIE) (pp. 1025–1029). IEEE.
Schein, A., Paisley, J., Blei, D. M., & Wallach, H. (2015, August). Bayesian poisson tensor factorization for inferring multilateral relations from sparse dyadic event counts. In Proceedings of the 21th ACM SIGKDD International conference on knowledge discovery and data mining (pp. 1045–1054).
Jiang, L. (2020). The underlying causal network from global dyadic events: allies and rivals in international relations. In 2020 IEEE Intl Conf on Parallel & Distributed Processing with Applications, Big Data & Cloud Computing, Sustainable Computing & Communications, Social Computing & Networking (ISPA/BDCloud/SocialCom/SustainCom) (pp. 1029–1036). IEEE.
Bi, S., Gao, J., Wang, Y., & Cao, Y. (2015). A contrast of the degree of activity among the three major powers, USA, China, and Russia: Insights from media reports. In 2015 International Conference on Behavioral, Economic and Socio-cultural Computing (BESC) (pp. 38–42). IEEE.
Yoshioka, M., & Kando, N. (2016). Comparative Analysis of GDELT Data Using the News Site Contrast System. In NewsIR@ ECIR (pp. 63–65).
Zheng, C., Fan, H., Singh, R., & Shi, Y. (2020). A Domain expertise and word-embedding geometric projection based semantic mining framework for measuring the soft power of social entities. IEEE Access, 8, 204597–204611.
Pogorelov, K., Schroeder, D. T., Filkukova, P., & Langguth, J. (2020). A system for high performance mining on gdelt data. In 2020 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW) (pp. 1101–1111). IEEE.
Kwak, H., & An, J. (2014). A first look at global news coverage of disasters by using the gdelt dataset. In Social Informatics: 6th International Conference, SocInfo 2014, Barcelona, Spain, November 11–13, 2014. Proceedings 6 (pp. 300–308). Springer International Publishing.
Su, Y., Lan, Z., Lin, Y. R., Comfort, L. K., & Joshi, J. (2016). Tracking disaster response and relief following the 2015 Nepal earthquake. In 2016 IEEE 2nd International Conference on Collaboration and Internet Computing (CIC) (pp. 495–499). IEEE..
Owuor, I., Hochmair, H. H., & Cvetojevic, S. (2020). Tracking hurricane dorian in GDELT and twitter. AGILE: GIScience Series, 1: 19.
Olteanu, A., Castillo, C., Diakopoulos, N., & Aberer, K. (2015). Comparing events coverage in online news and social media: The case of climate change. In Proceedings of the international AAAI conference on web and social media (Vol. 9, No. 1, pp. 288–297).
Wen, M., Ota, K., Li, H., Lei, J., Gu, C., & Su, Z. (2015). Secure data deduplication with reliable key management for dynamic updates in CPSS. IEEE Transactions on Computational Social Systems, 2(4), 137–147.
Li, J., Li, Y. K., Chen, X., Lee, P. P., & Lou, W. (2014). A hybrid cloud approach for secure authorized deduplication. IEEE Transactions on Parallel and Distributed Systems, 26(5), 1206–1216.
Xia, W., Jiang, H., Feng, D., Douglis, F., Shilane, P., Hua, Y., Fu, M., Zhang, Y., & Zhou, Y. (2016). A comprehensive study of the past, present, and future of data deduplication. Proceedings of the IEEE, 104(9), 1681–1710.
Sengar, S. S., & Mishra, M. (2012). A Parallel Architecture for In-Line Data Deduplication. In 2012 Second International Conference on Advanced Computing & Communication Technologies (pp. 399–403). IEEE.
Costa, G., Cuzzocrea, A., Manco, G., & Ortale, R. (2011). Data deduplication: A review. Learning structure and schemas from documents, 385–412.
Kushagra, S., Ben-David, S., & Ilyas, I. (2019). Semi-supervised clustering for deduplication. In The 22nd International Conference on Artificial Intelligence and Statistics (pp. 1659–1667). PMLR.
Cohen, W. W., & Richman, J. (2002). Learning to match and cluster large high-dimensional data sets for data integration. In Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining (pp. 475–480).
Schrodt, P. (2012). Conflict and Mediation Event Observations event and actor codebook V. 1.1 b3.
Colaresi, M., & Mahmood, Z. (2017). Do the robot: Lessons from machine learning to improve conflict forecasting. Journal of Peace Research, 54(2), 193–214.
Galla, D., & Burke, J. (2018). Predicting social unrest using GDELT. In International conference on machine learning and data mining in pattern recognition (pp. 103–116). Cham: Springer International Publishing.
Earl, J., Soule, S. A., & McCarthy, J. D. (2003). Protest under fire? Explaining the policing of protest. American Sociological Review, 68(4), 581–606.
Demarest, L., & Langer, A. (2019). Reporting on electoral violence in Nigerian news media: “Saying it as it is”? African Studies Review, 62(4), 83–109.
Ward, M. D., Beger, A., Cutler, J., Dickenson, M., Dorff, C., & Radford, B. (2013). Comparing GDELT and ICEWS event data. Analysis, 21(1), 267–297.
Smidt, H. M. (2020). United Nations peacekeeping locally: Enabling conflict resolution, reducing communal violence. Journal of Conflict Resolution, 64(2–3), 344–372.
Weidmann, N. B., & Ward, M. D. (2010). Predicting conflict in space and time. Journal of Conflict Resolution, 54(6), 883–901.
Ives, B., & Lewis, J. S. (2020). From rallies to riots: Why some protests become violent. Journal of Conflict Resolution, 64(5), 958–986.
Calvo Figueras, B., Caselli, T., & Broersma, M. (2021). Finding narratives in news flows: the temporal dimension of news stories. DHQ: Digital Humanities Quarterly, 15(4).
Demarest, L., & Langer, A. (2022). How events enter (or not) data sets: The pitfalls and guidelines of using newspapers in the study of conflict. Sociological Methods & Research, 51(2), 632–666.
Earl, J., Martin, A., McCarthy, J. D., & Soule, S. A. (2004). The use of newspaper data in the study of collective action. Annual Review of Sociology, 30, 65–80.
Yuan, W., Caren, N., & Amenta, E. (2023). What drives the news coverage of US social movements? Social Forces, 102(1), 242–262.
Cook, S. J., & Weidmann, N. B. (2019). Lost in aggregation: Improving event analysis with report-level data. American Journal of Political Science, 63(1), 250–264.
Hammond, J., & Weidmann, N. B. (2014). Using machine-coded event data for the micro-level study of political violence. Research & Politics, 1(2), 2053168014539924.
Andrews, K. T., & Caren, N. (2010). Making the news: Movement organizations, media attention, and the public agenda. American Sociological Review, 75(6), 841–866.
Amenta, E., Caren, N., Olasky, S. J., & Stobaugh, J. E. (2009). All the movements fit to print: Who, what, when, where, and why SMO families appeared in the New York Times in the twentieth century. American Sociological Review, 74(4), 636–656.
Weidmann, N. B. (2016). A closer look at reporting bias in conflict event data. American Journal of Political Science, 60(1), 206–218.
Dharmapala, D., & Huq, A. (2024). Imputing unreported hate crimes using google search data. Journal of Law and Empirical Analysis, 1(2), 2755323X241274653.
Sullivan, T. A., & Sullivan, T. A. (2020). Who’s Missing? Undercounting and Underreporting. Census 2020: Understanding the Issues, 33–47.
Spieler, E. A., & Wagner, G. R. (2014). Counting matters: Implications of undercounting in the BLS survey of occupational injuries and illnesses. American Journal of Industrial Medicine, 57(10), 1077–1084.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
All authors declare that they have no conflicts of interest.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Joshi, D., Werum, R., Hazelwood, D. et al. Deduplication of the media-based event databases. J Comput Soc Sc 8, 76 (2025). https://doi.org/10.1007/s42001-025-00409-4
Received:
Accepted:
Published:
Version of record:
DOI: https://doi.org/10.1007/s42001-025-00409-4