Skip to main content

Got 404s? Crawling and Analyzing an Institution’s Web Domain

  • Conference paper
  • First Online:
Linking Theory and Practice of Digital Libraries (TPDL 2022)

Abstract

Link rot - disappearance of web resources - is detrimental to an institution’s web presence, which is commonly used to communicate, for example, research highlights and organizational news. Organizations, especially taxpayer-funded ones such as the Los Alamos National Laboratory (LANL), therefore put emphasis on the availability and authenticity of their institutional record on the web. We conducted a web crawl of the lanl.gov domain and investigated the scale of missing resources and the ratio of resources recovered from public web archives. We found a noticeable number of special cases of link rot (soft404s) and transient errors, and had little success in recovering resources from web archives. We argue that, as an institution, we could become a better steward of our web content by establishing an institutional web archive to improve the availability and authenticity of web resources.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+
from €37.37 /Month
  • Starting from 10 chapters or articles per month
  • Access and download chapters and articles from more than 300k books and 2,500 journals
  • Cancel anytime
View plans

Buy Now

Chapter
EUR 29.95
Price includes VAT (Netherlands)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
EUR 67.40
Price includes VAT (Netherlands)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
EUR 87.19
Price includes VAT (Netherlands)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. IA Archival Copies of. http://www.lanl.gov/errors/service-unavailable.php

  2. Web crawler for Java. https://github.com/yasserg/crawler4j

  3. Archive-it-web archiving services. https://archive-it.org/

  4. Arquivo.pt - search pages from the past!. https://arquivo.pt/

  5. Internet archive wayback machine. http://web.archive.org/

  6. LANL soft404 in a browser https://lanl.gov/discover/news-release-archive/2017/July/0719-ultracold-reactions.php redirects to. https://www.lanl.gov/errors/service-unavailable.php which returns an HTTP \(200\)

  7. LANL web archive. http://lanlwebarchive.org/memento/

  8. LANL web archive - e.g. Archival copy of. http://www.lanl.gov/library/, http://lanlwebarchive.org/memento/20210213211725/http://www.lanl.gov/library/

  9. Library of congress web archives. https://webarchive.loc.gov/

  10. Los Alamos national lab: national security science. https://www.lanl.gov/

  11. Memento TimeTravel. http://timetravel.mementoweb.org/

  12. TimeTravel search results for. http://www.lanl.gov/errors/service-unavailable.php, http://timetravel.mementoweb.org/list/20220506051138/http://www.lanl.gov/errors/service-unavailable.php

  13. Ainsworth, S.G., et al.: How much of the web is archived? In: Proceedings of the 11th Annual International ACM/IEEE Joint Conference on Digital Libraries, pp. 133–136 (2011). https://doi.org/10.1145/1998076.1998100

  14. Bar-Yossef, Z., et al.: Sic transit Gloria Telae: towards an understanding of the web’s decay. In: Proceedings of WWW 2004, pp. 328–337 (2004)

    Google Scholar 

  15. Cho, J., Garcia-Molina, H.: Estimating frequency of change. ACM Trans. Internet Technol. 3(3), 256–290 (2003). https://doi.org/10.1145/857166.857170

    Article  Google Scholar 

  16. Jones, S., et al.: 205.3 the many shapes of archive-it. (2019). https://doi.org/10.17605/OSF.IO/EV42P

  17. Jones, S.M., et al.: Scholarly context adrift: three out of four URI references lead to changed content. PLoS ONE 11(12), e0167475 (2016)

    Article  Google Scholar 

  18. Jones, S.M., et al.: Robustifying links to combat reference rot. Code4Lib 50 (2021). https://journal.code4lib.org/articles/15509

  19. Klein, M., Balakireva, L.: An extended analysis of the persistence of persistent identifiers of the scholarly web. Int. J. Digit. Libr. 23(1), 5–17 (2021). https://doi.org/10.1007/s00799-021-00315-w

    Article  Google Scholar 

  20. Klein, M., Balakireva, L.: LANL domain crawl seed list (2022). https://doi.org/10.6084/m9.figshare.19912459.v1

  21. Klein, M., et al.: Scholarly context not found: one in five articles suffers from reference rot. PLoS ONE 9(12), e115253 (2014)

    Article  Google Scholar 

  22. Klein, M., Shankar, H., Balakireva, L., Van de Sompel, H.: The memento tracer framework: balancing quality and scalability for web archiving. In: Doucet, A., Isaac, A., Golub, K., Aalberg, T., Jatowt, A. (eds.) TPDL 2019. LNCS, vol. 11799, pp. 163–176. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-30760-8_15

    Chapter  Google Scholar 

  23. Koehler, W.: Web page change and persistence-a four-year longitudinal study. J. Am. Soc. Inform. Sci. Technol. 53(2), 162–171 (2002). https://doi.org/10.1002/asi.10018

    Article  Google Scholar 

  24. McCown, F., et al.: The availability and persistence of web references in d-lib magazine (2005). https://doi.org/10.48550/ARXIV.CS/0511077

  25. McCown, F., et al.: Why web sites are lost (and how they’re sometimes found). Commun. ACM 52(11), 141–145 (2009). https://doi.org/10.1145/1592761.1592794

    Article  Google Scholar 

  26. Wren, J.D.: URL decay in MEDLINE-a 4-year follow-up study. Bioinformatics 24(11), 1381–1385 (2008). https://doi.org/10.1093/bioinformatics/btn127

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Martin Klein .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2022 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Klein, M., Balakireva, L. (2022). Got 404s? Crawling and Analyzing an Institution’s Web Domain. In: Silvello, G., et al. Linking Theory and Practice of Digital Libraries. TPDL 2022. Lecture Notes in Computer Science, vol 13541. Springer, Cham. https://doi.org/10.1007/978-3-031-16802-4_48

Download citation

Keywords

Publish with us

Policies and ethics