Abstract
Link rot - disappearance of web resources - is detrimental to an institution’s web presence, which is commonly used to communicate, for example, research highlights and organizational news. Organizations, especially taxpayer-funded ones such as the Los Alamos National Laboratory (LANL), therefore put emphasis on the availability and authenticity of their institutional record on the web. We conducted a web crawl of the lanl.gov domain and investigated the scale of missing resources and the ratio of resources recovered from public web archives. We found a noticeable number of special cases of link rot (soft404s) and transient errors, and had little success in recovering resources from web archives. We argue that, as an institution, we could become a better steward of our web content by establishing an institutional web archive to improve the availability and authenticity of web resources.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
IA Archival Copies of. http://www.lanl.gov/errors/service-unavailable.php
Web crawler for Java. https://github.com/yasserg/crawler4j
Archive-it-web archiving services. https://archive-it.org/
Arquivo.pt - search pages from the past!. https://arquivo.pt/
Internet archive wayback machine. http://web.archive.org/
LANL soft404 in a browser https://lanl.gov/discover/news-release-archive/2017/July/0719-ultracold-reactions.php redirects to. https://www.lanl.gov/errors/service-unavailable.php which returns an HTTP \(200\)
LANL web archive. http://lanlwebarchive.org/memento/
LANL web archive - e.g. Archival copy of. http://www.lanl.gov/library/, http://lanlwebarchive.org/memento/20210213211725/http://www.lanl.gov/library/
Library of congress web archives. https://webarchive.loc.gov/
Los Alamos national lab: national security science. https://www.lanl.gov/
Memento TimeTravel. http://timetravel.mementoweb.org/
TimeTravel search results for. http://www.lanl.gov/errors/service-unavailable.php, http://timetravel.mementoweb.org/list/20220506051138/http://www.lanl.gov/errors/service-unavailable.php
Ainsworth, S.G., et al.: How much of the web is archived? In: Proceedings of the 11th Annual International ACM/IEEE Joint Conference on Digital Libraries, pp. 133–136 (2011). https://doi.org/10.1145/1998076.1998100
Bar-Yossef, Z., et al.: Sic transit Gloria Telae: towards an understanding of the web’s decay. In: Proceedings of WWW 2004, pp. 328–337 (2004)
Cho, J., Garcia-Molina, H.: Estimating frequency of change. ACM Trans. Internet Technol. 3(3), 256–290 (2003). https://doi.org/10.1145/857166.857170
Jones, S., et al.: 205.3 the many shapes of archive-it. (2019). https://doi.org/10.17605/OSF.IO/EV42P
Jones, S.M., et al.: Scholarly context adrift: three out of four URI references lead to changed content. PLoS ONE 11(12), e0167475 (2016)
Jones, S.M., et al.: Robustifying links to combat reference rot. Code4Lib 50 (2021). https://journal.code4lib.org/articles/15509
Klein, M., Balakireva, L.: An extended analysis of the persistence of persistent identifiers of the scholarly web. Int. J. Digit. Libr. 23(1), 5–17 (2021). https://doi.org/10.1007/s00799-021-00315-w
Klein, M., Balakireva, L.: LANL domain crawl seed list (2022). https://doi.org/10.6084/m9.figshare.19912459.v1
Klein, M., et al.: Scholarly context not found: one in five articles suffers from reference rot. PLoS ONE 9(12), e115253 (2014)
Klein, M., Shankar, H., Balakireva, L., Van de Sompel, H.: The memento tracer framework: balancing quality and scalability for web archiving. In: Doucet, A., Isaac, A., Golub, K., Aalberg, T., Jatowt, A. (eds.) TPDL 2019. LNCS, vol. 11799, pp. 163–176. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-30760-8_15
Koehler, W.: Web page change and persistence-a four-year longitudinal study. J. Am. Soc. Inform. Sci. Technol. 53(2), 162–171 (2002). https://doi.org/10.1002/asi.10018
McCown, F., et al.: The availability and persistence of web references in d-lib magazine (2005). https://doi.org/10.48550/ARXIV.CS/0511077
McCown, F., et al.: Why web sites are lost (and how they’re sometimes found). Commun. ACM 52(11), 141–145 (2009). https://doi.org/10.1145/1592761.1592794
Wren, J.D.: URL decay in MEDLINE-a 4-year follow-up study. Bioinformatics 24(11), 1381–1385 (2008). https://doi.org/10.1093/bioinformatics/btn127
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2022 Springer Nature Switzerland AG
About this paper
Cite this paper
Klein, M., Balakireva, L. (2022). Got 404s? Crawling and Analyzing an Institution’s Web Domain. In: Silvello, G., et al. Linking Theory and Practice of Digital Libraries. TPDL 2022. Lecture Notes in Computer Science, vol 13541. Springer, Cham. https://doi.org/10.1007/978-3-031-16802-4_48
Download citation
DOI: https://doi.org/10.1007/978-3-031-16802-4_48
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-16801-7
Online ISBN: 978-3-031-16802-4
eBook Packages: Computer ScienceComputer Science (R0)Springer Nature Proceedings Computer Science