Skip to content

Bug: Fails to parse list of URLs txt file #968

@rossvor

Description

@rossvor

Describe the bug

I can't seem to get archivebox to add any URLs from simple txt file with a newline separated list of URLs.
Based on error message it fails to parse it. I may be doing something wrong.

Steps to reproduce

  1. Create txt file with some URLs. Eg.
https://www.example.com/
https://example.com/
  1. Run archivebox add /tmp/urls.txt

Screenshots or log output

Here's the output I get:

ross@xx> archivebox add /tmp/urls.txt                                                                                                                                                                                     /tmp/archivebox
[i] [2022-04-20 16:05:12] ArchiveBox v0.6.2: archivebox add /tmp/urls.txt
    > /tmp/archivebox

[!] Warning: Missing 3 recommended dependencies
    ! SINGLEFILE_BINARY: single-file (unable to detect version)
      Hint: To install all packages automatically run: archivebox setup
            or to disable it and silence this warning: archivebox config --set SAVE_SINGLEFILE=False
            
    ! READABILITY_BINARY: readability-extractor (unable to detect version)
      Hint: To install all packages automatically run: archivebox setup
            or to disable it and silence this warning: archivebox config --set SAVE_READABILITY=False
            
    ! MERCURY_BINARY: mercury-parser (unable to detect version)
      Hint: To install all packages automatically run: archivebox setup
            or to disable it and silence this warning: archivebox config --set SAVE_MERCURY=False
            

[+] [2022-04-20 16:05:13] Adding 1 links to index (crawl depth=0)...
    > Saved verbatim input to sources/1650470713-import.txt
                                                                                                                                                                                                                        0.0% (0/240sec)[X] Error while loading link! [1650470713.151664] /tmp/urls.txt "None"
    > Parsed 0 URLs from input (Failed to parse)                                                                                                                                                                                           
    > Found 0 new URLs not already in index

[*] [2022-04-20 16:05:13] Writing 0 links to main index...
    √ ./index.sqlite3

ArchiveBox version

ArchiveBox v0.6.2
Cpython Linux Linux-5.17.1-arch1-1-x86_64-with-glibc2.35 x86_64
IN_DOCKER=False DEBUG=False IS_TTY=True TZ=UTC SEARCH_BACKEND_ENGINE=ripgrep

[i] Dependency versions:
 √  ARCHIVEBOX_BINARY     v0.6.2          valid     /home/ross/.local/bin/archivebox                                            
 √  PYTHON_BINARY         v3.10.4         valid     /usr/bin/python3.10                                                         
 √  DJANGO_BINARY         v3.1.14         valid     /home/ross/.local/lib/python3.10/site-packages/django/bin/django-admin.py   
 √  CURL_BINARY           v7.82.0         valid     /usr/bin/curl                                                               
 √  WGET_BINARY           v1.21.3         valid     /usr/bin/wget                                                               
 √  NODE_BINARY           v17.9.0         valid     /usr/bin/node                                                               
 X  SINGLEFILE_BINARY     ?               invalid   single-file                                                                 
 X  READABILITY_BINARY    ?               invalid   readability-extractor                                                       
 X  MERCURY_BINARY        ?               invalid   mercury-parser                                                              
 √  GIT_BINARY            v2.35.2         valid     /usr/bin/git                                                                
 √  YOUTUBEDL_BINARY      v2021.12.17     valid     /home/ross/.local/bin/youtube-dl                                            
 √  CHROME_BINARY         v100.0.4896.88  valid     /usr/bin/chromium                                                           
 √  RIPGREP_BINARY        v13.0.0         valid     /usr/bin/rg                                                                 

[i] Source-code locations:
 √  PACKAGE_DIR           23 files        valid     /home/ross/.local/lib/python3.10/site-packages/archivebox                   
 √  TEMPLATES_DIR         3 files         valid     /home/ross/.local/lib/python3.10/site-packages/archivebox/templates         
 -  CUSTOM_TEMPLATES_DIR  -               disabled                                                                              

[i] Secrets locations:
 -  CHROME_USER_DATA_DIR  -               disabled                                                                              
 -  COOKIES_FILE          -               disabled                                                                              

[i] Data locations:
 √  OUTPUT_DIR            5 files         valid     /tmp/archivebox                                                             
 √  SOURCES_DIR           3 files         valid     ./sources                                                                   
 √  LOGS_DIR              1 files         valid     ./logs                                                                      
 √  ARCHIVE_DIR           0 files         valid     ./archive                                                                   
 √  CONFIG_FILE           81.0 Bytes      valid     ./ArchiveBox.conf                                                           
 √  SQL_INDEX             204.0 KB        valid     ./index.sqlite3                                                             

[!] Warning: Missing 3 recommended dependencies
    ! SINGLEFILE_BINARY: single-file (unable to detect version)
      Hint: To install all packages automatically run: archivebox setup
            or to disable it and silence this warning: archivebox config --set SAVE_SINGLEFILE=False
            
    ! READABILITY_BINARY: readability-extractor (unable to detect version)
      Hint: To install all packages automatically run: archivebox setup
            or to disable it and silence this warning: archivebox config --set SAVE_READABILITY=False
            
    ! MERCURY_BINARY: mercury-parser (unable to detect version)
      Hint: To install all packages automatically run: archivebox setup
            or to disable it and silence this warning: archivebox config --set SAVE_MERCURY=False
  

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions