Skip to content

Bug: Pocket since high-water-mark gets set even when indexing fails #726

@cpmsmith

Description

@cpmsmith

Describe the bug

I'm setting up Pocket importing for the first time, meaning I'm importing a lot of old links, some of which are on now-defunct websites. When one of them fails, the entire import fails, but the since value in pocket_api.db is still set, meaning when I try to re-import my Pocket feed, it only retrieves new items, leaving me with no URLs archived.

Steps to reproduce

  1. Set up Pocket config, per Add parser for Pocket API #528
  2. Have a URL in pocket on a domain which refuses connections, or does not exist
  3. Import from Pocket:
    $ archivebox add --depth=1 pocket://myUserName
    [+] [2021-04-28 14:00:05] Adding 1 links to index (crawl depth=1)...
        > Saved verbatim input to sources/1619618411-import.txt
        > Parsed 169 URLs from input (Pocket API)
    
    [*] Starting crawl of 169 sites 1 hop out from starting point
        > Downloading http://my-working-url.com/ contents
        > Saved verbatim input to sources/1619618411-crawl-my-working-url.com.txt
        > Parsed 12 URLs from input (Generic TXT)
        > Downloading http://my-defunct-url.com/ contents
    [!] Failed to download http://my-defunct-url.com/
    
         HTTPConnectionPool(host='my-defunct-url.com', port=80): Max retries exceeded with url: / (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0xb4dbf9b8>: Failed to establish a new connection: [Errno -2] Name or service not known'))
    
  4. Remove broken URL from Pocket
  5. Try importing again:
    $ archivebox add --depth=1 pocket://myUserName
    [+] [2021-04-28 14:39:35] Adding 1 links to index (crawl depth=1)...
        > Saved verbatim input to sources/1619620775-import.txt
                                                                                                                                   0.1% (0/240sec)
    [X] No links found using Pocket API parser
        Hint: Try a different parser or double check the input?
    
        > Parsed 0 URLs from input (Pocket API)
        > Found 0 new URLs not already in index
    
    [*] [2021-04-28 14:39:35] Writing 0 links to main index...
        √ ./index.sqlite3
    

ArchiveBox version

ArchiveBox v0.6.2
Cpython Linux Linux-5.4.79-v7l+-armv7l-with-glibc2.28 armv7l
IN_DOCKER=True DEBUG=False IS_TTY=True TZ=UTC SEARCH_BACKEND_ENGINE=ripgrep

[i] Dependency versions:
 √  ARCHIVEBOX_BINARY     v0.6.2          valid     /usr/local/bin/archivebox
 √  PYTHON_BINARY         v3.9.4          valid     /usr/local/bin/python3.9
 √  DJANGO_BINARY         v3.1.8          valid     /usr/local/lib/python3.9/site-packages/django/bin/django-admin.py
 √  CURL_BINARY           v7.64.0         valid     /usr/bin/curl
 √  WGET_BINARY           v1.20.1         valid     /usr/bin/wget
 √  NODE_BINARY           v15.14.0        valid     /usr/bin/node
 √  SINGLEFILE_BINARY     v0.3.16         valid     /node/node_modules/single-file/cli/single-file
 √  READABILITY_BINARY    v0.0.2          valid     /node/node_modules/readability-extractor/readability-extractor
 √  MERCURY_BINARY        v1.0.0          valid     /node/node_modules/@postlight/mercury-parser/cli.js
 √  GIT_BINARY            v2.20.1         valid     /usr/bin/git
 √  YOUTUBEDL_BINARY      v2021.04.07     valid     /usr/local/bin/youtube-dl
 √  CHROME_BINARY         v89.0.4389.114  valid     /usr/bin/chromium
 √  RIPGREP_BINARY        v0.10.0         valid     /usr/bin/rg

[i] Source-code locations:
 √  PACKAGE_DIR           22 files        valid     /app/archivebox
 √  TEMPLATES_DIR         3 files         valid     /app/archivebox/templates
 -  CUSTOM_TEMPLATES_DIR  -               disabled

[i] Secrets locations:
 -  CHROME_USER_DATA_DIR  -               disabled
 -  COOKIES_FILE          -               disabled

[i] Data locations:
 √  OUTPUT_DIR            5 files         valid     /data
 √  SOURCES_DIR           28 files        valid     ./sources
 √  LOGS_DIR              1 files         valid     ./logs
 √  ARCHIVE_DIR           0 files         valid     ./archive
 √  CONFIG_FILE           204.0 Bytes     valid     ./ArchiveBox.conf
 √  SQL_INDEX             204.0 KB        valid     ./index.sqlite3

Metadata

Metadata

Assignees

No one assigned

    Type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions