-
-
Notifications
You must be signed in to change notification settings - Fork 1.4k
Closed
Labels
status: idea-phaseWork is tentatively approved and is being planned / laid out, but is not ready to be implemented yetWork is tentatively approved and is being planned / laid out, but is not ready to be implemented yetwhy: functionalityIntended to improve ArchiveBox functionality or featuresIntended to improve ArchiveBox functionality or features
Description
I am importing just under 1 million links supplied by forum users over 7 years. Not all links work and I need the system to skip over links it cannot import.
Type
- General question or discussion
- Propose a brand new feature
- Request modification of existing behavior or design
What is the problem that your feature request solves
I have a list of about 250,000 links that match an %archive.%/% format. Not all links may be valid because users supply them. When I tried to import my links, it quickly bailed citing an "invalid IPv6 url".
Describe the ideal specific solution you'd want, and whether it fits into any broader scope of changes
archivebox add < /tmp/links.txt --ignore-errors
What hacks or alternative solutions have you tried to solve the problem?
I can't imagine any way of accomplishing what I need without putting each url to a different file and making a bash script to execute add on each one.
How badly do you want this new feature?
- It's an urgent deal-breaker, I can't live without it
- It's important to add it in the near-mid term future
- It would be nice to have eventually
- I'm willing to contribute dev time / money to fix this issue
- I like ArchiveBox so far / would recommend it to a friend
- I've had a lot of difficulty getting ArchiveBox set up
sudo -u archive archivebox add < /tmp/archives.txt
[i] [2020-08-15 16:32:11] ArchiveBox v0.4.13: archivebox add < /dev/stdin
> /opt/archive
[+] [2020-08-15 16:32:12] Adding 228732 links to index (crawl depth=0)...
> Saved verbatim input to sources/1597509132-import.txt
Traceback (most recent call last):
File "/usr/local/bin/archivebox", line 10, in <module>
sys.exit(main())
File "/usr/local/lib/python3.7/dist-packages/archivebox/cli/__init__.py", line 126, in main
pwd=pwd or OUTPUT_DIR,
File "/usr/local/lib/python3.7/dist-packages/archivebox/cli/__init__.py", line 62, in run_subcommand
module.main(args=subcommand_args, stdin=stdin, pwd=pwd) # type: ignore
File "/usr/local/lib/python3.7/dist-packages/archivebox/cli/archivebox_add.py", line 72, in main
out_dir=pwd or OUTPUT_DIR,
File "/usr/local/lib/python3.7/dist-packages/archivebox/util.py", line 111, in typechecked_function
return func(*args, **kwargs)
File "/usr/local/lib/python3.7/dist-packages/archivebox/main.py", line 544, in add
new_links += parse_links_from_source(write_ahead_log)
File "/usr/local/lib/python3.7/dist-packages/archivebox/util.py", line 111, in typechecked_function
return func(*args, **kwargs)
File "/usr/local/lib/python3.7/dist-packages/archivebox/index/__init__.py", line 284, in parse_links_from_source
new_links = validate_links(raw_links)
File "/usr/local/lib/python3.7/dist-packages/archivebox/util.py", line 111, in typechecked_function
return func(*args, **kwargs)
File "/usr/local/lib/python3.7/dist-packages/archivebox/index/__init__.py", line 130, in validate_links
links = sorted_links(links) # deterministically sort the links based on timstamp, url
File "/usr/local/lib/python3.7/dist-packages/archivebox/util.py", line 111, in typechecked_function
return func(*args, **kwargs)
File "/usr/local/lib/python3.7/dist-packages/archivebox/index/__init__.py", line 175, in sorted_links
return sorted(links, key=sort_func, reverse=True)
File "/usr/local/lib/python3.7/dist-packages/archivebox/index/__init__.py", line 142, in archivable_links
scheme_is_valid = scheme(link.url) in ('http', 'https', 'ftp')
File "/usr/local/lib/python3.7/dist-packages/archivebox/util.py", line 30, in <lambda>
scheme = lambda url: urlparse(url).scheme.lower()
File "/usr/lib/python3.7/urllib/parse.py", line 368, in urlparse
splitresult = urlsplit(url, scheme, allow_fragments)
File "/usr/lib/python3.7/urllib/parse.py", line 435, in urlsplit
raise ValueError("Invalid IPv6 URL")
ValueError: Invalid IPv6 URL
I don't see any IPv6 links on my list, by the way. I can send that over as well. It looks like it may be a broken IPv6 on the page itself.
Metadata
Metadata
Assignees
Labels
status: idea-phaseWork is tentatively approved and is being planned / laid out, but is not ready to be implemented yetWork is tentatively approved and is being planned / laid out, but is not ready to be implemented yetwhy: functionalityIntended to improve ArchiveBox functionality or featuresIntended to improve ArchiveBox functionality or features