Skip to content

Conversation

@neodyne
Copy link
Contributor

@neodyne neodyne commented Sep 19, 2023

Inspired by #1541.

This PR adds support for Apple Podcasts web links in the format of
https://podcasts.apple.com/COUNTRY/podcast/PODCAST_TITLE/idPODCAST_ID.

utils.parse_apple_podcasts_url() opens https://itunes.apple.com/lookup?id=PODCAST_ID and returns the feedUrl field from the json response, or the original url on error, for further processing.

I also cleaned up and updated 'accelerators' in utils.normalize_feed_url(), fixed the test for feedburner, and added one for apple podcasts.

Copy link
Contributor

@tpikonen tpikonen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, this works and looks fine, except for the small nitpick.

@auouymous
Copy link
Member

https://podcasts.apple.com/it/podcast/il-podcast-di-alessandro-barbero-lezioni-e/id1501956064 works fine. However, https://podcasts.apple.com/us/podcast/marvels-wastelanders-black-widow/id1600546725, a podcast I found on the front page of podcasts.apple.com, does not work.

  File "util.py", line 318, in parse_apple_podcasts_url
    url = normalize_feed_url(lookup['results'][0]['feedUrl'])
                             ~~~~~~~~~~~~~~~~~^^^
IndexError: list index out of range

@neodyne
Copy link
Contributor Author

neodyne commented Sep 22, 2023

@auouymous that page doesn't load in the browser too, it is stuck on 'Connecting to Apple Podcasts'. Try also any random nonexistent number as id, you'd get the same landing page. With real podcasts, this happens when the rss feed is gone/unreachable, parts of the apple website ui may feature the podcast still, but the actual podcast page is delisted. In this case, the feed, found otherwise ( https://feeds.simplecast.com/tysWFCDv ) still works, page had to have been delisted by publisher. No listing == nothing to lookup.

curl -s https://itunes.apple.com/lookup?id=1600546725

{
 "resultCount":0,
 "results": []
}

A bigger issue though, looking at similar ex-paywalled Marvel podcasts, might be something like Marvel's Wastelanders: Hawkeye:

curl -s https://itunes.apple.com/lookup?id=1586348056

{
 "resultCount":1,
 "results": [
{"wrapperType":"track", "kind":"podcast", "collectionId":1586348056, "trackId":1586348056, "artistName":"Marvel & SiriusXM", "collectionName":"Marvel's Wastelanders: Hawkeye", "trackName":"Marvel's Wastelanders: Hawkeye", "collectionCensoredName":"Marvel's Wastelanders: Hawkeye", "trackCensoredName":"Marvel's Wastelanders: Hawkeye", "collectionViewUrl":"https://podcasts.apple.com/us/podcast/marvels-wastelanders-hawkeye/id1586348056?uo=4", "trackViewUrl":"https://podcasts.apple.com/us/podcast/marvels-wastelanders-hawkeye/id1586348056?uo=4", "artworkUrl30":"https://is1-ssl.mzstatic.com/image/thumb/Podcasts115/v4/b5/48/67/b54867a3-bf18-1016-92b2-9dce0e6a3a99/mza_13697071505097526305.jpg/30x30bb.jpg", "artworkUrl60":"https://is1-ssl.mzstatic.com/image/thumb/Podcasts115/v4/b5/48/67/b54867a3-bf18-1016-92b2-9dce0e6a3a99/mza_13697071505097526305.jpg/60x60bb.jpg", "artworkUrl100":"https://is1-ssl.mzstatic.com/image/thumb/Podcasts115/v4/b5/48/67/b54867a3-bf18-1016-92b2-9dce0e6a3a99/mza_13697071505097526305.jpg/100x100bb.jpg", "collectionPrice":0.00, "trackPrice":0.00, "collectionHdPrice":0, "releaseDate":"2022-09-12T07:00:00Z", "collectionExplicitness":"notExplicit", "trackExplicitness":"cleaned", "trackCount":16, "trackTimeMillis":107, "country":"USA", "currency":"USD", "primaryGenreName":"Fiction", "contentAdvisoryRating":"Clean", "artworkUrl600":"https://is1-ssl.mzstatic.com/image/thumb/Podcasts115/v4/b5/48/67/b54867a3-bf18-1016-92b2-9dce0e6a3a99/mza_13697071505097526305.jpg/600x600bb.jpg", "genreIds":["1483", "26"], "genres":["Fiction", "Podcasts"]}]
}

feedUrl is missing entirely, but the Apple Podcasts listing is valid and episodes play (from https://feeds.simplecast.com/ZnLWkl9A ).

All Wastelanders podcasts are folded into one on Spotify, the Marvel webpage advertises this exclusively for all storylines, but that feed doesn't have all episodes folded in, maybe they are mid-migration or this is some 'have to be logged in / subscribed in iTunes' type of deal. I strongly suspect this is a Marvel thing, I have been using the lookup endpoint via userscript on the iTunes webpage for years, haven't had any issues with (by now ~300 subscribed) podcasts from different publishers. Still, a gPodder user would see a valid 'Apple Podcasts' listing, copy link to gPodder, and it wouldn't work.

The whole iTunes Search API is old (json string as txt/javascript, docs archived in 2017), not sure I can coerce it to give up more data. I could do a more granular json response check, (nonsense || valid json, empty results || valid json, no feedurl), log something more meaningful ('No feedUrl in response, is this an exclusive podcast?'), alter the docstring to match (...tries to resolve...). On failure, it would return the original url, though, which would fail later (html).


EDIT:
Going over this superuser question from 2009. Tested:

https://getrssfeed.com/
https://www.labnol.org/podcast/
https://picklemonkey.net/feedflipper-home/
https://www.djm.org.uk/posts/reveal-rss-feed-url-itunes-podcast/

It seems they all use the same lookup endpoint, so predictably they all choke on the hawkeye url. Might be a one-off.

@auouymous
Copy link
Member

Better error messages would be nice to have, and I think parse_apple_podcasts_url() should return None if it fails to parse the URL. The None check after parse_youtube_url() adds the original url to a failed list instead of queuing it, and then causing 'not found' or 'invalid html' errors. We know the original apple URL will always fail if not resolved, correct?

The error_messages dict should be moved above the loop and the apple/youtube parse functions could return url, error_message, and the None check could add the error_message to error_messages[url]. Ideally normalize_feed_url() would also return an error_message, and its other callsites could use or discard it.

I also just tried subscribing to https://www.youtube.com/c/jsgdajfgjsadfjh and its exception causes no dialog to appear. So I will need to fix parse_youtube_url() to not throw exceptions or to catch them at the callsite and set error_message to str(e).

Thoughts?

@neodyne
Copy link
Contributor Author

neodyne commented Sep 22, 2023

We know the original apple URL will always fail if not resolved, correct?

Yes

parse_apple_podcasts_url() should return None

If you look through the review of my (albeit flawed) original patch, besides the original url, None used to be valid return, and the caller's responsibility to handle (1, 2). This was considered wasteful in gpo cli because of passing None around, despite the youtube parser being hooked into gtkui/main.py in the same manner. After the reorder, gpo cli would return "invalid url" early, but gui would still silently fail, as it does now with the youtube parser - to prevent that the succeeding None check should be extended to report errors, I agree.

Based on the review the function now always returns some url, None only if None is passed, which mirrors the youtube parser. If there's consensus whether this is better/worse, I can change the function accordingly (I lean towards returning None on error as well).

I think the additional error message return seems out of scope for this PR. Caller functions would have to be changed to accomodate it, both parsers (parse_youtube_url() also calls itself when looking for channel url on page) and normalize_feed_url() (in parse_apple_podcasts_url(), lookup feedUrl has to be normalized, can't really trust the endpoint to not return something like {...,"feedUrl":"error",...}).

need to fix parse_youtube_url()

The parser could be case-insensitive too (so /cHaNnEL/, /uSeR/ etc.), the same way Vimeo and Soundcloud (and Apple) is, which would mean lowercasing it for the check (normalize only lowercases netloc) or using regex.

In my opinion, youtube.py in general needs refractoring/simplification. There are multiple references to the now dead gdata.* youtube api, and code paths that are hard to follow (to me, I guess). We have youtube.get_youtube_id() and youtube.get_channel_id_url(), what does youtube.get_real_channel_url() do?

@neodyne
Copy link
Contributor Author

neodyne commented Sep 22, 2023

Changes:

  • made regex stricter
  • function now returns one of:
    • original url when it's not an apple podcasts link (regex doesn't match)
    • url if lookup contains a valid feedUrl after normalization
    • None everywhere else
  • adjusted docstring
  • more granular logging, doesn't include exception info anymore
  • moved the None check in bin/gpo, again, after both the apple parser, and the youtube one (in anticipation of changes)

gpo cli now reports all unsuccessfully parsed apple podcasts links with "Invalid url: %s", and doesn't process them further. The gui doesn't either; but no message is shown, same as with the current youtube parser after entering one of the mentioned marvel podcasts into the Add podcast URL dialog, a standard 'Some podcasts could not be added' msgbox is shown, with the url, and 'unknown' as text. The same box with the same text is shown when trying to subscribe to 'abcdef', so when normalize_feed_url(input_url) returns None.

EDIT: accelerators / prefixes in normalize_feed_url() are not resolved for the error message:

sc:123
'Unknown'

This shouldn't even be a None return, but a successful parse of https://soundcloud.com/123, when soundcloud in gpodder gets fixed, or now a failure in the soundcloud code because of the pinned issue.

The length criteria (more than 8 chars) should not apply to ( accelerator + id ), in fact I think id should have len(id) > 0 and no whitespace after str.strip() requirement only.

@neodyne
Copy link
Contributor Author

neodyne commented Sep 23, 2023

This is not strictly related to the apple podcasts parser, but I believe the validating of prefixed urls in normalize_feed_url() should be corrected - 3-char soundcloud usernames exist (2-char too, but less common), as well as 4-character youtube usernames, both would fail as 'prefix' urls currently. As I see it, we can

  • lower the condition for urls to fail to len(url)<6|7.

This would allow sc:123, yt:abcd. Longer prefixes would have shorter allowed id_min_length (len(url) - len(prefix)). Empty prefixed urls with a sufficiently long prefix would resolve to empty patterns https://patte.rn?id=, instead of returning None (loop could include a check to prevent this). Inconsistent, but smallest code change. How was url min_length==8 originally determined, 'https://' ? Will shorter allowed normalized urls break something, especially if normalize_feed_url() is used in different places, validating the enclosure url for example?

  • move the length check after prefix expansion.
    if not url:
        return None

    (...)

    for prefix, expansion in PREFIXES.items():
        if url.lower().startswith(prefix):
            # ^case-insensitive comparison
            if len(url) == len(prefix):
                # ^this could be `len(url) <= len(prefix)+some_arbitrary_id_min_length-1`
                return None
            elif '@' not in url:
                # ^fix for 'sc:PassWord@example.com' -> 'https://soundcloud.com/PassWord@example.com'
                # further processing possible here
                url = expansion % (url[len(prefix):],)

            break

    if len(url) < 8:
        return None

    (...)

This would allow prefix:id_min_length. It would keep min_length == 8 for non-prefixed urls, id_min_length would be consistent across prefixes of any length. id_min_length == 1 would mean non-empty id for a prefix. With more changes it would allow for processing on expansion (moving the ' ' in url condition as well would potentially allow for 'prefix:<space>id' -> 'https://patte.rn?=id' (strip) or 'prefix:<space>i<space>d' -> 'https://patte.rn?=id' (replace whitespace) if desired.
The prefix check would effectively come first though, after None. So a short garbage string with a valid prefix would immediately return None in the first/old scenario, but not here at first, only (maybe) after prefix expansion.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants