Support Apple Podcasts urls. #1556

neodyne · 2023-09-19T13:59:41Z

Inspired by #1541.

This PR adds support for Apple Podcasts web links in the format of
https://podcasts.apple.com/COUNTRY/podcast/PODCAST_TITLE/idPODCAST_ID.

utils.parse_apple_podcasts_url() opens https://itunes.apple.com/lookup?id=PODCAST_ID and returns the feedUrl field from the json response, or the original url on error, for further processing.

I also cleaned up and updated 'accelerators' in utils.normalize_feed_url(), fixed the test for feedburner, and added one for apple podcasts.

tpikonen

Thanks, this works and looks fine, except for the small nitpick.

bin/gpo

auouymous · 2023-09-22T05:40:37Z

https://podcasts.apple.com/it/podcast/il-podcast-di-alessandro-barbero-lezioni-e/id1501956064 works fine. However, https://podcasts.apple.com/us/podcast/marvels-wastelanders-black-widow/id1600546725, a podcast I found on the front page of podcasts.apple.com, does not work.

  File "util.py", line 318, in parse_apple_podcasts_url
    url = normalize_feed_url(lookup['results'][0]['feedUrl'])
                             ~~~~~~~~~~~~~~~~~^^^
IndexError: list index out of range

neodyne · 2023-09-22T09:28:01Z

@auouymous that page doesn't load in the browser too, it is stuck on 'Connecting to Apple Podcasts'. Try also any random nonexistent number as id, you'd get the same landing page. With real podcasts, this happens when the rss feed is gone/unreachable, parts of the apple website ui may feature the podcast still, but the actual podcast page is delisted. In this case, the feed, found otherwise ( https://feeds.simplecast.com/tysWFCDv ) still works, page had to have been delisted by publisher. No listing == nothing to lookup.

curl -s https://itunes.apple.com/lookup?id=1600546725

{
 "resultCount":0,
 "results": []
}

A bigger issue though, looking at similar ex-paywalled Marvel podcasts, might be something like Marvel's Wastelanders: Hawkeye:

curl -s https://itunes.apple.com/lookup?id=1586348056

{
 "resultCount":1,
 "results": [
{"wrapperType":"track", "kind":"podcast", "collectionId":1586348056, "trackId":1586348056, "artistName":"Marvel & SiriusXM", "collectionName":"Marvel's Wastelanders: Hawkeye", "trackName":"Marvel's Wastelanders: Hawkeye", "collectionCensoredName":"Marvel's Wastelanders: Hawkeye", "trackCensoredName":"Marvel's Wastelanders: Hawkeye", "collectionViewUrl":"https://podcasts.apple.com/us/podcast/marvels-wastelanders-hawkeye/id1586348056?uo=4", "trackViewUrl":"https://podcasts.apple.com/us/podcast/marvels-wastelanders-hawkeye/id1586348056?uo=4", "artworkUrl30":"https://is1-ssl.mzstatic.com/image/thumb/Podcasts115/v4/b5/48/67/b54867a3-bf18-1016-92b2-9dce0e6a3a99/mza_13697071505097526305.jpg/30x30bb.jpg", "artworkUrl60":"https://is1-ssl.mzstatic.com/image/thumb/Podcasts115/v4/b5/48/67/b54867a3-bf18-1016-92b2-9dce0e6a3a99/mza_13697071505097526305.jpg/60x60bb.jpg", "artworkUrl100":"https://is1-ssl.mzstatic.com/image/thumb/Podcasts115/v4/b5/48/67/b54867a3-bf18-1016-92b2-9dce0e6a3a99/mza_13697071505097526305.jpg/100x100bb.jpg", "collectionPrice":0.00, "trackPrice":0.00, "collectionHdPrice":0, "releaseDate":"2022-09-12T07:00:00Z", "collectionExplicitness":"notExplicit", "trackExplicitness":"cleaned", "trackCount":16, "trackTimeMillis":107, "country":"USA", "currency":"USD", "primaryGenreName":"Fiction", "contentAdvisoryRating":"Clean", "artworkUrl600":"https://is1-ssl.mzstatic.com/image/thumb/Podcasts115/v4/b5/48/67/b54867a3-bf18-1016-92b2-9dce0e6a3a99/mza_13697071505097526305.jpg/600x600bb.jpg", "genreIds":["1483", "26"], "genres":["Fiction", "Podcasts"]}]
}

feedUrl is missing entirely, but the Apple Podcasts listing is valid and episodes play (from https://feeds.simplecast.com/ZnLWkl9A ).

All Wastelanders podcasts are folded into one on Spotify, the Marvel webpage advertises this exclusively for all storylines, but that feed doesn't have all episodes folded in, maybe they are mid-migration or this is some 'have to be logged in / subscribed in iTunes' type of deal. I strongly suspect this is a Marvel thing, I have been using the lookup endpoint via userscript on the iTunes webpage for years, haven't had any issues with (by now ~300 subscribed) podcasts from different publishers. Still, a gPodder user would see a valid 'Apple Podcasts' listing, copy link to gPodder, and it wouldn't work.

The whole iTunes Search API is old (json string as txt/javascript, docs archived in 2017), not sure I can coerce it to give up more data. I could do a more granular json response check, (nonsense || valid json, empty results || valid json, no feedurl), log something more meaningful ('No feedUrl in response, is this an exclusive podcast?'), alter the docstring to match (...tries to resolve...). On failure, it would return the original url, though, which would fail later (html).

EDIT:
Going over this superuser question from 2009. Tested:

https://getrssfeed.com/
https://www.labnol.org/podcast/
https://picklemonkey.net/feedflipper-home/
https://www.djm.org.uk/posts/reveal-rss-feed-url-itunes-podcast/

It seems they all use the same lookup endpoint, so predictably they all choke on the hawkeye url. Might be a one-off.

auouymous · 2023-09-22T10:44:19Z

Better error messages would be nice to have, and I think parse_apple_podcasts_url() should return None if it fails to parse the URL. The None check after parse_youtube_url() adds the original url to a failed list instead of queuing it, and then causing 'not found' or 'invalid html' errors. We know the original apple URL will always fail if not resolved, correct?

The error_messages dict should be moved above the loop and the apple/youtube parse functions could return url, error_message, and the None check could add the error_message to error_messages[url]. Ideally normalize_feed_url() would also return an error_message, and its other callsites could use or discard it.

I also just tried subscribing to https://www.youtube.com/c/jsgdajfgjsadfjh and its exception causes no dialog to appear. So I will need to fix parse_youtube_url() to not throw exceptions or to catch them at the callsite and set error_message to str(e).

Thoughts?

neodyne · 2023-09-22T14:40:32Z

We know the original apple URL will always fail if not resolved, correct?

Yes

parse_apple_podcasts_url() should return None

If you look through the review of my (albeit flawed) original patch, besides the original url, None used to be valid return, and the caller's responsibility to handle (1, 2). This was considered wasteful in gpo cli because of passing None around, despite the youtube parser being hooked into gtkui/main.py in the same manner. After the reorder, gpo cli would return "invalid url" early, but gui would still silently fail, as it does now with the youtube parser - to prevent that the succeeding None check should be extended to report errors, I agree.

Based on the review the function now always returns some url, None only if None is passed, which mirrors the youtube parser. If there's consensus whether this is better/worse, I can change the function accordingly (I lean towards returning None on error as well).

I think the additional error message return seems out of scope for this PR. Caller functions would have to be changed to accomodate it, both parsers (parse_youtube_url() also calls itself when looking for channel url on page) and normalize_feed_url() (in parse_apple_podcasts_url(), lookup feedUrl has to be normalized, can't really trust the endpoint to not return something like {...,"feedUrl":"error",...}).

need to fix parse_youtube_url()

The parser could be case-insensitive too (so /cHaNnEL/, /uSeR/ etc.), the same way Vimeo and Soundcloud (and Apple) is, which would mean lowercasing it for the check (normalize only lowercases netloc) or using regex.

In my opinion, youtube.py in general needs refractoring/simplification. There are multiple references to the now dead gdata.* youtube api, and code paths that are hard to follow (to me, I guess). We have youtube.get_youtube_id() and youtube.get_channel_id_url(), what does youtube.get_real_channel_url() do?

neodyne · 2023-09-22T23:08:21Z

Changes:

made regex stricter
function now returns one of:
- original url when it's not an apple podcasts link (regex doesn't match)
- url if lookup contains a valid feedUrl after normalization
- None everywhere else
adjusted docstring
more granular logging, doesn't include exception info anymore
moved the None check in bin/gpo, again, after both the apple parser, and the youtube one (in anticipation of changes)

gpo cli now reports all unsuccessfully parsed apple podcasts links with "Invalid url: %s", and doesn't process them further. The gui doesn't either; ~~but no message is shown, same as with the current youtube parser~~ after entering one of the mentioned marvel podcasts into the Add podcast URL dialog, a standard 'Some podcasts could not be added' msgbox is shown, with the url, and 'unknown' as text. The same box with the same text is shown when trying to subscribe to 'abcdef', so when normalize_feed_url(input_url) returns None.

EDIT: accelerators / prefixes in normalize_feed_url() are not resolved for the error message:

sc:123
'Unknown'

This shouldn't even be a None return, but a successful parse of https://soundcloud.com/123, when soundcloud in gpodder gets fixed, or now a failure in the soundcloud code because of the pinned issue.

The length criteria (more than 8 chars) should not apply to ( accelerator + id ), in fact I think id should have len(id) > 0 and no whitespace after str.strip() requirement only.

neodyne · 2023-09-23T17:36:51Z

This is not strictly related to the apple podcasts parser, but I believe the validating of prefixed urls in normalize_feed_url() should be corrected - 3-char soundcloud usernames exist (2-char too, but less common), as well as 4-character youtube usernames, both would fail as 'prefix' urls currently. As I see it, we can

lower the condition for urls to fail to len(url)<6|7.

This would allow sc:123, yt:abcd. Longer prefixes would have shorter allowed id_min_length (len(url) - len(prefix)). Empty prefixed urls with a sufficiently long prefix would resolve to empty patterns https://patte.rn?id=, instead of returning None (loop could include a check to prevent this). Inconsistent, but smallest code change. How was url min_length==8 originally determined, 'https://' ? Will shorter allowed normalized urls break something, especially if normalize_feed_url() is used in different places, validating the enclosure url for example?

move the length check after prefix expansion.

    if not url:
        return None

    (...)

    for prefix, expansion in PREFIXES.items():
        if url.lower().startswith(prefix):
            # ^case-insensitive comparison
            if len(url) == len(prefix):
                # ^this could be `len(url) <= len(prefix)+some_arbitrary_id_min_length-1`
                return None
            elif '@' not in url:
                # ^fix for 'sc:PassWord@example.com' -> 'https://soundcloud.com/PassWord@example.com'
                # further processing possible here
                url = expansion % (url[len(prefix):],)

            break

    if len(url) < 8:
        return None

    (...)

This would allow prefix:id_min_length. It would keep min_length == 8 for non-prefixed urls, id_min_length would be consistent across prefixes of any length. id_min_length == 1 would mean non-empty id for a prefix. With more changes it would allow for processing on expansion (moving the ' ' in url condition as well would potentially allow for 'prefix:<space>id' -> 'https://patte.rn?=id' (strip) or 'prefix:<space>i<space>d' -> 'https://patte.rn?=id' (replace whitespace) if desired.
The prefix check would effectively come first though, after None. So a short garbage string with a valid prefix would immediately return None in the first/old scenario, but not here at first, only (maybe) after prefix expansion.

neodyne added 2 commits September 19, 2023 15:23

Support Apple Podcasts urls.

8559582

phrasing

3659d76

neodyne mentioned this pull request Sep 19, 2023

Add plural form of translation string. #1553

Merged

tpikonen reviewed Sep 20, 2023

View reviewed changes

bin/gpo Show resolved Hide resolved

neodyne added 2 commits September 20, 2023 18:54

make regex case-insensitive

c7f87aa

handle check for none in function

81ae646

neodyne added 2 commits September 22, 2023 23:00

Merge branch 'gpodder:master' into support-apple-podcasts

b8353c9

return none on failure, better logging

564c7b6

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Support Apple Podcasts urls. #1556

Support Apple Podcasts urls. #1556

Uh oh!

neodyne commented Sep 19, 2023

Uh oh!

tpikonen left a comment

Uh oh!

Uh oh!

auouymous commented Sep 22, 2023

Uh oh!

neodyne commented Sep 22, 2023 •

edited

Loading

Uh oh!

auouymous commented Sep 22, 2023

Uh oh!

neodyne commented Sep 22, 2023

Uh oh!

neodyne commented Sep 22, 2023 •

edited

Loading

Uh oh!

neodyne commented Sep 23, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Support Apple Podcasts urls. #1556

Are you sure you want to change the base?

Support Apple Podcasts urls. #1556

Uh oh!

Conversation

neodyne commented Sep 19, 2023

Uh oh!

tpikonen left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

auouymous commented Sep 22, 2023

Uh oh!

neodyne commented Sep 22, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

auouymous commented Sep 22, 2023

Uh oh!

neodyne commented Sep 22, 2023

Uh oh!

neodyne commented Sep 22, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

neodyne commented Sep 23, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

neodyne commented Sep 22, 2023 •

edited

Loading

neodyne commented Sep 22, 2023 •

edited

Loading