Further analysis of PyPI typosquatting [LWN.net]

We're bad at marketing
We can admit it, marketing is not our strong suit. Our strength is writing the kind of articles that developers, administrators, and free-software supporters depend on to know what is going on in the Linux world. Please subscribe today to help us keep doing that, and so we don’t have to get good at marketing.

By Jake Edge
October 14, 2020

We have looked at the problem of confusingly named packages in repositories such as the Python Package Index (PyPI) before. In general, malicious actors create these packages with names that can be mistaken for those of legitimate packages in the repository in a form of "typosquatting". Since our 2016 article, the problem has not gone away—no surprise—but there has been some recent analysis of it, as well as some efforts to combat it.

On the IQT blog, John Speed Meyers and Bentz Tozer recently posted some analysis they had done to quantify PyPI typosquatting attacks and to categorize them. They started by looking at the examples of actual attacks against PyPI users from 2017 to 2020; they found 40 separate instances over that time span. The criteria used were that the package had a name similar to another in PyPI, contained malware, and was identified and removed from the repository.

They identified two types of package typosquatting: misspelling and confusion. The first type relies on package names that are slightly misspelled, djanga instead of django or urlib3 instead of urllib3. The confusion attacks rely upon changing the order of the "words" in the name (e.g. nmap-python rather than python-nmap), removing or changing separators (e.g. easyinstall vs. easy_install), or otherwise changing the elements of the name (e.g. crypt/crypto, python-sqlite/pysqlite). Of the 40 attacks identified, 18 were of the misspelling variety, while 24 were confusing—two were both, which accounts for the overlap.

The blog post noted that William Bengston had done some research on one type of confusion attack in particular: separator changes. In July 2018, Bengston registered around 1,100 package names on PyPI by eliminating any separators (i.e. - or _) in the names the top 10,000 packages on PyPI. The packages registered would simply cause an error when they were installed; that error would redirect the user to the correct package name.

In a little over two years there have been 530,950 total pip install commands run on 1,131 packages! This does not include any mirrors or internal package registries that have cloned these packages privately. Malicious packages in PyPI have been know to steal credentials stored on the local file system such as SSH credentials in ~/.ssh/, GPG keys, or perhaps AWS credentials stored in ~/.aws/credentials. If these typosquat packages were written with malicious intent and we assume one attempt per install, that would mean 530,950 machines could have been compromised over the two year period.

The IQT researchers found that separator attacks made up a small portion of the typosquatting incidents they looked at, though. So the number of potential systems that have fallen prey to the typosquatting problem is likely far higher. "Separator attacks account for only three of the 26 confusion attacks in this dataset though, suggesting that Bengston’s already frightening estimate of PyPI user susceptibility to typosquatting is a lower bound of overall user susceptibility to typosquatting attacks."

As might be guessed, typosquatters concentrated their efforts on the most popular PyPI packages. The IQT researchers found that 28% of their instances were typosquatting the top 50 most popular PyPI packages and more than half of the attacks (63%) were against the top 500.

Finding typosquatting attacks, or preventing those kinds of packages from being created in the first place, is obviously worth doing. The misspelling attacks are relatively easy to detect using the Levenshtein distance between two names. That distance measures the number of one-character changes that are needed to turn one string into another; the misspellings that the researchers found have a Levenshtein distance of one (15 attacks) or two (3 attacks). While there may be perfectly valid reasons for a package to have a name that is only slightly different than an existing package, it could be used as a reason for administrator scrutiny. The confusion category, on the other hand, generally had distances that were three or more (17 of 24), making them more difficult to (automatically) detect.

Python has made some efforts to help reduce the typosquatting problem. In 2017, code was added to block new PyPI packages with names that are the same as those of standard library modules. Existing PyPI packages that conflict are not being removed—some are backports of newer functionality to older versions of the language, for example—but they are being audited to determine their validity. Some malware checking has also been added to Warehouse, which is the web application behind PyPI.

The researchers also noted several different papers about ways to detect and stop malware from being distributed from package repositories such as PyPI. Other language repositories, npm for JavaScript and RubyGems for Ruby, are also considered in these papers. A team largely from the University of Kansas looked specifically at typosquatting defenses [PDF] for npm and PyPI, while a Georgia Tech team "built a sophisticated anti-malware analysis pipeline that repositories could employ to find malicious software, including typosquatters, hiding in repositories" (paper [PDF]). The amusingly titled "Backstabber’s Knife Collection: A Review of Open Source Software Supply Chain Attacks" [PDF] analyzes nearly 200 malicious packages from npm, RubyGems, and PyPI to try to extract information that can be used to detect new malware based on the characteristics of the existing malicious packages.

There is, of course, a more draconian solution to the problem: connecting packages with the real-life identities of their maintainers. It is the semi-anonymous nature of the repositories that makes these kinds of attacks easy to perform with little risk of personal repercussions for perpetrating them. There are lots of advantages to the free-for-all nature of repositories like PyPI (as well as GitHub and friends), but there are some downsides to it as well. On the other hand, of course, it is hard to imagine that attackers would not find ways around some ID requirement—or some way to pin their attack on an innocent bystander. Finding ways to thwart these attacks without resorting to that kind of policing is important.

Better vetting for packages is another potential solution, but it is not really practical. The number of changes that goes into these repositories is so enormous that it rapidly outpaces the limited bandwidth of administrators. Even large commercial companies are unable to handle this problem—the various app stores have been known to provide malware—so projects like Python cannot even begin to keep up. It would seem that automated efforts are slowly getting somewhat better, but it would be unsurprising to see more instances of malicious typosquatting in PyPI and elsewhere in coming years.

Index entries for this article
Security	Package repositories
Security	Python
Python	Packaging

to post comments

Further analysis of PyPI typosquatting

Posted Oct 14, 2020 22:35 UTC (Wed) by iabervon (subscriber, #722) [Link] (3 responses)

It might be worth doing some extra scrutiny on well-known packages, and having pip alert the user if a package isn't one of these. If I've decided to try an obscure package, I'm not going to be concerned about it not having been vetted, but if I'm downloading reqeusts or mcok, I'm going to stop and take a second look if PyPI didn't think it was sufficiently notable to verify.

Further analysis of PyPI typosquatting

Posted Oct 15, 2020 8:13 UTC (Thu) by LtWorf (subscriber, #124958) [Link] (2 responses)

Having pip regularly issue loads of warnings would be a great idea to have people ignore warnings.

Further analysis of PyPI typosquatting

Posted Oct 15, 2020 19:33 UTC (Thu) by logang (subscriber, #127618) [Link] (1 responses)

Seems to me like it already does this. I can never tell when PIP fails because it generates so much noise during normal operation. I think that's a real problem.

Further analysis of PyPI typosquatting

Posted Nov 20, 2020 20:38 UTC (Fri) by nix (subscriber, #2304) [Link]

Worse than that, you have a choice of "pass -v --no-binary :all: to pip and sift through huge amounts of noise" or "trust other people's binaries" (or it whines in -v mode about every single binary found on pypi) or "leave -v out, and don't get to see any compiler output or URLs, even to stderr" (so a hostile package could come from anywhere and literally print out I AM DELETING YOUR DISK NOW HA HA HA and you'd never see it).

The tradeoffs in pip's verbosity just seem wrong.

Further analysis of PyPI typosquatting

Posted Oct 15, 2020 18:36 UTC (Thu) by sjj (guest, #2020) [Link] (26 responses)

Commercial companies could solve this for themselves. It just takes money. And people. The latter is the problem, big tech is trying very hard to rid themselves of humans unless they perform piecework as meat robots. See also content moderation hellscape.

Further analysis of PyPI typosquatting

Posted Oct 18, 2020 5:50 UTC (Sun) by NYKevin (subscriber, #129325) [Link] (25 responses)

> See also content moderation hellscape.

OK, since you asked...

Take YouTube. They get 500+ hours of content uploaded per minute,[1] which translates to 30,000+ hours of content per hour. Let's round off the "+" to make the numbers easier.

If you want an employee to watch everything before (or shortly after) it goes live, you need 30,000 employees watching videos at any given time to keep up. Assuming everyone works 40 hours per week, you need 126,000 employees in total (because each person is working 40/168 of the time). At US federal minimum wage (USD $7.25 an hour), they would collectively get paid a grand total of ~$1.9 billion per annum (excluding benefits). YouTube brought in $15 billion in ads revenue in 2019, so this might sound affordable, but that's before costs (and in particular, it is before revenue sharing with the video creators). Alphabet as a whole "only" made $34 billion in net income in 2019, and I doubt investors would be pleased with that number taking a $2B haircut just for YouTube.[2] Finally, this would more than double Alphabet's current (as of year-end 2019) workforce, and such a dramatic increase in employment would create substantial additional expense in the areas of HR, tech support, office space (assuming our interminable pandemic ends some day), etc. I will not attempt to put a number on that, because I don't see a good way of estimating it from publicly-available sources. Suffice it to say, $2B is a rather conservative number.

Even then, you still can't have humans process appeals, because all 126,000 employees are spending 100% of their working time watching new videos, and don't have any time left over for watching old videos again. Furthermore, they don't have any time between videos to think about their moderation decisions, so those decisions are going to be hilariously inconsistent and biased. If you want a 100% human-moderated system that doesn't suck, it's obviously going to cost far more than $2B. At that point, you would begin to face very serious questions about whether "this whole YouTube thing" is even worth it, from a financial perspective. Remember, you still need to pay for storage, servers, fiber, peering, etc., none of which are cheap for a service on the scale of YouTube.

So, unless you want to go down that road, you have to let at least some videos onto the service without a human reviewing them. The trouble is, everyone *knows* there's no humans reviewing the videos as they go up. So malicious users and bots will upload whatever garbage they like (spam, porn, copyrighted material,[3] etc.), in huge quantities, and now your options are to either block it with automation as best you can, or don't block it and let it go up (and *maybe* take it down later, when users complain about it). YouTube chose the former.

I'm not saying this is the ideal way for YouTube to operate. It (probably) isn't, and providing a more human touch would probably be a good thing, overall. I'm saying "just throw humans/money at the problem" is not a complete, workable solution all by itself. You need to be more specific: Where should money be spent? Which videos should get reviewed by a human moderator? How much will that realistically cost? Did you remember to assume that users will try to game the system, as soon as they know how it works? How much will it realistically improve the end user experience? How do you measure the latter, anyway?

(Disclaimer: I work for Google, which owns YouTube. I don't work on YouTube, nor does my work substantially intersect with content moderation of any kind. Views are my own, and are based solely on publicly-available financial information.)

[1]: https://www.tubefilter.com/2019/05/07/number-hours-video-...
[2]: https://abc.xyz/investor/static/pdf/2019Q4_alphabet_earni...
[3]: Copyright is a whole *other* clusterfuck, for reasons having less to do with scale and more to do with lawyers and rich, moneyed interests. See https://www.youtube.com/watch?v=1Jwo5qc78QU for further discussion.

Further analysis of PyPI typosquatting

Posted Oct 18, 2020 13:06 UTC (Sun) by farnz (subscriber, #17727) [Link]

Continuing on, and starting by taking YouTube as an example; that $2bn/year number of yours is $1/year per YouTube monthly active user, and YouTube's ARPU is about $7/year per MAU. At YouTube's scale, that's unpopular with investors but manageable - it'll damage profits, but it can be done by reducing YouTube's ARPU by about 15%. Even if your number is half the cost of human moderation at scale, it'll only reduce ARPU by 30%.

For comparison, ARPU at Vimeo on a MAU basis is around $2/year. If costs scale similarly, you've taken 50% of Vimeo's revenue for active moderation - if your number is an underestimate as aboove, you've taken 100% of Vimeo's ARPU just on moderation. The effect of forcing human moderation would be to establish YouTube as the only profitable video site. Now, from Alphabet's point of view, that might well be worthwhile; eliminating the competition completely and establishing Alphabet as the only choice for video hosting allows them monopoly pricing. From a social point of view, however, that might not be a good thing, not least because costs don't start to scale with users until you get big; employing enough people for 24/7 moderation when you only have a few thousand users is going to cost a lot more (proportionally) than maintaining 24/7 coverage for Alphabet.

This makes regulation for YouTube hard to design. "Fixing" 90% of the problem while eliminating all competitors to YouTube is not a great outcome.

Further analysis of PyPI typosquatting

Posted Oct 19, 2020 16:03 UTC (Mon) by raven667 (subscriber, #5198) [Link] (22 responses)

> you need 30,000 employees watching videos at any given time to keep up
> At US federal minimum wage (USD $7.25 an hour), they would collectively get paid a grand total of ~$1.9 billion per annum (excluding benefits)

Minimum wage should be at least $15 so double that to $4-5B

> YouTube brought in $15 billion in ads revenue in 2019,
> At that point, you would begin to face very serious questions about whether "this whole YouTube thing" is even worth it, from a financial perspective.

Maybe, but even at that price they are still making money, just less of it. I understand that people might _like_ to make more money, but there should be a limit on how much one can cut corners and disclaim responsibility for outcomes, and if you don't have a sustainable business after taking costs into consideration, then you don't have a sustainable business.

> you have to let at least some videos onto the service without a human reviewing them

? Have to ? YouTube doesn't _have to_ exist at all if their costs (financial and social) outweigh their benefits. I guess this is a fundamental difference in perspective.

In any event it does probably make sense to develop trust in content creators so that it is no longer necessary to review everything, by manually vetting creators in the same way that any publisher chooses what they publish, and not recommending or monetizing content that isn't vetted. My guess is that advertisers would find the service more valuable if they had more confidence on the kind of content their ads would run on, and malicious actors would find less value if they couldn't monetize their content. One could go as far as charging hosting fees for non-vetted, monetized uploads, restricting distribution by default until content is vetted, maybe with two levels of community standards, a high standard for the content that YouTube wants to promote (because they are under no obligation to promote and monetize everything), and a more relaxed standard for personal content.

> > Commercial companies could solve this for themselves. It just takes money. And people. The latter is the problem, big tech is trying very hard to rid themselves of humans unless they perform piecework as meat robots. See also content moderation hellscape.

> I'm saying "just throw humans/money at the problem" is not a complete, workable solution all by itself.

I very much doubt that the OP, in their brief message, intended to convey that literally spending money hiring a bunch of people for content moderation would magically solve everything by itself, and that no other action could be recommended, so that seems a bit of a straw man. All the questions you ask are solvable problems, and spending significant money on content moderation and licensing of content is a pre-requisite to solving them (moderation AI has not worked), the fact that Alphabet and YouTube are choosing not to solve them because they like keeping the money and don't mind the pollution they are enabling, does not mean that we should just throw up our hands and give up.

Further analysis of PyPI typosquatting

Posted Oct 19, 2020 16:49 UTC (Mon) by farnz (subscriber, #17727) [Link] (14 responses)

Doubling his costs brings in my reply - if you double the cost of content moderation, then smaller platforms like Vimeo will spend over 100% of revenue on moderation. Yes, you've dealt with the limited moderation on YouTube, but by doing so, you've set a standard that no other platform can keep up with, thus ensuring that YouTube's monopoly is permanent.

Now, maybe this is the desired end state - but if I were Alphabet, I would be wary of doing something so anti-competitive without a legislative compulsion. At least if they limit moderation to what Vimeo et al can afford, competition can in theory fix certain classes of error on the part of YouTube's moderators. If you insist on moderation that costs Vimeo more than 100% of revenue, then should YouTube err on the side of caution, there's now no alternative host to run to.

Further analysis of PyPI typosquatting

Posted Oct 19, 2020 19:34 UTC (Mon) by mathstuf (subscriber, #69389) [Link] (13 responses)

Wouldn't a for-pay account being required to upload help with this? Why are we assuming that these platforms need to be free-to-use for either side of the equation? Possibly something Patreon-alike where you pay uploaders for access to videos early (LWN-like) or in general (other, larger, news media outlets)?

Further analysis of PyPI typosquatting

Posted Oct 19, 2020 20:01 UTC (Mon) by farnz (subscriber, #17727) [Link] (12 responses)

I chose Vimeo for a reason; they charge all but the smallest uploaders for accounts (unless you pay, you're limited to 5 GiB of video storage and 10 uploads totalling no more than 500 MiB per week), and their ARPU is low enough that they'd have to massively hike prices to afford the sort of moderation that was described.

In a competitive market, it's a fair bet that everyone has set their prices to maximise their own profits (which starts by maximising revenue). If moderation is going to put a majority of players out of business, and leave us all at the mercy of Alphabet and Facebook's decisions, that's not exactly selling the idea…

Further analysis of PyPI typosquatting

Posted Oct 19, 2020 21:36 UTC (Mon) by raven667 (subscriber, #5198) [Link] (11 responses)

If the centralized video service model truly cannot be made to work safely, which I don't actually think is true as they can innovate around the new market fundamentals and cost structure, then content creators can go back to hosting their own videos on their own websites, where they only have to meet the ToS of the hosting provider, which can be more lax than youtube because the incentives and costs are different.

Further analysis of PyPI typosquatting

Posted Oct 19, 2020 21:45 UTC (Mon) by pizza (subscriber, #46) [Link] (2 responses)

> where they only have to meet the ToS of the hosting provider

What makes you think that any hosting provider's ToS will be any better? After all, the host is liable for whatever the user "publishes" in this new regime.

No, what will actually happen is that user-generated content will get shut down, hard, across the board. And only folks with deep pockets will be able to afford to play.

Mission accomplished, I guess.

Further analysis of PyPI typosquatting

Posted Oct 19, 2020 22:04 UTC (Mon) by raven667 (subscriber, #5198) [Link] (1 responses)

The publisher in this case is the person running the website, not the hardware the are running it on. And yes, if you are renting space you will probably have some ToS (like don't commit crimes) and at some point they may just choose not to do business with you (like Stormfront) but in practice you have a lot more leeway on your own site than publishing under some one else's brand like YouTube.

Further analysis of PyPI typosquatting

Posted Oct 20, 2020 0:21 UTC (Tue) by pizza (subscriber, #46) [Link]

> The publisher in this case is the person running the website, not the hardware the are running it on.

You conveniently ignore hosting providers being targeted solely because it's their hardware. Or the network provider.

Further analysis of PyPI typosquatting

Posted Oct 20, 2020 8:24 UTC (Tue) by farnz (subscriber, #17727) [Link] (7 responses)

That's again why I chose Vimeo - they are effectively a hosting company that specialises in hosting videos for pay, and not a centralized video service. A regulatory model that requires content moderation on video content means that all hosting companies have to comply when they are used to host video - otherwise how do you distinguish Vimeo and YouTube (hosting videos that are uploaded to the site) from my personal web host (hosting videos videos that are uploaded to the site)?

This is why it's a damned hard problem, and simple solutions won't work. Imagine if you were required by law to employ someone to moderate all video content you upload to your personal home server and expose on the Internet; that's the direction that requiring human moderation goes in, because all uploaded videos have been moderated by the uploader already.

Further analysis of PyPI typosquatting

Posted Oct 20, 2020 16:32 UTC (Tue) by raven667 (subscriber, #5198) [Link] (6 responses)

> Vimeo - they are effectively a hosting company that specialises in hosting videos for pay, and not a centralized video service

That seems an easier moderation problem to solve, as you have a more formal business relationship with content creators, increasing the costs or bad actors who will lose their channel and their hosting fees if they violate the content policies.

> Imagine if you were required by law to employ someone to moderate all video content you upload to your personal home server and expose on the Internet;

Why would I imagine that, it sounds like an absurd scenario, you'd already be responsible as you are the one distributing it. If the content was egregious enough then maybe your ISP would stop doing business with you, or if it was fully illegal then someone may come knocking on your door, but you'd probably be the first point of contact.

At the end of the day, if you have your own space (domain name, hosting, etc) you can have whatever content policy you want, within some pretty wide limits (no CP for example), because there is a certain amount of consent and choice between the community members, but if you want to be licensed by some popular place for distribution, like YouTube, then the policies have to be quite a bit more narrow, because the communities share space.

Does any of this make sense?

Further analysis of PyPI typosquatting

Posted Oct 20, 2020 16:40 UTC (Tue) by farnz (subscriber, #17727) [Link] (5 responses)

Vimeo has the same problems as YouTube with insufficient moderation, and the same difficulties taking them all down. It also has a fraction of the ARPU, so while you can take out bad actors more easily, they find it much harder to detect them due to costs. They also have a different set of bad actor problems, relating to stolen cards and the like (fraud).

And yes, it sounds absurd - but YouTube is also just distributing content, too. If regulation requires all content distributors to have video moderated by persons other than the uploaders, that's exactly what the end result is - you must pay someone to monitor the content you upload to your distribution system, because the regulations don't treat you any differently to YouTube.

At the end of the day, working out how to regulate a content distributor like YouTube in a sensible fashion is a very hard problem, and the easy ways to do it have huge ramifications. I'm not saying that sensible regulation of YouTube is impossible, merely that it's a very hard problem to get right without accidentally setting things up so that YouTube is the monopoly due to regulatory capture.

Further analysis of PyPI typosquatting

Posted Oct 20, 2020 17:40 UTC (Tue) by Wol (subscriber, #4433) [Link] (1 responses)

> And yes, it sounds absurd - but YouTube is also just distributing content, too. If regulation requires all content distributors to have video moderated by persons other than the uploaders, that's exactly what the end result is - you must pay someone to monitor the content you upload to your distribution system, because the regulations don't treat you any differently to YouTube.

Except you're confusing *distributing* with *providing*. If my server only has my stuff I've put on it, I'm a provider. If (as it does) YouTube has loads of stuff from loads of different people, they are a distributor.

To put it a bit differently, it's the difference between a publisher and a printer. If the law can cope with the difference between those two, it can cope with the difference between a private server and a content distributor.

Cheers,
Wol

Further analysis of PyPI typosquatting

Posted Oct 20, 2020 17:45 UTC (Tue) by farnz (subscriber, #17727) [Link]

Under (at least) US federal and UK laws, the publisher and the printer are not treated differently - they are equivalent. A printer gets into a different legal space by not offering the printed product directly to the public, but rather accepting stuff to print from a publisher and sending the printed output back to the publisher for onward handling.

The important difference between a printer and a publisher is thus who they make the material available to - a printer merely makes it available to the party paying for the printing (putting them closer to something like BackBlaze), whereas a publisher makes it available "to the public" in some form (usually for money). If you, as a "provider" in your terminology, provide that material to the general public, then you are also a distributor.

Further analysis of PyPI typosquatting

Posted Oct 20, 2020 21:45 UTC (Tue) by raven667 (subscriber, #5198) [Link] (2 responses)

> I'm not saying that sensible regulation of YouTube is impossible, merely that it's a very hard problem to get right without accidentally setting things up so that YouTube is the monopoly due to regulatory capture.

Sure, this is pretty much textbook case if YouTube is allowed to write the rules so that they are the only ones who can afford to meet them, but one could also write the rules to scale with volume, so that small sites, personal sites, aren't held to the same standard as large ones (based on number of videos, viewership counts, subscriber numbers, whatever) because the risk of harm is different, their large userbase is what makes them attractive for terrorist recruitment and disinfo campaigns, and someone's personal blog with some hosted video doesn't have as big an effect, so doesn't need the same level of regulation.

Further analysis of PyPI typosquatting

Posted Oct 20, 2020 22:41 UTC (Tue) by pizza (subscriber, #46) [Link]

> so that small sites, personal sites, aren't held to the same standard as large ones (based on number of videos, viewership counts, subscriber numbers, whatever)

The only "size" that matters in this context is that of the network pipe.

After all, most youtube (or whatever) videos have only a small handful of views. Until one suddenly doesn't, at which point it's too late to "moderate" it.

Further analysis of PyPI typosquatting

Posted Oct 21, 2020 8:46 UTC (Wed) by farnz (subscriber, #17727) [Link]

That's part of the problem. The risk of harm for any given YouTube video is very, very low - it's just that YouTube has sufficient capacity to stay up if one of those videos goes viral, whereas smaller providers may not.

If you write the rules to scale with volume, you set up a situation where YouTube always complies, whereas your personal VPS suddenly comes out of compliance when a video goes viral, and you go from small volume to high volume overnight. You might not even be awake when the rules that apply to you change - but suddenly, you've got to show compliance because your volume has shot up.

FWIW, I think regulating the recommendation engines (which includes Facebook, Google Search, Bing, YouTube et al - anything that takes user input and attempts to find them something they might want or like) is a better approach. To escape regulation, you simply stop recommending things to people (which isn't hard to do if you're Vimeo or equivalent, relying on paid accounts for the majority of your revenue); otherwise, you are expected to have applied an appropriate degree of moderation to the content you recommend - less if it's directly relevant to user-entered search terms, more moderation if it's a "we think you would like" sort of recommendation.

Further analysis of PyPI typosquatting

Posted Oct 19, 2020 21:18 UTC (Mon) by NYKevin (subscriber, #129325) [Link] (6 responses)

> Maybe, but even at that price they are still making money, just less of it.

The public documents do not list net income for YouTube alone, so it's hard to say whether this is actually true. You would have to add subscription income (for services like YouTube TV) and subtract other costs (such as servers, fiber, etc. as I mentioned). I don't think all of those numbers are public.

> ? Have to ? YouTube doesn't _have to_ exist at all if their costs (financial and social) outweigh their benefits. I guess this is a fundamental difference in perspective.

No, it's not a fundamental difference in perspective. I just took "or else YouTube wouldn't exist at all" as a given. Perhaps I should have been more explicit about that. Obviously, "shut down YouTube" will solve all problems arising out of YouTube's existence; I did not realize I needed to say so explicitly.

> In any event it does probably make sense to develop trust in content creators so that it is no longer necessary to review everything, by manually vetting creators in the same way that any publisher chooses what they publish, and not recommending or monetizing content that isn't vetted.

Then you're not really YouTube any more. You're Netflix, or Hulu. IMHO that's in the same territory as YouTube not existing, since it would be a fundamentally different service, and one which other people are already selling.

(For the record, YouTube actually does sell licensed content alongside its free or ad-supported user-generated content.)

> I very much doubt that the OP, in their brief message, intended to convey that literally spending money hiring a bunch of people for content moderation would magically solve everything by itself, and that no other action could be recommended, so that seems a bit of a straw man.

You're probably right. But I hear this argument so often, from so many different people, and so rarely with any kind of nuance, that I felt the need to make my point anyway.

> All the questions you ask are solvable problems

Yes, they probably are.

> and spending significant money on content moderation and licensing of content is a pre-requisite to solving them (moderation AI has not worked)

I agree that AI is no better than humans, and that it is actively causing problems right now. However, AI that is properly designed should not be too much *worse* than humans, in statistical aggregate (because if your machines are statistically different from your humans, then your ML algorithm is poorly-trained). In my view, the real problem is that, when the AI tells you "no," there is often no one to escalate to, whereas when a human tells you "no," you very often can escalate to another human. YouTube ought to fix that problem, and I agree that the solution probably should involve hiring more humans in some capacity. Of course, if you give every user the right to escalate to a human every time the AI says "no," then you might as well escalate to the human automatically, at which point you're going to be hiring a lot of humans. So it probably ends up being more complicated than that.

(I'm deliberately not expressing an opinion on whether YouTube's AI is "properly designed" because I have not worked on it and have no idea whether this is the case.)

> the fact that Alphabet and YouTube are choosing not to solve them because they like keeping the money and don't mind the pollution they are enabling, does not mean that we should just throw up our hands and give up.

I didn't say that we should give up. But, as I mentioned, your proposal would take YouTube out of the realm of a user-generated content service and into the realm of a curated content service. Maybe user-generated content just doesn't work, or maybe there are other solutions that would allow it to continue existing. I really don't know the best way out of this hole, but I do know that it's a harder problem than is usually appreciated. That was all I wanted to convey; I genuinely *don't know* what we should do about this hard problem.

Further analysis of PyPI typosquatting

Posted Oct 19, 2020 22:00 UTC (Mon) by raven667 (subscriber, #5198) [Link] (5 responses)

> Then you're not really YouTube any more. You're Netflix, or Hulu. IMHO that's in the same territory as YouTube not existing, since it would be a fundamentally different service, and one which other people are already selling.

If you really think that effective content moderation would fundamentally destroy the service, that hosting harmful and fraudulent content is so core the the business model that kicking that off the platform will leave them unable to differentiate from Netflix, well then I guess you are very sympathetic to the current management of YouTube who also are unwilling to moderate, but are fundamentally arguing for it to be shutdown if the negative social costs of its operation are accounted for.

> However, AI that is properly designed should not be too much *worse* than humans, in statistical aggregate (because if your machines are statistically different from your humans, then your ML algorithm is poorly-trained).

I'm just going to assume that the AIs which moderate content and recommend videos are competently built, and that the lack of good results is because the approach is fundamentally wrong and the incentives are wrong (prioritizing watch time and revenue over quality and safety)

> Maybe user-generated content just doesn't work

I mean, without setting and enforcing some standard it really doesn't, every place will eventually turn into 4chan/8kun or whatever if you don't reliably, aggressively and consistently kick people off when they misbehave. Those people who are no longer welcome can always go make their own service (eg. Parler or whatever)

> That was all I wanted to convey; I genuinely *don't know* what we should do about this hard problem.

It's probably going to require changes to the terms of service it's probably going to make less money, once you accept that is a possibility then it's easier to conceive of ways to reduce the size of the problem, fixing this without changing anything is impossible.

Further analysis of PyPI typosquatting

Posted Oct 19, 2020 23:15 UTC (Mon) by Wol (subscriber, #4433) [Link]

> > However, AI that is properly designed should not be too much *worse* than humans, in statistical aggregate (because if your machines are statistically different from your humans, then your ML algorithm is poorly-trained).

> I'm just going to assume that the AIs which moderate content and recommend videos are competently built, and that the lack of good results is because the approach is fundamentally wrong and the incentives are wrong (prioritizing watch time and revenue over quality and safety)

Unfortunately, at present, it's a case of "what humans can design, humans can circumvent". All the evidence is that humans will always be able to game the system. Look at the current worries over deep fakes. An un-gameable system is currently beyond our capabilities ...

Cheers,
Wol

Further analysis of PyPI typosquatting

Posted Oct 20, 2020 0:18 UTC (Tue) by pizza (subscriber, #46) [Link] (3 responses)

> that hosting harmful and fraudulent content is so core the the business model

So, pray tell, what is "harmful" content? And why is your definition better than, say, mine?

Or why is the United States' definition better than, say, China's?

Even "fraudulent content" is completely okay most of the time when it's called "advertising" or even "propaganda"

Further analysis of PyPI typosquatting

Posted Oct 20, 2020 16:06 UTC (Tue) by raven667 (subscriber, #5198) [Link] (2 responses)

> So, pray tell, what is "harmful" content? And why is your definition better than, say, mine?

Ok, so you are saying that there is no difference between good and bad things, no way to make value judgements, and no way to hold people or organizations accountable for harm. That's a real galaxy brain take, but the counter example is all of human history, laws, regulation, ethics, community standards, etc.

Further analysis of PyPI typosquatting

Posted Oct 20, 2020 16:32 UTC (Tue) by mpr22 (subscriber, #60784) [Link]

Some of what I consider harmless (or even praiseworthy), the Shūrā-ye Negahbān or the Republican Party would find harmful, and vice versa. (And I probably don't even have to reach that far along the political spectrum, either.)

So sadly, I have to agree that "harmful" is unlikely to be a satisfactory descriptor with which to label the disfavoured content.

Further analysis of PyPI typosquatting

Posted Oct 20, 2020 18:33 UTC (Tue) by pizza (subscriber, #46) [Link]

> That's a real galaxy brain take, but the counter example is all of human history, laws, regulation, ethics, community standards, etc.

Indeed, "all of human history" has plenty of examples of brutal warfare between communities that didn't agree on laws, regulations, ethics, or standards.

Further analysis of PyPI typosquatting

Posted Oct 26, 2020 12:59 UTC (Mon) by flussence (guest, #85566) [Link]

There's a time-tested and sufficiently Silicon Valleyish method of addressing this problem that other sites used to cope with exponentially growing userbases a decade before YouTube existed: give users tools and subtle encouragement (YT just needs to push the latter more) to snitch on each other to moderators anonymously.
They'll work tirelessly for nothing but a dopamine buzz in return. For free!

Further analysis of PyPI typosquatting

Posted Oct 15, 2020 19:36 UTC (Thu) by amarao (guest, #87073) [Link] (6 responses)

Wouldn't gpg web of trust solves this problem? If package is signed by key, trusted by many, it's good. If it's totally alone with zero signatures, one should inspect package before installing.

If you noticed a good package, sign it's key with low trust. If you know author, give him medium or high.

Further analysis of PyPI typosquatting

Posted Oct 16, 2020 0:32 UTC (Fri) by mathstuf (subscriber, #69389) [Link] (4 responses)

Maybe something like crev then? https://github.com/crev-dev/crev

Further analysis of PyPI typosquatting

Posted Oct 16, 2020 3:43 UTC (Fri) by pabs (subscriber, #43278) [Link] (3 responses)

Crowdsourced code review is definitely something awesome that the world needs. I imagine it would be an incredibly complex thing to do correctly though.

Further analysis of PyPI typosquatting

Posted Oct 16, 2020 15:39 UTC (Fri) by smoogen (subscriber, #97) [Link] (1 responses)

Crowd sourcing sounds good on paper when people have a lot of 'free' time where they enjoy reviewing tidbits of other people's code. However what tends to happen is that most people who would like to do it at one point have other things that take up their time or just get tired of it quickly as it is dealing with prickly people. Instead the people who stick around are people paid to 'review', and the groups with the largest interest to get code reviewed are the criminal groups who can pay people to 'review' things just not in the interest of everyone else.

In the end it isn't as much 'who watches the watchmen?' as much as 'who pays the watchmen.'

Further analysis of PyPI typosquatting

Posted Oct 17, 2020 19:38 UTC (Sat) by mathstuf (subscriber, #69389) [Link]

You can trust reviewers individually in crev I think. So if you suspect someone of astroturfing reviews, feel free to ignore them.

And sure, doing it takes a lot of time, but sometimes you're in to patch something anyways. Why do we need infrastructure and businesses for that and not the code itself? Why not review it as you're going (especially for smaller libraries)? In any case, use of a library over time should be considered as in "I've used this library for X years and it's been rock solid for me". Kind of like trusting gpg keys used consistently by pseudonymous people on lists.

Further analysis of PyPI typosquatting

Posted Oct 18, 2020 13:23 UTC (Sun) by kushal (subscriber, #50806) [Link]

For the upcoming securedrop-workstation project we are doing this at https://github.com/freedomofpress/securedrop-debian-packa...
There is a also a very low interaction https://mail.python.org/mailman3/lists/diff-review.python... list.

My related talk at PyCon US 2019 can be found at https://www.youtube.com/watch?v=wRHi8Ui5vWA

Further analysis of PyPI typosquatting

Posted Oct 16, 2020 16:15 UTC (Fri) by amacater (subscriber, #790) [Link]

Web of trust - it's hard enough with a thousand people who more or less know each other and can vouch for each other's code - and then you get people who can't use GPG.
Congratulations, you've just re-invented Debian - only without the minimal barrier of competency and (relative) agreement on common values :(

Also test against a spelling corrector

Posted Oct 15, 2020 20:46 UTC (Thu) by davecb (subscriber, #1574) [Link]

A variant of the same problem is addressed by spelling correctors, that of putative typos of words in a particular dictionary.