Further analysis of PyPI typosquatting
We're bad at marketingWe can admit it, marketing is not our strong suit. Our strength is writing the kind of articles that developers, administrators, and free-software supporters depend on to know what is going on in the Linux world. Please subscribe today to help us keep doing that, and so we don’t have to get good at marketing.
We have looked at the problem of confusingly named packages in repositories such as the Python Package Index (PyPI) before. In general, malicious actors create these packages with names that can be mistaken for those of legitimate packages in the repository in a form of "typosquatting". Since our 2016 article, the problem has not gone away—no surprise—but there has been some recent analysis of it, as well as some efforts to combat it.
On the IQT blog, John Speed Meyers and Bentz Tozer recently posted some analysis they had done to quantify PyPI typosquatting attacks and to categorize them. They started by looking at the examples of actual attacks against PyPI users from 2017 to 2020; they found 40 separate instances over that time span. The criteria used were that the package had a name similar to another in PyPI, contained malware, and was identified and removed from the repository.
They identified two types of package typosquatting: misspelling and confusion. The first type relies on package names that are slightly misspelled, djanga instead of django or urlib3 instead of urllib3. The confusion attacks rely upon changing the order of the "words" in the name (e.g. nmap-python rather than python-nmap), removing or changing separators (e.g. easyinstall vs. easy_install), or otherwise changing the elements of the name (e.g. crypt/crypto, python-sqlite/pysqlite). Of the 40 attacks identified, 18 were of the misspelling variety, while 24 were confusing—two were both, which accounts for the overlap.
The blog post noted that William Bengston had done some research on one type of confusion attack in particular: separator changes. In July 2018, Bengston registered around 1,100 package names on PyPI by eliminating any separators (i.e. - or _) in the names the top 10,000 packages on PyPI. The packages registered would simply cause an error when they were installed; that error would redirect the user to the correct package name.
The IQT researchers found that separator attacks made up a small portion of
the typosquatting incidents they looked at, though. So the number of
potential systems that have fallen prey to the typosquatting problem is likely far
higher. "Separator attacks account for only three of the 26 confusion
attacks in this dataset though, suggesting that Bengston’s already
frightening estimate of PyPI user susceptibility to typosquatting is a
lower bound of overall user susceptibility to typosquatting attacks.
"
As might be guessed, typosquatters concentrated their efforts on the most popular PyPI packages. The IQT researchers found that 28% of their instances were typosquatting the top 50 most popular PyPI packages and more than half of the attacks (63%) were against the top 500.
Finding typosquatting attacks, or preventing those kinds of packages from being created in the first place, is obviously worth doing. The misspelling attacks are relatively easy to detect using the Levenshtein distance between two names. That distance measures the number of one-character changes that are needed to turn one string into another; the misspellings that the researchers found have a Levenshtein distance of one (15 attacks) or two (3 attacks). While there may be perfectly valid reasons for a package to have a name that is only slightly different than an existing package, it could be used as a reason for administrator scrutiny. The confusion category, on the other hand, generally had distances that were three or more (17 of 24), making them more difficult to (automatically) detect.
Python has made some efforts to help reduce the typosquatting problem. In 2017, code was added to block new PyPI packages with names that are the same as those of standard library modules. Existing PyPI packages that conflict are not being removed—some are backports of newer functionality to older versions of the language, for example—but they are being audited to determine their validity. Some malware checking has also been added to Warehouse, which is the web application behind PyPI.
The researchers also noted several different papers about ways to detect and
stop malware from being distributed from package repositories such as
PyPI. Other language repositories, npm for JavaScript and RubyGems for Ruby, are also considered in these papers.
A team largely from the University of Kansas looked specifically at
typosquatting defenses [PDF] for npm and PyPI, while
a Georgia Tech team "built a sophisticated anti-malware
analysis pipeline that repositories could employ to find malicious
software, including typosquatters, hiding in repositories
" (paper [PDF]). The
amusingly titled "Backstabber’s Knife Collection:
A Review of
Open Source Software Supply Chain Attacks" [PDF] analyzes nearly 200 malicious
packages from npm, RubyGems, and PyPI to try to extract information that
can be used to detect new malware based on the characteristics of the
existing malicious packages.
There is, of course, a more draconian solution to the problem: connecting packages with the real-life identities of their maintainers. It is the semi-anonymous nature of the repositories that makes these kinds of attacks easy to perform with little risk of personal repercussions for perpetrating them. There are lots of advantages to the free-for-all nature of repositories like PyPI (as well as GitHub and friends), but there are some downsides to it as well. On the other hand, of course, it is hard to imagine that attackers would not find ways around some ID requirement—or some way to pin their attack on an innocent bystander. Finding ways to thwart these attacks without resorting to that kind of policing is important.
Better vetting for packages is another potential solution, but it is not really practical. The number of changes that goes into these repositories is so enormous that it rapidly outpaces the limited bandwidth of administrators. Even large commercial companies are unable to handle this problem—the various app stores have been known to provide malware—so projects like Python cannot even begin to keep up. It would seem that automated efforts are slowly getting somewhat better, but it would be unsurprising to see more instances of malicious typosquatting in PyPI and elsewhere in coming years.
| Index entries for this article | |
|---|---|
| Security | Package repositories |
| Security | Python |
| Python | Packaging |
