
Accept: Django file download count
Django file download count | |
Django file download count | |
Django file download count | |
Django file download count |
Analyzing PyPI package downloads¶
This section covers how to use the public PyPI download statistics dataset to learn more about downloads of a package (or packages) hosted on PyPI. For example, you can use it to discover the distribution of Python versions used to download a package.
Background¶
PyPI does not display download statistics for a number of reasons: 1
Inefficient to make work with a Content Distribution Network (CDN): Download statistics change constantly. Including them in project pages, which are heavily cached, would require invalidating the cache more often, and reduce the overall effectiveness of the cache.
Highly inaccurate: A number of things prevent the download counts from being accurate, some of which include:
’s download cache (lowers download counts)
Internal or unofficial mirrors (can both raise or lower download counts)
Packages not hosted on PyPI (for comparisons sake)
Unofficial scripts or attempts at download count inflation (raises download counts)
Known historical data quality issues (lowers download counts)
Not particularly useful: Just because a project has been downloaded a lot doesn’t mean it’s good; Similarly just because a project hasn’t been downloaded a lot doesn’t mean it’s bad!
In short, because it’s value is low for various reasons, and the tradeoffs required to make it work are high, it has been not an effective use of limited resources.
Public dataset¶
As an alternative, the Linehaul project streams download logs from PyPI to Google BigQuery2, where they are stored as a public dataset.
Data schema¶
Linehaul writes an entry in a table for each download. The table contains information about what file was downloaded and how it was downloaded. Some useful columns from the table schema include:
Column | Description | Examples |
---|---|---|
timestamp | Date and time | |
file.project | Project name | , |
file.version | Package version | , |
details.installer.name | Installer | pip, bandersnatch |
details.python | Python version | , |
Useful queries¶
Run queries in the BigQuery web UI by clicking the “Compose query” button.
Note that the rows are stored in a partitioned, which helps limit the cost of queries. These example queries analyze downloads from recent history by filtering on the column.
Counting package downloads¶
The following query counts the total number of downloads for the project “pytest”.
To only count downloads from pip, filter on the column.
Package downloads over time¶
To group by monthly downloads, use the function. Also filtering by this column reduces corresponding costs.
num_downloads | month |
---|---|
1956741 | 2018-01-01 |
2344692 | 2017-12-01 |
1730398 | 2017-11-01 |
2047310 | 2017-10-01 |
1744443 | 2017-09-01 |
1916952 | 2017-08-01 |
Python versions over time¶
Extract the Python version from the column. Warning: This query processes over 500 GB of data.
python | num_downloads |
---|---|
3.7 | 12990683561 |
3.6 | 9035598511 |
2.7 | 8467785320 |
3.8 | 4581627740 |
3.5 | 2412533601 |
null | 1641456718 |
Caveats¶
In addition to the caveats listed in the background above, Linehaul suffered from a bug which caused it to significantly under-report download statistics prior to July 26, 2018. Downloads before this date are proportionally accurate (e.g. the percentage of Python 2 vs. Python 3 downloads) but total numbers are lower than actual by an order of magnitude.
-