Visualizing an Individual Percentile Agains a Population R

CRAN_Status_Badge GitHub_Status_Badge

packageRank: compute and visualize packet download counts and rank percentiles

'packageRank' is an R package that helps put package download counts into context. It does so via two core functions, cranDownloads() and packageRank(), a set of filters that reduces parcel download count inflation, and other assorted functions that assist you assess interest in your package.

I discuss these topics in iv sections; a fifth discusses package related issues.

  • I Packet Download Counts describes how cranDownloads() extends the functionality of cranlogs::cran_downloads() past adding a more convenient interface and past providing a generic R plot() method that makes visualization easy.
  • II Package Download Rank Percentiles describes how packageRank() uses rank percentiles, a nonparametric statistic that tells y'all the percent of packages with fewer downloads, to help y'all see how your package is doing relative to all other CRAN packages.
  • III Package Download Filters describes the functions that filter out software and behavioral artifacts from the download logs which contribute to inflated download counts.
  • IV Other Functions describes six other 'packageRank' functions that help y'all better understand interest in your parcel.
  • V Notes discusses issues associated with country code top-level domains, memoization, time zone effects, and the net connectedness time out problem.

getting started

To install 'packageRank' from CRAN:

install.packages(                "packageRank"              )

To install the development version from GitHub:

                              #                You lot may need to first install 'remotes' via install.packages("remotes").              remotes              ::install_github(                "lindbrook/packageRank"              ,              build_vignettes              =              TRUE)

Note that 'packageRank' has two upstream online dependencies: 1) RStudio's CRAN package download logs, which records traffic to the "0-Cloud" mirror at cloud.r-project.org (formerly RStudio's CRAN mirror); and 2) Gábor Csárdi's 'cranlogs' R package, which is an interface to a database that computes R and R packet download counts using the aforementioned logs.

When everything is working right, the CRAN package download logs for the previous day will be posted past 17:00 UTC and the results for 'cranlogs' will exist available before long afterward. However, occasionally issues with "today's" information tin emerge due to the downstream nature of the dependencies (illustrated below).

              CRAN Download Logs --> 'cranlogs' --> 'packageRank'                          

If there's a problem with the logs (e.g., they're not posted on time), both 'cranlogs' and 'packageRank' volition be afflicted. Here, depending on the function y'all'll see things like an unexpected naught count(s) for your package(s) (really, information technology'southward zilch downloads for all of CRAN), data from "yesterday", or a "Log is not (yet) on the server" error message.

If there'southward a problem with 'cranlogs' but not with the logs, only packageRank::cranDownalods() volition be afflicted (the zero downloads problem). All the other 'packageRank' functions should work since they directly admission the logs.

Usually, these errors resolve themselves the next time the underlying scripts are run (typically "tomorrow", if not sooner).

I - computing package download counts

cranDownloads() uses all the same arguments as cranlogs::cran_downloads():

              cranlogs              ::cran_downloads(packages              =                              "HistData"              )
              >         appointment count  package > ane 2020-05-01   338 HistData                          

The just difference is that cranDownloads() adds iv features:

i) "spell cheque" for parcel names

cranDownloads(packages              =                              "GGplot2"              )
              ## Fault in cranDownloads(packages = "GGplot2") : ##   GGplot2: misspelled or not on CRAN.                          

cranDownloads(packages              =                              "ggplot2"              )
              >         date count cumulative packet > 1 2020-05-01 56357      56357 ggplot2                          


This as well works for inactive or "retired" packages in the Archive:

cranDownloads(packages              =                              "vr"              )
              ## Error in cranDownloads(packages = "vr") : ##  vr: misspelled or not on CRAN/Archive.                          

cranDownloads(packages              =                              "VR"              )
              >         date count cumulative parcel > 1 2020-05-01    11         11      VR                          

ii) ii additional appointment formats

With cranlogs::cran_downloads(), yous specify a time frame using the from and to arguments. The downside of this is that you must utilize the "yyyy-mm-dd" appointment format. For convenience's sake, cranDownloads() besides allows you to utilise "yyyy-mm" or "yyyy" (yyyy likewise works).

"yyyy-mm"

Let'south say you want the download counts for 'HistData' for February 2020. With cranlogs::cran_downloads(), you'd accept to type out the whole engagement and recollect that 2022 was a jump year:

              cranlogs              ::cran_downloads(packages              =                              "HistData"              ,              from              =                              "2020-02-01"              ,              to              =                              "2020-02-29"              )


With cranDownloads(), you can just specify the year and calendar month:

cranDownloads(packages              =                              "HistData"              ,              from              =                              "2020-02"              ,              to              =                              "2020-02"              )
"yyyy" or yyyy

Permit's say you desire the year-to-date download counts for 'rstan'. With cranlogs::cran_downloads(), you'd type something like:

              cranlogs              ::cran_downloads(packages              =                              "rstan"              ,              from              =                              "2022-01-01"              ,              to              =              Sys.Date()              -              1)


With cranDownloads(), you tin use:

cranDownloads(packages              =                              "rstan"              ,              from              =                              "2022"              )

or

cranDownloads(packages              =                              "rstan"              ,              from              =              2022)

3) bank check date validity

cranDownloads(packages              =                              "HistData"              ,              from              =                              "2019-01-15"              ,              to              =                              "2019-01-35"              )
              ## Error in resolveDate(to, type = "to") : Non a valid engagement.                          

iv) cumulative count

cranDownloads(packages              =                              "HistData"              ,              when              =                              "terminal-calendar week"              )
              >         date count cumulative  package > 1 2020-05-01   338        338 HistData > 2 2020-05-02   259        597 HistData > 3 2020-05-03   321        918 HistData > 4 2020-05-04   344       1262 HistData > v 2020-05-05   324       1586 HistData > 6 2020-05-06   356       1942 HistData > 7 2020-05-07   324       2266 HistData                          

visualizing package download counts

cranDownloads() makes visualizing bundle downloads easy. Just utilise plot():

plot(cranDownloads(packages              =                              "HistData"              ,              from              =                              "2019"              ,              to              =                              "2019"              ))

If you pass a vector of bundle names for a single 24-hour interval, plot() returns a dotchart:

plot(cranDownloads(packages              =              c(                "ggplot2"              ,                              "information.table"              ,                              "Rcpp"              ),              from              =                              "2020-03-01"              ,              to              =                              "2020-03-01"              ))

If you pass a vector of package names for multiple days, plot() uses ggplot2 facets:

plot(cranDownloads(packages              =              c(                "ggplot2"              ,                              "data.table"              ,                              "Rcpp"              ),              from              =                              "2020"              ,              to              =                              "2020-03-xx"              ))

If you want to plot those data in a single frame, ready multi.plot = TRUE:

plot(cranDownloads(packages              =              c(                "ggplot2"              ,                              "information.table"              ,                              "Rcpp"              ),              from              =                              "2020"              ,              to              =                              "2020-03-twenty"              ),              multi.plot              =              True)


If you want plot those data in separate plots merely use the aforementioned calibration, set graphics = "base" (you'll be prompted for each plot):

plot(cranDownloads(packages              =              c(                "ggplot2"              ,                              "data.table"              ,                              "Rcpp"              ),              from              =                              "2020"              ,              to              =                              "2020-03-xx"              ),              graphics              =                              "base"              )

If you want do the above on separate independent scales, set same.xy = FALSE:

plot(cranDownloads(packages              =              c(                "ggplot2"              ,                              "information.table"              ,                              "Rcpp"              ),              from              =                              "2020"              ,              to              =                              "2020-03-20"              ),              graphics              =                              "base"              ,              same.xy              =              False)

unit of observation

If y'all want to visualize the information from a unit of observation other than the default ("day"), laissez passer "month", or "twelvemonth" to the unit of measurement.observation statement. For case, beneath is the plot for the daily downloads of 'HistData' from January 2022 through December xv 2021.

plot(cranDownloads(packages              =                              "HistData"              ,              from              =                              "2021"              ,              to              =                              "2021-12-15"              ))

Here is the plot for the aforementioned data aggregated by month:

plot(cranDownloads(packages              =                              "HistData"              ,              from              =                              "2021"              ,              to              =                              "2021-12-15"              ),              unit.ascertainment              =                              "calendar month"              )

There are iii things to find with these aggregated plots. Offset, if an aggregate observation is still in-progress (e.chiliad., in the plot above, we've only seen the kickoff one-half of Dec), that observation is split up into 2 carve up points: 1) a "grayed-out" betoken for the in-progress or observed full (the blackness empty foursquare) and ii) a highlighted point for the projected or estimated total (the reddish empty circumvolve). The estimate is based on how much the unit of ascertainment is completed. In the plot to a higher place, there are 2,708 downloads between December 1 and December 15. Thus, the estimate for the whole month is v,597 or 31 / 15 * 2708. 2d, all other points represents the full count at the finish of an aggregate period. For case, the first solid point, on the far left, records the total download count for the month Jan and is plotted on January 31. Third, if you lot include a smoother, using the shine = True statement, the bend only uses complete, not in-progress, data.

logarithm of download counts

To employ the base of operations 10 logarithm of the download count in a plot, gear up log.count = TRUE:

plot(cranDownloads(packages              =                              "HistData"              ,              from              =                              "2021"              ,              to              =                              "2021-12-15"              ),              log.count              =              TRUE)

packages = Zippo

cranlogs::cran_download(packages = NULL) computes the total number of package downloads from CRAN. You can plot these data past using:

plot(cranDownloads(from              =              2019,              to              =              2019))

packages = "R"

cranlogs::cran_download(packages = "R") computes the total number of downloads of the R application (note that you can only use "R" or a vector of packages names, not both!). Y'all can plot these data past using:

plot(cranDownloads(packages              =                              "R"              ,              from              =              2019,              to              =              2019))

If y'all want the total count of R downloads, set up r.total = TRUE:

plot(cranDownloads(packages              =                              "R"              ,              from              =              2019,              to              =              2019),              r.total              =              TRUE)

smoothers and confidence intervals

To add a lowess smoother to your plot, use shine = Truthful:

plot(cranDownloads(packages              =                              "rstan"              ,              from              =                              "2019"              ,              to              =                              "2019"              ),              smoothen              =              Truthful)

With graphs that utilise 'ggplot2', se = True will add confidence intervals:

plot(cranDownloads(packages              =              c(                "HistData"              ,                              "rnaturalearth"              ,                              "Zelig"              ),              from              =                              "2020"              ,              to              =                              "2020-03-20"              ),              smooth              =              True,              se              =              Truthful)

parcel and R release dates

To comment a graph with a packet's release dates:

plot(cranDownloads(packages              =                              "rstan"              ,              from              =                              "2019"              ,              to              =                              "2019"              ),              package.version              =              Truthful)

To annotate a graph with R release dates:

plot(cranDownloads(packages              =                              "rstan"              ,              from              =                              "2019"              ,              to              =                              "2019"              ),              r.version              =              True)

plot growth curves (cumulative download counts)

To plot growth curves, set statistic = "cumulative":

plot(cranDownloads(packages              =              c(                "ggplot2"              ,                              "data.table"              ,                              "Rcpp"              ),              from              =                              "2020"              ,              to              =                              "2020-03-20"              ),              statistic              =                              "cumulative"              ,              multi.plot              =              TRUE,              points              =              Imitation)

population plot

To visualize a bundle's downloads relative to "all" other packages over time:

plot(cranDownloads(packages              =                              "HistData"              ,              from              =                              "2020"              ,              to              =                              "2020-03-20"              ),              population.plot              =              TRUE)

This longitudinal view of package downloads plots the appointment (x-axis) against the base 10 logarithm of the selected packet'south downloads (y-axis). To get a sense of how the selected package's functioning stacks upwards against all other packages, a gear up of smoothed curves representing a stratified random sample of packages is plotted in gray in the groundwork (the "typical" pattern of downloads on CRAN for the selected time period). Specifically, within each five% interval of rank percentiles (e.g., 0 to 5, 5 to 10, 95 to 100, etc.), a random sample of 5% of packages is selected and tracked.

Ii - computing packet download rank percentiles

Later on spending some fourth dimension with nominal download counts, the "compared to what?" question will come to heed. For case, consider the data for the 'cholera' package from the first week of March 2020:

plot(cranDownloads(packages              =                              "cholera"              ,              from              =                              "2020-03-01"              ,              to              =                              "2020-03-07"              ))

Exercise Wednesday and Saturday reflect surges of interest in the packet or surges of traffic to CRAN? To put it differently, how can we know if a given download count is typical or unusual? I style to reply these questions is to locate your parcel in the overall frequency distribution of download counts.

Below are the distributions of logarithm of download counts for Wednesday and Saturday. Each vertical segment (along the 10-axis) represents a download count. The height of a segment represents that download count's frequency. The location of 'cholera' in the distribution is highlighted in red.

plot(packageDistribution(packet              =                              "cholera"              ,              date              =                              "2020-03-04"              ))

plot(packageDistribution(bundle              =                              "cholera"              ,              engagement              =                              "2020-03-07"              ))

While these plots requite us a better picture show of where 'cholera' is located, comparisons between Wednesday and Sabbatum are impressionistic at all-time: all we can confidently say is that the download counts for both days were greater than the style.

To facilitate interpretation and comparing, I use the rank percentile of a download count instead of the uncomplicated nominal download count. This nonparametric statistic tells you lot the percentage of packages that had fewer downloads. In other words, it gives you the location of your parcel relative to the locations of all other packages. More importantly, by rescaling download counts to lie on the bounded interval between 0 and 100, rank percentiles brand information technology easier to compare packages within and across distributions.

For example, we can compare Wednesday ("2020-03-04") to Sabbatum ("2020-03-07"):

packageRank(packet              =                              "cholera"              ,              date              =                              "2020-03-04"              )              >              date              packages              downloads              rank              percentile              >              i              2020              -              03              -              04              cholera              38              5,556              of              eighteen,038              67.9            

On Wednesday, we tin encounter that 'cholera' had 38 downloads, came in 5,556th identify out of the 18,038 unlike packages downloaded, and earned a spot in the 68th percentile.

packageRank(package              =                              "cholera"              ,              date              =                              "2020-03-07"              )              >              date              packages              downloads              rank              percentile              >              1              2020              -              03              -              07              cholera              29              3,061              of              15,950              eighty            

On Sabbatum, we tin can encounter that 'cholera' had 29 downloads, came in 3,061st place out of the xv,950 different packages downloaded, and earned a spot in the 80th percentile.

So contrary to what the nominal counts tell us, 1 could say that the interest in 'cholera' was actually greater on Sabbatum than on Midweek.

calculating rank percentile

To compute rank percentiles, I do the following. For each package, I tabulate the number of downloads so compute the percentage of packages with fewer downloads. Here are the details using 'cholera' from Wednesday as an example:

              pkg.rank              <-              packageRank(packages              =                              "cholera"              ,              date              =                              "2020-03-04"              )              downloads              <-              pkg.rank              $              freqtab              circular(100              *              hateful(downloads              <              downloads[                "cholera"              ]),              1)              >              [i]              67.nine            

To put it differently:

(pkgs.with.fewer.downloads              <-              sum(downloads              <              downloads[                "cholera"              ]))              >              [1]              12250              (tot.pkgs              <-              length(downloads))              >              [1]              18038              circular(100              *              pkgs.with.fewer.downloads              /              tot.pkgs,              1)              >              [1]              67.9            

nominal ranks

In the example above, 38 downloads puts 'cholera' in v,556th place among 18,038 observed packages. This rank is "nominal" because it'due south possible that multiple packages can have the same number of downloads. As a effect, a parcel's nominal rank but non its rank percentile tin be affected by its name. For example, because packages with the aforementioned number of downloads are sorted in alphabetical order, 'cholera' benefits from the fact that information technology is 31st in the list of 263 packages with 38 downloads:

              pkg.rank              <-              packageRank(packages              =                              "cholera"              ,              date              =                              "2020-03-04"              )              downloads              <-              pkg.rank              $              freqtab              which(names(downloads[downloads              ==              38])              ==                              "cholera"              )              >              [i]              31              length(downloads[downloads              ==              38])              >              [i]              263            

visualizing package download rank percentiles

To visualize packageRank(), use plot().

plot(packageRank(packages              =                              "cholera"              ,              date              =                              "2020-03-04"              ))


plot(packageRank(packages              =                              "cholera"              ,              date              =                              "2020-03-07"              ))

These graphs above, which are customized hither to be on the same scale, plot the rank order of packages' download counts (x-axis) confronting the logarithm of those counts (y-axis). It then highlights (in red) a package'south position in the distribution along with its rank percentile and download count. In the background, the 75th, 50th and 25th percentiles are plotted as dotted vertical lines. The package with the nigh downloads, 'magrittr' in both cases, is at height left (in blue). The total number of downloads is at the top right (in blue).

III - filtering package download counts

We compute the number of package downloads by simply counting log entries. While straightforward, this approach tin can run into problems. Putting aside the question of whether package dependencies should exist counted, what I have in mind here is what I believe to be two types of "invalid" log entries. The outset, a software artifact, stems from entries that are smaller, oftentimes orders of magnitude smaller, than a package's actual binary or source file. The 2d, a behavioral antiquity, emerges from efforts to download all of CRAN. In both cases, a reliance on nominal counts will give you an inflated sense of the caste of interest in your package. For those interested, an early on but detailed analysis and give-and-take of both types of aggrandizement is included as role of this R-hub blog post.

software artifacts

When looking at package download logs, the first thing y'all'll observe are wrongly sized log entries. They come in two sizes. The "minor" entries are approximately 500 bytes in size. The "medium" entries are variable in size: they autumn somewhere between a "small" entry and a full download (i.e., "small-scale" <= "medium" <= full download). "Minor" entries manifest themselves equally standalone entries, paired with a full download, or as part of a triplet along side a "medium" and a full download. "Medium" entries manifest themselves as either standalone entries or as part of a triplet.

The example below illustrates a triplet:

packageLog(engagement              =                              "2020-07-01"              )[4              :              6,              -(iv              :              6)]              >              engagement              time              size              package              version              country              ip_id              >              3998633              2020              -              07              -              01              07              :              56              :              15              99622              cholera              0.7.0              US              4760              >              3999066              2020              -              07              -              01              07              :              56              :              15              4161948              cholera              0.7.0              United states              4760              >              3999178              2020              -              07              -              01              07              :              56              :              xv              536              cholera              0.7.0              US              4760            

The "medium" entry is the beginning ascertainment (99,622 bytes). The full download is the second entry (iv,161,948 bytes). The "small" entry is the final observation (536 bytes). At a minimum, what makes a triplet a triplet (or a pair a pair) is that all members share organization configuration (e.g. IP address, etc.) and have identical or next fourth dimension stamps.

To bargain with the inflationary effect of "small" entries, I filter out observations smaller than 1,000 bytes (the smallest parcel on CRAN appears to be 'source.gist', which weighs in at 1,200 bytes). "Medium" entries are harder to handle. I remove them using either a triplet-specific filter or a filter that looks upward a packet's bodily size.

behavioral artifacts

While wrongly sized entries are fairly like shooting fish in a barrel to spot, seeing the result of efforts to download all of CRAN require a change of perspective. While details and farther testify tin can be found in the R-hub blog mail mentioned above, I'll illustrate the trouble with the following example:

packageLog(packages              =                              "cholera"              ,              date              =                              "2020-07-31"              )[8              :              14,              -(four              :              half dozen)]
              >              date     time    size packet version country ip_id > 132509 2020-07-31 21:03:06 3797776 cholera   0.2.ane      United states    14 > 132106 2020-07-31 21:03:07 4285678 cholera   0.4.0      United states    14 > 132347 2020-07-31 21:03:07 4109051 cholera   0.three.0      US    14 > 133198 2020-07-31 21:03:08 3766514 cholera   0.5.0      Us    14 > 132630 2020-07-31 21:03:09 3764848 cholera   0.v.one      U.s.    14 > 133078 2020-07-31 21:03:11 4275831 cholera   0.half-dozen.0      US    14 > 132644 2020-07-31 21:03:12 4284609 cholera   0.6.5      United states of america    14                          

Here, we run into that 7 unlike versions of the package were downloaded equally a sequential bloc. A footling digging shows that these seven versions represent all versions of 'cholera' available on that engagement:

packageHistory(package              =                              "cholera"              )
              >   Package Version       Engagement Repository > 1 cholera   0.2.1 2017-08-10    Archive > 2 cholera   0.3.0 2018-01-26    Archive > 3 cholera   0.4.0 2018-04-01    Archive > 4 cholera   0.five.0 2018-07-16    Archive > v cholera   0.5.1 2018-08-15    Archive > 6 cholera   0.6.0 2019-03-08    Archive > 7 cholera   0.half dozen.five 2019-06-11    Archive > 8 cholera   0.7.0 2019-08-28       CRAN                          

While there are "legitimate" reasons for downloading past versions (e.grand., research, container-based software distribution, etc.), I'd argue that examples like the higher up are "fingerprints" of efforts to download CRAN. While this is not necessarily problematic, it does mean that when your package is downloaded as part of such efforts, that download is more than a reflection of an interest in CRAN itself (a collection of packages) than of an interest in your packet per se. And since ane of the uses of counting package downloads is to assess interest in your package, it may be useful to exclude such entries.

To do so, I try to filter out these entries in two ways. The start identifies IP addresses that download "too many" packages and so filters out campaigns, large blocs of downloads that occur in (nearly) alphabetical order. The 2d looks for campaigns not associated with "greedy" IP addresses and filters out sequences of past versions downloaded in a narrowly divers time window.

example usage

To get an idea of how inflated your package's download count may be, use filteredDownloads(). Below are the results for 'ggplot2' for fifteen September 2021.

filteredDownloads(parcel              =                              "ggplot2"              ,              engagement              =                              "2021-09-xv"              )              >              date              package              downloads              filtered.downloads              aggrandizement              >              1              2021              -              09              -              15              ggplot2              113842              57951              96.45            

While there were 113,842 nominal downloads, applying all the filters reduced that number to 57,951, an inflation of 96%.

Note that the filters are computationally enervating. Excluding the time it takes to download the log file, the filters in the above example accept guess 75 seconds to run using parallelized lawmaking (currently only available on macOS and Unix) on a 3.1 GHz Dual-Cadre Intel Core i5 processor.

There are 5 filters. You can control them using the following arguments (listed in gild of awarding):

  • ip.filter: removes campaigns of "greedy" IP addresses.
  • triplet.filter: reduces triplets to a single observation.
  • small.filter: removes entries smaller than 1,000 bytes.
  • sequence.filter: removes blocs of by versions.
  • size.filter: removes entries smaller than a package's binary or source file.

These filters are off by default (e.g., ip.filter = FALSE). To apply them, set the argument for the filter you want to Truthful:

packageRank(bundle              =                              "cholera"              ,              small.filter              =              Truthful)

Alternatively, you lot can simply set up all.filters = TRUE.

packageRank(package              =                              "cholera"              ,              all.filters              =              True)

Note that the all.filters = Truthful is contextual. Depending on the office used, you lot'll either get the CRAN-specific or the package-specific set of filters. The former sets ip.filter = TRUE and size.filter = True; it works independently of packages at the level of the entire log. The latter sets triplet.filter = TRUE, sequence.filter = True and size.filter True; it relies on package specific information (e.g., size of source or binary file).

Ideally, we'd like to use both sets. Still, the package-specific prepare is computationally expensive because they need to be practical individually to all packages in the log, which can involve tens of thousands of packages. While not unfeasible, currently this takes a long time. For this reason, when all.filters = Truthful, packageRank(), ipPackage(), countryPackage(), countryDistribution() and packageDistribution() use only CRAN specific filters while packageLog(), packageCountry(), and filteredDownloads() utilise both CRAN and package specific filters.

IV - other functions

Six other functions (some used in a higher place) may be of interest: 1) packageDistribution() plots the location of your parcel in the overall frequency distribution of package downloads; 2) packageHistory() retrieves your package"s release history; 3) packageLog() extracts your bundle'southward entries from the CRAN download counts log; four) filteredDownloads() computes an estimate of your package'due south download count inflation (computationally intensive!) and 5 & 6) bioconductorDownloads() and bioconductorRank() offer analogous simply limited functionality to the 2 principal functions.

V - notes

country codes (top level domains)

While IP addresses are anonymized, packageCountry() and countryPackage() brand use of the fact that the logs provide corresponding ISO state codes or height level domains (e.m., AT, JP, US). Note that coverage extends to about 85% of observations (i.e., approximately 15% land codes are NA). Also, for what it's worth, there seems to exist a a couple of typos for country codes: "A1" (A + number one) and "A2" (A + number two). According to RStudio's documentation, this coding was washed using MaxMind's free database, which no longer seems to be bachelor and may be a flake out of date.

memoization

To avoid the bottleneck of downloading multiple log files, packageRank() is currently limited to individual calendar dates. To reduce the bottleneck of re-downloading logs, which tin be upwards of l MB, 'packageRank' makes use of memoization via the 'memoise' packet.

Here'south relevant lawmaking:

              fetchLog              <-              function(url)              data.table              ::fread(url)              mfetchLog              <-              memoise              ::memoise(fetchLog)              if              (RCurl              ::url.exists(url)) {              cran_log              <-              mfetchLog(url) }                              #                Notation that data.table::fread() relies on R.utils::decompressFile().            

This means that logs are intelligently cached; those that have already been downloaded in your current R session volition non be downloaded again.

time zones

The calendar date (e.g. "2021-01-01") is the unit of observation for 'packageRank' functions. However, because the typical use example involves the latest log file, time zone differences can come into play.

Permit's say that it's 09:01 on 01 Jan 2022 and y'all want to compute the rank percentile for 'ergm' for the last day of 2020. You might be tempted to use the post-obit:

packageRank(packages              =                              "ergm"              )

Withal, depending on where you make this request, you may not get the data yous await. In Honolulu, United states of america, you will but in Sydney, Australia you won't. The reason is that you've somehow forgotten a key piece of trivia: RStudio typically posts yesterday's log around 17:00 UTC the post-obit mean solar day.

The expression works in Honolulu because 09:01 HST on 01 January 2022 is xix:01 UTC 01 January 2021. And so the log you desire has been available for two hours. The expression fails in Sydney because 09:01 AEDT on 01 January 2022 is 31 December 2022 22:00 UTC. The log you want won't really be bachelor for another nineteen hours.

To make life a little easier, 'packageRank' does two things. First, when the log for the engagement y'all desire is not available (due to time zone rather than server issues), you'll just get the last bachelor log. If you specified a date in the future, you'll either get an error message or a warning that provides an estimate of when that log should be bachelor.

Using the Sydney example and the expression above, yous'd get the results for 30 December 2020:

packageRank(packages              =                              "ergm"              )
              >         date packages downloads          rank percentile > 1 2020-12-30     ergm       292 873 of 20,077       95.6                          

If you had specified the date, you'd get an boosted alert:

packageRank(packages              =                              "ergm"              ,              engagement              =                              "2021-01-01"              )
              >         appointment packages downloads          rank percentile > ane 2020-12-30     ergm       292 873 of xx,077       95.half dozen  Warning message: 2020-12-31 log arrives in appox. 19 hours at 02 January 04:00 AEDT. Using last available!                          

Second, to help y'all check/remember when logs are posted in your location, there's logDate() and logPostInfo(). The former silently returns the appointment of the current available log. The latter adds the guess local and UTC times when logs of the desired date are posted to RStudio'south server.

Here'southward what you'd see using the Honolulu example:

and

              > $log.date > [1] "2021-01-01" > > $GMT > [ane] "2021-01-01 17:00:00 GMT" > > $local > [one] "2021-01-01 07:00:00 HST"                          

For both functions, the default is to apply your local time zone. To see the results in a different time zone, laissez passer the desired zone name from OlsonNames() to the tz argument. Here are the results for Sydney when the functions are called from Honolulu (nineteen:01 UTC):

logDate(tz              =                              "Australia/Sydney"              )

and

logPostInfo(tz              =                              "Australia/Sydney"              )
              > $log.date > [1] "2021-01-01" > > $GMT > [one] "2021-01-01 17:00:00 GMT" > > $local > [ane] "2021-01-01 04:00:00 AEDT"                          

This functionality depends on R's ability to to compute your local time and fourth dimension zone (e.g., Sys.fourth dimension()). My understanding is that at that place may exist operating organization or platform specific issues that could undermine this.

timeout

With R iv.0.3, the timeout value for internet connections became more explicit. Here are the relevant details from that release's "New features":

              The default value for options("timeout") can be gear up from surroundings variable R_DEFAULT_INTERNET_TIMEOUT, still defaulting to threescore (seconds) if that is not set or invalid.                          

This change can affect functions that download logs. This is peculiarly true over slower internet connections or when you're dealing with large log files. To fix this, fetchCranLog() volition, if needed, temporarily set the timeout to 600 seconds.

rothmanaravitn.blogspot.com

Source: https://github.com/lindbrook/packageRank

0 Response to "Visualizing an Individual Percentile Agains a Population R"

Post a Comment

Iklan Atas Artikel

Iklan Tengah Artikel 1

Iklan Tengah Artikel 2

Iklan Bawah Artikel