Analyzing the ggcharts CRAN Downloads. Part 2: Data Visualization

Introduction

In part 1 of this post I described how I got data about the ggcharts CRAN downloads. In this follow-up post I will walk you through how to turn this data into an informative visualization using ggplot2 and patchwork.

library(ggplot2)
library(patchwork)

Daily Downloads

The first plot will show the daily downloads over time. I will use a line chart for that purpose. Remember from part 1 that the daily_downloads dataset contains the aggregated number of downloads per day.

p1 <- ggplot(daily_downloads, aes(date, N)) +
  geom_line(color = "#94C1E0", size = 1.25) +
  ggcharts::theme_hermit(axis = "x", ticks = "x", grid = "X") + 
  ggtitle("Daily Downloads")
p1

Currently the y axis does not include 0. In addition, the line exceeds the highest values on the y axis which I don’t particularly like. Let’s change this.

p1 + ylim(0, 80)

That’s not quite what I wanted to achieve. Notice the gap between the x axis line and the grid line for 0. ggplot2 automatically adds 5% space around the axis limits. This can be changes like this.

p1 <- p1 +
  scale_y_continuous(
    limits = c(0, 80),
    expand = expansion(mult = c(0, .05))
  )
p1

Neat!

Cumulative Downloads

Next, let’s continue with plotting the cumulative downloads over time. This will be a line chart as well.

p2 <- ggplot(daily_downloads, aes(date, cumulative_N)) +
  geom_line(color = "#94C1E0", size = 1.25) +
  ggcharts::theme_hermit(axis = "x", ticks = "x", grid = "X") + 
  ggtitle("Cumulative Downloads") +
  scale_y_continuous(expand = expansion(mult = c(0, .05)))
p2

Doesn’t look too bad. However, notice that when I put p1 and p2 next to each other the grid lines and axis annotations don’t align with each other which is very unsightly.

p1 + p2

Let’s change this by tweaking the y scale.

p2 <- p2 +
  scale_y_continuous(
    limits = c(0, 1200),
    breaks = seq(from = 0, to = 1200, by = 300),
    expand = expansion(mult = c(0, .05))
  )

When putting the plots side-by-side they now create an harmonious picture.

p1 + p2
## Warning: Removed 9 row(s) containing missing values (geom_path).

Distribution of Daily Downloads

The third plot will show the distribution of daily downloads.

ggplot(daily_downloads, aes(N)) +
  geom_histogram(fill = "#94C1E0", color = "#494F5C") +
  ggcharts::theme_hermit(axis = "x", ticks = "x", grid = "X")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

When using geom_histogram() by default 30 bins are drawn. You almost never want to leave it like this but rather tweak it by setting the bins argument. After some trial-and-error I chose to go with 10 bins.

p3 <- ggplot(daily_downloads, aes(N)) +
  geom_histogram(fill = "#94C1E0", color = "#494F5C", bins = 10) +
  ggcharts::theme_hermit(axis = "x", ticks = "x", grid = "X") +
  ggtitle("Distribution of Daily Downloads")
p3

Daily downloads are counts and counts are integers. Thus, to me having decimal numbers in the y axis looks odd. Let’s fix this and in addition get rid of the gap between the bars and the x axis.

p3 <- p3 +
  scale_y_continuous(
    breaks = c(0, 2, 4, 6, 8),
    expand = expansion(mult = c(0, .05))
  )
p3

Downloads by Country

The last plot will display the number of downloads by country. Remember from part 1 that this data is stored in the downloads_by_countries dataset.

p4 <- ggcharts::bar_chart(downloads_by_country, country, N, bar_color = "#94C1E0", top_n = 10) +
  ggcharts::theme_hermit(axis = "y", grid = "Y") +
  labs(
    title = "Downloads by Country",
    subtitle = "Top 10"
  )
p4

Apparently, there are quite a lot of downloads for which the country is unknown. Let’s turn the NA values into "Unknown".

downloads_by_country[is.na(country), country := "Unknown"]
p4 <- ggcharts::bar_chart(downloads_by_country, country, N, bar_color = "#94C1E0", top_n = 10) +
  ggcharts::theme_hermit(axis = "y", grid = "Y") +
  labs(
    title = "Downloads by Country",
    subtitle = "Top 10"
  )
p4

Next, let’s label the bars with the values they actually represent.

p4 <- p4 +
  geom_text(aes(label = N), hjust = -.2, color = "#D6DDE1", size = 5) +
  scale_y_continuous(expand = expansion(mult = c(0, .15)))
p4

If you have no idea what I just did check out this post to learn more about labeling bar charts.

When labeling the bars I prefer not to display the x axis and remove the grid lines as well.

p4 <- p4 +
  theme(
    axis.text.x = element_blank(),
    panel.grid.major.x = element_blank()
  )
p4

Putting it All Together

Combining these four plots into one data visualization is a piece of cake thanks to the patchwork package. Simply add them together.

p1 + p2 + p3 + p4
## Warning: Removed 9 row(s) containing missing values (geom_path).

That looks quite nice. What makes this plot a bit messy though is the axis labels. I think they are redundant. It’s very clear from the titles what is displayed so having axis labels only adds visual clutter.

plot <- p1 + p2 + p3 + p4 & labs(x = NULL, y = NULL)
plot
## Warning: Removed 9 row(s) containing missing values (geom_path).

Notice that I used & instead of the usual +. By using & the labels of all plots in the patchwork are adjusted. When using + only the labels of the last plot, i.e. p4, would be changed.

Finally, let’s add an overall title, subtitle and a caption.

plot +
  plot_annotation(
    title = "ggcharts is on the Rise",
    subtitle = "A Summary of Downloads from the RStudio CRAN Mirror",
    caption = "Source: RStudio CRAN Logs"
  )
## Warning: Removed 9 row(s) containing missing values (geom_path).

Ok, that looks awful! Fortunately, this can be fixed by passing the same theme used for the individual plots to the patchwork.

plot +
  plot_annotation(
    title = "ggcharts is on the Rise",
    subtitle = "A Summary of Downloads from the RStudio CRAN Mirror",
    caption = "Source: RStudio CRAN Logs",
    theme = ggcharts::theme_hermit()
  )
## Warning: Removed 9 row(s) containing missing values (geom_path).

Much better, but I would prefer the overall title to be (much) larger than the titles of the individual plots. That can be achieved by tweaking the theme passed to plot_annotation().

plot_theme <- ggcharts::theme_hermit() +
  theme(
    plot.title = element_text(face = "bold", size = 30),
    plot.subtitle = element_text(size = 24),
    plot.caption = element_text(size = 12)
  )
plot <- plot +
  plot_annotation(
    title = "ggcharts is on the Rise",
    subtitle = "A Summary of Downloads from the RStudio CRAN Mirror",
    caption = "Source: RStudio CRAN Logs",
    theme = plot_theme
  )
plot
## Warning: Removed 9 row(s) containing missing values (geom_path).

I am almost satisfied with this data visualization. One thing that bothers me is the position of the individual titles, though. They are aligned with the edge of the plot which I think is great for a stand alone plot but not for a patchwork. Let’s align them with the panel, i.e. the part of the plot that actually shows the data.

plot & theme(plot.title.position = "panel")
## Warning: Removed 9 row(s) containing missing values (geom_path).

It took quite some effort to get here but I really like the end results. How about you? Let me know in the comments.

Just the Code

library(ggplot2)
library(patchwork)

p1 <- ggplot(daily_downloads, aes(date, N)) +
  geom_line(color = "#94C1E0", size = 1.25) +
  scale_y_continuous(
    limits = c(0, 80),
    expand = expansion(mult = c(0, .05))
  ) +
  ggcharts::theme_hermit(axis = "x", ticks = "x", grid = "X") + 
    labs(
    x = NULL,
    y = NULL,
    title = "Daily Downloads"
  )

p2 <- ggplot(daily_downloads, aes(date, cumulative_N)) +
  geom_line(color = "#94C1E0", size = 1.25) +
  scale_y_continuous(
    limits = c(0, 1200),
    breaks = seq(from = 0, to = 1200, by = 300),
    expand = expansion(mult = c(0, .05))
  ) +
  ggcharts::theme_hermit(axis = "x", ticks = "x", grid = "X") + 
  labs(
    x = NULL,
    y = NULL,
    title = "Cumulative Downloads"
  )

p3 <- ggplot(daily_downloads, aes(N)) +
  geom_histogram(fill = "#94C1E0", color = "#494F5C", bins = 10) +
  scale_y_continuous(
    breaks = c(0, 2, 4, 6, 8),
    expand = expansion(mult = c(0, .05))
  ) +
  ggcharts::theme_hermit(axis = "x", ticks = "x", grid = "X") +
  labs(
    x = NULL,
    y = NULL,
    title = "Distribution of Daily Downloads"
  )

p4 <- ggcharts::bar_chart(
  downloads_by_country, 
  country, N, 
  bar_color = "#94C1E0", 
  top_n = 10
  ) +
  geom_text(
    mapping = aes(label = N),
    hjust = -.2,
    color = "#D6DDE1",
    size = 5
  ) +
  scale_y_continuous(expand = expansion(mult = c(0, .15))) +
  ggcharts::theme_hermit(axis = "y") +
  theme(axis.text.x = element_blank()) +
  labs(
    x = NULL,
    y = NULL,
    title = "Downloads by Country",
    subtitle = "Top 10"
  )


plot_theme <- ggcharts::theme_hermit() +
  theme(
    plot.title = element_text(face = "bold", size = 30),
    plot.subtitle = element_text(size = 24),
    plot.caption = element_text(size = 12)
  )
plot <- p1 + p2 + p3 + p4 +
  plot_annotation(
    title = "ggcharts is on the Rise",
    subtitle = "A Summary of Downloads from the RStudio CRAN Mirror",
    caption = "Source: RStudio CRAN Logs",
    theme = plot_theme
  )
plot & theme(plot.title.position = "panel")