Reproducible and honest visualizations

Overview

Teaching: 90 min
Exercises: 90-120 min

Questions

How to create graphs reproducibly?

How to transmit information truthfully in graphs?

What are the good practice principles for visualizations?

Objectives

Learn about dos and donts for honest graphs

Learn about good practice for visualizations

Apply the principles to concrete examples

Learn the necessary R code

Data visualization

Data visualization is omnipresent in science. Visualizations range from presenting raw data to illustrating analysis results or modeling outcomes. The way visualizations are constructed should, as any other part of the analysis, be reproducible and adhering to the basic principles of good scientific practice. For visualizations it is specifically important to honestly show the data without distorting the contained information towards an intended message, i.e. we present how to transmit information as truthfully as possible in graphs. You will practice reproducible data analysis skills while learning about best practice for graphs.

Good practice for visualizations

In the following sections we will have a look at different visualizations and things to be aware of when using them with the goal of transmitting information truthfully. The most important principles of good practice for visualizations are

1. Be simple, clear and to the point
2. Show the data
3. Be honest about the axes
4. Use colors sensibly

We provide code in ggplot. A short introduction to ggplot is provided in this episode of the carpentries course: R for Social Scientists. A good reference for both systems, plots in base R and ggplot is the book by RD Peng: Exploratory data analysis with R.

1. Be simple, clear and to the point

Encoding data using visual cues

As a basic principle it is useful to consider the relationship of visual cues, i.e. the type of visual encoding of quantitative data such as bars or areas, and the accuracy of the understanding of a viewer of these visualizations. The graph below shows how accurately the visualizations are perceived for different types of representation. Lengths (in form of bars) represent the data most accurately while volumes are rather generic and are more difficult to be perceived accurately.

Image credit: P Aldhous

The linked picture is based on Graphical Perception: Theory, Experimentation, and Application to the Development of Graphical Methods by William S. Cleveland and Robert McGill.

Therefore, when creating a visualization you should consider the best type of visual cue that represents the data best with the goal of transmitting the intended message. For good perception of a message it is clearly better to provide simple visualizations. We discuss some specific points in more detail below.

3D

Providing simple and easily perceptible visualizations implies that you should avoid 3-dimensional graphical representations in most circumstances. Consider the following visualization:

Image credit: Josh Taylor

As you can see (or not see!) some data is hidden behind the different bars. Furthermore it is rather difficult (and misleading) to compare the height from different depths.

Avoid occlusion

As a general principle we can conclude from the 3D example that you should always avoid occlusion of some parts of the visualization. An example can be found in the following plot showing multiple densities in the same panel. The different densities where colored according to group but only the density in the front is fully visible.

An alternative is to plot lines which allows us to see all groups completely.

Pie charts

Pie charts can be considered an alternative to bar charts, although often not a good one since they use angles as visual cues. For instance look at the following three visualizations of a change of the count of three factors over time. First a barplot, second a stacked barplot and lastly three pie charts (on top of each other). Where are differences most visible?

Arrangement of plots

The arrangement of multiple plots and panels can also contribute to increasing the clarity of a visualization. Have a look at the following plot.

Two inconsistencies are present. First of all the order of the sample of the top row and the bottom row is not the same. Secondly in the top row var1 is on the y-axis while in the bottom row it is on the x-axis. Staying consistent and in general have an arrangement that makes sense helps to have a clear representation that transmits the desired information efficiently. A better alternative for the above plot is:

2. Show the data

Boxplots

Boxplots are used to give a rough overview of the distribution of a data set based on a few summary characteristics (quantiles). Consider the following three boxplots each representing a different dataset. The boxplots look identical even tough the underlying distributions may not be.

The code for the above plot:

ggplot(df_long) +
  geom_boxplot(aes(y = y, x = dataset))

Violin plots are an alternative to boxplots. They are based on an estimation of the underlying probability density, i.e. they use more information inherent in the data set. Have a look at the following three violin plots of the same datasets as above. Again, two of the violin plots look similar but the underlying data may not be identical.

Let’s finally have a look at the actual data. As you can see the samples x1 and x3 are in fact very distinct, or more precisely, x3 seems to have only 5 possible values.

So why did the boxplot not show the distributional differences? Since boxplots only show certain quantiles (usually the quartiles, i.e., 25%, 50% and 75%, plus “outliers”) plots of different datasets having the same or similar quantiles appear identical. The quartiles of the three data sets are

dataset	q25	q75
x1	-2.1	2.1
x2	-2.0	2.0
x3	-2.0	2.0

Violin plots show a mirrored estimation of the underlying density using a smoothing technique. Loosely speaking smoothing means that starting from a histogram a smooth version of the underlying probability distribution is created. The degree of smoothing, ranging in this case from histogram to straight line, determines the actual appearence of the plot. For the violin plot the degree of smoothing is chosen automatically. Already histograms with 5 bins for the data x1 and x3 would be very similar and hence the smoothed versions thereof as well.

Caution is furthermore advised if the datasets that are compared are of very different size, because often more data gives you a higher confidence in the observed distribution. It is therefore advised to initially always have a look at the actual data and not just the summaries (like boxplots and violin plots) to detect anomalies.

Another option is the use of geom_jitter (or geom_sina from the ggforce package) in combination with violin plots:

The code for the above plot:

ggplot(df_long) +
  geom_violin(aes(y = y, x = dataset)) +
  geom_jitter(aes(y = y, x = dataset), width = 0.3)

The advantage is that individual points as well as the distribution are shown.

Boxplots together with geom_jitter are another possibility.

Another possibility is to only show the jittered data:

Bar plots

The same as discussed before for boxplots also holds for barplots. If you have continuous data and see the following barplots you might conclude that the data sets are the same:

But if you also show the individual points you can see clear differences:

Important to keep in mind when using barplots with error bars is to state what the error bars mean. Do they correspond to the standard deviation, the standard error or a confidence interval? There is no clear answer to which one to use and, if possible, other types of visualizations should be used.

3. Be honest about the axes

The axes of plots determine how much information you provide and where you put the focus. You could cut axes, blow certain parts of an axis up through transformation or hide information on certain scales if you do not transform. You can expose or hide information by choosing the aspect ratio between the x and y axis. You can provide clear and precise information through meaningful labeling of axes and axis tick marks or you can obscure the same information by deliberately choosing uninformative tick locations, for example. These issues are illustrated through example in the following

Cutting axes

Let’s consider the following two barplots. The first has a shortened axis range and shows clear differences between the datasets. The second plot on the other hand shows the enire axis starting from zero and the differences disappear.

See this concrete example of cutting an axis, which makes differences appear much huger than they are in reality:

Image credit: Helena Jambor

Axis transformation

In some cases you might have data that is on completely different scales, meaning that there are differences to be seen at different orders of magnitudes. In these cases it can often to be helpful to do an axis-transformation. For instance consider the following untransformed plot:

There seems to be some structure but especially for the low values it is not clear what is going on. If instead you do a log10 transformation of the x-axis things get much clearer. Axis transformations are also something to consider if you have for example non linear scales. But beware, transformations can also be used to showcase differences that do not really matter in practice.

Aspect ratio

The aspect ratio is another important parameter that can be manipulated to overstress certain patterns. For example, have a look at the following two plots. The first as a ratio of one, meaning the scale of the x and y axis are the same. The second plot has an aspect ration of 1/4 meaning the x axis is substantially longer.

Code for the above plot:

ggplot(df) +
  geom_point(aes(x, y)) +
  coord_fixed(ratio = 1)
ggplot(df) +
  geom_point(aes(x, y)) +
  coord_fixed(ratio = 1 / 4)

Visually the second plot implies that the variance of x is much higher than of y, which is not the case:

summarise(df, x = var(x), y = var(y))

          x         y
1 0.8332328 0.9350631

Also consider the following real example. Where does the increase look the most dramatic?

Image credit: Demetris Christodoulou

Bin width of histograms

The appearance of a histogram is determined by the bin width that is used to create it. If you have a very large binwidth (or a low total number of bins) you might see something like this and you would probably consider the distribution to be approximately uniformly distributed.

If on the other hand you decrease the binwidth (or increase the number of bins) you might see something like this:

Making it quite obvious that the distribution is most definitely not uniformly distributed (on this scale). Choosing the correct bin width is not easy and depends largely on the context.

With geom_rug you can mark the position of individual observations:

Code for the above plot:

ggplot(df, aes(x)) +
  geom_histogram(binwidth = 0.5) +
  geom_rug()

Axis scales in multiple plots

If you provide plots in multiple panels, each using the same variables, you need to pay attention to the scale of each subplot. For example have a look at the following plot.

At first glance the distribution of each of the three samples looks the same. But if you look closely you can see that the scales are not the same for each subplot. If you instead keep the scale the same across subplots you get a visualization with clear differences of the distributions between the different samples.

Code for the above plot:

ggplot(df) +
  geom_point(aes(x, y)) +
  facet_wrap( ~ sample)

4. Use colors sensibly

Color contrast

Trying to encode more than 8 category with colors is usually not a good idea as distinction between colors can become very difficult:

In such a case it can be a better idea to directly label the points:

Code for the above plot:

mtcars %>% 
  rownames_to_column() %>% 
  ggplot() +
  geom_point(aes(mpg, disp, color = cyl)) +
  ggrepel::geom_label_repel(aes(mpg, disp, label = rowname),
                            size = 2.5, label.size = 0.1,
                            label.padding = 0.1)

See also: Common pitfalls of color use in Fundamentals of Data Visualization.

Color vision deficiency

About 1 of every 12 people is affected by some type of color vision deficiency (see, e.g., here). This is important to keep in mind when choosing colors for visualizations. For example consider the following scatter plot using a Red-Yellow-Green color palette, knowing that Red-Green colorblindness is the most frequent type of color deficiency.

To check how the plots appear for color deficient persons you can use the cvd_grip function from the colorblindr package (install instructions on the Github colorblindr repo).

colorblindr::cvd_grid() +
  medtheme()

Using a different color palette can help. For example the following:

Code for the above plot:

ggplot(mtcars) +
  geom_point(aes(mpg, disp, color = factor(carb))) +
  scale_color_OkabeIto()

Another option is the dichromat package (CRAN link) which features multiple palettes for people with red-green colorblindness.

Quiz

Let’s have a look at the graph, : Image credit: We discovered the graph here: https://statmodeling.stat.columbia.edu/2008/06/06/new_candidate_f/. The picture is not from a scientific study but originates from a blog: http://www.shakesville.com/2007/09/rotundity-revealed.html.

1

What aspect of the “Be simple, clear and to the point” input has been violated?

3D

occlusion

use of pie charts

arrangement of multiple plots

Solution

T 3D
T occlusion
F use of pie charts
F arrangement of multiple plots

2

Is the data shown appropriately by the plot through

the height of bars?

the values on bars?

additional values in white font?

the tick marks indicating to which value each bar belongs?

the raw data?

Solution

the height of bars?

the values on bars?

additional values in white font?

the tick marks indicating to which value each bar belongs?

the raw data?

3

What could be the reason that the Y-axis is shown starting at the value 55?

the value 55 could be the smallest possible value in the context

the differences between the bars are more pronounced than if the Y-axis starts at zero.

the values below 55 need to be hidden

Solution

T the value 55 could be the smallest possible value in the context
T the differences between the bars are more pronounced than if the Y-axis starts at zero.
T the values below 55 need to be hidden

4

When you think about the information regarding the axes, please tick which one of the following items is present in the plot

Sensible X-axis tick label display

Sensible X-axis label

Sensible Y-axis tick values

Sensible Y axis label

Solution

F Sensible X-axis tick label display
F Sensible X-axis label
F Sensible Y-axis tick values
F Sensible Y axis label

5

Does the displayed grid help to determine the height of the color sections of the bars?

Yes

No

Solution

F Yes
T No

6

Is the used color palette color-blind friendly?

Yes

No

Solution

F Yes
T No

7

Which of the following additional information items does the plot feature?

Informative title

Informative legend labels

Provenance of data

Context of data

Solution

F Informative title
F Informative legend labels
F Provenance of data
F Context of data

8

Does the course team think this is a good plot?

Yes

No

Solution

F Yes
T No

Episode challenge

For this challenge we will work with climate data published by the Bundesamt für Statistic BFS in which various climate related variables measured at different locations in Switzerland have been put together. The data has already been wrangled into a csv file that you can download from here.

The source data was downloaded from here: https://www.bfs.admin.ch/asset/de/je-d-02.03.03.02 and here: https://www.bfs.admin.ch/asset/de/je-d-02.03.03.03

Task 1: data exploration

In this first task read in the climate_data.csv file and do a short exploration of the dataset.

1.1 First look

Show the top 3 rows of the dataset and additionally a short summary of the dataset (Hint: use summary). Describe what you observe in a few words.

Solution

head(climatedf_comp, n = 3)

  Year       Location Sunshine_duration Altitude Annual_Precipitation
1931 BaselBinningen          1594.317      316                816.0
1931 BernZollikofen          1742.500      553               1137.6
1931          Davos          1767.600     1594               1077.3
  Annual_temperature Annual_ice_days Annual_frost_days Annual_summer_days
              8.5              NA                NA                 NA
              7.2              NA                NA                 NA
              1.7              NA                NA                 NA
  Annual_heat_days Annual_tropic_days Annual_precipitation_days
             NA                 NA                        NA
             NA                 NA                        NA
             NA                 NA                        NA

summary(climatedf_comp)

      Year        Location         Sunshine_duration    Altitude     
 Min.   :1931   Length:1170        Min.   :1046      Min.   : 273.0  
 1st Qu.:1953   Class :character   1st Qu.:1557      1st Qu.: 411.0  
 Median :1976   Mode  :character   Median :1725      Median : 485.0  
 Mean   :1976                      Mean   :1759      Mean   : 805.9  
 3rd Qu.:1998                      3rd Qu.:1937      3rd Qu.: 776.0  
 Max.   :2020                      Max.   :2741      Max.   :2501.0  
                                   NA's   :129                       
 Annual_Precipitation Annual_temperature Annual_ice_days  Annual_frost_days
 Min.   : 338.9       Min.   :-3.300     Min.   :  0.00   Min.   :  1.00   
 1st Qu.: 829.9       1st Qu.: 6.925     1st Qu.:  5.00   1st Qu.: 60.25   
 Median :1050.5       Median : 9.000     Median : 17.00   Median : 87.00   
 Mean   :1212.9       Mean   : 7.755     Mean   : 33.05   Mean   :107.99   
 3rd Qu.:1411.8       3rd Qu.:10.400     3rd Qu.: 41.00   3rd Qu.:120.00   
 Max.   :3704.2       Max.   :13.900     Max.   :218.00   Max.   :289.00   
                                         NA's   :364      NA's   :364      
 Annual_summer_days Annual_heat_days Annual_tropic_days
 Min.   :  0.00     Min.   : 0.000   Min.   : 0.000    
 1st Qu.: 12.25     1st Qu.: 0.000   1st Qu.: 0.000    
 Median : 38.00     Median : 2.000   Median : 0.000    
 Mean   : 36.95     Mean   : 6.257   Mean   : 1.561    
 3rd Qu.: 56.00     3rd Qu.:10.000   3rd Qu.: 1.000    
 Max.   :125.00     Max.   :56.000   Max.   :40.000    
 NA's   :364        NA's   :364      NA's   :364       
 Annual_precipitation_days
 Min.   : 64.0            
 1st Qu.:100.0            
 Median :120.0            
 Mean   :119.8            
 3rd Qu.:136.0            
 Max.   :229.0            
 NA's   :365              

1.2 Which has been the hottest year?

Solution

climatedf_comp %>% 
  dplyr::group_by(Year) %>% 
  dplyr::summarise(mean_temp = mean(Annual_temperature)) %>% 
  dplyr::filter(mean_temp == max(mean_temp)) %>% 
  dplyr::pull(Year)

[1] 2018

1.3 Which has been the coldest year?

Solution

climatedf_comp %>% 
  dplyr::group_by(Year) %>% 
  dplyr::summarise(mean_temp = mean(Annual_temperature)) %>% 
  dplyr::filter(mean_temp == min(mean_temp)) %>% 
  dplyr::pull(Year)

[1] 1956

Task 2: visualization

2.1 Association of `Annual_temperature` and `Year`

The goal is to visualize the association of Annual_temperature and Year. To increase the visibility we will only look at the locations ZürichFluntern, Säntis, Samedan, LocarnoMonti.

Choose a suitable visualization (maybe consider looking at the decision tree) and plot the respective graph.

Solution

climatedf_comp %>% 
  dplyr::filter(Location %in% c("ZürichFluntern", 
                                "Säntis", 
                                "Samedan", 
                                "LocarnoMonti")) %>%
  ggplot() +
  geom_line(aes(Year, Annual_temperature, color = Location)) +
  labs(y = "Annual temperature")

2.2 Add information on the altitude

Based on the previous plot update / change your plot to also include the information about the altitude. Make sure that the location information is also provided.

Solution

climatedf_comp %>% 
  dplyr::filter(Location %in% c("ZürichFluntern", 
                                "Säntis", 
                                "Samedan", 
                                "LocarnoMonti")) %>%
  ggplot() +
  geom_line(aes(Year, Annual_temperature, color = Altitude, group = Location)) +
  geom_label(aes(Year, Annual_temperature, label = Location), 
             data = climatedf_comp %>% 
               dplyr::filter(Location %in% c("ZürichFluntern", 
                                             "Säntis", 
                                             "Samedan", 
                                             "LocarnoMonti")) %>% 
               dplyr::filter(Year==min(Year)+5),nudge_y = 1) +
  labs(y = "Annual temperature")

2.3 Normalization

In the next step we want to normalize the Annual temperature by using the values of the years <1951 as a base. I.e. calculate the mean Annual_temperature for Year<1951 for each Location and subtract this value from Annual_temperature. Present a visualization that allows to study the deviation from this annual mean by location.

Solution

climatedf_comp_translated <- climatedf_comp %>% 
  dplyr::filter(Location %in% c("ZürichFluntern", 
                                "Säntis", 
                                "Samedan", 
                                "LocarnoMonti")) %>% 
  dplyr::group_by(Location) %>% 
  dplyr::mutate(mean_temperature = mean(Annual_temperature[Year < 1951]),
                Annual_temperature = Annual_temperature - mean_temperature)

ggplot(climatedf_comp_translated) +
  geom_line(aes(Year, Annual_temperature, color = Altitude, group = Location)) +
  facet_wrap( ~ Location) +
  # geom_label(aes(Year, Annual_temperature, label = Location), 
  #            data = climatedf_comp_translated %>%
  #              dplyr::filter(Location %in% c("ZürichFluntern",
  #                                            "Säntis",
  #                                            "Samedan",
  #                                            "LocarnoMonti")) %>%
  #              dplyr::filter(Year == min(Year) + 5), nudge_y = 1) +
  labs(y = "Annual temperature deviation from average up to 1951")

2.4 Associations between `Annual_Precipitation`, and `Sunshine_duration`

The next goal is to explore associations between Annual_Precipitation, and Sunshine_duration for the locations ZürichFluntern,Säntis,Samedan,LocarnoMonti. Present at least two different types of plots.

Solution

climatedf_comp %>% 
  dplyr::filter(Location %in% c("ZürichFluntern",
                                "Säntis",
                                "Samedan",
                                "LocarnoMonti")) %>%
  tidyr::drop_na() %>%
  ggplot() +
  geom_point(aes(Annual_Precipitation, Sunshine_duration, color = Location))

climatedf_comp %>% 
  dplyr::filter(Location %in% c("ZürichFluntern",
                                "Säntis",
                                "Samedan",
                                "LocarnoMonti")) %>% 
  # dplyr::group_by(Location) %>% 
  # dplyr::mutate(Sunshine_duration = scale(Sunshine_duration)) %>%
  tidyr::drop_na() %>%
  ggplot() +
  geom_violin(aes(Location, Annual_Precipitation, color = Sunshine_duration)) +
  ggforce::geom_sina(aes(Location, Annual_Precipitation, color = Sunshine_duration))

Warning: The following aesthetics were dropped during statistical transformation: colour
ℹ This can happen when ggplot fails to infer the correct grouping structure in the data.
ℹ Did you forget to specify a `group` aesthetic or to convert a numerical variable into a factor?

climatedf_comp %>% 
  dplyr::filter(Location %in% c("ZürichFluntern",
                                "Säntis",
                                "Samedan",
                                "LocarnoMonti")) %>% 
  # dplyr::group_by(Location) %>% 
  # dplyr::mutate(Sunshine_duration = scale(Sunshine_duration))  %>%
  tidyr::drop_na() %>%
  ggplot() +
  geom_boxplot(aes(Location, Sunshine_duration, color = Annual_Precipitation)) +
  geom_jitter(aes(Location, Sunshine_duration, color = Annual_Precipitation))

Warning: The following aesthetics were dropped during statistical transformation: colour
ℹ This can happen when ggplot fails to infer the correct grouping structure in the data.
ℹ Did you forget to specify a `group` aesthetic or to convert a numerical variable into a factor?

  # ggforce::geom_sina(aes(Location, Annual_Precipitation, color = Sunshine_duration))

Bonus challenge

We have already shortly had a look at facets which allow to easily arrange multiple plots. But so far we have only considered the case where each subplot shows the same variables, e.g Sunshine_duration vs. Annual_frost_days. What if instead you would like to use facets to plot multiple variables? For instance you would like to do a plot containing two subplots, the first Annual_frost_days vs. Sunshine_duration and the second Annual_summer_days vs. Sunshine_duration?

There are basically two options:

Do both plots separately and then combine them
Do both plots simultaneously using facets

We will in the following explore both options.

Combine plots

There are many options available how to combine plots. Two useful packages are cowplot (for all graphics) and ggpubr (for ggplots). In this exercise we will use ggpubr.

Exercise 1

Create two ggplot2 scatterplots, Annual_frost_days vs. Sunshine_duration and Annual_summer_days vs. Sunshine_duration, color by location. Combine the two plots using ggpubr::ggarrange and make sure to have only one legend. Also make sure to have the same axis range in both plots.

Solution

minmax <- c(min=min(na.omit(c(climatedf_comp$Annual_frost_days, 
                              climatedf_comp$Annual_summer_days))),
               max=max(na.omit(c(climatedf_comp$Annual_frost_days, 
                                 climatedf_comp$Annual_summer_days))))
pl1 <- climatedf_comp %>% 
  ggplot() +
  geom_point(aes(Annual_frost_days, Sunshine_duration, color = Location)) +
  xlim(minmax)+
  labs(x="Annual frost days",
       y="Sunshine duration")

pl2 <- climatedf_comp %>% 
  ggplot() +
  geom_point(aes(Annual_summer_days, Sunshine_duration, color = Location)) +
  xlim(minmax)+
  labs(x="Annual summer days",
       y="Sunshine duration")

ggpubr::ggarrange(pl1, pl2, common.legend = TRUE)

The second option is to use facets (e.g. ggplot2::facet_wrap). Since our data is currently not in the correct format we first have to bring it into shape. This can be done using tidyr::pivot_longer which transforms data from wide to long format. The wide format means we have multiple values per row while the long format means we only have a single value while the remaining columns act as an identifier of the sample. You can learn more about pivot, long and wide formats by running vignette("pivot",package = "tidyr") in the console.

Exercise 2

Use tidyr::pivot_longer to bring the data into long format and plot Annual_frost_days vs. Sunshine_duration and Annual_summer_days vs. Sunshine_duration in the same plot using ggplot2::facet_wrap.

Hint: The columns to pivot into longer format are Annual_frost_days and Annual_summer_days.

Solution

climatedf_comp %>% 
  dplyr::select(Location, Year, Sunshine_duration, 
                Annual_frost_days, Annual_summer_days) %>% 
  dplyr::rename(`Annual frost days` = Annual_frost_days,
                `Annual summer days` = Annual_summer_days) %>% 
  tidyr::pivot_longer(cols = c("Annual frost days", "Annual summer days")) %>% 
  ggplot() +
  geom_point(aes(value, Sunshine_duration, color = Location)) +
  facet_wrap( ~ name) +
  labs(x = "Days", y = "Sunshine duration") 

Optional: ggplot2 theme

Rotate axis text

In some situations where labels on the x-axis are long they can overlap with the default setting:

A solution can be to rotate the labels:

Reproduce the above plot.
Hint: use the argument axis.text.x in the theme function and make sure to check the expected input class in axis.text.x.

Solution

climatedf_comp %>% 
  ggplot() + 
  geom_violin(aes(Location, Sunshine_duration, color = Location)) + 
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) + 
  labs(y = "Sunshine duration", x = "")

Custom colors

You can generate custom colors using RColorBrewer::brewer.pal. The generated colors can then be used in combination with scale_color_manual(values=generated_colors).

Solution

climatedf_comp_red <- climatedf_comp[climatedf_comp$Location %in% 
                                       c("Luzern","ZürichFluntern","Lugano"), ]
colors_use <- RColorBrewer::brewer.pal(length(unique(climatedf_comp_red$Location)), 
                                       "Set2")

climatedf_comp_red %>% 
  ggplot() +
  geom_violin(aes(Location, Sunshine_duration, color = Location, fill = Location)) +
  scale_color_manual(values = colors_use) +
  scale_fill_manual(values = colors_use) +
  labs(y = "Sunshine duration", x = "")

Key Points

Be simple, clear and to the point

Show the data

Be honest about the axes

Use colors sensibly

previous episode

Understanding Open and Reproducible Science

lesson home

Reproducible and honest visualizations

Overview

Data visualization

Good practice for visualizations

1. Be simple, clear and to the point

Encoding data using visual cues

3D

Avoid occlusion

Pie charts

Arrangement of plots

2. Show the data

Boxplots

Bar plots

3. Be honest about the axes

Cutting axes

Axis transformation

Aspect ratio

Bin width of histograms

Axis scales in multiple plots

4. Use colors sensibly

Color contrast

Color vision deficiency

Quiz

1

Solution

2

Solution

3

Solution

4

Solution

5

Solution

6

Solution

7

Solution

8

Solution

Episode challenge

Task 1: data exploration

1.1 First look

Solution

1.2 Which has been the hottest year?

Solution

1.3 Which has been the coldest year?

Solution

Task 2: visualization

2.1 Association of Annual_temperature and Year

Solution

2.2 Add information on the altitude

Solution

2.3 Normalization

Solution

2.4 Associations between Annual_Precipitation, and Sunshine_duration

Solution

Bonus challenge

Combine plots

Exercise 1

Solution

Facets

Exercise 2

Solution

Optional: ggplot2 theme

Rotate axis text

Solution

Custom colors

Solution

Key Points

previous episode

lesson home

2.1 Association of `Annual_temperature` and `Year`

2.4 Associations between `Annual_Precipitation`, and `Sunshine_duration`