ggGallery: Post #43. Pair them up—creating scatterplot matrices and correlation heatmaps in ggplots

Introduction

Welcome to the first post of 2026! It’s been over a half year since my last post, so supper happy to be back in the saddle!

Suppose you have a dataset (with continuous variables, categorical variables, or a mix of both) and you would like to explore the bivariate relationships between those variables, what would you do? Well, there might be many ways to approach this task, but a scatterplot matrix will probably be something you are looking for: It visualizes all pairwise combinations of variables using a range of plot types, for example, a scatterplot for two continuous variables and a boxplot for a continuous vs. a categorical variable. If you have only continuous variables in the dataset, you can further visualize their pairwise correlation coefficients in a correlation heatmap, which uses color to show the sign and the strength of correlations. Interesting patterns will often reveal themselves in scatterplot matrices and correlation heatmaps!

So in this post, we’ll be exploring scatterplot matrices and correlation heatmaps in ggplots using the extension package GGally and ggcorrheatmap. Are you ready? Let’s start pairing things up!

Part I. Create scatterplot matrices with `GGally`

In the first part of the post, we’ll create scatterplot matrices using the package GGally. We’ll use the “Soils” dataset from the package cardata: the dataset contains various physical and chemical characteristics of soils from different topographic positions. To keep things simple, we’ll just use two continuous (“pH” and “N”) and two categorical (“Contour” and “Depth”) variables in the following examples. This also allows us to explore all three possible variable combinations (continuous–continuous, continuous–categorical, and categorical–categorical).

(1) Create a basic scatterplot matrix

The function for creating a scatterplot matrix is ggpair(), which simply takes a dataframe and returns a scatterplot matrix. By default, the diagonal plots will be density plots for continuous variables and barplots of level counts for categorical variables. For the off-diagonal plots, it depends on the types of variable combinations:

For continuous–continuous variable combinations, one triangular matrix will display scatterplots and the other triangular matrix will display correlation coefficients.
For continuous–categorical variable combinations, one triangular matrix will display histograms of the continuous variable faceted by the categorical variable and the other triangular matrix will display boxplots of the continuous variable by each level of the categorical variable.
For categorical–categorical variable combinations, one triangular matrix will display mosaic plots of the frequencies of the two categorical variables and the other triangular matrix will display barplots of the level counts of one categorical variable faceted by the other categorical variables.

We can also map a categorical variable to the “color” aesthetic and use both scale_color_XXX and scale_fill_XXX to adjust the colors in the plots.

library(tidyverse)
library(GGally)
library(carData)  # for the dataset "Soils"

### Select four focal columns
Soils_subset <- Soils %>% 
  select(Contour, Depth, pH, N)

### Create a basic scatterplot matrix
ggpairs(Soils_subset, aes(color = Contour),  # color the plots by the variable "Coutour"
        columnLabels = c("Topography", "Soil depth", "pH", "N (%)")) + 
  scale_color_manual(values = c("#43a2ca", "#006d2c", "#d95f0e")) + 
  scale_fill_manual(values = c("#43a2ca", "#006d2c", "#d95f0e")) +  # need to specify the fill scale too  
  theme_bw()

(2) Change plot types in the scatterplot matrix

If you are not satisfied with the default plot types, we can of course change them! To do so, we’ll use the arguments “upper” and “lower”, which control the plots in the upper and the lower triangular matrices. These two arguments take a list of three possible named elements: “continuous”, “combo”, and “discrete”, representing the continuous-continuous, the continuous-categorical, and the categorical-categorical variable combinations. We can specify new plot types in those named elements using character strings of GGally’s built-in plot options (e.g., “smooth_lm”). A full list of the built-in options is available via vig_ggally("ggally_plots").

Also, if we want to change the default parameter settings for the plots, we can wrap() the plot names and specify the new parameter values.

### Change plot types using built-in plot options
ggpairs(Soils_subset, aes(color = Contour),
        columnLabels = c("Topography", "Soil depth", "pH", "N (%)"),
        
        # new plot types for the upper triangular matrix
        upper = list(continuous = wrap("smooth_lm", se = F),  # wrap the plot name to change the parameter settings
                     combo = wrap("facetdensity", linewidth = 1),
                     discrete = "crosstable"),
        
        # new plot types for the lower triangular matrix
        lower = list(continuous = wrap("smooth_loess", se = F),
                     combo = wrap("dot", width = 0.05),
                     discrete = "count")) + 
  scale_color_manual(values = c("#43a2ca", "#006d2c", "#d95f0e")) + 
  scale_fill_manual(values = c("#43a2ca", "#006d2c", "#d95f0e")) +
  theme_bw()

Besides the built-in plot options, we can also create our own custom plot functions! The custom plot functions should follow the structure function(data, mapping, ...) { } and return a ggplot. For example, we can create a custom plot function for the continuous-categorical variable combination that shows the means and the standard errors. We can then pass this function to the list.

### Create a custom plot function for continuous vs. categorical variables
pointrange_fun <- function(data, mapping, ...) {
   ggplot(data = data, mapping = mapping) +
    stat_summary(geom = "pointrange", fun.data = "mean_se", position = position_dodge(width = 1))
}

### Change plot type using the custom plot function
ggpairs(Soils_subset, aes(color = Contour),
        columnLabels = c("Topography", "Soil depth", "pH", "N (%)"),
        legend = 1,  # create a legend based on plot #1
        upper = list(continuous = wrap("smooth_lm", se = F),
                     combo = pointrange_fun,  # use the custom plot function "pointrange_fun" for plotting continuous vs. categorical variables
                     discrete = "crosstable"),
        lower = list(continuous = wrap("smooth_loess", se = F),
                     combo = wrap("dot", width = 0.05),
                     discrete = "count")) + 
  scale_color_manual(values = c("#43a2ca", "#006d2c", "#d95f0e")) + 
  scale_fill_manual(values = c("#43a2ca", "#006d2c", "#d95f0e")) +
  theme_bw() + 
  theme(legend.position = "top",
        legend.key.spacing.x = unit(0.5, "in"),
        legend.title = element_blank())

(3) Modify a certain plot in the scatterplot matrix

We can further modify a certain plot in the scatterplot matrix! First, we’ll store the scatterplot matrix in an object. We’ll then extract the plot using the row and the column indices and make changes to the extracted plot. Finally, we’ll replace the original plot with the new one using the same row and column indices:

### Store the previous scatterplot matrix in an object
p_scatterplot_matrix <- ggpairs(Soils_subset, aes(color = Contour),
        columnLabels = c("Topography", "Soil depth", "pH", "N (%)"),
        legend = 1,
        upper = list(continuous = wrap("smooth_lm", se = F),
                     combo = pointrange_fun,
                     discrete = "crosstable"),
        lower = list(continuous = wrap("smooth_loess", se = F),
                     combo = wrap("dot", width = 0.05),
                     discrete = "count")) + 
  scale_color_manual(values = c("#43a2ca", "#006d2c", "#d95f0e")) + 
  scale_fill_manual(values = c("#43a2ca", "#006d2c", "#d95f0e")) +
  theme_bw() + 
  theme(legend.position = "top",
        legend.key.spacing.x = unit(0.5, "in"),
        legend.title = element_blank())

### Extract the plot at row 2 and column 4
p_2_4_original <- p_scatterplot_matrix[2, 4]

### Modify the extracted plot
p_2_4_new <- p_2_4_original + aes(shape = Contour)

### Replace the original plot with the new one
p_scatterplot_matrix[2, 4] <- p_2_4_new
p_scatterplot_matrix

With this functionality, we can essentially customize the scatterplot matrix in any way we like!

Part II. Create correlation heatmaps with `ggcorrheatmap`

In the second part of this post, we’ll focus on the continuous variables in the Soils dataset. ggcorrheatmap offers a wide variety of functionality for creating correlation heatmaps. We’ll start with the basics and add more complexity as we go.

(1) Create a basic correlation heatmap

ggcorrhm() is the master function for creating correlation heatmaps. Simply pass your data to the function and you’ll get a basic heatmap (note that the default choice of the correlation measure used in the plot is the Pearson’s r)!

library(ggcorrheatmap)

### Select continuous variables
Soils_continuous <- Soils %>% 
  select(is.numeric)

### Create a basic correlation heatmap
ggcorrhm(Soils_continuous)

We can also change the cell color/shape and remove the diagonal cells (which all have a correlation coefficient of 1).

### Change cell color and shape
ggcorrhm(Soils_continuous,
         high = "#de2d26",  # color for the upper limit
         mid = "white",  # color for the midpoint
         low = "#006d2c",  # color for the lower limit
         mode = 21,  # shape of the cell (a number between 1 to 25)
         include_diag = F,  # remove diagonal cells
         size_range = c(5, 10),  # cell size range
         layout = "bottomright")  # show only the bottom-right triangular matrix

Looks like there are some strong positive correlations (e.g., between Na and Conductance and between N and P) and some strong negative ones (e.g., between N and Density and between P and Density). So what are the actual coefficients (and are they statistical significant)?

(2) Create a correlation heatmap with a mixed layout

ggcorrhm() allows users to create correlation heatmaps with a mixed layout (e.g., a mix of heatmap and text display). To do so, we first have to tell the function that we want two triangular matrices, using the “layout” argument. We will also need to specify what we would like to display in these two triangular matrices using the “mode” argument: “heatmap” gives the colored cells, whereas “none” gives empty cells. Next, we can use the “p_values” argument to specify whether we would like to display the significance stars in each of the two triangular matrices. Finally, we can use the “cell_labels” argument to specify whether we would like to display the correlation coefficients in each of the two triangular matrices.

### A mixed layout with color cells and p-values
ggcorrhm(Soils_continuous,
         layout = c("topleft", "bottomright"),  # create two triangular matrices
         mode = c("heatmap", "none"),  # what to display in each of the two triangular matrices
         include_diag = T,
         p_values = c(F, T),  # whether to show the significance stars in each of the two triangular matrices
         p_adjust = "none",  # can be any of the method in "p.adjust.methods"
         p_thresholds = c("***" = 0.001, "**" = 0.01, "*" = 0.05, 1),  # thresholds for annotating the significance stars
         cell_labels = c(T, F))  # whether to show the correlation coefficients in each of the two triangular matrices

All correlation coefficients are significant (without p-value adjustment)!!!

(3) Highlight certain cells in the heatmap

Now that we’ve learned how to make a basic correlation heatmap, we can spruce it up! One thing we can do is to highlight certain cells. For example, we can highlight the cells with very strong correlations (absolute correlation coefficients larger than 0.8).

To do so, we’ll first create a basic heatmap and set “return_data = T”. This allows us to access the data underlying the heatmap. We can then save the data as a new object and create two new columns for the cell border width and the cell border color based on the correlation coefficients. Finally, we recreate the original heatmap and add a new “geom_tile()” layer where we specify our new cell border width and color.

### Create a basic plot
p_basic <- ggcorrhm(Soils_continuous, 
                    layout = "bottomright", 
                    include_diag = F,
                    return_data = T)  # return the underlying data

### Create two new columns for the cell border width and color
p_data <- p_basic$plot_data %>% 
  mutate(border_width = if_else(abs(value) < 0.8, 0, 1),  # thicker cell border if |r| > 0.8
         border_color = if_else(abs(value) < 0.8, NA, "red"))  # red cell border if |r| > 0.8

### Add a new geom_tile() layer to the original plot
ggcorrhm(Soils_continuous, 
         layout = "bottomright", 
         include_diag = F,
         p_values = T) + 
  geom_tile(data = p_data,
            aes(x = col, y = row),
            fill = NA,
            linewidth = p_data$border_width,  # new cell border width 
            color = p_data$border_color)  # new cell border color

(4) Cluster the variables and annotate the heatmap

There are two more things we can do to make the heatmap look even fancier: (1) clustering the variables with a dendrogram and (2) annotating the plot.

To cluster the variables, we need to specify “cluster_rows = T” and “cluster_cols = T”, which will perform hierarchical clustering on the original data using the function hclust(). We can specify the clustering distance (default is the Euclidean distance) and the agglomeration method (default is the complete method). We can also decide whether we would like to display the row/column dendrogram using the argument “show_dend_rows/show_dend_cols = T/F”. Finally, there are multiple “dend_XXX” arguments for adjusting the dendrogram appearance.

To annotate the heatmap, we’ll first create a new dataframe with a column indicating the variable names and one or more columns of annotation variables (which can be both categorical or continuous). Simply pass the data to the argument “annot_rows_df =” (or “annot_cols_df =” depending on the side of the heatmap on which you would like annotate) and the annotation will show up in the plot! We can further customize the annotation using a variety of “annot_XXX” arguments, as shown in the example below.

### Data for highlighting cell borders
p_basic <- ggcorrhm(Soils_continuous, 
                    layout = "topright", 
                    include_diag = F,
                    return_data = T)

p_data <- p_basic$plot_data %>% 
  mutate(border_width = if_else(abs(value) < 0.8, 0, 1),
         border_color = if_else(abs(value) < 0.8, NA, "#fed976"))

### Create a dataframe for annotation
annotation_df <- data.frame(.names = colnames(Soils_continuous),  # variable names
                            Type = c("Non-element", "Element", "Non-element",  # annotation variable
                                     "Element", "Element", "Element",
                                     "Element", "Element", "Non-element"))

### Cluster the variables and annotate the heatmap
ggcorrhm(Soils_continuous, 
         layout = "topright", 
         include_diag = T,
         high = "#e34a33",
         mid = "white",
         low = "#4575b4",
         p_values = T,
         
         # arguments for clustering
         cluster_rows = T,  # cluster the rows  
         cluster_cols = T,  # cluster the columns
         cluster_distance = "euclidean",  # clustering distance measure
         cluster_method = "complete",  # clustering method
         show_dend_cols = F,  # hide column dendrogram
         dend_height = 0.7,  # rescale dendrogram branch length
         dend_lwd = 0.7,  # dendrogram line width
         
         # arguments for annotation
         annot_rows_df = annotation_df,  # dataframe for annotation
         annot_rows_names_side = "top",  # side of annotation title (this has no effect here since the annotation name is hidden)
         show_annot_names = F,  # hide the annotation title ("Type")
         annot_dist = 0.3,  # distance between annotation and heatmap
         annot_rows_col = list("Type" = scale_fill_manual(name = "Variable type", values = c("#f1a340", "#01665e"))),  # annotation tile color
         annot_border_lwd = 0.5,  # annotation border width
         annot_border_col = "white") +  # annotation border color
  geom_tile(data = p_data, aes(x = col, y = row),
            fill = NA,
            linewidth = p_data$border_width,
            color = p_data$border_color) + 
  theme(legend.position = "inside",
        legend.position.inside = c(0.1, 0.35))

Awesome! Looks like there are two main clusters of variables: [N, P, K, pH, and Ca] vs. [Mg, Density, Na, and Conductance].

Summary

In the first part of the post, we learned how to make scatterplot matrices with the package GGally. We started with a basic scatterplot matrix and then modified the plots in the scatterplot matrix using the package’s built-in options as well as the custom plot functions we created on our own. In the second part of the post, we focused specifically on continuous variables and learned how to make correlation heatmaps with the package ggcorrheatmap. Again we started with a basic correlation heatmap and then modified the heatmap layout to add text of correlation coefficients. We also highlighted certain cells in the heatmap by changing the color and the width of cell borders. Finally, we clustered the variables with a dendrogram and added an annotation to the heatmap. Scatterplot matrices and correlation heatmaps are pretty common, and we now know there are many ways to make them more informative and engaging!

Hope you learn something useful from this post and don’t forget to leave your comments and suggestions below if you have any!

Comment on this article Share:

Post #43. Pair them up—creating scatterplot matrices and correlation heatmaps in ggplots