Tag Archives: R

A “NetCDF 4 in R” cheatsheet

.

I recently promised a "NetCDF in R" cheat sheet to a friend, and I thought it might make a useful tutorial. NetCDF files are often used to distribute gridded, multidimensional spatial data such as sea surface temperature, chlorophyll-a levels and so on. NetCDF is more than just a file format, and so googling it can be a little intimidating. I hope this helps make these files a little easier to use in R.

A full specification for NetCDF can be found here. NetCDF files are really great for data distribution because they are self-describing, in other words they tell you what’s inside. Additionally if you have a very large NetCDF data file, you can only pull out the subset of data you are interested in instead of opening the whole thing. There’s more to them than this, but those are the important bits to me so far.

Data

We’ll use (amazingly) free chlorophyll-a data from http://www.globcolour.info. These are gridded longitude by latitude values. Bear in mind that your NetCDF files may contain higher dimensions (e.g. longitude by latitude by time).

R

I’m using ncdf4 along with Hadley Wickham’s reshape2 and dplyr libraries.

library(ncdf4)
library(reshape2)
library(dplyr)
setwd("D:/My Folders/R/2016/blog/20160804_r_netcdf_cheatsheet")
# retrieve a list of nc files in my data folder:
flist <- list.files(path = "data/", pattern = "^.*\\.(nc|NC|Nc|Nc)$")

Investigate a NetCDF file

We’ll dissect a single NetCDF file first and then we’ll tie it all together in a function at the end.

# Open a connection to the first file in our list
nc <- nc_open(paste0("data/", flist[1]))

You can get a description of what’s inside the nc file using print(nc) which dumps info to your console. The output can be quite long so I prefer to save it to a text file. That way I can keep it open and avoid continuous scrolling in my console window.

# Save the print(nc) dump to a text file (same name as the nc file with a txt extension)
{
    sink(paste0("data/", flist[1], ".txt"))
    print(nc)
    sink()
}

We can also access info about what’s inside the NetCDF file by investigating its attributes. The term attributes might get a little confusing now. I’ll try to clarify which I’m referring to.

  • We use R’s attributes() function to get a list of a variable’s attributes in the R workspace (?attributes for more).

  • The NetCDF data has its own list of global attributes (product info, spatial extents etc) as well as attributes for each NetCDF variable (units etc).

Calling the R attributes of the NetCDF file connection provides access to some information about the file, e.g. variable names (var), number of variables (nvars) etc.

# Get a list of the NetCDF's R attributes:
attributes(nc)$names
##  [1] "filename"    "writable"    "id"          "safemode"    "format"     
##  [6] "is_GMT"      "groups"      "fqgn2Rindex" "ndims"       "natts"      
## [11] "dim"         "unlimdimid"  "nvars"       "var"
print(paste("The file has",nc$nvars,"variables,",nc$ndims,"dimensions and",nc$natts,"NetCDF attributes"))
## [1] "The file has 3 variables, 2 dimensions and 52 NetCDF attributes"

Retrieve the data

Calling the R attributes of "var" (revealed above), reveals the NetCDF variable names. These names in turn give us access to the data and associated NetCDF attributes (units etc).

Our data contains three grids, "CHL1_mean" "CHL1_flags" "CHL1_error". We’ll just focus on the chlorophyll-a data. The flags data is associated with cloud cover and the error describes error associated with each data point. These are fully described in the Product User Guide. We use the ncvar_get() function to save the data locally. refer to ?ncvar_get for more on extracting subsets.

# Get a list of the nc variable names.
attributes(nc$var)$names
## [1] "CHL1_mean"  "CHL1_flags" "CHL1_error"
# Take a look at the chlorophyll variable's nc attributes (units etc).
ncatt_get(nc, attributes(nc$var)$names[1])
## $standard_name
## [1] "mass_concentration_of_chlorophyll_a_in_sea_water"
## 
## $long_name
## [1] "Chlorophyll concentration - Mean of the binned pixels"
## 
## $`_FillValue`
## [1] -999
## 
## $units
## [1] "mg/m3"
## 
## $pct_characterised_error
## [1] 43.31
# Retrieve a matrix of the chlorophyll data using the ncvar_get function:
chla_mean <- ncvar_get(nc, attributes(nc$var)$names[1])

# Print the data's dimensions
dim(chla_mean)
## [1] 453 256

Now we will retrieve the latitude and longitude data stored as NetCDF dimensions ("dim"). We can compare the extents of these dimensions with those of our data matrix to confirm that they match. We will then use them to assign meaningful row and column names to our data.

# Retrieve the latitude and longitude values.
attributes(nc$dim)$names
## [1] "lat" "lon"
nc_lat <- ncvar_get( nc, attributes(nc$dim)$names[1])
nc_lon <- ncvar_get( nc, attributes(nc$dim)$names[2])

print(paste(dim(nc_lat), "latitudes and", dim(nc_lon), "longitudes"))
## [1] "256 latitudes and 453 longitudes"

These dimensions match those of the chl-a data which is arranged in a nc_lon(rows) by nc_lat(cols) matrix. This also matches what we saw in the text file.

Tidy up a bit

We can make the data a little easier to think about with a bit of labelling and by transposing the data matrix so that the latitudes are in the rows and longitudes are in the columns. This isn’t really critical though.

# a quick look at a (pseudo) random section of this data
chla_mean[35:37, 245:247]
##           [,1]      [,2]      [,3]
## [1,] 0.1415056 0.1358830 0.1303250
## [2,] 0.1339336 0.1328560 0.1255906
## [3,] 0.1358152 0.1276533 0.1243571
# Change the dimension names of our matrix to "lon" and "lat", 
# and the row and column names to the latitude and longitude values.
dimnames(chla_mean) <- list(lon=nc_lon, lat=nc_lat)
chla_mean[35:37, 245:247]
##                   lat
## lon                -36.4791717529297 -36.5208358764648 -36.5625038146973
##   16.6458396911621         0.1415056         0.1358830         0.1303250
##   16.6875057220459         0.1339336         0.1328560         0.1255906
##   16.7291717529297         0.1358152         0.1276533         0.1243571
# lastly, you may want to transpose this matrix.
chla_mean <- t(chla_mean)
chla_mean[245:247, 35:37]
##                    lon
## lat                 16.6458396911621 16.6875057220459 16.7291717529297
##   -36.4791717529297        0.1415056        0.1339336        0.1358152
##   -36.5208358764648        0.1358830        0.1328560        0.1276533
##   -36.5625038146973        0.1303250        0.1255906        0.1243571

Global attributes

Let’s take a look at the global NetCDF attributes. We saw above and in the text file that there were 52 global attributes in this file and they contain all kinds of useful info. Which attributes you will need is dependent on your analysis. We’ll retrieve all the attributes with the ncatt_get() function. The names() function will give us a list of attribute names. These names will give us access to the relevant values. Thereafter it’s easy to store whichever you want. I’ll store the start and end date-times for this data set.

# Retrieve the attributes
nc_atts <- ncatt_get(nc, 0)

# List all the attributes (commented to save space).
# names(nc_atts)

# Retrieve the start and end date-times
date_time_start <- as.POSIXct(nc_atts$start_time, format = "%Y%m%dT%H%M%SZ", tz = "UTC")
date_time_end <- as.POSIXct(nc_atts$end_time, format = "%Y%m%dT%H%M%SZ", tz = "UTC")

It’s always a good idea to close the connection to the nc file when youre done.

nc_close(nc)

Processing multiple NetCDF files

We’ve investigated various properties of a NetCDF file and seen how to extract and store variables. We have a whole directory of daily chlorophyll-a data files though and we can’t process them one at a time. We’ll write a small function to loop through them, and using what we’ve learned we will extract and add the data to a dataframe in long format. If you need other data structures it should be easy to adjust the function as required.

# Define our function
process_nc <- function(files){
    # iterate through the nc
    for (i in 1:length(files)){
        # open a conneciton to the ith nc file
        nc_tmp <- nc_open(paste0("data/", files[i]))
        # store values from variables and atributes
        nc_chla <- ncvar_get(nc_tmp, attributes(nc_tmp$var)$names[1])
        nc_lat <- ncvar_get(nc_tmp, attributes(nc_tmp$dim)$names[1])
        nc_lon <- ncvar_get(nc_tmp, attributes(nc_tmp$dim)$names[2])
        nc_atts <- ncatt_get(nc_tmp, 0)
        nc_start_date <- as.Date(nc_atts$period_start_day, format = "%Y%m%d", tz = "UTC")
        # close the connection sice were finished
        nc_close(nc_tmp)
        # set the dimension names and values of your matrix to the appropriate latitude and longitude values
        dimnames(nc_chla) <- list(lon=nc_lon, lat=nc_lat)

        # I'm choosing to store all the data in long format.
        # depending on your workflow you can make different choices here...
        # Your variable may get unmanageably large here
        # if you have high spatial and temporal resolution nc data.
        tmp_chl_df <- melt(nc_chla, value.name = "chla")
        tmp_chl_df$date_start <- nc_start_date

        # set the name of my new variable and bind the new data to it
        if (exists("chla_data_monthly")){
            chla_data_monthly <- bind_rows(chla_data_monthly, tmp_chl_df)
        }else{
            chla_data_monthly <- tmp_chl_df
        }
        # tidy up, not sure if necesarry really, but neater
        rm(nc_chla, nc_lat, nc_lon, nc_tmp, nc_atts, nc_start_date, tmp_chl_df)
    }

    return(chla_data_monthly)
}


data <- process_nc(flist)

Ok that’s it. I hope someone finds this useful :). Please feel free to comment below.

South African Local Government Elections 2016

. 2016_election_effort

A visual comparison of party effort across municipalities / districts.

With South Africa’s local government elections in a few days time (3rd August 2016), I wondered about the effort being expended by the parties in each district or municipality. Has it increased or decreased since the 2011 elections? I must point out that I am not a political analyst or a political scientist. I am nevertheless a scientist with an appreciation for analytics.

Employing and promoting candidates costs money. Assuming that our political parties don’t have infinite financial resources it follows that investigating where they invest their resources may be a reasonable proxy for effort. Furthermore, looking at the change in effort adds a temporal dimension, suggesting where effort has increased or decreased between the elections.

The map below shows the change in PR candidate representation by each party across the county. If you hover your mouse pointer over a district, you can see by how much the representation has changed in terms of two metrics:

Relative change: The change in representation as a proportion of all representatives. E.g. In 2011 parties A and B each had 50% of a district’s PR candidates. In 2016 party A has 60% and party B has 40%. Party A’s relative change is 10% and party B’s is -10%.

Absolute change: The change in actual number of PR candidates in a party from 2011 to 2016.

Scroll below the map for more info regarding data sources and methods.

Disclaimer: As I said though, I am no political scientist and so maybe this is all nonsense.

Embed this map using the following iframe code:

<iframe width="100%" height="730px" src="https://philmassie.github.io/election_effort_d3/za_election_effort_2016.html" scrolling = "no" frameborder="0" seamless name="iframe-election-content" id="iframe-election-content">
<p>Your browser does not support iframes. Click <a href="https://philmassie.github.io/election_effort_d3/">here</a> to navigate to the content.</p>
</iframe>

Data sources

The Independent Electoral Commission have published the 2016 candidate lists as pdf documents. The good people at openAfrica have already gone to the trouble of processing these pdfs (which can be a bit painful) so I used their lists instead. openAfrica also host the 2011 candidate lists. After learning a little more about the electoral system, it seemed that the proportional representative candidates would be the interesting group to look at.

The Municipal Demarcation Board website hosts shape files for South Africa’s districts and provinces.

R data processing

Some simple wrangling in R (scroll down for the process) exposed the list of district codes which I joined with the appropriate candidate lists from each election year. From there I produced a list of parties, ordered by the total number of district or municipal representatives. The Economic Freedom Fighters lead this list with a total of 1501 candidates, 106 more than the African National Congress and 139 more than the Democratic Alliance. These are the three largest parties which is why I mention them. I chose to load the map with ANC data because the EFF didn’t exist during the 2011 elections and so they’ve not decreased in any provinces. I believe that a few candidates have withdrawn from the elections. I’m not sure who they were but I doubt their withdrawal would make any substantive change to this analysis.

## Libraries
library(rgdal) #
library(xlsx) #
library(plyr)#
library(stringr)#
library(dplyr) #
setwd("D:/My Folders/R/2016/blog/20160714_election_effort")
# Load the district shape file to get a list of district codes
za.districts <- readOGR("Districts", layer="DistrictMunicipalities2011")
za.district.list <- unique(za.districts@data$DISTRICT)
rm(za.districts)
# load and process electoral candidate data
# 2011 Proportional Representative candidates
pr.cand.2011 <- read.xlsx2("data/2011-pr-candidate-lists.xls", 1, startRow = 3)
# A few minor edits to party names
pr.cand.2011$Party <- str_to_title(str_replace_all(pr.cand.2011$Party, "  ", " "))
pr.cand.2011$Party[pr.cand.2011$Party == "Democratic Alliance/Demokratiese Alliansie"] <- "Democratic Alliance"
pr.cand.2011$Party[pr.cand.2011$Party == "Independent Ratepayers Association Of Sa"] <- "Independent Ratepayers Association Of SA"
pr.cand.2011$Party[pr.cand.2011$Party == "South African Maintanance And Estate Beneficiaries Associati"] <- "South African Maintanance And Estate Beneficiaries Association"
# 2016 ward & proportional Representative candidates were in the same data set
ward.pr.cand.2016 <- read.csv("data/Electoral_Candidates_2016.csv")
# A few minor edits to party names
ward.pr.cand.2016$Party <- str_to_title(str_replace_all(ward.pr.cand.2016$Party, "  ", " "))
ward.pr.cand.2016$Party[ward.pr.cand.2016$Party == "Independent Ratepayers Association Of Sa"] <- "Independent Ratepayers Association Of SA"
ward.pr.cand.2016$Party[ward.pr.cand.2016$Party == "South African Maintanance And Estate Beneficiaries Associati"] <- "South African Maintanance And Estate Beneficiaries Association"

# A number of the 2016 district codes had a wierd "\f" prefix.
# I consulted the original IEC data,confirmed that these were errors, and tidied them up.
err_index <- grep("\f", ward.pr.cand.2016$Municipality)
ward.pr.cand.2016$Municipality[err_index] <- str_replace(ward.pr.cand.2016$Municipality[err_index], "\f", "")
ward.pr.cand.2016 <- droplevels(ward.pr.cand.2016)

#### Split the data into ward and pr data
# change variable names so they match later
pr.cand.2016 <- subset(ward.pr.cand.2016, ward.pr.cand.2016$PR.List.OrderNo...Ward.No < 1000)
names(pr.cand.2016)[4] <- "list.order.no"
#### Process ward and PR data
processor <- function(df){
    # get the variable name of passed data
    data.name <- (deparse(substitute(df)))
    # split the municipality names into codes and names
    dist.split <- str_split_fixed(df$Municipality, " - ", 2)
    df$dist.code <- dist.split[,1]
    df$dist.name <- dist.split[,2]
    # Keep only the district values from the shape file. Main centers and DC areas.
    df <- df[df$dist.code %in% za.district.list, ]
    # split data by district and party, counting how many candidates per party per district
    df.long <- ddply(df, c("dist.code", "Party"), function(df) nrow(df))
    # temp naming - to keep track not NB
    names(df.long)[3] <- "num"
    # split new data by district, summing tot candidates from all parties (in district)
    df.tot <- ddply(df.long, "dist.code", function(df) sum(df$num))
    # join tot numbers by district to main data
    df.long <- join(df.long, df.tot, by = "dist.code")
    # temp naming - to keep track not NB
    names(df.long)[4] <- "tot"
    # calculate the proportion of each party to total, for each district
    df.long$prop <- df.long$num / df.long$tot
    # quick srting, not v NB
    df.long <- arrange(df.long, dist.code, -prop)
    # renaming based on variable name
    names(df.long)[c(3:5)] <- c(paste0(data.name, ".num"), paste0(data.name, ".tot"), paste0(data.name, ".prop"))
    return(df.long)
}
pr.cand.2011.long <- processor(pr.cand.2011)
pr.cand.2016.long <- processor(pr.cand.2016)

# Join data, dropping parties that no longer exist
pr.cand <- join(pr.cand.2011.long, pr.cand.2016.long, by = c("dist.code", "Party"), type = "right")

# replace NAs with zero for parties new in 2016
pr.cand[is.na(pr.cand)] <- 0

# calculate relative and absolute change
pr.cand$rel <- pr.cand$pr.cand.2016.prop - pr.cand$pr.cand.2011.prop
pr.cand$abs <- pr.cand$pr.cand.2016.num - pr.cand$pr.cand.2011.num
# generate data for d3 map
# full data list
d3_data_all <- pr.cand[, c(1, 2, 9, 10)]
names(d3_data_all) <- c("dist_code", "party", "Relative Change", "Absolute Change")
write.csv(d3_data_all, file="data_d3/d3_data_all.csv", row.names = FALSE, quote = FALSE)

# Party data list, ordered by total number or district PR candidates
# group the data by Party
grouped <- group_by(pr.cand, Party)
# Summarise by summing the total candidates per party, excluding zeros and sorting
d3_party_list <- summarise(grouped, tot = sum(pr.cand.2016.num))
d3_party_list <- d3_party_list[d3_party_list$tot > 0, ]
d3_party_list <- arrange(d3_party_list, -tot)
# now we only need the party list
d3_party_list <- data.frame(party = d3_party_list$Party)

write.csv(d3_party_list, file="data_d3/d3_party_list.csv", row.names = FALSE, quote = FALSE)

The map was build with JavaScript and the mighty D3 libraries.

Plot a wind rose in R

. windrose_directions_crop

Introduction

This is another post regarding some plots that I needed to make for a publication. As before, I relied heavily on Stack Exchange and many other sites for figuring out how to get my plot looking the way I needed it to, and so this is my attempt to contribute back to the broader community.
In my article I wanted a graphic which illustrated the preferred outward post-moult migration direction of adult female southern elephant seals from Marion Island. In reality it doesn’t matter too much what you want to plot, and these sorts of plots are more generally used for wind direction illustrations. Really, they are like radial histograms. In fact, if you look through the ggplot2 call, it is basically a histogram until the last couple of lines, where it is wrapped into a wind rose.

The data

As I mentioned, my data was related to seal swimming directions, gathered from satellite tags. For this tutorial I will simulate 100000 directions using the wrapped normal function (rwrpnorm) from the CircStats package.

library(ggplot2)
library(CircStats)

set.seed(1)
data <- data.frame(direction = deg(rwrpnorm(100000, rad(225), rho=0.8, sd=1)))

# Take a quick look...
hist(data$direction, main = "Histogram of hypothetical direction frequencies.", xlab = "Direction", ylab = "Frequency")
That looks like a reasonable distribution of directions, favouring 225°.

That looks like a reasonable distribution of directions, favouring 225°.

That looks like a reasonable distribution of directions, favouring 225°.

Method

There are various ways to split and plot these data. Some searching lead me to this one. First we will define the bin width (30°), then we will define dir.breaks which stores the range of each bin as follows 345°-15°, 15°-45°, 45°-75° etc.

# choose bin size (degrees/bin)
deg <- 30 
# define the range of each bin
dir.breaks <- seq(0-(deg/2), 360+(deg/2), deg)

Now we generate a factor variable, exchanging the directions with the ranges. We’ll also generate some pretty labels and assign them as levels of the new object. Finally we’ll attach the new variable to the main dataset.

# assign each direction to a bin range
dir.binned <- cut(data$direction,
                       breaks = dir.breaks,
                       ordered_result = TRUE)
# generate pretty lables
dir.labels <- as.character(c(seq(0, 360-deg, by = deg), 0))
# replace ranges with pretty bin lables
levels(dir.binned) <- dir.labels
# Assign bin names to the original data set
data$dir.binned <- dir.binned

Plotting

Theming

The important points here are blank values for panel.border and panel.grid. This turns off the default background for ggplot2 so that we can define the borders and grids manually later.

thm <- theme_bw() + 
    theme(axis.text.x = element_text(size=8, face = "plain"),
          axis.text.y = element_text(size=8, face = "plain"),
          axis.title.x = element_blank(),
          axis.title.y = element_text(size=8, face = "plain", hjust = 0.9, vjust = 1.3),
          panel.border = element_blank(),
          panel.grid  = element_blank())

Now the main plot…

The main plot sets things up quite specifically. What we need to know before going on is the frequency of the most common bin.

summary(dir.binned)
##     0    30    60    90   120   150   180   210   240   270   300   330 
##    79     1     5    83   884  4993 15875 28134 28403 15751  4926   866

So there are 28403 counts of the 240° bin. We need this value to set our y-scales correctly. Our y-axis will have breaks every 5000 counts so we’ll use 30000 as the max for the y-axis. The first parts of this plot will plot a rectangular histogram, only the coord_polar function wraps it into a wind rose.

# initialise the plot
plt.dirrose <- ggplot() +
    # since the plot background is blank we'll add a series of horizontal lines, at 5000 count intervals, up to 25000.
    geom_hline(yintercept = seq(0, 25000, by = 5000), colour = "grey60", size = 0.3) +
    # Now we add a darker horizontal line as the top border at 30000.
    geom_hline(yintercept = 30000, colour = "black", size = 0.3) +
    # We want 12 vertical lines representing the centers of the 30° ranges.
    geom_vline(xintercept = c(seq(1,12,1)), colour = "grey60", size = 0.3) +
    # On top of everything we place the histogram bars.
    geom_bar(data = data, aes(x = dir.binned), width = 1, colour="black", size = 0.3, alpha=0.5) + 
    # Add the x-axis labels
    scale_x_discrete( drop = FALSE, labels = c(0, "", "", 90, "", "", 180, "", "", 270, "", "")) +
    # Add the y-axis labels
    scale_y_continuous(limits = c(0, 30000), expand = c(0, 0), 
                       breaks = c(0, 5000, 10000, 15000, 20000, 25000, 30000), 
                       labels = c(0, 5, 10, 15, 20, 25, 30)) +
    # Add the axis titles
    labs(x = 'Outward step bearing (°)', y = 'Count of outward steps (x10³)') +
    # If you only use the plot code up till here you will get a histogram.
    # the next line wraps the histogram into a windrose
    coord_polar(start = -(deg/2)*(pi/180)) +
    # apply theme
    thm

plt.dirrose
Frequencies of hypothetical directions.

Frequencies of hypothetical directions.

There we have it; a wind rose, in grey scale illustrating our data exactly as we need it to.

Saving the plot according to the publishers requirements

For completeness, this is one approach to saving your plots

ggsave(filename = 'plt.dirrose.png', plot = plt.dirrose, width = 84, height = 84, units="mm", dpi = 300, type="cairo-png")

Hope this helps 🙂

Plotting greyscale contoured data in R with ggplot2

. Region of elevated mesoscale activity, or eddy field (dashed rectangle), to the south-west of Marion Island. Mean eddy kinetic energy values for the period 2008–2010 are plotted, and the 3000 m isobaths show the series of faults cross-cutting the Southwest Indian Ridge (SWIR).

Introduction

Preparing figures for publication can take a long time (well it does for me anyway), and I relied very heavily on numerous online resources to figure out some of the dos and don’ts. Obviously I owe massive thanks to the hundreds of blogs and Stack Exchange questions and answers I relied on. Where I can, I will try to link to them.

Additionally, this tutorial covers a little bit about working with reshaped gridded data, shape files, GeoTIFF data and even a simple homemade polygon. This then is my contribution to folks trying to get their plots to look the way they need them to.

As most people know, publishing in colour is way more expensive than in grey scale. The costs were completely prohibitive for myself and my co-authors, so I made efforts to change my beautiful, spectacularly coloured plots, which looked so nice in slide shows, into grey scale.

This is also my first attempt to use R Markdown in a real-world scenario. Please feel free to provide critical feedback.

The data

This plot is showing mean values of the derived variable eddy kinetic (EKE) energy to the south west of Marion Island (in the Southern Ocean) from 2008 to 2010. Interactions between the Antarctic Circumpolar Current and a series of undersea faults in the South West Indian Ridge, give rise to an area of instability, and it is the degree and extent of this instability that I set out to investigate via EKE.

EKE is derived from u and v components like this. The friendly folks at AVISO have kindly let me use and make available this subset of their data for the purpose of this tutorial. Perhaps if someone is interested I’ll make another post showing how I went about the calculations and wrangling.

If you are interested in this sort of data, please check out AVISO. They make available unfathomable amounts of absolutely beautiful data for free. All you need is an account and it’s yours – Honestly access to data like this blows my mind. I wish it was like this with South African regional data… but that’s another story.

Method

Start by loading all the necessary libraries, setting working directories and loading data. You can download the EKE data set here.

library(sp)
library(rgdal)
library(raster)
library(ggplot2)
library(plyr)
setwd("D:/My Folders/R/2016/blog/20160622_ggplot2_contours")
load(file = "data/data.eke.rdata")
head(data.eke)
##       lon     lat          eke
## 1 -18.875 -68.625 1.052298e-04
## 2 -18.875 -68.375 1.009409e-04
## 3 -18.875 -68.125 7.404519e-05
## 4 -18.875 -67.875 6.044744e-05
## 5 -18.875 -67.625 8.102543e-05
## 6 -18.875 -67.375 9.289138e-05
# When we derive EKE, grid cells over land produce NaNs, so we must replace them with NAs
data.eke$eke[which(is.nan(data.eke$eke))] <- NA

Define our map extents

The EKE data spans from -18.875 E to 43.875 E and -38.88 N to -68.62 N.
Those are the extents needed for the full study, but it’s too much for the plot, so we must subset it.

latmin <- -55
latmax <- -43
lonmin <- 20
lonmax <- 43
map.extents <- extent(lonmin, lonmax, latmin, latmax)
data.eke.sub <- subset(data.eke,
                       data.eke$lat >= latmin &
                           data.eke$lat <= latmax &
                           data.eke$lon >= lonmin &
                           data.eke$lon <= lonmax)

Shape Files

The plot area includes the Prince Edward Islands. Instead of leaving blank, blocky spaces, let’s use some shape files to make it look neater. These shape files come from http://www.naturalearthdata.com. They are from the ‘fine scale (1:10), Admin 0 -Countries’ set available for free from here.

# read in the shape file
countries <- readOGR("shapes", layer="ne_10m_admin_0_countries")
# Add an id column to the data slot of countries, based on rownames
countries@data$id <- rownames(countries@data)
# Fortify the spatial polygons for ggplot2 - likes long format data
# The resulting data frame contains all the spatial data and the id but no information labels.
countries.df <- fortify(countries)
# Use the id column to join all the label data in countries@data to the new dataframe.
countries.df <- join(countries.df, countries@data, by="id")
# Now we can subset the new dataframe to include only South Africa and her Islands and tidy up a bit
za <- subset(countries.df, countries.df$ISO_A2=="ZA")
rm(countries, countries.df)

GeoTIFF

The plot needs some bathymetric contour lines. We will use the ETOPO1 data for this via a GeoTIFF. There are higher resolution data sets around (e.g. ETOTPO2 and GEBCO) but ETOPO1 will suffice for this application. The ETOPO1 data is (freely?) available from here.

# Load up the geotiff
etopo1.full <- raster("etopo/ETOPO1_Ice_c_geotiff.tif")
# Crop out our ROI
etopo.crop <- crop(etopo1.full, map.extents)
# Prepare for ggplot2 (long format) and tidy up a bit
etopo.crop.df <- as.data.frame(etopo.crop, xy=TRUE)
names(etopo.crop.df) <- c("long", "lat", "z")
rm(etopo1.full)

Polygon

Lastly we’ll build a simple polygon to demarcate an area we consider within the region of elevated EKE (eddy field).

# Define extents
eflatmin <- -53
eflatmax <- -47.33
eflonmin <- 27.33
eflonmax <- 37.66
# bind the data into sequential coordinates, outlining the new polygon and generate a polygon.
ef.coords <- cbind(c(eflonmin, eflonmax, eflonmax, eflonmin, eflonmin),
                   c(eflatmax, eflatmax, eflatmin, eflatmin, eflatmax))
ef.poly <- Polygon(ef.coords)
# Prepare for ggplot2 and tidy up
ef.df <- fortify(ef.poly)
# set an individual group for plotting
ef.df$group <-1
rm(ef.coords, ef.poly)

Plot time

Finally we have all the bits in place so we can build a plot. We’ll start with the colourful version, and then see how to refine it to a grey scale contoured version.

Preliminaries

We start by defining the plot label positions and text and then set some custom theme stuff. See here for more on theming.

# define some lables and their coordinates
lbl <- data.frame(x = c(40, 36, 30, 37.1, 22.5),
                  y = c(-46.5, -53.4, -50, -45, -51.5),
                  txt = c("Marion Island", "Eddy Field", " ", "SWIR", "SWIR"))
# Define the theme data as I like it for these plots...
thm <- theme_bw() + 
    theme(axis.text.x = element_text(size=8, face = "plain"),
          axis.text.y = element_text(size=8, face = "plain"),
          axis.title.x = element_text(size=8, face = "plain"),
          axis.title.y = element_text(size=8, face = "plain"),
          axis.ticks.x = element_line(size=0.3),
          axis.ticks.y = element_line(size=0.3),          
          legend.key.height = unit(13, units="mm"),
          legend.text = element_text(size=8, face = "plain"),
          legend.title = element_text(size=8, angle = 90, vjust = 1, face = "plain"),
          legend.title.align = 0.5,
          panel.border = element_rect(colour = "black", fill=NA, size=.3))

Coloured tiles plot

Ggplot2 is even more awesome than anything you might have read, but it can be intimidating (I find). The best I can suggest is to keep playing and you’ll start to get a handle on. One of the biggest difficulties for me was figuring out what order the layers should be in. This is where I ended up. I’d like to tell you that I had figured out an overriding set of rules, but there was some trial and error involved.

I have tried to comment the coloured plot extensively.

# initiate the plot
plot.eke.colour <- ggplot() +
    # include a layer with the eke data as coloured tile data
    geom_tile(data=data.eke.sub, aes(x=lon,y=lat,fill=eke)) +
    # include the shape file data
    geom_polygon(data = za, aes(x = long, y = lat, group = group), 
                 colour="black", fill="white", alpha=1, size = 0.3) + 
    # include the home made polygon data
    geom_polygon(data=ef.df, aes(x=long, y=lat, group=group), 
                 colour="black", fill="white", alpha=0, linetype="dashed", size = 0.3) +
    # include the bathymetry data with geom_contour. Define a single custom break
    geom_contour(data=etopo.crop.df, aes(x=long,y=lat,z=z), 
                 breaks=c(-3000), colour="black", size = 0.3) +
    # axis lables
    labs(x = 'Longitude (ºE)', y = 'Latitude (ºN)') +
    # map projection - boring choice but suits the task at hand
    coord_map("mercator") +
    # this trims unwanted whitespace around the plot
    coord_fixed(xlim = c(lonmin, lonmax), ylim = c(latmin, latmax), expand = FALSE) +
    # change the default EKE colour to more traditional rainbow colours (oceanography etc)
    scale_fill_gradientn(colours = rev(rainbow(7, end=4/6)), 
                         space = "Lab", 
                         guide = guide_colorbar(title="Eddy kinetic energy (cm²/s²)", 
                                                title.position="right")) +
    # include the lables defined earlier
    geom_text(data=lbl, aes(x=x, y=y, label=txt), size=rel(3), colour="white", fontface="bold", alpha = 1) +
    geom_text(data=lbl, aes(x=x+0.01, y=y+0.01, label=txt), size=rel(3), colour="black", fontface="bold") +
    # include the theme settings
    thm
plot.eke.colour
Region of elevated mesoscale activity, or eddy field (dashed rectangle), to the south-west of Marion Island. Mean eddy kinetic energy values for the period 2008–2010 are plotted, and the 3000 m isobaths show the series of faults cross-cutting the Southwest Indian Ridge (SWIR).

Region of elevated mesoscale activity, or eddy field (dashed rectangle), to the south-west of Marion Island. Mean eddy kinetic energy values for the period 2008–2010 are plotted, and the 3000 m isobaths show the series of faults cross-cutting the Southwest Indian Ridge (SWIR).

That looks pretty good. Unfortunately when we save the file, things don’t always look the same as they do on the screen so there is generally some fine tuning involved. This version, as it stands, makes rather a good print. The colour palette is familiar to all my MATLAB/Python oceanographic colleagues too.

Saving the plot according to the publishers requirements

For completeness, this is one approach to saving your plots

ggsave(filename = 'plot.eke.colour.png', plot = plot.eke.colour, width = 174, height = 105, units="mm", dpi = 300, type="cairo-png")

Greyscale contoured plot

If we just jump in and plot the existing data using the stat_contour() approach we immediately hit a problem. It’s a little difficult to explain, but easy enough to understand if you play around a bit. I’ll try my best.

Some of the contoured values exist as closed polygons within the plot frame. These plot as we would expect. Unfortunately, other contoured values are (now) polylines. These lines would have closed somewhere outside of the plot extents, but were cropped, resulting in open polygons/polylines. As a result, ggplot2 can’t fill them with shading as required. Ggplot2 closes them automatically by joining their ends. This is pretty catastrophic. You can see what happens for yourself by exchanging data.eke.sub.contour for data.eke.sub in the plot code below.

Calculate a low arbitrary value

To avoid this we need to add extra, make-believe EKE data just outside the plotting frame. We set this to a low, arbitrary value which allows ggplot2 to close off the contour polygons which we are interested in. What the arbitrary value is, is not that important, as long as it is lower than everything else. I found this approach on stackoverflow here. It looks at the data spread, divides it into n bins and sets an arbitrary value according to the lowest value minus the bin width * 1.5.

bins <- 50
bin.width <- diff(range(na.omit(data.eke.sub$eke))) / bins
arbitary.value <- min(na.omit(data.eke.sub$eke)) - bin.width * 1.5

# Build some data frames representing 1 degree of extra EKE data in each direction, with the arbitrary value as EKE
min <- sapply(data.eke.sub, min, na.rm = TRUE)
max <- sapply(data.eke.sub, max, na.rm = TRUE)

seq.lat <- seq(min['lat'], max['lat'], 0.25)
seq.lon <- seq(min['lon'], max['lon'], 0.25)

west.edge <- data.frame(lon = rep(min['lon'] - 1, length(seq.lat)), lat = seq.lat, eke = arbitary.value)
east.edge <- data.frame(lon = rep(max['lon'] + 1, length(seq.lat)), lat = seq.lat, eke = arbitary.value)
south.edge <- data.frame(lon = seq.lon, lat = rep(min['lat'] - 1, length(seq.lon)), eke = arbitary.value)
north.edge <- data.frame(lon = seq.lon, lat = rep(max['lat'] + 1, length(seq.lon)), eke = arbitary.value)

# rbind the new data to the existing data and tidy up
data.eke.sub.contour <- rbind(data.eke.sub, west.edge, east.edge, north.edge, south.edge)

rm(bins, bin.width, arbitary.value, min, max, seq.lat, seq.lon, west.edge, east.edge, north.edge, south.edge)

Now we plot

Now, with our newly padded dataset, we can complete the greyscale plot. I’ve only commented the changed lines in this plot (vs. the colour plot above).

plot.eke.contour.grey <- ggplot() +
    # This line denotes the newly padded data, as a contour statistic.
    # ..level.. indicates that the fill aesthetic is calculated by the contour statistic.
    # breaks manually define which contours we want to keep.
    stat_contour(data = data.eke.sub.contour, aes(x = lon, y = lat, z = eke, fill = ..level..), 
                 geom = "polygon", breaks = c(0, 0.02, 0.04, 0.06, 0.08)) +
    geom_polygon(data = za, aes(x = long, y = lat, group = group), 
                 colour="black", fill="white", alpha=1, size = 0.3) + 
    geom_polygon(data=ef.df, aes(x=long, y=lat, group=group), 
                 colour="black", fill="white", alpha=0, linetype="dashed", size = 0.3) +
    geom_contour(data=etopo.crop.df, aes(x=long,y=lat,z=z), breaks=c(-3000), colour="black", size = 0.3) +
    coord_map("mercator") +
    coord_fixed(xlim = c(lonmin, lonmax), ylim = c(latmin, latmax), expand = FALSE) +
    labs(x = 'Longitude (ºE)', y = 'Latitude (ºN)') +
    # Here the grey scale colours are defined for the plot and the legend
    scale_fill_gradient(low = "gray25", high = "gray95",
                        space = "Lab", 
                        guide = guide_colorbar(title="Eddy kinetic energy (cm²/s²)", title.position="right")) +
    geom_text(data=lbl, aes(x=x, y=y, label=txt), size=rel(3), colour="white") +
    thm
plot.eke.contour.grey

Region of elevated mesoscale activity, or eddy field (dashed rectangle), to the south-west of Marion Island. Mean eddy kinetic energy values for the period 2008–2010 are plotted, and the 3000 m isobaths show the series of faults cross-cutting the Southwest Indian Ridge (SWIR).

Region of elevated mesoscale activity, or eddy field (dashed rectangle), to the south-west of Marion Island. Mean eddy kinetic energy values for the period 2008–2010 are plotted, and the 3000 m isobaths show the series of faults cross-cutting the Southwest Indian Ridge (SWIR).

There we go! The contours look great. There aren’t too many of them, they don’t have distracting borders, the 3000 m bathymetry contour looks nice and isn’t overly distracting.

Saving the plot according to the publishers requirements

For completeness, this is one approach to saving your plots

ggsave(filename = 'plot.eke.contour.grey.png', plot = plot.eke.contour.grey, width = 174, height = 105, units="mm", dpi = 300, type="cairo-png")

I hope this helps someone 🙂