R and the NetCDF library

This package ncdump was an early attempt to support a tidyverse-inspired package for R.

The key idea is to integrate interactive exploration of what is in the source with lazy-specification of subset requests - so that a user or developer gets helpers that show:

  • the data available in variables, and how they relate to each other structurally
  • the structural array axes available for “slicing”
  • the impact of coordinate- or index-based filter-expressions using dplyr idioms
  • the indexes that the raw API understands for a given slice.

This would provide facility for any data-read to be “lazy”, delayed until the last moment at which the choice of output form is made (long-form data frame, raw array, bespoke format such as raster, image, streaming to another service on-demand etc.).

https://github.com/hypertidy/tidync

This work needed a systematic “metadata-extraction” language, and currently ncmeta/tidync are the core of that, wrapping ncdf4 and RNetCDF and other exploratory wrappings of rhdf5 and rgdal for other cases.

Some poor choices were made in an early version “ncdump” on CRAN (basically class “NetCDF” already used by RNetCDF), and so current direction involves streamlining ncdump with ncmeta then getting the tidync package onto CRAN. The partial visibility of groups had also obscured what these packages were enabling and the required insights to transcend the format details in specific cases.

Why so many NetCDF packages?

Support in R for NetCDF is piecemeal and fragmented. The following sections describe the various facilities of this format and the patchy suppport for them in various R packages.

NetCDF had very large breaking-change update in the move from version 3 to version 4.

NetCDF classic (version 3)

The “original” format of NetCDF was pretty straightforward. A source could have variables, dimensions and attributes. This is well supported by RNetCDF and ncdf4 on CRAN, both of which are provided for multiple architectures (Windows and MacOS). This was also supported by ncdf, but that was superseded by ncdf4 (by the same author) and ncdf is now removed from CRAN (end of 2015).

When ncdf was removed from CRAN the raster package also updated and removed its references to that package. It had previously used ncdf4 in preference, deferring to ncdf when required i.e. on Windows.

The rgdal package can include the NetCDF library as a driver, but no CRAN build has ever done so. Unlike raster the use of the NetCDF library by GDAL is independent of these other R packages, and users are expected to build it in if it’s required (true for many other drivers).

The relationship between raster and rgdal is a little complex, since raster has an independent interpretation of these sources that uses ncdf4 directly, but after checking and failing for its own support for a read raster will fall back and see if rgdal can provide read from a source - but the user cannot request that raster go via rgdal without masking the ncdf4 package visibility. The model interpreation provided by raster and rgdal is analogous, but different and independent. They may “fail to support” a given source for the same broad reason, but the details can be very different.

NetCDF modern (version 4)

This was a complex update to NetCDF, essentially a re-engineering of the library from HDF5. It enabled a number of new facilities:

  • groups (hierarchical structure within a source, like a file system - directories of variables)
  • internal compression
  • “chunking” (i.e. multi-dimensional tiling, the layout on disk of the values relative to the logical layout of the array)
  • compound types (struct-like custom data types, commonly used to approximate “tables” i.e. sets of same-length-different-type 1-D arrays)

The ncdf4 package in its original form supported all of these new features except for compound types, and it also supports the classic “version 3” forms.

Both raster and rgdal support NetCDF in all cases above for NetCDF version 4 apart from compound types. The specification of a source within groups is quite specific though and there’s little exercise of how these packages relate to them. Neither support “non-regular” non-affine-based georeferencing - both rely on the rectilinear-axes-coordinate model used by NetCDF being degenerate-rectilinear - but again the heuristics applied are different for different sources and so this is a complex area to summarize.

The rhdf5 package supports NetCDF version 4 including compound types. Specifically, it has a straightforward way to read these as data frames when it makes sense to do so. There’s no limit on what NetCDF version 4 can be read, but the interpretation is very much lower-level than either raster or rgdal. This package is on Bioconductor, so it obscure to the normal CRAN user but it is supported cross platform. rhdf5 cannot read the classic form NetCDF version 3 format.

DODS, OpenDAP, Thredds

(DODS is the old system, sequentially replaced by OpenDAP and now Thredds - these are synonymous terms as far as I know, but “DODS” is the name of the GDAL driver, for raster and vector sources).

The NetCDF driver can be OpenDAP-aware. The missing OpenDAP support for Windows / MacOS is a lower level shared library issue that is a problem with the Windows ncdf4 and RNetCDF packages as well.

GDAL has an independent driver DODS, but NetCDF itself can also be DODS/OpenDAP capable. Similar overlap occurs with NetCDF(4) and HDF5, and you can see conflicts with raw HTTP sources and these DODS/OpenDAP/Thredds sources because the “same syntax” triggers driver-choice on connect. All driver conflicts within a given GDAL build can be resolved by prepending the driver identifier to the data source string, as far as I know.

Both RNetCDF and ncdf4 support these server systems when the library is configured for its support (so usually only Linux users who can install the requirements). NetCDF can be installed from source and configured with these options, or installed from distros - essentially the unstable-ubuntu-gis stack + libdnetcdf-dev is the simplest way.

Groups are partly why this is so confusing

Groups are a way to add an extra level to collections of variables within a single data source. It’s like a “group” allows a file to contain more than one file, where “file” corresponds to an available set of dimensions.

Both RNetCDF and ncdf4 support groups but neither will list the contents of any group that contains compound types, so we don’t notice at the R level that the groups with those types are silently ignored - unless they are the only type in the file - and we notice because it simply fails to work at all.

NOTE: I am referring to the current CRAN version of RNetCDF 1.8-2 - the development version on R-forge already has new support for version 4.0 and groups.

http://r-forge.r-project.org/R/?group_id=2008

Supporting groups in full requires a re-write, a super-package to transcend ncdf4, RNetCDF and rhdf5 - wrappers at the R level could drive these for a virtual super-package, but it’s complicated by the cross-platform problems.

Ultimately groups provide a nice analogy for dealing with sets of files, which is a standard model for long-running observations or large model output with long temporal axes. Dealing with this level of hierarchy will enable a true abstraction over these file system artefacts and provide a proper virtual array with database-like support.