--- title: "Introduction to cmemsarco" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Introduction to cmemsarco} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) ``` ## Overview cmemsarco provides cloud-native access to Copernicus Marine Service (CMEMS) Analysis-Ready Cloud-Optimized (ARCO) Zarr datasets. The package builds a catalog of GDAL-ready data source names, letting you go straight from URL to pixels without file downloads, directory listings, format, or tool wrangling. ```{r} library(cmemsarco) # The bundled catalog cmems_catalog_data ``` ## The catalog The catalog is built by walking the CMEMS STAC API. Each row represents a versioned dataset with URLs to Zarr stores in different formats: | Column | Description | |--------|-------------| | `product_id` | CMEMS product identifier | | `dataset_id` | Dataset identifier (without version) | | `version` | 6-digit version (YYYYMM) | | `timeChunked_url` | HTTPS URL to timeChunked.zarr | | `geoChunked_url` | HTTPS URL to geoChunked.zarr | | `*_gdal` | GDAL DSN using `/vsicurl/` | | `*_gdals3` | GDAL DSN using `/vsis3/` | | `*_s3` | S3 URI (`s3://bucket/path`) | Use `cmems_latest()` to keep only the most recent version of each dataset, and `cmems_arco_only()` to drop datasets without Zarr URLs (static/native-only). ```{r} cmems_catalog_data |> cmems_arco_only() |> cmems_latest() ``` ## Chunking strategies CMEMS provides two Zarr stores for each dataset, optimised for different access patterns: **timeChunked** (chunks: 1 × 720 × 512 in time × lat × lon) - One time step per chunk in the time dimension - Use for spatial queries: maps, regional extracts, spatial analysis - Efficient when you need a large area at one or few time steps **geoChunked** (chunks: 138 × 32 × 64 in time × lat × lon) - Many time steps per chunk, small spatial footprint - Use for time series: point extraction, temporal analysis - Efficient when you need many time steps at one or few locations Choosing the wrong chunking strategy means many more HTTP requests and slower performance. ## URL formats Each Zarr store is available in four formats. Use whichever suits your tooling: ### `*_gdal` — zero configuration (recommended) Uses GDAL's `/vsicurl/` handler which works without any environment setup: ```{r, eval = FALSE} dsn <- cmems_catalog_data$timeChunked_gdal[1] #> 'ZARR:"/vsicurl/https://s3.waw3-1.cloudferro.com/mdl-arco-time-045/..."' # Works immediately with any GDAL-based tool #vapour::vapour_raster_info(dsn) #terra::rast(dsn) ``` ### `*_gdals3` — S3 protocol Uses GDAL's `/vsis3/` handler which requires `cmems_setup()` first to configure the AWS endpoint: ```{r} cmems_setup() # Sets AWS_NO_SIGN_REQUEST=YES, AWS_S3_ENDPOINT=... dsn <- cmems_catalog_data$timeChunked_gdals3[1L] dsn ``` This may offer better performance in some cases due to S3-specific optimisations in GDAL. ### `*_s3` — S3 URI Standard `s3://` URIs for use with S3-aware tools: ```{r} uri <- cmems_catalog_data$timeChunked_s3[1] uri ``` ### `*_url` — raw HTTPS The underlying HTTPS URLs, useful if you need to construct your own access pattern: ```{r} url <- cmems_catalog_data$timeChunked_url[1] url ``` ## Typical workflow ```{r} library(cmemsarco) # Find your dataset sla <- cmems_catalog_data |> dplyr::filter(grepl("SEALEVEL.*NRT", product_id)) |> cmems_latest() # Grab the DSN (no setup needed) dsn <- sla$timeChunked_gdal[1] dsn ``` ## Refreshing the catalog The bundled catalog is a snapshot. To get the latest datasets: ```{r, eval = FALSE} #fresh <- cmems_catalog() ``` This walks the STAC API and takes a few minutes for all ~330 products. ## Why this works The CMEMS S3 buckets don't allow LIST operations, but GDAL's Zarr driver doesn't need them. It reads `/.zmetadata` to understand the array structure, then fetches only the chunks required for your read operation. No directory listings, no full downloads—just the bytes you need.