--- title: "GDAL SQL" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{GDAL SQL} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) ``` ```{r setup} library(lazysf) ``` GDAL is a complex tool and there are many intricacies to using SQL with it. We haven't explored this comprehensively so [feedback is very welcome](https://github.com/hypertidy/lazysf/issues). ## OGRSQL https://gdal.org/user/ogr_sql_dialect.html Unfortunate label. ## Real databases and file databases PG, GPKG etc https://gdal.org/user/sql_sqlite_dialect.html ## Virtual SQL shapefile, OpenFileGDB etc. ESRI "Geodatabases", known as 'OpenFileGDB' to GDAL *are not real databases*. Amazing! Sadly this means we cannot perform subquerys, which occurs with some contructs. For example, this is fine - filter where on variables and then count the number of rows. There is only one 'SELECT ...' statement. ```R ## here f is a file path to the directory containing list_2d_building_polys_hobart from TheLIST opendata lazysf(f) %>% filter(SHAPE_Area < 15) %>% tally() %>% show_query() SELECT COUNT(*) AS "n" FROM "list_2d_building_polys_hobart" WHERE ("SHAPE_Area" < 15.0) ## we predicted this woud work, even on our not-a-database geodatabase lazysf(f) %>% filter(SHAPE_Area < 15) %>% tally() %>% collect() # A tibble: 1 x 1 n 1 56 ``` However, if we do something more complex like a GROUP BY and then `slice_min()` we cannot run the query. ```R lazysf(f) %>% group_by(BUILD_TY) %>% slice_min(SHAPE_Area) %>% show_query() SELECT "BUILD_ID", "BUILD_TY", "BUILD_NAME", "BLD_PUR", "MEAN_HGT", "UFI", "CREATED_ON", "LIST_GUID", "NOM_REG_NO", "SHAPE_Length", "SHAPE_Area", "SHAPE" FROM (SELECT "BUILD_ID", "BUILD_TY", "BUILD_NAME", "BLD_PUR", "MEAN_HGT", "UFI", "CREATED_ON", "LIST_GUID", "NOM_REG_NO", "SHAPE_Length", "SHAPE_Area", "SHAPE", RANK() OVER (PARTITION BY "BUILD_TY" ORDER BY "SHAPE_Area") AS "q01" FROM "list_2d_building_polys_hobart") "q01" WHERE ("q01" <= 1) lazysf(f) %>% group_by(BUILD_TY) %>% slice_min(SHAPE_Area) Error in CPL_read_ogr(dsn, layer, query, as.character(options), quiet, : Query execution failed, cannot open layer. In addition: Warning message: In CPL_read_ogr(dsn, layer, query, as.character(options), quiet, : ``` Even a simpler version `slice_min()` without grouping fails because it would be generating a subquery to do the work (this might change, but for now consider if you can re-imagine your query as something that will not require nested statements, or write it yourself.) ```R lazysf(f) %>% slice_min(SHAPE_Area) %>% show_query() SELECT "BUILD_ID", "BUILD_TY", "BUILD_NAME", "BLD_PUR", "MEAN_HGT", "UFI", "CREATED_ON", "LIST_GUID", "NOM_REG_NO", "SHAPE_Length", "SHAPE_Area", "SHAPE" FROM (SELECT "BUILD_ID", "BUILD_TY", "BUILD_NAME", "BLD_PUR", "MEAN_HGT", "UFI", "CREATED_ON", "LIST_GUID", "NOM_REG_NO", "SHAPE_Length", "SHAPE_Area", "SHAPE", RANK() OVER (ORDER BY "SHAPE_Area") AS "q01" FROM "list_2d_building_polys_hobart") "q01" WHERE ("q01" <= 1) ``` At worst, we can GROUP_BY and filter in the database and then sub select. Note that this kind of query works on a "real database", so consider relating your experiences to your friendly local data provider. ```{r real-gpkg} library(lazysf) library(dplyr) f <- system.file("gpkg/nc.gpkg", package = "sf", mustWork = TRUE) #lazysf(f) %>% group_by(BIR74) %>% slice_min(PERIMETER) %>% show_query() #lazysf(f) %>% group_by(BIR74) %>% slice_min(PERIMETER) %>% select(NAME, PERIMETER, geom) %>% st_as_sf() ``` Other things are less obvious, why doesn't this work? The tally fails but filter/collect is fine ```R lazysf("C:/temp/temp.xlsx") %>% group_by(a) %>% tally() %>% show_query() SELECT "a", COUNT(*) AS "n" FROM "Encoding Time" GROUP BY "a" lazysf("C:/temp/temp.xlsx") %>% filter(d < 675868) %>% group_by(a) %>% collect() # A tibble: 31 x 6 # Groups: a [1] a .... ```