library(lazysf)
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, unionlazysf translates dplyr verbs into SQL and executes them via GDAL. The SQL dialect determines what’s possible — and lazysf defaults to SQLITE, which is the most capable option.
GDAL provides two SQL dialects for querying vector data:
GROUP BY, ORDER BY,
CAST, and spatial SQL functions. Available for any
format — GDAL runs SQLite in-process even for Shapefiles.%LIKE%, but no subquery
support.The SQLITE dialect is what makes lazysf work well with dbplyr. Most dplyr verbs generate subqueries, and only SQLITE handles those correctly.
f <- system.file("extdata/nc.gpkg", package = "lazysf", mustWork = TRUE)
lf <- lazysf(f)
## filter + select + arrange
lf |>
filter(AREA < 0.1) |>
select(NAME, AREA, geom) |>
arrange(AREA) |>
collect()
#> # A tibble: 34 × 4
#> FID NAME AREA geom
#> <int64> <chr> <dbl> <wk_wkb>
#> 1 0 New Hanover 0.042 <MULTIPOLYGON (((-77.96073 34.18924, -77.96587 34.…
#> 2 1 Chowan 0.044 <MULTIPOLYGON (((-76.68874 36.29452, -76.64822 36.…
#> 3 2 Clay 0.051 <MULTIPOLYGON (((-83.938 34.98939, -83.98855 34.98…
#> 4 3 Pasquotank 0.053 <MULTIPOLYGON (((-76.29893 36.21423, -76.32423 36.…
#> 5 4 Mitchell 0.059 <MULTIPOLYGON (((-82.11885 35.81853, -82.14665 35.…
#> 6 5 Polk 0.06 <MULTIPOLYGON (((-82.21017 35.19313, -82.27833 35.…
#> 7 6 Alleghany 0.061 <MULTIPOLYGON (((-81.23989 36.36536, -81.24069 36.…
#> 8 7 Camden 0.062 <MULTIPOLYGON (((-76.00897 36.3196, -75.95718 36.1…
#> 9 8 Perquimans 0.063 <MULTIPOLYGON (((-76.48053 36.07979, -76.53696 36.…
#> 10 9 Avery 0.064 <MULTIPOLYGON (((-81.94135 35.95498, -81.9614 35.9…
#> # ℹ 24 more rowsAggregation works:
lazysf provides SQL translations for spatial functions. These
translate R-style st_*() names to their SQL equivalents
(e.g. st_area() → ST_Area()).
These work on any system with GDAL’s SQLITE dialect:
## Measurements and accessors
lf |>
mutate(
area = st_area(geom),
srid = st_srid(geom)
) |>
head(5) |>
collect()Functions available without SpatiaLite include
st_area(), st_srid(), st_minx(),
st_maxx(), st_miny(),
st_maxy().
With SpatiaLite linked, the full set of spatial SQL becomes available:
## Spatial predicates
lf |>
filter(st_intersects(geom, st_geomfromtext("POLYGON((-80 35, -79 35, -79 36, -80 36, -80 35))"))) |>
collect()
## Geometry operations
lf |>
mutate(
wkt = st_astext(geom),
centroid = st_centroid(geom),
buffered = st_buffer(geom, 0.01)
) |>
collect()
## Coordinate transformation
lf |>
mutate(geom_3857 = st_transform(geom, 3857)) |>
collect()To check if SpatiaLite is available in your GDAL build:
OGRSQL is simpler but has format-specific features. Set it explicitly:
lf_ogr <- lazysf(f, dialect = "OGRSQL")
## Basic filter/select work
lf_ogr |>
filter(AREA < 0.1) |>
select(NAME, AREA) |>
collect()OGRSQL limitations with dbplyr:
rename(), transmute(), or chained verbs that
dbplyr wraps in subqueries will fail.GROUP BY with subqueries: Simple
aggregations may work, complex ones won’t.Use OGRSQL when you need format-specific features or are working with a data source where SQLITE dialect isn’t available.
You can pass raw SQL directly:
lazysf(f, query = "SELECT NAME, AREA, geom FROM nc WHERE AREA < 0.1 ORDER BY AREA") |>
collect()
#> # A tibble: 34 × 4
#> FID NAME AREA geom
#> <int64> <chr> <dbl> <wk_wkb>
#> 1 0 New Hanover 0.042 <MULTIPOLYGON (((-77.96073 34.18924, -77.96587 34.…
#> 2 1 Chowan 0.044 <MULTIPOLYGON (((-76.68874 36.29452, -76.64822 36.…
#> 3 2 Clay 0.051 <MULTIPOLYGON (((-83.938 34.98939, -83.98855 34.98…
#> 4 3 Pasquotank 0.053 <MULTIPOLYGON (((-76.29893 36.21423, -76.32423 36.…
#> 5 4 Mitchell 0.059 <MULTIPOLYGON (((-82.11885 35.81853, -82.14665 35.…
#> 6 5 Polk 0.06 <MULTIPOLYGON (((-82.21017 35.19313, -82.27833 35.…
#> 7 6 Alleghany 0.061 <MULTIPOLYGON (((-81.23989 36.36536, -81.24069 36.…
#> 8 7 Camden 0.062 <MULTIPOLYGON (((-76.00897 36.3196, -75.95718 36.1…
#> 9 8 Perquimans 0.063 <MULTIPOLYGON (((-76.48053 36.07979, -76.53696 36.…
#> 10 9 Avery 0.064 <MULTIPOLYGON (((-81.94135 35.95498, -81.9614 35.9…
#> # ℹ 24 more rowsA few things to know when working with GDAL’s SQL engine:
FID is sticky. GDAL always includes the feature ID in results, like sf’s sticky geometry column. Your SQL might select two columns but you’ll get three back.
Table-qualified wildcards. When dbplyr generates
SELECT "nc".*, GDAL returns column names prefixed with the
table name (nc.AREA, nc.NAME). lazysf strips
these prefixes automatically.
No window functions. RANK(),
ROW_NUMBER(), DENSE_RANK() etc. are not
available. This means slice_min(),
slice_max(), and slice_head() won’t work. Use
arrange() |> head() instead.
For large datasets, the Arrow C stream interface provides columnar
data transfer from GDAL, which can be significantly faster than the
default row-based fetch():
This uses GDALVector$getArrowStream() via nanoarrow.
Data moves from GDAL to R in batches of up to 65,536 features without
going through the OGRFeature abstraction. Requires GDAL >= 3.6 and
the nanoarrow package.
The Arrow path produces identical results to the default path — same column names, same geometry handling, same wk type marking.
lazysf can return geometry in several formats:
## WKB (default) — compact binary, good for sf conversion
lazysf(f, geom_format = "WKB")
## WKT — human-readable text
lazysf(f, geom_format = "WKT")
## BBOX — bounding box per feature (fast, no full geometry)
lazysf(f, geom_format = "BBOX")
## NONE — no geometry at all (fastest for attribute-only queries)
lazysf(f, geom_format = "NONE")All geometry columns are marked with wk types (wk::wkb,
wk::wkt, wk::rct) so they interoperate with
the wk ecosystem and convert cleanly to sf via
sf::st_as_sf().
lazysf respects these options for defaults:
lazysf.dialect: default SQL dialect (default:
"SQLITE")lazysf.geom_format: default geometry format (default:
"WKB")lazysf.use_arrow: default Arrow stream setting
(default: FALSE)lazysf.query.debug: if TRUE, print SQL and
row counts for every query