• AIPressRoom
  • Posts
  • A sparklyr extension for analyzing geospatial knowledge

A sparklyr extension for analyzing geospatial knowledge

sparklyr.sedona is now accessiblebecause the sparklyr-based R interface for Apache Sedona.

To put in sparklyr.sedona from GitHub utilizingthe remotes package deal, run

remotes::install_github(repo = "apache/incubator-sedona", subdir = "R/sparklyr.sedona")

On this weblog publish, we’ll present a fast introduction to sparklyr.sedona, outlining the motivation behindthis sparklyr extension, and presenting some instance sparklyr.sedona use circumstances involving Spark spatial RDDs,Spark dataframes, and visualizations.

Motivation for sparklyr.sedona

A suggestion from themlverse survey results earlierthis 12 months talked about the necessity for up-to-date R interfaces for Spark-based GIS frameworks.Whereas trying into this suggestion, we discovered aboutApache Sedona, a geospatial knowledge system powered by Sparkthat’s fashionable, environment friendly, and straightforward to make use of. We additionally realized that whereas our pals from theSpark open-source neighborhood had developed asparklyr extension for GeoSpark, thepredecessor of Apache Sedona, there was no related extension making more moderen Sedonafunctionalities simply accessible from R but.We subsequently determined to work on sparklyr.sedona, which goals to bridge the hole betweenSedona and R.

The lay of the land

We hope you might be prepared for a fast tour via a few of the RDD-based andSpark-dataframe-based functionalities in sparklyr.sedona, and in addition, some bedazzlingvisualizations derived from geospatial knowledge in Spark.

In Apache Sedona,Spatial Resilient Distributed Datasets(SRDDs)are fundamental constructing blocks of distributed spatial knowledge encapsulating“vanilla” RDDs ofgeometrical objects and indexes. SRDDs assist low-level operations comparable to Coordinate Reference System (CRS)transformations, spatial partitioning, and spatial indexing. For instance, with sparklyr.sedona, SRDD-based operations we will carry out embody the next:

  • Importing some exterior knowledge supply right into a SRDD:

library(sparklyr)
library(sparklyr.sedona)

sedona_git_repo <- normalizePath("~/incubator-sedona")
data_dir <- file.path(sedona_git_repo, "core", "src", "check", "sources")

sc <- spark_connect(grasp = "native")

pt_rdd <- sedona_read_dsv_to_typed_rdd(
  sc,
  location = file.path(data_dir, "arealm.csv"),
  sort = "level"
)
  • Making use of spatial partitioning to all knowledge factors:

sedona_apply_spatial_partitioner(pt_rdd, partitioner = "kdbtree")
  • Constructing spatial index on every partition:

sedona_build_index(pt_rdd, sort = "quadtree")
  • Becoming a member of one spatial knowledge set with one other utilizing “include” or “overlap” because the be a part of predicate:

polygon_rdd <- sedona_read_dsv_to_typed_rdd(
  sc,
  location = file.path(data_dir, "primaryroads-polygon.csv"),
  sort = "polygon"
)

pts_per_region_rdd <- sedona_spatial_join_count_by_key(
  pt_rdd,
  polygon_rdd,
  join_type = "include",
  partitioner = "kdbtree"
)

It’s value mentioning that sedona_spatial_join() will carry out spatial partitioningand indexing on the inputs utilizing the partitioner and index_type provided that the inputsare usually not partitioned or listed as specified already.

From the examples above, one can see that SRDDs are nice for spatial operations requiringfine-grained management, e.g., for guaranteeing a spatial be a part of question is executed as effectivelyas potential with the precise forms of spatial partitioning and indexing.

Lastly, we will strive visualizing the be a part of outcome above, utilizing a choropleth map:

sedona_render_choropleth_map(
  pts_per_region_rdd,
  resolution_x = 1000,
  resolution_y = 600,
  output_location = tempfile("choropleth-map-"),
  boundary = c(-126.790180, -64.630926, 24.863836, 50.000),
  base_color = c(63, 127, 255)
)

which provides us the next:

Wait, however one thing appears amiss. To make the visualization above look nicer, we willoverlay it with the contour of every polygonal area:

contours <- sedona_render_scatter_plot(
  polygon_rdd,
  resolution_x = 1000,
  resolution_y = 600,
  output_location = tempfile("scatter-plot-"),
  boundary = c(-126.790180, -64.630926, 24.863836, 50.000),
  base_color = c(255, 0, 0),
  browse = FALSE
)

sedona_render_choropleth_map(
  pts_per_region_rdd,
  resolution_x = 1000,
  resolution_y = 600,
  output_location = tempfile("choropleth-map-"),
  boundary = c(-126.790180, -64.630926, 24.863836, 50.000),
  base_color = c(63, 127, 255),
  overlay = contours
)

which provides us the next:

With some low-level spatial operations taken care of utilizing the SRDD API andthe precise spatial partitioning and indexing knowledge buildings, we will thenimport the outcomes from SRDDs to Spark dataframes. When working with spatialobjects inside Spark dataframes, we will write high-level, declarative querieson these objects utilizing dplyr verbs along with Sedonaspatial UDFs, e.g.

, thefollowing question tells us whether or not every of the 8 nearest polygons to thequestion level comprises that time, and in addition, the convex hull of every polygon.

tbl <- DBI::dbGetQuery(
  sc, "SELECT ST_GeomFromText("POINT(-66.3 18)") AS `pt`"
)
pt <- tbl$pt[[1]]
knn_rdd <- sedona_knn_query(
  polygon_rdd, x = pt, ok = 8, index_type = "rtree"
)

knn_sdf <- knn_rdd %>%
  sdf_register() %>%
  dplyr::mutate(
    contains_pt = ST_contains(geometry, ST_Point(-66.3, 18)),
    convex_hull = ST_ConvexHull(geometry)
  )

knn_sdf %>% print()
# Supply: spark<?> [?? x 3]
  geometry                         contains_pt convex_hull
  <checklist>                           <lgl>       <checklist>
1 <POLYGON ((-66.335674 17.986328… TRUE        <POLYGON ((-66.335674 17.986328,…
2 <POLYGON ((-66.335432 17.986626… TRUE        <POLYGON ((-66.335432 17.986626,…
3 <POLYGON ((-66.335432 17.986626… TRUE        <POLYGON ((-66.335432 17.986626,…
4 <POLYGON ((-66.335674 17.986328… TRUE        <POLYGON ((-66.335674 17.986328,…
5 <POLYGON ((-66.242489 17.988637… FALSE       <POLYGON ((-66.242489 17.988637,…
6 <POLYGON ((-66.242489 17.988637… FALSE       <POLYGON ((-66.242489 17.988637,…
7 <POLYGON ((-66.24221 17.988799,… FALSE       <POLYGON ((-66.24221 17.988799, …
8 <POLYGON ((-66.24221 17.988799,… FALSE       <POLYGON ((-66.24221 17.988799, …

Acknowledgements

The writer of this weblog publish wish to thank Jia Yu,the creator of Apache Sedona, and Lorenz Walthert fortheir suggestion to contribute sparklyr.sedona to the upstreamincubator-sedona repository. Jia has suppliedintensive code-review suggestions to make sure sparklyr.sedona complies with coding requirementsand greatest practices of the Apache Sedona undertaking, and has additionally been very useful within theinstrumentation of CI workflows verifying sparklyr.sedona works as anticipated with snapshotvariations of Sedona libraries from improvement branches.

The writer can be grateful for his colleague Sigrid Keydanafor priceless editorial options on this weblog publish.

That’s all. Thanks for studying!

Photograph by NASA on Unsplash

Take pleasure in this weblog? Get notified of latest posts by e-mail:

Posts additionally accessible at r-bloggers