The geospatial capabilities of Microsoft Fabric and Esri Geaanalytics prove

0 0 6 minutes read

The geospatial capabilities of Microsoft Fabric and Esri Geaanalytics prove

80% of the data collected, stored and maintained by the government is related to geographical location. Although never proven empirically, it illustrates the importance of location in the data. The increasing amount of data has imposed limitations on systems that process geospatial data. Common big data computing engines were originally designed to extend text data and needed to adapt to work effectively with geospatial data – considering geoindexes, partitions and operators. Here I introduce and explain how to leverage Microsoft Fabric Spark Compune with an essentially integrated ESRI GeoAnalytics engine^# Used for geospatial big data processing and analysis.

The optional analytical capability within the fabric enables the processing and analysis of vector geospatial data, where vector geospatial data refers to points, lines, polygons. These features include over 150 spatial functions to create geometric shapes, test and select spatial relationships. As it extends the spark, the Earth analysis function can be called when using Python, SQL, or Scala. These spatial operations automatically apply spatial indexes, making the Spark Compute engine effective for this data as well. It can handle 10 additional common spatial data formats to load and save data spatial data, as well as above data source formats supported within SPARK. This blog post highlights the scalable geospatial computing engine introduced in my post on geospatial in the AI era.

Explained the demonstration

Here I demonstrate some of these spatial functions by displaying data manipulation and analysis steps on large datasets. By using several tiles covering point cloud data (a bunch of X, Y, Z values), a huge data set begins to form, while it still covers relatively small areas. The Dutch Open AHN dataset is a national digital elevation and surface model, currently in its fifth update cycle and spans nearly 30 years. Here, the data collected from the second, third and fourth time are used because these data retain full national coverage (fifth is not yet), and the first version does not include point cloud releases (only derivative mesh versions).

Another Dutch open dataset, namely, build data, bags, used to illustrate spatial selection. The building dataset contains the building’s footprint as a polygon. Currently, the dataset has more than 11 million buildings. To test the space functionality, I only use 4 AHN tiles per AHN version. So in this case, 12 tiles, each tiled 5 x 6.25 km. A total of more than 3.5 billion points in an area of 125 square kilometers. The selected area covers the municipalities of Loppersum, which is prone to settlement due to natural gas extraction.

The steps taken include selecting the building within the Loppersum area and selecting points X, Y, Z from the roof of the building. We then bring 3 datasets into a data frame and perform additional analysis on it. A spatial regression to predict the expected height of a building based on its height history and the history of the building directly around it. Not necessarily the best analysis for actual predictions of these data**, but it is only suitable for demonstrating the spatial processing capabilities of ESRI GeoAnalytics for fabrics. All the following code snippets are also available as notebooks on GitHub.

Step 1: Read the data

Spatial data can appear in many different data formats. We comply with the geodata format for further processing. Bag construction data, including footprint and accompanying municipal boundaries, appear in geobarrel format. However, Point Cloud Ahn Data, versions 2, 3 and 4, is the LAZ file format, which is the compressed industry standard format for point cloud. I haven’t found the spark library to read laz (please leave a message if there is a message) and create a TXT file separately and use lastools⁺ First.

# ESRI - FABRIC reference: 

# Import the required modules
import geoanalytics_fabric
from geoanalytics_fabric.sql import functions as ST
from geoanalytics_fabric import extensions

# Read ahn file from OneLake
# AHN lidar data source: 

ahn_csv_path = "Files/AHN lidar/AHN4_csv"
lidar_df = spark.read.options(delimiter=" ").csv(ahn_csv_path)
lidar_df = lidar_df.selectExpr("_c0 as X", "_c1 as Y", "_c2 Z")

lidar_df.printSchema()
lidar_df.show(5)
lidar_df.count()

The above code snippet^and The following results are provided:

Now, with space function make_point and srid Columns X, Y, Z are converted to point geometry and set to a specific Dutch coordinate system (SRID = 28992), see the code snippet below^and:

# Create point geometry from x,y,z columns and set the spatial refrence system
lidar_df = lidar_df.select(ST.make_point(x="X", y="Y", z="Z").alias("rd_point"))
lidar_df = lidar_df.withColumn("srid", ST.srid("rd_point"))
lidar_df = lidar_df.select(ST.srid("rd_point", 28992).alias("rd_point"))
  .withColumn("srid", ST.srid("rd_point"))

lidar_df.printSchema()
lidar_df.show(5)

Building and municipal data can be read through extensions spark.read Geopaper functions, see code snippet^and:

# Read building polygon data
path_building = "Files/BAG NL/BAG_pand_202504.parquet"
df_buildings = spark.read.format("geoparquet").load(path_building)

# Read woonplaats data (=municipality)
path_woonplaats = "Files/BAG NL/BAG_woonplaats_202504.parquet"
df_woonplaats = spark.read.format("geoparquet").load(path_woonplaats)

# Filter the DataFrame where the "woonplaats" column contains the string "Loppersum"
df_loppersum = df_woonplaats.filter(col("woonplaats").contains("Loppersum"))

Step 2: Select

In the included laptop, I read and write to Geoparquet. To ensure that the correct data is read correctly as a data frame, see the following code snippet:

# Read building polygon data
path_building = "Files/BAG NL/BAG_pand_202504.parquet"
df_buildings = spark.read.format("geoparquet").load(path_building)

# Read woonplaats data (=municipality)
path_woonplaats = "Files/BAG NL/BAG_woonplaats_202504.parquet"
df_woonplaats = spark.read.format("geoparquet").load(path_woonplaats)

# Filter the DataFrame where the "woonplaats" column contains the string "Loppersum"
df_loppersum = df_woonplaats.filter(col("woonplaats").contains("Loppersum"))

With all the data in the data range, this becomes a simple step to perform spatial selection. The following code snippet^and Shows how buildings were selected within the Loppersum municipality boundaries and buildings that existed throughout the period were made separately (Point Cloud AHN-2 data were obtained in the area in 2009). This has led to 1196 of the 2492 buildings currently.

# Clip the BAG buildings to the gemeente Loppersum boundary
df_buildings_roi = Clip().run(input_dataframe=df_buildings,
                    clip_dataframe=df_loppersum)

# select only buildings older then AHN data (AHN2 (Groningen) = 2009) 
# and with a status in use (Pand in gebruik)
df_buildings_roi_select = df_buildings_roi.where((df_buildings_roi.bouwjaar

The three AHN versions used (2, 3 and 4) are then edited based on the selected building data, namely T1, T2 and T3. this AggregatePoints Functions can be used to calculate, in this case, some statistics from height (z-values), such as average per roof, standard deviation and number based on z-values; see code snippet:

# Select and aggregrate lidar points from buildings within ROI

df_ahn2_result = AggregatePoints() 
            .setPolygons(df_buildings_roi_select) 
            .addSummaryField(summary_field="T1_z", statistic="Mean", alias="T1_z_mean") 
            .addSummaryField(summary_field="T1_z", statistic="stddev", alias="T1_z_stddev") 
            .run(df_ahn2)

df_ahn3_result = AggregatePoints() 
            .setPolygons(df_buildings_roi_select) 
            .addSummaryField(summary_field="T2_z", statistic="Mean", alias="T2_z_mean") 
            .addSummaryField(summary_field="T2_z", statistic="stddev", alias="T2_z_stddev") 
            .run(df_ahn3)

df_ahn4_result = AggregatePoints() 
            .setPolygons(df_buildings_roi_select) 
            .addSummaryField(summary_field="T3_z", statistic="Mean", alias="T3_z_mean") 
            .addSummaryField(summary_field="T3_z", statistic="stddev", alias="T3_z_stddev") 
            .run(df_ahn4)

Step 3: Summary and Regression

Since geographic analysis functions geolocation-weighted regression (GWR) can only work on point data, their centroids are extracted from building polygons centroid Function. Connect 3 data ranges to one, see also the notebook and prepare to execute the GWR function. In this case, it predicts the height of T3 (AHN4) based on the local regression function.

# Import the required modules
from geoanalytics_fabric.tools import GWR

# Run the GWR tool to predict AHN4 (T3) height values for buildings at Loppersum
resultGWR = GWR() 
            .setExplanatoryVariables("T1_z_mean", "T2_z_mean") 
            .setDependentVariable(dependent_variable="T3_z_mean") 
            .setLocalWeightingScheme(local_weighting_scheme="Bisquare") 
            .setNumNeighbors(number_of_neighbors=10) 
            .runIncludeDiagnostics(dataframe=df_buildingsT123_points)

The diagnosis can be conducted for the predicted Z value, in which case the following results are produced. Note again that these results cannot be used for real-world applications, as the data and methods may not be suitable for settlement modeling purposes – it only shows the fabric geo-analysis function here.

R2	0.994
ADGR2	0.981
AICC	1509
Sigma2	0.046
Edof	378

Step 4: Visualize the results

Using spatial function graphs, you can visualize the results as maps in your notebook – only with the Python API in Spark. First, visualize all buildings in the city of Loppersum.

# visualize Loppersum buildings
df_buildings.st.plot(basemap="light", geometry="geometry", edgecolor="black", alpha=0.5)

This is a visualization of the high difference between T3 (AHN4) and T3 predictions (T3 prediction minus T3).

# Vizualize difference of predicted height and actual measured height Loppersum area and buildings

axes = df_loppersum.st.plot(basemap="light", edgecolor="black", figsize=(7, 7), alpha=0)
axes.set(xlim=(244800, 246500), ylim=(594000, 595500))
df_buildings.st.plot(ax=axes, basemap="light", alpha=0.5, edgecolor="black") #, color='xkcd:sea blue'
df_with_difference.st.plot(ax=axes, basemap="light", cmap_values="subsidence_mm_per_yr", cmap="coolwarm_r", vmin=-10, vmax=10, geometry="geometry")

Summary

This blog post discusses the importance of geographic data. It highlights the challenges posed by increasing the amount of data in geospatial data systems and shows that traditional big data engines must adapt to efficiently process geospatial data. Here, we introduce how to use the Microsoft Fabric Spark Compune engine and its integration with the ESRI GeoAnalytics engine for efficient geospatial big data processing and analysis.

The opinions here are mine.

footnote

#In preview

* Land settlement can be modeled using higher accuracy and time frequencies, other methods and data can be used, such as the satellite Insar Insar method (see also Bodemdalingskaart)

+ It’s interesting to use lastools alone here, testing the use of fabric user data features (previews) or using Azure features for this purpose.

&Code snippets are readable and not necessarily for efficiency. Multiple data processing steps can be bound.