top of page

Random Forest ML algorithm utilized for Geospatial Analysis

  • Writer: Arpit Shah
    Arpit Shah
  • Nov 8, 2020
  • 7 min read

Updated: 5 days ago

SECTION HYPERLINKS

Yes, Machine Learning can be applied to geospatial datasets—in fact, they are extremely effective in workflows involving classification, clustering, or prediction, and large-scale pattern detection.


Among the most popular supervised ML techniques is the Random Forest , developed by Leo Breiman and Adele Cutler. As an ensemble-based learning method, Random Forest builds numerous decision trees and combines their outputs to reach fast, accurate, and robust classifications.


The workflow begins by training the algorithm on a pre-classified sample dataset. Once trained, the model is exposed to unknown data—it evaluates each pixel or feature (refer: decision trees) using the explanatory variables it has learned from and then generates a predicted class. The quality, representativeness, and randomness of the training samples directly influence the accuracy of the final classification (a form of scientific guesswork at computing speeds in a manner of speaking). Refer to this time-stamped portion of the video.


Here is an excellent explainer on Esri’s Forest-based Classification and Regression Tool, which is tailored specifically for geospatial ML analysis and will be used in Workflow 3-


Video 1: Forest-based Classification & Regression Algorithm Explained Source: Esri's Spatial Data Science MOOC 

In Slider 1 below, the left image displays Synthetic Aperture Radar imagery over the Chaco forest region in northern Paraguay—an area heavily affected by deforestation.


  • The areas under blue polygons are verified as vegetation

  • The areas under yellow polygons are verified as barren / deforested land


These verified observations form the supervised training dataset for Random Forest. After learning from these samples, the algorithm is run on the full imagery scene to classify every pixel into forested (green) or deforested (grey). The classified output appears on the right.


The Slider is best viewed on PC.


Slider 1: Training Data (left) and Random Forest Classification Output (right). Derived using ESA’s SNAP software


The classification appears highly accurate based on visual interpretation (if you wish to see the actual processing steps, refer to this video tutorial).


While this demonstration is intentionally simple, it is easy to imagine the same workflow scaled up—training on many parameters, ingesting massive imagery datasets, and classifying thousands of square kilometres for deforestation analysis with remarkable speed.


And that is exactly what the next workflow builds upon.


The study area below is covered by Sentinel-2 optical imagery over Seville, Spain (2017). The multi-coloured polygons represent the verified crop types observed on the ground—tomato, wheat, corn, and several others. These labelled samples form the training dataset.


Study Area as seen in Sentinel-2 Optical Imagery over Seville, Spain (2017). The multi-colored polygons correspond to the verified land use parameters i.e. the crop-type growing on it. Source: RUS Copernicus
Figure 1: Training Dataset for Crop Classification. Basemap - Sentinel-2 optical imagery. Source: RUS Copernicus

This workflow is significantly more complex than the previous one:

  • It requires training the model on multiple explanatory variables.

  • The area to be classified spans thousands of agricultural parcels.

  • The imagery includes diverse spectral bands, enhancing predictive accuracy.


Once trained, the Random Forest model is executed to classify the crop type for every pixel in the dataset. Below is the output:

(if you wish to see the actual processing steps, refer to this video tutorial)


Crop-type classification over the entire study area generated by the Random Forest Machine Learning Algorithm
Figure 2: Random Forest—Crop-Type Classification Over the Full Study Area
Zoomed-in view of the classified output
Figure 3: Zoomed-in View of the Classified Output

As expected, the accuracy improves with more numerous, diverse and precise training samples. Randomness in training selection reduces overfitting - as explained in Video 1 around the 01:40 mark.


Both Workflow 1 and Workflow 2 were executed using ESA’s SNAP—a powerful toolset for running Random Forest on raster-based geospatial datasets. The same approach can extend to urban land-cover classification, hydrological analysis, soil-type mapping, and countless other geospatial applications.


While the previous two workflows applied the Random Forest algorithm to raster datasets, this workflow demonstrates how its vector-focused variant—the Forest-based Classification and Regression tool—can be applied to polygonal geospatial data. The objective is to evaluate the predictive strength of five demographic and behavioural parameters believed to influence voter turnout in U.S. National Elections. By understanding which variables are most potent, the model can be refined to predict the 2020 turnout more accurately.


Surveys conducted across select U.S. counties in 2019 serve as the supervised training dataset. These county-level responses—linked spatially—are used to train the machine learning model. The algorithm’s predictions are then assessed against actual voter turnout from the 2016 election, thereby validating the potency of the chosen parameters.

Credits: Esri Learn ArcGIS, Esri ArcGIS Pro


Survey responses were aggregated to the county level for the following training variables (county-level aggregation):

  1. % Population with at least High School Education

  2. Median Age of Population

  3. Per Capita Income

  4. % Population Who Own a Selfie Stick (a deliberately whacky but illustrative variable)

  5. Distance to Nearest City Class

    • Ten city classes, each representing proximity to urban settlements of increasing population sizes (10,000 up to 100,000).

    • This proxy aims to capture how distance from an urban centre influences turnout behaviour.

USA National Elections 2016 - Actual Voter Turnout aggregated at County-level and color-coded based on standard deviation from the national mean
Figure 4: Actual Voter Turnout (2016) by County. Color-coded based on standard deviation from the national mean

Using ArcGIS Pro, the Forest-based Classification and Regression tool—run in Train and Predict mode—allows us to:

  1. Train the algorithm on 2019 survey responses.

  2. Predict voter turnout for all 3,244 counties.

  3. Validate predictions using actual 2016 results.


Snapshot of the Forest-based Classification and Regression geoprocessing tool which runs the namesake Machine Learning algorithm in ArcGIS Pro GIS software
Figure 5: Screenshot of the Forest-Based Classification and Regression Tool
Output of the Forest-based Classification and Regression Machine Learning Algorithm - Predicted Voter Turnout aggregated at County-level and expressed in percentage for all the Counties (3244) of USA
Figure 6 - Predicted Voter Turnout for All U.S. Counties (Model Output). Expressed in percentages for all the Counties (3244)

How did the model perform to the validation dataset (2016 Voter Turnout)?


Regression Diagnostics output - comparing the Algorithm's predicted Voter Turnout aggregated at County-level based on the five test parameters to the Validation Dataset (Actual Voter Turnout at the 2016 USA National Elections aggregated at County-level)
Figure 7: Regression Diagnostics Output

The Validation Data: Regression Diagnostics output indicates that the Coefficient of Determination (R-squared) i.e. the R² value is 61.9%. This indicates that the five test parameters moderately explain the variation in actual voter turnout. This is a respectable score given the complexity of voter behaviour and the small number of predictors used.


Which parameters of the five are the most reliable predictors of Voter Turnout ?

Snapshot of the Distribution of Variable Importance Box Plot - Predicted Voter Turnout based on five test parameters v/s Validation Dataset (Actual Voter Turnout in 2016 USA National Elections aggregated at County-level)
Figure 8: Variable Importance Box Plot

The results are revealing:

  • Per Capita Income and High School Education emerge as the strongest predictors.

  • Distance to Nearest City Class has minimal explanatory power.

  • Surprisingly, owning a Selfie Stick is a better predictor of turnout than proximity to urban centres—illustrating how behavioural proxies can sometimes outperform structural ones.


How would the performance of these test parameters change if the algorithm were to make predictions at a more granular level i.e. by increasing the geographic resolution to Census Tract-level data instead of County-level?


A Census Tract represents a neighbourhood-scale division (84,414 tracts nationwide). The original survey was conducted at the individual level, meaning responses can be re-aggregated to Census Tracts rather than counties.

Census Tracts in USA (84,414) dataset. Source: Esri Learn ArcGIS / Living Atlas
Figure 9: U.S. Census Tracts Dataset (84,414 polygons)

Before reading on, consider:


Would predictions become more accurate or less accurate at this finer spatial scale?


Deploying the Forest-based Classification and Regression Machine Learning geoprocessing tool again - the namesake algorithm is being trained on the same Survey responses, albeit which are now aggregated at a more micro Census Tract-level.
Figure 10: Forest-Based Tool with Census-Tract Aggregation

The algorithm is retrained using Survey data aggregated to Census Tracts.


Notice in Figure 10 that:

  • the Selfie Stick variable is omitted. This is because data existed only at county level.

  • The Distance to City Class variable remains. This is because as it can be computed geospatially for each tract.

Predicted Voter Turnout Output at Census Tract-level
Figure 11: Predicted Turnout (Census Tract-Level)

How did the model fare to the new Validation dataset (Actual Voter Turnout data aggregated at Census Tract-level from the 2016 election)?

Statistics from the Forest-based Classification and Regression algorithm's new output
Figure 12: Regression Diagnostics (Census Tract-Level)

The new R² = 62.9%, a 1% improvement over the county-level model.

Did you anticipate this? Why did accuracy increase despite a dramatic rise in prediction volume?

Initially, one might expect accuracy to fall because:

  • The model must now generate >80,000 predictions (vs. 3,244 counties).

  • More predictions typically introduce more variance.


However, accuracy improves, likely because:

  • Census Tracts more precisely capture differences in education, income, age and urban proximity.

  • Training data aggregated at a finer scale becomes more directly attributable to the areas being predicted.

  • The algorithm benefits from reduced spatial averaging, enabling clearer signal extraction.


Another possibility is that the variables themselves—education, income, age, proximity—are structural indicators not highly sensitive to geographic scale, meaning predictive potency remains stable.


I know what some of you may be thinking - it would help to know if there is a discernible change in potency of the test parameters at the Census Tract-level.


Let's explore its Box Plot-

Distribution of Variable Importance - Voter Turnout Prediction vs Actual aggregated at a 'Census Tract' level
Figure 13: Variable Importance Box Plot (Census Tract-Level)

Surprisingly, the ranking barely changes from what was depicted in Figure 8.


  • High School Education and Per Capita Income remain dominant.

  • The other variables retain roughly the same influence.

  • Thus, the test parameters show remarkable stability across spatial resolutions.


There is another piece of statistics which reveals an interesting insight though-

Prediction Interval graph generated by the Forest-based Classification and Regression geospatial tool
Figure 13: Prediction Interval graph generated by the Forest-based Classification and Regression geospatial tool

The Prediction Interval graph plots:

  • X-axis: Census Tracts (sorted by predicted turnout)

  • Y-axis: Predicted turnout percentage (with P05–P95 intervals)


A striking pattern emerges:

  • Low-turnout regions (<50%) show wide variability across confidence intervals

  • High-turnout regions (>50%) show narrow variability, indicating stronger predictive confidence


This means:

Education, Income, Age and Urban Proximity are far more reliable predictors of high voter turnout than of low turnout.

This is a valuable behavioural insight!

Some of the ways to improve the prediction reliability would be-

  • Gather more survey responses from a broader set of counties

  • Add additional behavioural or socioeconomic variables

  • Remove weak predictors

  • Increase the number of validation runs

  • Allow the algorithm to generate more decision trees


Through this demonstration, it becomes evident that the Forest-based Classification and Regression algorithm can model a complex social phenomenon like voter turnout with surprising accuracy, using only a handful of explanatory variables.


Even more impressive is the algorithm’s speed—ArcGIS Pro processed thousands of geospatial features and learned meaningful relationships within minutes.


This same method can be extended to countless applications:

  • Predicting accident-prone road corridors

  • Identifying high-potential tourist zones

  • Classifying online content based on viewer behaviour

  • Forecasting retail demand

  • Mapping disease vulnerability zones

  • Detecting credit risk clusters


Which other applications come to mind? Feel free to share.

ABOUT US - OPERATIONS MAPPING SOLUTIONS FOR ORGANIZATIONS


Intelloc Mapping Services, Kolkata | Mapmyops.com offers a suite of Mapping and Analytics solutions that seamlessly integrate with Operations Planning, Design, and Audit workflows. Our capabilities include — but are not limited to — Drone Services, Location Analytics & GIS Applications, Satellite Imagery Analytics, Supply Chain Network Design, Subsurface Mapping and Wastewater Treatment. Projects are executed pan-India, delivering actionable insights and operational efficiency across sectors.


My firm's services can be split into two categories - Geographic Mapping and Operations Mapping. Our range of offerings are listed in the infographic below-

Range of solutions that Intelloc Mapping Services (Mapmyops.com) offers
Range of solutions that Intelloc Mapping Services (Mapmyops.com) offers

A majority of our Mapping for Operations-themed workflows (50+) can be accessed from this website's landing page. We respond well to documented queries/requirements. Demonstrations/PoC can be facilitated, on a paid-basis. Looking forward to being of service.


Regards,

Mapmyops I Intelloc Mapping Services

Mapmyops
  • LinkedIn Social Icon
  • Facebook
  • Twitter
  • YouTube
Intelloc Mapping Services - Mapmyops.com
bottom of page