top of page
  • Writer's pictureArpit Shah

Spatial Applications of Random Forest Algorithm

Updated: Jan 21, 2022

Machine Learning algorithms can be applied on spatial data to solve problems which have classification, clustering or prediction requirements. Random Forest algorithm is a popular Ensemble Method within Machine Learning which can be applied on spatial data to solve problems which have data classification and prediction requirements, in particular. The technique involves 'training the data' and creation of 'decision trees' to arrive at conclusions which are, in general, quite accurate.

Below is an excellent video explaining the technique -

Source: Esri's Spatial Data Science MOOC

Liked what you've seen? See another slightly longer (5 mins) explainer video here. You'd be able to relate to the examples discussed below, better.


Below are a few applications of Random Forest technique on spatial data -

Mapping Deforestation

The geographic extent below (left image) shows the radar imagery over a part of the Chaco forest region (North of Paraguay). Some of the darker patches (vegetation) are manually marked with blue polygons while some of the lighter patches (barren) are manually marked with yellow polygons.

The algorithm understands this 'training data' and classifies the entire geographic extent (right image) as either forested (green) or deforested (grey). The output, as you'd observe, appears very accurate. This is a simple 'classification' based problem solving application using Random Forest algorithm.

(The sliders below are best viewed on a PC.)


Mapping Crop Types

A slightly more complex use case of Random Forest, here the polygons have been created manually over Optical Imagery of Seville, Spain classifying a few parcels of agricultural land as per the crop type (Tomato, Wheat, Corn and so on). The Random Forest algorithm then attempts to identify and classify each pixel in the imagery as per the polygons inputted i.e. 'training data'. The output again appears to be reasonably accurate.

Needless to say, the more training data (polygons) one can input, the more accurate the output is expected to become.

Sentinel 2 Optical Imagery over Seville region (2017) with Training Polygons manually inputted. Source: RUS Copernicus

Random Forest algorithm generates crop type classification over the total geographic extent under study.

Zoomed in view of Random Forest Classifier output

One can possibly use Random Forest algorithm to 'classify' virtually any terrain on earth, including urban areas. Isn't this fascinating?


Predicting Voter Turnout

In this exercise, we will predict the voter turnout in USA and identify the potency of the variables affecting it, thereafter. In the examples above, we had seen how Random Forest was used 'to classify' output. Here we will use Random Forest 'to predict' voter turnout. Another difference is that while the examples above involved the application of Random Forest on 'raster' form of spatial data, the exercise below involves the application of Random Forest on 'vector' form of spatial data.

2016 County Level Voter Turnout: Color coded as per voter turnout's standard deviation from the mean/average.

Let's set up the Training Data first. We'll use 5 input variables (aggregated county-wise from the year 2019): a. Percent of population with max. High School Education, b. Median Age, c. Per Capita Income, d. Percent of population who own a selfie stick (wacky parameter) and e. distance of that county to the nearest city class (Simply put, a city class of 10 means a city with 100,000 residents whereas City Class 5 means a city with 50,000 residents. Essentially, we are trying to see how a county's urban or rural characteristics affects its voter turnout). These 5 variables, according to our opinion, are good predictors of actual voter turnout and are hence, the training data which we feed into the algorithm.

Model Parameters Snapshot

Upon running the model, the output is as below -

Random Forest Predicted Voter Turnout at County Level

So how accurate is the model prediction (basis the 2019 variables used) to the actual voter turnout in 2016 ?

A: 61.9%

How potent are the variables to the model?

Relative Importance of Variables

Among the variables, Per Capita Income and High School Education were the best indicators of voter turnout. The distance to nearest city variables weren't that important (distance to city class 9 & 10 respectively were better indicators than distance to other city classes). Owning a selfie stick was actually a better indicator of voter turnout than proximity to city classes!

Next we'll see how these variables perform at a Census Tract Level. A Census tract is representative of a 'neighborhood' in US terminology. In relative terms, what we are essentially doing is re-running the model, which we had earlier run on South Mumbai (CBD) sized regions, on Nariman Point sized regions i.e. an even smaller area. Essentially, this is to see how well the model performs at a more granular level using the same variables.

US Census Tract base layer

Model Parameters snapshot. Notice that we removed the % Selfie Stick Ownership variable. That is because that data is not available at the more granular Census Tract level.

Random Forest Predicted Voter Turnout at Tract Level

So how accurate is the model prediction to the actual voter turnout, at a tract level?

A: 62.9% (while not directly comparable due to technical changes in the parameters, the model accuracy increased by 1% at a tract level when compared to county level results.)

How potent are the variables to the model ?

Relative Importance of Variables: There have been only been minor changes when compared to the previous box plot diagram (county level). The relative importance of each variable to the model is still largely the same.

Prediction Interval

In the chart above, we can see that the confidence intervals (stripes) for low voter turnout levels are much larger than the confidence intervals for high voter turnout levels. The way we can interpret this is that the variables we've used are much better at predicting high voter turnout regions than low voter turnout regions.


There are several ways to improve the overall model - adding more relevant variables, removing less relevant variables, increasing the number of runs for validation and increasing the number of decision trees (if necessary) being some of them.

What you'd appreciate, however, is the ability to predict a complex phenomenon such as voter turnout using multiple variables of our choice. This can be done within minutes thanks to ML based algorithms such as Random Forest. For those who are wondering what is the difference between Regression and Random Forest, here are some answers.

What other applications / uses of Random Forest algorithm can you think of?

Using Random Forest, we can determine which roads are more accident prone, which places are tourists most likely to visit, which ads & content are you likely to watch & appreciate online and numerous other applications.

Isn't this useful?


Intelloc Mapping Services | is engaged in selling products which capture geo-data (Drones), process geo-data (Geographic Information System) as well as services (PoI Datasets & Satellite Imagery). Together, these help organizations to benefit from Geo-Intelligence for purposes such as operations improvement, project management and digital enabled growth.

Write to us on Download our one-page profile here. Request a demo.



Much Thanks to RUS Copernicus & Esri for the training material.

807 views0 comments

Recent Posts

See All
bottom of page