Using 'Random Forest' ML Algorithm in 3 Geospatial Workflows
Updated: Jun 13
Machine Learning algorithms can be applied on spatial data to solve problems which have classification, clustering or prediction requirements. Random Forest algorithm is a popular Ensemble Method within Machine Learning which can be applied on spatial data to solve problems which have data classification and prediction needs, in particular. The technique involves 'training the data' and creating 'decision trees' to arrive at a high-probability conclusion.
Below is an excellent video explaining the technique -
Video Source: Esri's Spatial Data Science MOOC
Liked what you've seen? See another slightly longer (5 mins) explainer video here. You'd be able to relate to the examples discussed below, better.
Section Hyperlinks below:
Using Random Forest / Forest-based Classification to:
The geographic extent below (left image) shows the radar imagery over a part of the Chaco forest region (North of Paraguay). Some of the darker patches (vegetation) are manually marked with blue polygons while some of the lighter patches (barren) are manually marked with yellow polygons.
The algorithm understands this 'training data' and classifies the entire geographic extent (right image) as either forested (green pixels) or deforested (grey pixels). The output, as you'd observe, appears very accurate. This is a simple 'classification' based problem which can be addressed using the Forest technique.
(The sliders below are best viewed on a PC.)
A slightly more complex use case of Random Forest, here the polygons have been created manually on an Optical Satellite Image of Seville in Spain to classify a few parcels of agricultural land as per the crop type (Tomato, Wheat, Corn and so on). The Random Forest algorithm then attempts to identify and classify each pixel in the imagery as per the this 'training data'. The output generated - refer Figure 2 & Figure 3 - is believed to be accurate and can be better validated using in-situ observations.
Needless to say, the more quality training data is fed to the algorithm, the more accurate the output will tend to become.
One can possibly use Random Forest algorithm to 'classify' virtually any terrain on earth, including urban areas. Isn't this fascinating?
In this exercise, we will predict the voter turnout in USA and identify the potency of the variables affecting it, thereafter. In the examples above, we had seen how Random Forest was used to 'classify' the raw input accurately. Here, we will use Random Forest to 'predict' voter turnout. Another difference is that - while the examples above involved the application of Random Forest on 'raster' form of spatial data, the exercise below involves the application of Random Forest on 'vector' form of spatial data.
Let's set up the Training Data first. We'll use 5 input variables (aggregated county-wise from the year 2019): a. Percent of population with max. High School Education, b. Median Age, c. Per Capita Income, d. Percent of population who own a selfie stick (whacky parameter) and e. distance of that county to the nearest city class (Simply put, a city class of 10 means a city with 100,000 residents whereas City Class 5 means a city with 50,000 residents. Essentially, we are trying to see how a county's urban or rural characteristics affects its voter turnout). These 5 variables, according to our opinion, are good predictors of actual voter turnout and are hence, forms the training data which we feed into the algorithm.
Upon running the model, the output is as below -
So how accurate is the model prediction (based on the 2019 variables used) to the actual voter turnout in 2016 ?
How potent are the variables to the model?
Relative Importance of Variables
Among the variables, Per Capita Income and High School Education were the best indicators of voter turnout. The distance to nearest city variables weren't that important (distance to city class 9 & 10 respectively were better indicators than distance to other city classes). Owning a selfie stick was actually a better indicator of voter turnout than proximity to city classes!
Next we'll see how these variables perform at a Census Tract Level. A Census Tract is representative of a 'neighborhood' in USA terminology. In relative terms, what we are essentially doing is re-running the model, which we had earlier run on South Mumbai (CBD) sized geographic extent, on Nariman Point sized regions i.e. an even smaller extent. Essentially, this is to see how well the model performs at a more granular level using the same variables.
Notice that we removed the % Selfie Stick Ownership variable above. That is because this geodata is not available at the more granular Census Tract level.
So how accurate is the model prediction to the actual voter turnout, at a tract level?
A: 62.9% (while not directly comparable due to technical changes in the parameters, the model accuracy increased by 1% at a tract level when compared to county level results)
How potent are the variables to the model ?
Relative Importance of Variables: There have been only been minor changes when compared to the previous box plot diagram (county level in Figure 8). The relative importance of each variable to the model is still largely the same.
In the chart above, we can see that the confidence intervals (stripes) for low voter turnout levels are much larger than the confidence intervals for high voter turnout levels. The way we can interpret this is that the variables we've used are much better at predicting high voter turnout regions than they are for predicting low voter turnout regions.
There are several ways to improve the overall model - adding more relevant variables, removing less relevant variables, increasing the number of runs for validation and increasing the number of decision trees (if necessary), being some of them.
What you'd appreciate, however, is the ability to predict a complex phenomenon such as voter turnout using multiple variables of our choice. This can be done within minutes thanks to Machine Learning-based algorithms such as Random Forest. For those who are wondering what is the difference between Regression & Random Forest, here are some answers.
What other applications / uses of Random Forest algorithm can you think of?
Using Random Forest, we can determine which roads are more accident prone, which places are tourists most likely to visit, which ads & content are you likely to watch & appreciate online and numerous other applications. Can you think of some unique applications?
Intelloc Mapping Services | Mapmyops is engaged in providing mapping solutions to organizations which facilitate operations improvement, planning & monitoring workflows. These include but are not limited to Supply Chain Design Consulting, Drone Solutions, Location Analytics & GIS Applications, Site Characterization, Remote Sensing, Security & Intelligence Infrastructure, & Polluted Water Treatment. Projects can be conducted pan-India and overseas.
Several demonstrations for these workflows are documented on our website. For your business requirements, reach out to us via email - firstname.lastname@example.org or book a paid consultation (video meet) from the hyperlink placed at the footer of the website's landing page.