Machine Learning algorithms can be applied on spatial data to solve problems which have classification, clustering or prediction requirements. Random Forest algorithm is a popular Ensemble Method within Machine Learning which can be applied on spatial data to solve problems which have data classification and prediction needs, in particular. The technique involves 'training the data' and creating 'decision trees' to arrive at a high-probability conclusion.
Below is an excellent video explaining the technique -
Video Source: Esri's Spatial Data Science MOOC
Liked what you've seen? See another slightly longer (5 mins) explainer video here. You'd be able to relate to the examples discussed below, better.
Section Hyperlinks below:
Using Random Forest / Forest-based Classification to:
The geographic extent below (left image) shows the radar imagery over a part of the Chaco forest region (North of Paraguay). Some of the darker patches (vegetation) are manually marked with blue polygons while some of the lighter patches (barren) are manually marked with yellow polygons.
The algorithm understands this 'training data' and classifies the entire geographic extent (right image) as either forested (green pixels) or deforested (grey pixels). The output, as you'd observe, appears very accurate. This is a simple 'classification' based problem which can be addressed using the Forest technique.
(The sliders below are best viewed on a PC.)
A slightly more complex use case of Random Forest, here the polygons have been created manually on an Optical Satellite Image of Seville in Spain to classify a few parcels of agricultural land as per the crop type (Tomato, Wheat, Corn and so on). The Random Forest algorithm then attempts to identify and classify each pixel in the imagery as per the this 'training data'. The output generated - refer Figure 2 & Figure 3 - is believed to be accurate and can be better validated using in-situ observations.
Needless to say, the more quality training data is fed to the algorithm, the more accurate the output will tend to become.
.
One can possibly use Random Forest algorithm to 'classify' virtually any terrain on earth, including urban areas. Isn't this fascinating?
In this exercise, we will predict the voter turnout in USA and identify the potency of the variables affecting it, thereafter. In the examples above, we had seen how Random Forest was used to 'classify' the raw input accurately. Here, we will use Random Forest to 'predict' voter turnout. Another difference is that - while the examples above involved the application of Random Forest on 'raster' form of spatial data, the exercise below involves the application of Random Forest on 'vector' form of spatial data.
Let's set up the Training Data first. We'll use 5 input variables (aggregated county-wise from the year 2019): a. Percent of population with max. High School Education, b. Median Age, c. Per Capita Income, d. Percent of population who own a selfie stick (whacky parameter) and e. distance of that county to the nearest city class (Simply put, a city class of 10 means a city with 100,000 residents whereas City Class 5 means a city with 50,000 residents. Essentially, we are trying to see how a county's urban or rural characteristics affects its voter turnout). These 5 variables, according to our opinion, are good predictors of actual voter turnout and are hence, forms the training data which we feed into the algorithm.
Upon running the model, the output is as below -
So how accurate is the model prediction (based on the 2019 variables used) to the actual voter turnout in 2016 ?
A: 61.9%
How potent are the variables to the model?
Relative Importance of Variables
Among the variables, Per Capita Income and High School Education were the best indicators of voter turnout. The distance to nearest city variables weren't that important (distance to city class 9 & 10 respectively were better indicators than distance to other city classes). Owning a selfie stick was actually a better indicator of voter turnout than proximity to city classes!
Next we'll see how these variables perform at a Census Tract Level. A Census Tract is representative of a 'neighborhood' in USA terminology. In relative terms, what we are essentially doing is re-running the model, which we had earlier run on South Mumbai (CBD) sized geographic extent, on Nariman Point sized regions i.e. an even smaller extent. Essentially, this is to see how well the model performs at a more granular level using the same variables.
Notice that we removed the % Selfie Stick Ownership variable above. That is because this geodata is not available at the more granular Census Tract level.
So how accurate is the model prediction to the actual voter turnout, at a tract level?
A: 62.9% (while not directly comparable due to technical changes in the parameters, the model accuracy increased by 1% at a tract level when compared to county level results)
How potent are the variables to the model ?
Relative Importance of Variables: There have been only been minor changes when compared to the previous box plot diagram (county level in Figure 8). The relative importance of each variable to the model is still largely the same.
In the chart above, we can see that the confidence intervals (stripes) for low voter turnout levels are much larger than the confidence intervals for high voter turnout levels. The way we can interpret this is that the variables we've used are much better at predicting high voter turnout regions than they are for predicting low voter turnout regions.
There are several ways to improve the overall model - adding more relevant variables, removing less relevant variables, increasing the number of runs for validation and increasing the number of decision trees (if necessary), being some of them.
What you'd appreciate, however, is the ability to predict a complex phenomenon such as voter turnout using multiple variables of our choice. This can be done within minutes thanks to Machine Learning-based algorithms such as Random Forest. For those who are wondering what is the difference between Regression & Random Forest, here are some answers.
What other applications / uses of Random Forest algorithm can you think of?
Using Random Forest, we can determine which roads are more accident prone, which places are tourists most likely to visit, which ads & content are you likely to watch & appreciate online and numerous other applications. Can you think of some unique applications?
ABOUT US
Intelloc Mapping Services | Mapmyops.com is based in Kolkata, India and engages in providing Mapping solutions that can be integrated with Operations Planning, Design and Audit workflows. These include but are not limited to - Drone Services, Subsurface Mapping Services, Location Analytics & App Development, Supply Chain Services & Remote Sensing Services. The services can be rendered pan-India, some even globally, and will aid an organization to meet its stated objectives especially pertaining to Operational Excellence, Cost Reduction, Sustainability and Growth.
Broadly, our area of expertise can be split into two categories - Geographic Mapping and Operations Mapping. The Infographic below highlights our capabilities.
Our 'Mapping for Operations'-themed workflow demonstrations can be accessed from the firm's Website / YouTube Channel and an overview can be obtained from this flyer. Happy to address queries and respond to documented requirements. Custom Demonstration, Training & Trials are facilitated only on a paid-basis. Looking forward to being of service.
Regards,
Much Thanks to RUS Copernicus & Esri for the training material.