Anytime you have lat / long coordinates, you have an opportunity to do data science with kmeans clustering and visualization on a map. This is a story about how I used geo data with kmeans clustering that relates to a topic which has effected me personally - wildfires!

Every summer wildfire become front-of-mind for thousands of people who live in the west, Pacific Northwest, and Northern Rockies regions of the United States. Odds are, if you don’t see the flames first hand, you will probably see smoke influenced weather, road closures, and calls for caution by local authorities.

I’ve lived in Oregon for about 10 years. In that time I’ve had more than one close encounter with a forest fire. This past summer was especially bad. A fire in the Columbia River Gorge blew smoke and ash through my neighborhood. Earlier in the year I crossed paths with firefighters attempting to control a fire in steep rugged terrain in southern Washington. I was stunned to see the size of their equipment and trucks so badass they could be in a Mad Max movie.

Fire fighting is big business. Wildland fire suppression costs exceeded $2 billion in 2017, making it the most expensive year on record for the Forest Service. Lets look at one small way in which data science could be applied within the context of streamlining fire fighting operations in order to reduce costs.

The Problem

The cost of moving heavy firefighting equipment is probably a “drop in the bucket” but it’s the type of problem that can be optimized with a little data wrangling and applied math. By staging heavy firefighting equipment as close as possible to where fires are likely to occur then the cost of moving that equipment to where it will be needed can be minimized.

The Solution:

My goal is to predict where forest fires are prone to occur by partitioning the locations of past burns into clusters whose centroids can be used to optimally place heavy fire fighting equipment as near as possible to where fires are likely to occur. The K-Means clustering algorithm is perfectly suited for this purpose.

The United States Forest Service provides datasets that describe forest fires that have occurred in Canada and the United States since year 2000. That data can be downloaded from For my purposes, this dataset is provided in an inconvenient shapefile format. It needs to be transformed to CSV in order to be easily usable by Spark. Also, the records after 2008 have a different schema than prior years, so after converting the shapefiles to CSV they’ll need to be ingested into Spark using separate user-defined schemas. By the way, this complexity is typical. Raw data is hardly ever suitable for machine learning without cleansing. The process of cleaning and unifying messy data sets is called “data wrangling” and it frequently comprises the bulk of the effort involved in real world machine learning.

Apache Zeppelin

The data wrangling that precedes machine learning typically involves writing expressions in R, SQL, Scala, and/or Python which join and transform sampled datasets. Often, getting these expressions right involves a lot of trial and error. Ideally you want to test those expressions without the burden of compiling and running a full program. Data scientists have embraced web based notebooks, such as Apache Zeppelin, for this purpose because they allow you to interactively transform datasets and know right away if what you’re trying to do will work properly.

The Zeppelin notebook I wrote for this study contains a combination of Bash, Python, Scala, and Angular code.

Here’s the bash code I use to download the dataset:

mkdir -p /mapr/
cd /mapr/
curl -s --remote-name
curl -s --remote-name
curl -s --remote-name
curl -s --remote-name
curl -s --remote-name
curl -s --remote-name
curl -s --remote-name
curl -s --remote-name
curl -s --remote-name
curl -s --remote-name
curl -s --remote-name
curl -s --remote-name
curl -s --remote-name
curl -s --remote-name
curl -s --remote-name
curl -s --remote-name
find modis*.zip | xargs -I {} unzip {} modis*.dbf
find mcd*.zip | xargs -I {} unzip {} mcd*.dbf

Here’s the python code I use to convert the downloaded datasets to CSV files:

import csv
from dbfpy import dbf
import os
import sys

for filename in os.listdir(DATADIR):

    if filename.endswith('.dbf'):
        print "Converting %s to csv" % filename
        csv_fn = DATADIR+filename[:-4]+ ".csv"
        with open(csv_fn,'wb') as csvfile:
            in_db = dbf.Dbf(DATADIR+filename)
            out_csv = csv.writer(csvfile)
            names = []
            for field in in_db.header.fields:
            for rec in in_db:
            print "Done..."

Here’s the Scala code I use to ingest the CSV files and train a K-Means model with Spark libraries:

import org.apache.spark._
import org.apache.spark.rdd.RDD
import org.apache.spark.sql.types._
import org.apache.spark.sql.functions._
import org.apache.spark.sql._
import org.apache.spark._
import org.apache.spark.mllib.linalg.Vectors
import sqlContext.implicits._
import sqlContext._
val schema = StructType(Array(
  StructField("area", DoubleType, true),
  StructField("perimeter", DoubleType, true),
  StructField("firenum", DoubleType, true),      
  StructField("fire_id", DoubleType, true),      
  StructField("lat", DoubleType, true),
  StructField("lon", DoubleType, true),
  StructField("date", TimestampType, true),
  StructField("julian", IntegerType, true),
  StructField("gmt", IntegerType, true),
  StructField("temp", DoubleType, true),     
  StructField("spix", DoubleType, true),      
  StructField("tpix", DoubleType, true),            
  StructField("src", StringType, true),
  StructField("sat_src", StringType, true),      
  StructField("conf", IntegerType, true),
  StructField("frp", DoubleType, true)
val df_all ="com.databricks.spark.csv").option("header", "true").schema(schema).load("/user/mapr/data/fires/modis*.csv")
// Include only fires with coordinates in Cascadia
val df = df_all.filter($"lat" > 42).filter($"lat" < 50).filter($"lon" > -124).filter($"lon" < -110)
val featureCols = Array("lat", "lon")
val assembler = new VectorAssembler().setInputCols(featureCols).setOutputCol("features")
val df2 = assembler.transform(df)
val Array(trainingData, testData) = df2.randomSplit(Array(0.7, 0.3), 5043)
val kmeans = new KMeans().setK(400).setFeaturesCol("features").setMaxIter(5)
val model =
println("Final Centers: ")
// Save the model to disk

The resulting cluster centers are shown below. Where would you stage firefighting equipment?

These centroids were calculated by analyzing the locations for fires that have occurred in the past. These points can be used to help stage firefighting equipment as near as possible to regions prone to burn, but how do we know which staging area should respond when a new forest fire starts? We can use our previously saved model to answer that question. The Scala code for that would look like this:

val test_coordinate = Seq((42.3,-112.2 )).toDF("latitude", "longitude")
val df3 = assembler.transform(test_coordinate)
val categories = model.transform(df3)
val centroid_id ="prediction") => r(0)).collect()(0).asInstanceOf[Int]

Using Zepplin in Docker with the MapR Data Science Refinery

Installing and using data science notebooks is relatively straightforward. However, integrating notebooks with seperate clusters so that you could for example, read large datasets from Hadoop and process them in Spark is difficult to setup and often slow due to data movement.

MapR solves this problem by distributing Zeppelin in a dockerized data science container preconfigured with secure read/write access to all data on a MapR cluster. MapR’s support for Zeppelin makes it the fastest and most scalable notebook for data science because it has direct access to all data on your cluster, so you can analyze that data without moving that data, regardless of whether that data is in streams, files, or tables.

To illustrate the value of this, checkout the Zeppelin notebook I developed for the firefighting problem I described above. In it you will see how data can be ingested and processed through a variety of data engineering and machine learning libraries with seamless access to the MapR Converged Data Platform.

The MapR Convergence Conference is coming to Portland!

If you are trying to build data science applications for your business, I would like to personally invite you to join me at MapR’s one-day Big Data conference on Thursday November 16th at the Nines Hotel in downtown Portland, Oregon.

We will be discussing the following:

  • Multi-Cloud and Data Integration
  • IoT and Edge Computing
  • Data Ops and Global Data Fabric
  • Machine Learning Logistics

When you attend this event, you’ll have the opportunity to engage with other attendees and industry experts to explore new ideas and find practical solutions to your own Big Data challenges.

Register with the following link to receive a free pass:

Please provide your feedback to this article by adding a comment to

Did you enjoy the blog? Did you learn something useful? If you would like to support this blog please consider making a small donation. Thanks!