How to Stop Hardcoding Service Endpoints in Vue.js

One of the most common misunderstandings with Vue.js deals with how to define endpoints for backend services that are not resolvable during build time. In this post I’m going to describe how to define dynamic configurations like backend endpoints so they can be determined at runtime. Vue.js is a very... [Read More]
Tags: aws, lambda, python, mediainfo, docker

Running MediaInfo as an AWS Lambda Function

This post describes how to package MediaInfo so it can be used in applications hosted by AWS Lambda. AWS Lambda is a cloud service from Amazon that lets you run code without the complexity of building and managing servers. MediaInfo is a very popular tool for people who do video... [Read More]
Tags: aws, lambda, python, mediainfo, docker

Deep Dive into CORS Configs on AWS S3

I originally published this article on the AWS blog, here: For several weeks I’ve been trying to diagnose Cross-Origin Resource Sharing (CORS) errors in a web component I built for uploading files to AWS S3. This has been one of the hardest software defects I’ve had to solve in... [Read More]
Tags: aws, s3, python

Running OpenCV as an AWS Lambda Function

This post describes how to package the OpenCV python library so it can be used in applications that run in AWS Lambda. AWS Lambda is a Function-as-a-Service (FaaS) offering from Amazon that lets you run code without the complexity of building and maintaining the underlying infrastructure. OpenCV is one of... [Read More]
Tags: aws, lambda, python, opencv, docker

Data Management Strategies for Computer Vision

Computer Vision (CV) developers often find the biggest barrier to success deals with data management and yet so much of what you’ll find about CV is about the algorithms, not the data. In this blog I’ll describe three seperate data management strategies I’ve used with applications that process images. Through... [Read More]
Tags: computer vision, docker, kafka, mapr, twitter

Business Innovation through Data Transformation

Today I presented at the Seattle Technology Leadership Summit, which was a gathering of CxO’s and upper management from a variety of companies. I made the case that companies can become more competitive by innovating with data intensive applications and (secondarily) that MapR provides the best data platform to make... [Read More]
Tags: mapr, business strategy

Using StreamSets and MapR together in Docker

In this post I demonstrate how to integrate StreamSets with MapR in Docker. This is made possible by the MapR persistent application client container (PACC). The fact that any application can use MapR simply by mapping /opt/mapr through Docker volumes is really powerful! Installing the PACC is a piece of... [Read More]
Tags: streamsets, mapr, docker, data pipelines

Creating Data Pipelines for IoT with StreamSets

If you think building data pipelines requires advanced software development skills, think again. A company called StreamSets has created software which enables you to build data pipelines using a drag-and-drop GUI. It frees you from the burden of writing code with the application programming interfaces (APIs) needed to ingest data,... [Read More]
Tags: streamsets, mqtt, iot, opentsdb, grafana, data visualization, data pipelines

The MapR-DB Connector for Apache Spark

MapR just released Python and Java support for their MapR-DB connector for Spark. It also supports Scala, but Python and Java are new. I recorded a video to help them promote it, but I also learned a lot in the process, relating to how databases can be used in Spark.... [Read More]
Tags: nosql, json, spark

Predicting Time-Series data from OpenTSDB with RNNs in Tensorflow

I’ve been learning a lot of really interesting stuff about time-series data, lately. Over the past month I’ve learned how to consume Factory IoT sensor data from an MQTT server, process it in StreamSets, persist it in OpenTSDB, visualize it in Grafana, and forecast it with Tensorflow. It’s really amazing... [Read More]
Tags: rnn, tensorflow, machine learning, data science, opentsdb, python

Predicting Forest Fires with Spark Machine Learning

Anytime you have lat / long coordinates, you have an opportunity to do data science with kmeans clustering and visualization on a map. This is a story about how I used geo data with kmeans clustering that relates to a topic which has effected me personally - wildfires! Every summer... [Read More]
Tags: spark, kmeans, machine learning, data science, data engineering, data wrangling, scala, python

Joining streams and NoSQL tables for Customer 360 analytics in Spark.

“MapR-DB is the perfect database for Customer 360 applications”. That’s the tag line I used to describe a demo I created for MapR for the Strata Data Conference in New York in September of 2017. Describing Customer 360 as a use case for MapR-DB was the focus of this demo... [Read More]
Tags: spark, streaming, sql, machine learning, customer 360, master data management

Using Tensorflow on a Raspberry Pi in a Chicken Coop

Ever since I first heard about Tensorflow and the promises of Deep Learning I’ve been anxious to give it a whirl. Tensorflow is a powerful and easy to use library for machine learning. It was open-sourced by Google in November 2015. In less than 2 years it has become one... [Read More]
Tags: tensorflow, python

How to plot data on maps in Jupyter using Matplotlib, Plotly, and Bokeh

If you’re trying to plot geographical data on a map then you’ll need to select a plotting library that provides the features you want in your map. And if you haven’t plotted geo data before then you’ll probably find it helpful to see examples that show different ways to do... [Read More]
Tags: data discovery, data integration, apache drill

How to combine relational and NoSQL datasets with Apache Drill

It is rarely the case that enterprise data science applications can operate on data which is entirely contained within a single database system. Take for instance a company which wants to build a Customer 360 application that uses data sources across its enterprise to develop marketing campaigns or recommendation engines... [Read More]
Tags: data discovery, data integration, apache drill

Visualizing K-Means Clusters in Jupyter Notebooks

The information technology industry is in the middle of a powerful trend towards machine learning and artificial intelligence. These are difficult skills to master but if you embrace them and just do it, you’ll be making a very significant step towards advancing your career. As with any learning curve, it’s... [Read More]
Tags: machine learning, python, jupyter, kmeans, customer 360

How To Clone Virtual Machines in Azure

I use Azure a lot to create virtual machines for demos and application prototypes. It often takes me a long time to setup these rigs, so once I finally get things the way I like them I really don’t want to duplicate that effort. Fortunately, Azure lets us clone VMs.... [Read More]
Tags: azure

Kafka vs MapR Streams Benchmark

A lot of people choose MapR as their core platform for processing and storing big data because of its advantages for speed and performance. MapR consistently performs faster than any other big data platform for all kinds of applications, including Hadoop, distributed file I/O, NoSQL data storage, and data streaming.... [Read More]
Tags: performance, kafka, mapr streams

Automating MapR with MapR Stanzas

In my life as a technical marketeer for MapR I have configured more clusters than you can shake a stick at. So, imagine my excitement when I heard that MapR installations can be automated with a new capability called, “MapR Stanzas”. MapR Stanzas allow you to automate the MapR installation... [Read More]
Tags: mapr, automation

What's wrong with using small batch sizes in Kafka?

What is Kafka’s batch size? Kafka producers will buffer unsent records for each partition. These buffers are of a size specified by the batch.size config. You can achieve higher throughput by increasing the batch size, but there is a trade-off between more batching and increased end-to-end latency. The larger your... [Read More]
Tags: java, kafka

How to deploy a 3 node Kafka cluster in Azure

In an earlier post I described how to setup a single node Kafka cluster in Azure so that you can quickly familiarize yourself with basic Kafka operations. However, most real world Kafka applications will run on more than one node to take advantage of Kafka’s replication features for fault tolerance.... [Read More]
Tags: azure, kafka, cluster

How to persist Kafka streams as JSON in No-SQL storage

Streaming data is like, “Now you see it. Now you don’t!” One of the challenges when working with streams, especially streams of fast data, is the transitory nature of the data. Kafka streams are characterized by a retention period that defines the point at which messages will be permanently deleted.... [Read More]
Tags: mapr, kafka, maprdb, drill, json

How to Use JUnit to Optimize Throughput in Kafka Streams

Finding the optimal set of configurations for Kafka in order to achieve the fastest possible throughput for real time/stream analytics can be a time-consuming process of trial and error. Automating that process with parametrized JUnit tests can be an excellent way to find optimal Kafka configurations without guess work and... [Read More]
Tags: Kafka, Performance, JUnit, R

How to quickly get started using Kafka

This post describes how to quickly install Apache Kafka on a one node cluster and run some simple producer and consumer experiments. Apache Kafka is a distributed streaming platform. It lets you publish and subscribe to streams of data like a messaging system. You can also use it to store... [Read More]
Tags: azure, mapr, kafka

How to "Right Size" VMs in Azure for Kafka Streaming

I’ve been using Azure for hosting a 3 node MapR cluster with which I’m running a streaming application that uses Kafka and Spark to process a fast data stream. My use case requires that I be able to ingest 1.7 GB of data into Kafka within 1 minute (approximately 227... [Read More]
Tags: azure, kafka

How To Debug Remote Spark Jobs With IntelliJ

Application developers often use debuggers to find and fix defects in their code. Attaching a debugger to a running application is straightforward when the runtime is local on a laptop but trickier when that code runs on a remote server. This is even more confusing for Big Data applications since... [Read More]
Tags: intellij, spark

How To Install Mapr In Azure

Last week MapR released a new version of their Converged Data Platform. Today I installed it on Azure, and kept notes on all the commands I used. It’s possible to automate this installation, and I’m pretty sure MapR has documented how to do that, but until I find that doc,... [Read More]
Tags: azure, mapr

How To Monitor Kafka Apps With Jprofiler

I’ve been spending a lot of time trying to maximize throughput for a Kafka data streaming pipeline. A large part of this effort has involved optimizations to data structures in my Java code. Generally speaking, anytime I use a data structure which is not a byte array, I sacrifice performance.... [Read More]
Tags: jprofiler, kafka

How To Build A Data Lake In 15 Minutes

When getting acquainted with new technologies I believe users should be able to get started without spending more than 15 minutes setting up a sandbox environment. But when it comes to setting up a big data cluster, 15 minutes is a lofty goal. However, it is possible to get started... [Read More]
Tags: azure, mapr

How To Use Java Serializers With Kafka

Apache Kafka is a distributed pub-sub messaging system that scales horizontally and has built-in message durability and delivery guarantees. It’s a cluster-based technology and has evolved from its origins at LinkedIn to become the defacto standard messaging system enterprises use to move massive amounts of data through transformation pipelines. I’ve... [Read More]
Tags: java, kafka