One of the most common misunderstandings with Vue.js deals with how to define endpoints for backend services that are not resolvable during build time. In this post I’m going to describe how to define dynamic configurations like backend endpoints so they can be determined at runtime. Vue.js is a very...
[Read More]
Running MediaInfo as an AWS Lambda Function
This post describes how to package MediaInfo so it can be used in applications hosted by AWS Lambda. AWS Lambda is a cloud service from Amazon that lets you run code without the complexity of building and managing servers. MediaInfo is a very popular tool for people who do video...
[Read More]
Deep Dive into CORS Configs on AWS S3
I originally published this article on the AWS blog, here: https://aws.amazon.com/blogs/media/deep-dive-into-cors-configs-on-aws-s3/ For several weeks I’ve been trying to diagnose Cross-Origin Resource Sharing (CORS) errors in a web component I built for uploading files to AWS S3. This has been one of the hardest software defects I’ve had to solve in...
[Read More]
Running OpenCV as an AWS Lambda Function
This post describes how to package the OpenCV python library so it can be used in applications that run in AWS Lambda. AWS Lambda is a Function-as-a-Service (FaaS) offering from Amazon that lets you run code without the complexity of building and maintaining the underlying infrastructure. OpenCV is one of...
[Read More]
Data Management Strategies for Computer Vision
Computer Vision (CV) developers often find the biggest barrier to success deals with data management and yet so much of what you’ll find about CV is about the algorithms, not the data. In this blog I’ll describe three seperate data management strategies I’ve used with applications that process images. Through...
[Read More]
Business Innovation through Data Transformation
Today I presented at the Seattle Technology Leadership Summit, which was a gathering of CxO’s and upper management from a variety of companies. I made the case that companies can become more competitive by innovating with data intensive applications and (secondarily) that MapR provides the best data platform to make...
[Read More]
Using StreamSets and MapR together in Docker
In this post I demonstrate how to integrate StreamSets with MapR in Docker. This is made possible by the MapR persistent application client container (PACC). The fact that any application can use MapR simply by mapping /opt/mapr through Docker volumes is really powerful! Installing the PACC is a piece of...
[Read More]
Creating Data Pipelines for IoT with StreamSets
If you think building data pipelines requires advanced software development skills, think again. A company called StreamSets has created software which enables you to build data pipelines using a drag-and-drop GUI. It frees you from the burden of writing code with the application programming interfaces (APIs) needed to ingest data,...
[Read More]
The MapR-DB Connector for Apache Spark
MapR just released Python and Java support for their MapR-DB connector for Spark. It also supports Scala, but Python and Java are new. I recorded a video to help them promote it, but I also learned a lot in the process, relating to how databases can be used in Spark....
[Read More]
Predicting Time-Series data from OpenTSDB with RNNs in Tensorflow
I’ve been learning a lot of really interesting stuff about time-series data, lately. Over the past month I’ve learned how to consume Factory IoT sensor data from an MQTT server, process it in StreamSets, persist it in OpenTSDB, visualize it in Grafana, and forecast it with Tensorflow. It’s really amazing...
[Read More]
Predicting Forest Fires with Spark Machine Learning
Anytime you have lat / long coordinates, you have an opportunity to do data science with kmeans clustering and visualization on a map. This is a story about how I used geo data with kmeans clustering that relates to a topic which has effected me personally - wildfires! Every summer...
[Read More]
Joining streams and NoSQL tables for Customer 360 analytics in Spark.
“MapR-DB is the perfect database for Customer 360 applications”. That’s the tag line I used to describe a demo I created for MapR for the Strata Data Conference in New York in September of 2017. Describing Customer 360 as a use case for MapR-DB was the focus of this demo...
[Read More]
Using Tensorflow on a Raspberry Pi in a Chicken Coop
Ever since I first heard about Tensorflow and the promises of Deep Learning I’ve been anxious to give it a whirl. Tensorflow is a powerful and easy to use library for machine learning. It was open-sourced by Google in November 2015. In less than 2 years it has become one...
[Read More]
How to plot data on maps in Jupyter using Matplotlib, Plotly, and Bokeh
If you’re trying to plot geographical data on a map then you’ll need to select a plotting library that provides the features you want in your map. And if you haven’t plotted geo data before then you’ll probably find it helpful to see examples that show different ways to do...
[Read More]
How to combine relational and NoSQL datasets with Apache Drill
It is rarely the case that enterprise data science applications can operate on data which is entirely contained within a single database system. Take for instance a company which wants to build a Customer 360 application that uses data sources across its enterprise to develop marketing campaigns or recommendation engines...
[Read More]
Visualizing K-Means Clusters in Jupyter Notebooks
The information technology industry is in the middle of a powerful trend towards machine learning and artificial intelligence. These are difficult skills to master but if you embrace them and just do it, you’ll be making a very significant step towards advancing your career. As with any learning curve, it’s...
[Read More]
How To Clone Virtual Machines in Azure
I use Azure a lot to create virtual machines for demos and application prototypes. It often takes me a long time to setup these rigs, so once I finally get things the way I like them I really don’t want to duplicate that effort. Fortunately, Azure lets us clone VMs....
[Read More]
Kafka vs MapR Streams Benchmark
A lot of people choose MapR as their core platform for processing and storing big data because of its advantages for speed and performance. MapR consistently performs faster than any other big data platform for all kinds of applications, including Hadoop, distributed file I/O, NoSQL data storage, and data streaming....
[Read More]
Automating MapR with MapR Stanzas
In my life as a technical marketeer for MapR I have configured more clusters than you can shake a stick at. So, imagine my excitement when I heard that MapR installations can be automated with a new capability called, “MapR Stanzas”. MapR Stanzas allow you to automate the MapR installation...
[Read More]
What's wrong with using small batch sizes in Kafka?
What is Kafka’s batch size? Kafka producers will buffer unsent records for each partition. These buffers are of a size specified by the batch.size config. You can achieve higher throughput by increasing the batch size, but there is a trade-off between more batching and increased end-to-end latency. The larger your...
[Read More]
What data types are most suitable for fast Kafka data streams? [Part Two]
In my last post I explained how important it is to format data types as byte arrays rather than other types, such as POJOs or json objects, in order to achieve minimal overhead when serializing data records to Kafka’s native byte array format. However, although serialization may be faster byte...
[Read More]
What data types are most suitable for fast Kafka data streams? [Part One]
The data types you choose to use to represent data can have a big impact on how fast you can stream that data through Kafka. A typical Kafka pipeline includes multiple stages that access streaming data to perform some kind of operation. Each stage will typically need to consume messages...
[Read More]
How to deploy a 3 node Kafka cluster in Azure
In an earlier post I described how to setup a single node Kafka cluster in Azure so that you can quickly familiarize yourself with basic Kafka operations. However, most real world Kafka applications will run on more than one node to take advantage of Kafka’s replication features for fault tolerance....
[Read More]
How to persist Kafka streams as JSON in No-SQL storage
Streaming data is like, “Now you see it. Now you don’t!” One of the challenges when working with streams, especially streams of fast data, is the transitory nature of the data. Kafka streams are characterized by a retention period that defines the point at which messages will be permanently deleted....
[Read More]
How to Use JUnit to Optimize Throughput in Kafka Streams
Finding the optimal set of configurations for Kafka in order to achieve the fastest possible throughput for real time/stream analytics can be a time-consuming process of trial and error. Automating that process with parametrized JUnit tests can be an excellent way to find optimal Kafka configurations without guess work and...
[Read More]
How to quickly get started using Kafka
This post describes how to quickly install Apache Kafka on a one node cluster and run some simple producer and consumer experiments. Apache Kafka is a distributed streaming platform. It lets you publish and subscribe to streams of data like a messaging system. You can also use it to store...
[Read More]
How to "Right Size" VMs in Azure for Kafka Streaming
I’ve been using Azure for hosting a 3 node MapR cluster with which I’m running a streaming application that uses Kafka and Spark to process a fast data stream. My use case requires that I be able to ingest 1.7 GB of data into Kafka within 1 minute (approximately 227...
[Read More]
How To Debug Remote Spark Jobs With IntelliJ
Application developers often use debuggers to find and fix defects in their code. Attaching a debugger to a running application is straightforward when the runtime is local on a laptop but trickier when that code runs on a remote server. This is even more confusing for Big Data applications since...
[Read More]
How To Install Mapr In Azure
Last week MapR released a new version of their Converged Data Platform. Today I installed it on Azure, and kept notes on all the commands I used. It’s possible to automate this installation, and I’m pretty sure MapR has documented how to do that, but until I find that doc,...
[Read More]
How To Monitor Kafka Apps With Jprofiler
I’ve been spending a lot of time trying to maximize throughput for a Kafka data streaming pipeline. A large part of this effort has involved optimizations to data structures in my Java code. Generally speaking, anytime I use a data structure which is not a byte array, I sacrifice performance....
[Read More]
How To Build A Data Lake In 15 Minutes
When getting acquainted with new technologies I believe users should be able to get started without spending more than 15 minutes setting up a sandbox environment. But when it comes to setting up a big data cluster, 15 minutes is a lofty goal. However, it is possible to get started...
[Read More]
How To Use Java Serializers With Kafka
Apache Kafka is a distributed pub-sub messaging system that scales horizontally and has built-in message durability and delivery guarantees. It’s a cluster-based technology and has evolved from its origins at LinkedIn to become the defacto standard messaging system enterprises use to move massive amounts of data through transformation pipelines. I’ve...
[Read More]