Visualizing K-Means Clusters in Jupyter Notebooks

The information technology industry is in the middle of a powerful trend towards machine learning and artificial intelligence. These are difficult skills to master but if you embrace them and just do it, you’ll be making a very significant step towards advancing your career. As with any learning curve, it’s... [Read More]
Tags: machine learning, python, jupyter, kmeans, customer 360

How To Clone Virtual Machines in Azure

I use Azure a lot to create virtual machines for demos and application prototypes. It often takes me a long time to setup these rigs, so once I finally get things the way I like them I really don’t want to duplicate that effort. Fortunately, Azure lets us clone VMs.... [Read More]
Tags: azure

Kafka vs MapR Streams Benchmark

A lot of people choose MapR as their core platform for processing and storing big data because of its advantages for speed and performance. MapR consistently performs faster than any other big data platform for all kinds of applications, including Hadoop, distributed file I/O, NoSQL data storage, and data streaming.... [Read More]
Tags: performance, kafka, mapr streams

Automating MapR with MapR Stanzas

In my life as a technical marketeer for MapR I have configured more clusters than you can shake a stick at. So, imagine my excitement when I heard that MapR installations can be automated with a new capability called, “MapR Stanzas”. MapR Stanzas allow you to automate the MapR installation... [Read More]
Tags: mapr, automation

What's wrong with using small batch sizes in Kafka?

What is Kafka’s batch size? Kafka producers will buffer unsent records for each partition. These buffers are of a size specified by the batch.size config. You can achieve higher throughput by increasing the batch size, but there is a trade-off between more batching and increased end-to-end latency. The larger your... [Read More]
Tags: java, kafka

How to deploy a 3 node Kafka cluster in Azure

In an earlier post I described how to setup a single node Kafka cluster in Azure so that you can quickly familiarize yourself with basic Kafka operations. However, most real world Kafka applications will run on more than one node to take advantage of Kafka’s replication features for fault tolerance.... [Read More]
Tags: azure, kafka, cluster

How to persist Kafka streams as JSON in No-SQL storage

Streaming data is like, “Now you see it. Now you don’t!” One of the challenges when working with streams, especially streams of fast data, is the transitory nature of the data. Kafka streams are characterized by a retention period that defines the point at which messages will be permanently deleted.... [Read More]
Tags: mapr, kafka, maprdb, drill, json

How to Use JUnit to Optimize Throughput in Kafka Streams

Finding the optimal set of configurations for Kafka in order to achieve the fastest possible throughput for real time/stream analytics can be a time-consuming process of trial and error. Automating that process with parametrized JUnit tests can be an excellent way to find optimal Kafka configurations without guess work and... [Read More]
Tags: Kafka, Performance, JUnit, R

How to quickly get started using Kafka

This post describes how to quickly install Apache Kafka on a one node cluster and run some simple producer and consumer experiments. Apache Kafka is a distributed streaming platform. It lets you publish and subscribe to streams of data like a messaging system. You can also use it to store... [Read More]
Tags: azure, mapr, kafka

How to "Right Size" VMs in Azure for Kafka Streaming

I’ve been using Azure for hosting a 3 node MapR cluster with which I’m running a streaming application that uses Kafka and Spark to process a fast data stream. My use case requires that I be able to ingest 1.7 GB of data into Kafka within 1 minute (approximately 227... [Read More]
Tags: azure, kafka

How To Debug Remote Spark Jobs With IntelliJ

Application developers often use debuggers to assist with application development and fix problems in their code. Typically, developers run and debug their applications locally on their workstation, however one of the challenges with developing big data applications is that they’re designed to be run on a multi-node cluster. The presents... [Read More]
Tags: intellij, spark

How To Install Mapr In Azure

Last week MapR released a new version of their Converged Data Platform. Today I installed it on Azure, and kept notes on all the commands I used. It’s possible to automate this installation, and I’m pretty sure MapR has documented how to do that, but until I find that doc,... [Read More]
Tags: azure, mapr

How To Monitor Kafka Apps With Jprofiler

I’ve been spending a lot of time trying to maximize throughput for a Kafka data streaming pipeline. A large part of this effort has involved optimizations to data structures in my Java code. Generally speaking, anytime I use a data structure which is not a byte array, I sacrifice performance.... [Read More]
Tags: jprofiler, kafka

How To Build A Data Lake In 15 Minutes

When getting acquainted with new technologies I believe users should be able to get started without spending more than 15 minutes setting up a sandbox environment. But when it comes to setting up a big data cluster, 15 minutes is a lofty goal. However, it is possible to get started... [Read More]
Tags: azure, mapr

How To Use Java Serializers With Kafka

Apache Kafka is a distributed pub-sub messaging system that scales horizontally and has built-in message durability and delivery guarantees. It’s a cluster-based technology and has evolved from its origins at LinkedIn to become the defacto standard messaging system enterprises use to move massive amounts of data through transformation pipelines. I’ve... [Read More]
Tags: java, kafka