In this post I demonstrate how to integrate StreamSets with MapR in Docker. This is made possible by the MapR persistent application client container (PACC). The fact that any application can use MapR simply by mapping /opt/mapr through Docker volumes is really powerful! Installing the PACC is a piece of cake, too.

Introduction

I use StreamSets a lot for creating and visualizing data pipelines. I recently discovered that I’ve been installing StreamSets the hard way, meaning I’ve been downloading their tar installer, but now I’m using Docker and I’m liking the isolation and reproducibility it provides.

To use StreamSets with MapR, the mapr-client package needs to be installed on the StreamSets host. Alternatively (emphasized because this is important) you can run a separate CentOS Docker container which has the mapr-client package installed, then you can share /opt/mapr as a docker volume with the StreamSets container. I like this approach because the MapR installer (which you can download here) can configure a mapr-client container for me! MapR calls this container the Persistent Application Client Container (PACC).

Here is the procedure I used to create and configure the PACC and StreamSets in Docker:

Start the MapR Client in Docker

Here’s a short video showing how to create, configure, and run the PACC:

For more information about creating the PACC image, see https://maprdocs.mapr.com/home/AdvancedInstallation/CreatingPACCImage.html.

Here are the steps I used for creating the PACC:

wget http://package.mapr.com/releases/installer/mapr-setup.sh -P /tmp
bash /tmp/mapr-setup.sh docker client
vi /tmp/docker_images/client/mapr-docker-client.sh
  # Set these properties:
  # MAPR_CLUSTER=nuc.cluster.com
  # MAPR_CLDB_HOSTS=10.0.0.10
  # MAPR_MOUNT_PATH=/mapr
  # MAPR_DOCKER_ARGS="-v /opt/mapr --name mapr-client"
bash /tmp/docker_images/client/mapr-docker-client.sh

Start StreamSets in Docker

In another terminal session, start the StreamSets docker container with the following command.

docker run --restart on-failure -it  -p 18630:18630 -d --volumes-from mapr-client --name sdc streamsets/datacollector

Normally we would need to install the MapR client on the StreamSets host, but since we’ve mapped /opt/mapr from the PACC via docker volumes, the StreamSets host already has it!

Now you need to go to StreamSet’s package manager and install the MapR libraries:

You’ll see several MapR packages in StreamSets.

  • MapR 6.0.0
  • MapR 6.0.0 MEP 4
  • MapR Spark 2.1.0 MEP 3

You’ll want to install the first one, “MapR 6.0.0”. That package lets you use MapR filesystem, MapR-DB, and MapR Streams. If you want Hive and cluster mode execution, then install “MapR 6.0.0 MEP 4” as well as “MapR 6.0.0”. If you want Spark, then also install “MapR Spark 2.1.0 MEP 3”.

For more details on why the MapR package was split up like this, see this particular commit https://github.com/streamsets/datacollector/commit/9452a03489ddf8ae2af81be9afaa904c7e766a55#diff-fd75725ca8cdddff01e7533e9b740e44

After you install the package, don’t forget to run the setup-MapR script and all that jazz as described in the setup guide.

You’ll be prompted to restart StreamSets. After it’s restarted, run these commands finish the MapR setup:

docker exec -u 0 -it sdc /bin/bash
export SDC_HOME=/opt/streamsets-datacollector-3.3.0/
export SDC_CONF=/etc/sdc
echo "export CLASSPATH=\`/opt/mapr/bin/mapr classpath\`" >> /opt/streamsets-datacollector-3.3.0/libexec/sdc-env.sh
/opt/streamsets-datacollector-3.3.0/bin/streamsets setup-mapr

Restart StreamSets again from the gear menu.

When it comes up you will be able to use MapR in StreamSets data pipelines. Here’s a basic pipeline example that saves the output of tailing a file to a file on MapR-FS:


Please provide your feedback to this article by adding a comment to https://github.com/iandow/iandow.github.io/issues/11.