ELK Stack for Kubernetes Metrics

Part of the work of the Platform Services team at Alfresco is around deployment, specifically Kubernetes.

Monitoring a Kubernetes (K8s) cluster is critical in production and there are many great tools out there like Prometheus and Grafana, but I wanted to ‘play’ with getting K8s metrics into the Elasticsearch, Logstash, and Kibana (ELK) stack, all deployed in my local Docker Desktop Kubernetes environment.

I found several resources on various parts of the solution, but didn’t find one that tied all of it together, so hopefully this post will be of use to someone.

Prerequisites

1. Deploy the ELK Stack

Helm provides excellent package management of K8s and is always my first choice for deployment so I started there for the ELK stack, and sure enough, there’s a Helm chart for that (elastic-stack).

You can easily override/merge the default values of a chart with your own values file so I created one that enabled filebeat and pointed it to logstash and the elasticsearch client and pointed logstash to the elasticsearch client:

my-elastic-stack.yaml:

logstash:
  enabled: true
  elasticsearch:
    host: elastic-stack-elasticsearch-client

filebeat:
  enabled: true
  config:
    output.file.enabled: false
    output.logstash:
      hosts: ["elastic-stack-logstash:5044"]
  indexTemplateLoad:
    - elastic-stack-elasticsearch-client:9200

Then deployed with those values:

helm install --name elastic-stack stable/elastic-stack -f ~/Desktop/my-elastic-stack.yaml

You’ll see some output detailing what you’ve deployed and you can check the status of the pods with kubectl:

kubectl get pods -l "release=elastic-stack"

2. Port Forward Kibana and Setup Index

Once the pods are all showing proper READY status (which may take a minute or two) you can set up port forwarding to the Kibana pod:

export POD_NAME=$(kubectl get pods --namespace default -l "app=kibana,release=elastic-stack" -o jsonpath="{.items[0].metadata.name}");
kubectl port-forward --namespace default $POD_NAME 5601:5601

Then open http://127.0.0.1:5601 in your browser to access the Kibana UI.

You can then of course create an index pattern for filebeat and browse those entries:

3. Deploy kube-state-metrics

The elastic metricbeat module (which we’ll deploy in a moment) relies on kube-state-metrics. Sure enough, there’s a chart for that (kube-state-metrics):

helm install --name kube-state-metrics stable/kube-state-metrics

4. Deploy Metricbeat

Metricbeat collects (easy guess) metrics and, through its Kubenetes modules, can get data from K8s nodes and the API. Sure enough, (you know where this is going…) there’s a chart for that (metricbeat).

In this case we have to override values to:

Use the open source version of the metricbeat Docker image (metricbeat-oss)
Point the daemonset Kubernetes module to the proper K8s SSL metrics host (and disable SSL verification)
Point the deployment to the elasticsearch client and kibana in the cluster

my-elastic-metricbeat.yaml

image:
  repository: docker.elastic.co/beats/metricbeat-oss
daemonset:
  config:
    output.file:
    output.elasticsearch:
      hosts: ["elastic-stack-elasticsearch-client:9200"]
  modules:
    kubernetes:
      config:
        - module: kubernetes
          metricsets:
            - node
            - system
            - pod
            - container
            - volume
          period: 10s
          host: ${NODE_NAME}
          hosts: ["https://${HOSTNAME}:10250"]
          bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
          ssl.verification_mode: "none"
deployment:
  config:
    output.file:
    output.elasticsearch:
      hosts: ["elastic-stack-elasticsearch-client:9200"]
    setup.kibana:
      host: "elastic-stack-kibana:443"
    setup.dashboards.enabled: true

Then deploy with those values:

helm install --name elastic-metricbeat stable/metricbeat -f ~/Desktop/my-elastic-metricbeat.yaml

5. Setup Metricbeat Index and K8s Dashboard

Metricbeat will start feeding the K8s metrics to Elasticsearch and should setup another index pattern and load several dashboards into Kibana.

You can then open the [Metricbeat Kubernetes] Overview dashboard:

Of course, there are lots ways this could be done, and this could all be easily tied together in a parent Helm chart for easier release management.

Tools like Prometheus and Grafana may be targeted more specifically for this sort of task out of the box, but it’s great to see the Elastic stack as another viable option.

Are any of you using the ELK stack for K8s monitoring in production?

Daemontools and File Permissions

Contegix turned me on to daemontools which is a great way to manage services you need to keep alive. A great use for it is managing multiple Apache Tomcat instances which can each be easily be configured to run under different users with different options.

I was recently running into a problem where a webapp running in Tomcat didn’t have access to a particular directory mounted via fuse. I could verify that the user Tomcat was running under had access to the directory by logging in as that user then creating a file there. WTF?

Turns out the run script being used for the Tomcat service was using the daemontools program setuidgid, i.e.

exec setuidgid ${USER+"$USER"} ./bin/catalina.sh run

and the daemontools manual for that program states that it removes “all supplementary groups“. Doh!

The Tomcat user’s access to this particular directory is in fact granted based on a secondary group membership and in this case should really stay that way. setuidgid only allows a single user argument that’s used for both the user and group so I couldn’t change the running group and there’s no option to keep the user’s secondary groups.

Instead I opted to use su instead of setuidgid which leaves the secondary groups intact and gave the desired access to the directory:

exec su ${USER+"$USER"} -c "./bin/catalina.sh run"

Run catalina, run.

What to Look for in Managed Hosting

If you find the need to outsource some or all of your server environment during your bid to take over the world, remember that there’s a huge difference between a hosting provider that merely gives you access to hardware, be it physical or virtual, and one who truly manages those servers with you. Here’s a few traits to consider that can prove invaluable not only in day to day operations but when it comes time to upgrade, expand, or add to your deployment:

Level of Competence

[picapp src=”0266/3259b9c3-4908-49f9-9f9e-b8b621911eee.jpg?adImageId=6722375&imageId=269668″ width=”83″ height=”125″ /]

No single provider can excel in every area for which you might need a server. An asterisk VoIP deployment has very different requirements from a LAMP app for example. On the other hand, most apps one might deploy these days require knowledge of quite a few technologies, and most providers will probably tell you they can do whatever you need, so it’s important that you vet the provider’s competency in the areas that will benefit you the most.

Of course the level of competence in any area can vary from complete jackassery, to being able to solve a problem after exhaustive research, to being real innovators in a space, on the forefront of the subject, even disseminating their knowledge through blogging or other means.

A Good Trouble Ticket System

[picapp src=”b/d/5/e/Businessmans_hands_tied_594d.jpg?adImageId=6725332&imageId=5231632″ width=”154″ height=”135″ /]

Phone support is often overrated and inefficient, especially when it concerns servers. A phone call to support is almost always going to spawn a support ticket at some point, requiring follow-ups, attachments, etc. so it just makes sense to initiate it there.

A good trouble ticket system will allow you to email support with your issue, identifying your account with your email address, and will maintain any CCs you included on the original request.

Seamless Handoff Between Techs

[picapp src=”f/f/c/7/closeup_of_the_eabd.jpg?adImageId=6725507&imageId=5159404″ width=”167″ height=”120″ /]

When an issue reaches a support tech on the providers end who might not be the best qualified to handle it, the transition from that tech to another is a crucial step in the problem solving process. The customer shouldn’t be responsible for bringing each new tech assigned to the ticket up to speed.

Great support departments handle that transition so well that you have to double check the ticket to see who you’re replying to.

Specialization Rather Than Escalation

Most of us have had trouble tickets rise through tiers of support where the first is really just gathering information or digging through their knowledgebase for a previous answer, the second has some problem solving skills, and the third or higher are some form of ‘engineering’.

It’s fun to screw with first tier techs and tell them whatever they need to hear to escalate the ticket, but more fun when you get the sense that all of your provider’s support staff are ‘engineers’, and if a ticket does get handed off to another tech it’s most likely due only to a shift change or because the second tech really knows the crap out of the subject and the support team thinks that person would be the best to tackle it.

Response Time

[picapp src=”9/4/9/5/Stopwatch_bf74.jpg?adImageId=6727212&imageId=5288411″ width=”100″ height=”100″ /]

Whether the issue is handled via email, ticket system, instant message, or other medium, response time is key.

Most deployment servers are mission critical (for at least someone’s mission) and downtime of more than a few hours can cause rapid hair loss.

Even if an issue isn’t resolved immediately, some indicator that the provider is looking into the issue gives great piece of mind.

Monitoring

It’s no secret that you and your support staff love dealing with angry users, but it’s often beneficial to have trouble brought to your attention before your users notice. Maintaining resource monitoring and alert systems is a job in itself and a good provider will let you know not just when your server or app has been down for a while but when it’s getting close to consuming its allocated resources, is under some sort of malicious attack, or otherwise exhibiting odd behavior.

When done right the level of monitoring is so good that you feel like a jerk when you forget to tell the provider you’re going to work on your own application and will be taking it down.

They Take Action

Being notified that something isn’t working is one thing, but having a provider that takes obvious first steps to solve the problem, like restarting a service, can save all sorts of time, and may circumvent the need for you getting involved at all.

They Call You

[picapp src=”a/d/b/d/Red_rotary_telephone_0d23.jpg?adImageId=6729532&imageId=5288594″ width=”113″ height=”150″ /]

There are times when phone support does come in handy.

No one likes to be woken up in the middle of the night because an app has crashed and can’t be brought back up, but if the provider’s team has done everything they can without interaction from you and you’ve selfishly closed your eyes for a few hours without a mechanism to poke you for each new email received, a phone call may be the only recourse, and it’s certainly better than waking up to find that a critical service that has been down for hours.

Trust

[picapp src=”0071/1e1b8e53-5aa8-47e9-a2dd-33a62c1c2957.jpg?adImageId=6730842&imageId=74056″ width=”200″ height=”129″ /]

It’s only when you really trust your provider that you can realize the true value of their services.

When you trust them to install and setup a third party application rather than tackling it yourself you save the inevitable hours searching through documentation and forums to resolve some stupid issue unique to your deployment environment.

When you trust their recommendation on which app to use to fulfill some need you can save hours of additional research time.

When you trust that their network, power, and most importantly people infrastructure can handle your app when it becomes the mega success you’ve dreamt it will be, you’re free to dream of the next.

If trust could be quantified and put on a feature list we’d certainly have an easier job of comparing these folks, but as we all know, this one can only be earned over time.

An Unsolicited Shout Out

I can say from experience that Contegix excels in all of these areas. I’m in no way affiliated with them, just a satisfied customer.

Good luck in your search.