On Code, And Other Things: operations

Showing posts with label operations. Show all posts

Saturday, 1 February 2020

Automating Heap Dumps For Java Containers in Google Kubernetes Engine

Heap dumps are an indispensable tool for debugging memory issues in Java processes. The typical way of taking a memory dump is using the jmap command

jmap -dump:format=b;file=/tmp/heap.dump 2592

This will trigger a heap dump for the process with id 2592 (assuming it's a Java process) and store it in a file /tmp/heap.dump. This can be analyzed later with a heap dump analyzer like jhat, MAT, or VisualVM - which are free tools.

Triggering a heap dump for a Java process that is running in a container, inside a pod, in a Google Kubernetes Engine (GKE) cluster node is not so straightforward. There are many layers of infrastructure that you have to cross to get at the Java process. Your process would usually run as part of a managed abstraction like a Deployment or a StatefulSet in your Kubernetes cluster. Your starting point would be just the pod name.

But,

- Knowing the pod name is not enough - you also have to locate the cluster node where it's running and ssh into it.

- A GKE node might be running many pods, and many Java processes - you have to identify the correct one once once you have ssh'ed into it. "docker ps" can help here.

- The GKE node might not have jmap. It's not straightforward to install the JDK there because it would typically be running COS. So you have to get inside the container and trigger the dump.

- You have to copy the dump to an accessible location, maybe a GCS bucket, from where you can download it to analyze. Uploading to a GCS bucket requires gsutil, which is not present by default in a COS node.

I have automated this entire process using just shell scripts and gcloud commands. The source code is on GitHub. They also use the toolbox utility that Google provides as a container for running debug tools. Invoking "toolbox" inside your GKE node will launch this container.

These scripts have an assumption which might not be valid for your cluster - I'll point it out at the relevant point in the code.

Here's a step by step explanation of the flow.

There are 3 shell scripts - k8s-debug-client.sh being the one to run from your dev box or bastion host. This one invokes k8s-debug-vm.sh (inside the GKE cluster node, i.e. the VM) which in turn invokes the k8s-debug-toolbox.sh.

First we find out the node on which the pod is running

node_name=`kubectl get pod ${pod_name} -o json | jq '. | .spec.nodeName'`

and get its public IP

public_ip=`gcloud compute instances list --filter="name=(${node_name})" --format="value(networkInterfaces[].accessConfigs[0].natIP)"`

copy the other two scripts to it

scp -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -i ${keyfile} k8s-debug-vm.sh k8s-debug-toolbox.sh ${user}@${public_ip}:

and trigger them

ssh -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -i ${keyfile} ${user}@${public_ip} sh k8s-debug-vm.sh ${pod_name} ${action} ${bucket}

I turn off the ssh warnings as I trigger this from a CI system on demand. If you run them manually, you can remove the -o options.

Inside the GKE node, k8s-debug-vm.sh figures out the correct container id and uses docker exec to trigger a heap dump inside it.

container_id=`docker ps | grep ${pod_name} | grep -v "POD" | awk '{print $1}'`
docker exec ${container_id} sh -c "jmap -dump:format=b,file=heap.dump 1"

Note that the heap dump is inside the container, and not in the VM. You may not be able to push it to a GCS bucket from the container as there is no gsutil and no permissions. So we need to copy the dump to the VM. We copy it to a uniquely named file.

dttime=`echo $(date '+%d-%b-%Y-%H-%M-%S')`
filename=${pod_name}-${dttime}.hdump
docker cp ${container_id}:heap.dump ${filename}

Now you have the dump file in the VM but there is no gsutil. So you need to invoke toolbox, which has gsutil inside it. Does that mean we need to copy the dump inside toolbox now? No, because toolbox mounts several useful directories by default from the VM it's running on.

So we just invoke toolbox and pass it the path to the k8s-debug-toolbox.sh (which is in the home directory of the user you are logged in as in the VM) as it would appear from inside toolbox (since the home directory is also mounted inside toolbox).

Inside toolbox, we can use gsutil to upload the dump file (which is also available inside toolbox because it's in the home directory of the user you are logged in as in the VM). But here's a catch. gsutil requires permissions to upload to a GCS bucket. One way to provide this permission is with an IAM permissions JSON file. But how does it get to the VM?

This is the caveat I mentioned above. In the infrastructure I manage, almost every Java pod has a config map with a permissions file that the pod uses to access Google Cloud services. This file is accessible as a mounted directory inside toolbox, so, voila!

/google-cloud-sdk/bin/gcloud auth activate-service-account --key-file=${dir}/key.json
/google-cloud-sdk/bin/gsutil cp /media/root/home/${user}/${filename} gs://${bucket}/kdev-debug/${filename}

If you don't have this shortcut, you will a need to set the permissions somehow. There are multiple ways of doing it - one being to run a custom container instead of toolbox that has gsutil installed and can mount a config map which has the permissions. Another is to upload the permissions file to the VM when you run the command, use it from inside toolbox, and then delete it. The second one is a tad risky.

These scripts can be modified to be usable for any Kubernetes cluster and not just GKE. Most of the changes will be in the commands that fetch the list of running nodes. If you are using another OS for your K8S VMs, you can install Java directly on the VM and trigger the dump, after you find out the mapping between the container ids and the process ids as visible from the VM.

Saturday, 20 October 2012

Lessons learned while managing technical operations

for a cloud based SaaS product, and which might be useful to you if you’re doing the same.

There is no substitute to knowing your fundamentals. Whatever you’re managing - your own datacenter or a suite of apps on a public cloud - you have to know your Operating Systems, your Computer Networking, your Linux, your VMs.
Know your tools. Find out what tools you need to monitor, maintain and debug your systems. Know how they work, keep up with updates and play with them often. It will save you time when the crisis hits.
Know one editor and know it well - vim, emacs or other. Know all common shortcuts, complicated copying pasting routines, tips and tricks - in times of crisis, every second counts.
Learn a little every day. Share what you learn even if you think nobody’s listening. Soon you’ll find like minded people you can share ideas with.
Visibility to other teams of what you’re doing is very important. Graph it, present it, blog and talk about it.
Try to fill your team with the right people. The best people in technical operations have an eye for detail who do not lose sight of the big picture. They are good split-second decision makers and are experts in prioritizing in times of crises. And of course, they know their stuff or are smart enough to figure it out if they don’t.
Know your industry. Study what others are doing, and why.
Keep up to date. Know what is new in your field - subscribe to the best newsletters, RSS feeds, podcasts and conferences. There is a lot of noise, so take out the time to sift to the useful parts, adopt what is good for your operations and forget the rest.
Keep an open mind. Fads will come and go, old ideas will be repackaged and sold with a new coating every few years. Whatever be the case, keep up with trends - they always have something to teach.
Know your organization’s business. Interface and build relationships with all teams. If you cut away the trappings of the DevOps movement, the most important point that remains is collaboration. How you achieve it depends on you.

Monday, 8 August 2011

The operational mentality in software development

A talk by Theo Schlossnagle spurred this line of thought. The description of the talk is "about the evolution of a career in web operations", but he talks more about the importance of thinking operationally by developers. In other words, he takes a different stance about the meaning of DevOps than what is prevalent. DevOps is usually described as increased collaboration between development and operations, with knowledge sharing between the two groups leading to better delivery.

Theo Schlossnagle is of the view that developers need to take the operational view when writing code. It's a very valid point of view and something that I suspect most of us overlook.

I'm going to expand on what he said and share my ideas on that.

Let's take the software you're writing now (assuming it's web software). Is it operable?
It probably does atleast these things

Fulfills your requirements document
Passes your unit tests and integration tests
The UI is usable and responsive

But is it operable? Once it's deployed, can it survive unprecedented load? Fringe cases? Subsystems going down? Third party services it depends on becoming unavailable?

And present a front of graceful degradation as it does all this?

Selective Failure
Most of the time, we stress systems before deployment, using load tests to simulate real world conditions. That takes care of one aspect. But most of us don't think of failures of selective systems, especially when the system is distributed and its components interact in complex ways. The latter is true of most big web applications.Handling selective subsystem failures is not purely an operations responsibility. The application has to be written keeping selective failure in mind.

In the video, Theo brings up an analogy with security. Security is not a feature, but a way of thinking.
Sanitizing user input before putting it in a database query is not a feature.
Not allowing access to your internal web services is not a feature.
These are security restrictions you automatically think of when you develop.

In the same way, operational thinking should not be a something postponed till deployment while writing code. It should be de rigueur in the design and development process.

On Code, And Other Things

Pages