On Code, And Other Things: automation

Showing posts with label automation. Show all posts

Saturday, 1 February 2020

Automating Heap Dumps For Java Containers in Google Kubernetes Engine

Heap dumps are an indispensable tool for debugging memory issues in Java processes. The typical way of taking a memory dump is using the jmap command

jmap -dump:format=b;file=/tmp/heap.dump 2592

This will trigger a heap dump for the process with id 2592 (assuming it's a Java process) and store it in a file /tmp/heap.dump. This can be analyzed later with a heap dump analyzer like jhat, MAT, or VisualVM - which are free tools.

Triggering a heap dump for a Java process that is running in a container, inside a pod, in a Google Kubernetes Engine (GKE) cluster node is not so straightforward. There are many layers of infrastructure that you have to cross to get at the Java process. Your process would usually run as part of a managed abstraction like a Deployment or a StatefulSet in your Kubernetes cluster. Your starting point would be just the pod name.

But,

- Knowing the pod name is not enough - you also have to locate the cluster node where it's running and ssh into it.

- A GKE node might be running many pods, and many Java processes - you have to identify the correct one once once you have ssh'ed into it. "docker ps" can help here.

- The GKE node might not have jmap. It's not straightforward to install the JDK there because it would typically be running COS. So you have to get inside the container and trigger the dump.

- You have to copy the dump to an accessible location, maybe a GCS bucket, from where you can download it to analyze. Uploading to a GCS bucket requires gsutil, which is not present by default in a COS node.

I have automated this entire process using just shell scripts and gcloud commands. The source code is on GitHub. They also use the toolbox utility that Google provides as a container for running debug tools. Invoking "toolbox" inside your GKE node will launch this container.

These scripts have an assumption which might not be valid for your cluster - I'll point it out at the relevant point in the code.

Here's a step by step explanation of the flow.

There are 3 shell scripts - k8s-debug-client.sh being the one to run from your dev box or bastion host. This one invokes k8s-debug-vm.sh (inside the GKE cluster node, i.e. the VM) which in turn invokes the k8s-debug-toolbox.sh.

First we find out the node on which the pod is running

node_name=`kubectl get pod ${pod_name} -o json | jq '. | .spec.nodeName'`

and get its public IP

public_ip=`gcloud compute instances list --filter="name=(${node_name})" --format="value(networkInterfaces[].accessConfigs[0].natIP)"`

copy the other two scripts to it

scp -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -i ${keyfile} k8s-debug-vm.sh k8s-debug-toolbox.sh ${user}@${public_ip}:

and trigger them

ssh -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -i ${keyfile} ${user}@${public_ip} sh k8s-debug-vm.sh ${pod_name} ${action} ${bucket}

I turn off the ssh warnings as I trigger this from a CI system on demand. If you run them manually, you can remove the -o options.

Inside the GKE node, k8s-debug-vm.sh figures out the correct container id and uses docker exec to trigger a heap dump inside it.

container_id=`docker ps | grep ${pod_name} | grep -v "POD" | awk '{print $1}'`
docker exec ${container_id} sh -c "jmap -dump:format=b,file=heap.dump 1"

Note that the heap dump is inside the container, and not in the VM. You may not be able to push it to a GCS bucket from the container as there is no gsutil and no permissions. So we need to copy the dump to the VM. We copy it to a uniquely named file.

dttime=`echo $(date '+%d-%b-%Y-%H-%M-%S')`
filename=${pod_name}-${dttime}.hdump
docker cp ${container_id}:heap.dump ${filename}

Now you have the dump file in the VM but there is no gsutil. So you need to invoke toolbox, which has gsutil inside it. Does that mean we need to copy the dump inside toolbox now? No, because toolbox mounts several useful directories by default from the VM it's running on.

So we just invoke toolbox and pass it the path to the k8s-debug-toolbox.sh (which is in the home directory of the user you are logged in as in the VM) as it would appear from inside toolbox (since the home directory is also mounted inside toolbox).

Inside toolbox, we can use gsutil to upload the dump file (which is also available inside toolbox because it's in the home directory of the user you are logged in as in the VM). But here's a catch. gsutil requires permissions to upload to a GCS bucket. One way to provide this permission is with an IAM permissions JSON file. But how does it get to the VM?

This is the caveat I mentioned above. In the infrastructure I manage, almost every Java pod has a config map with a permissions file that the pod uses to access Google Cloud services. This file is accessible as a mounted directory inside toolbox, so, voila!

/google-cloud-sdk/bin/gcloud auth activate-service-account --key-file=${dir}/key.json
/google-cloud-sdk/bin/gsutil cp /media/root/home/${user}/${filename} gs://${bucket}/kdev-debug/${filename}

If you don't have this shortcut, you will a need to set the permissions somehow. There are multiple ways of doing it - one being to run a custom container instead of toolbox that has gsutil installed and can mount a config map which has the permissions. Another is to upload the permissions file to the VM when you run the command, use it from inside toolbox, and then delete it. The second one is a tad risky.

These scripts can be modified to be usable for any Kubernetes cluster and not just GKE. Most of the changes will be in the commands that fetch the list of running nodes. If you are using another OS for your K8S VMs, you can install Java directly on the VM and trigger the dump, after you find out the mapping between the container ids and the process ids as visible from the VM.

Monday, 7 December 2009

Adding MySQL server instances using mysqlmanager

The MySQL instance manager - mysqlmanager - provides a way to manage multiple MySQL server instances on the same installation. All these instances use a common my.cnf file - but each can be configured individually (using the same file). mysqlmanager itself provides a command line interface to control the individual instances.

Part of a sample mysql.cnf with multiple mysql instances

[mysqld1]
user    = mysql
datadir = /data/mysql-1
socket  = /tmp/mysql-1.sock
port    = 3306

[mysqld2]
user    = mysql
datadir = /data/mysql-2
socket  = /tmp/mysql-2.sock
port    = 3307

The ability to setup multiple database servers fast is particularly useful in development boxes where fresh DBs need to be created often. In my team, we often need to do this. Every time a new DB has to be setup, we have to go through the steps of creating a datadir, installing the system tables, adding a root password, adding the entries to the my.cnf file and starting the instance using the mysqlmanager shell.

So I whipped up a small Linux shell script which automates this process.

Here it is.

It's still in a quite primitive state - but it works!

Usage is simple -

add-mysql-instance.sh mysql config-file-location datadir groupname username password instance port instance-name mysqlmanager-user mysqlmanager-password mysqlmanager-socket-file

Of course, mysqlmanager has to be running for this to work.

I'll be adding improvements to this script - like the ability to generate a mysql instance name based on existing instances (instance names are usually mysqld1, mysqld2 etc), picking up the user name from the file itself etc.

Sunday, 12 July 2009

Consistency in Development

Consistency in Development?

Simply put, it means following a set of basic guidelines in all development activities, from coding to deployment. This does not imply having rigid protocols and processes, because immutable rules don't help development but obstruct it. What it does imply is having simple, tried and tested conventions and some formal processes that people are comfortable with - 'Whatever works best for the team'. The key to getting the most out of consistency is to arrive at these rules by consensus and making sure everyone follows them.

It has to start at the very bottom.

Coding Standards
I cannot stress this enough. While there is no such thing as a perfect code convention, there should definitely be an agreed upon convention for a team - where everyone follows the coding standards agreed upon. This is true for any language. Imagine the plight of a developer who has to work on code originally written by someone else, and it takes him hours to figure out what the code does because the coding style followed by the original author is completely different. Junior developers often don't get this. The standards can be chosen democratically by involving every member of the team, freezing the conventions and applying it to everyone's IDE (Most IDEs support code style import/export). Ideally I would trust the developers to follow this, but it can also be enforced at the source control level where style checks can fail a checkin in case somebody messes up.

Speak the same dialect so that others understand you and vice versa.

Maintenance becomes much easier.

ArchitectureDon't run and start to hack away the moment you get the requirements. Stop for a moment - think awhile. Put your thoughts on the whiteboard and discuss with your peers. Run through your design with somebody who knows the big picture. Come up with atleast two or three different solutions to the problem - that way you would know that you have looked at it from various angles and chosen the right one.

Application StructureYou might have multiple web applications in your project, differing in configuration files, HTMLs, images, css, libraries, server side scripts etc. Storing them in a consistent manner across applications helps keep them organized and makes it easier for developers to find things, especially new ones. You will not be surprised by the completely different disk layout of an application if it follows the same directory structure as all the others. Deep down it also appeals to the organized mindset that most good developers have.

Issue Tracking
New bugs/features come up every day. Small teams can probably manage these for sometime with sticky notes and paper and pen. Some developers have their own ways of keeping track of their ToDo lists - but without a unified interface, you are not going to scale. You will have chaos - files missed in checkins, people clueless about who is looking into a particular issue, wondering about the status of a critical bug. When your team grows large, you need some kind of tracking system to track milestones and open issues, tasks to do, assign issues to developers and prioritize them. There are lots of good bug tracking systems out there - get one which works best for you.
Track issues in the same way across people

Deployment
Web applications will need to scale somewhere in their lifetimes, especially if they are successful. Think about your initial deployment environment - one webserver and one database server (on the same machine). As time goes by and your app becomes popular, you add servers. And features. New features translate to new application modules and new databases. The simple script you used to upload and deploy your small app is useless now.

At this point, scaling has multiple meanings

        The ability to handle increasing load and maintain baselevel performance, and

        The ability to push code into deployment quickly.

Both of these are affected by having (or not having) a consistent model.

The first point is actual application scaling - the complexity of your infrastructural setup is going to increase hugely as it grows. The second point has to do with how your team scales to increasing demands.

They are related. Critical fixes and features might need to be pushed immediately. These demands would reach proportions where you cannot afford to spend time figuring out why Server No 6 in your cluster does not have the latest changes. Automation is the key here. Automation demands formal well defined processes (for making builds, uploading to production servers etc). Formal deployment processes imply consistency. They do not mean a bureaucracy – just well followed and automated rules about how to deploy a change into production. As your app and infrastructure grows, it becomes more and more important to be able to rollback changes if necessary. This is possible only if you have well defined deployment paths and scripts.

Automate wherever possible. Minimize the number of things you have to keep in your head. And in the process, lower your stress levels!

Note: These thoughts are not entirely mine - these are the culmination of what I feel about consistency after having read the experiences and opinions of many others in blogs, books and articles, coupled with my own experiences in developing products. Also, you might have noted that I have been talking of web application development in some of the sections above, but these apply to any kind of software development.

On Code, And Other Things

Pages