Troubleshooting Java applications on OpenShift

What is it about?

OpenShift has seen a lot of traction with the release of its third version based on Kubernetes a couple of years ago. More and more companies after a thorough evaluation of OpenShift Container Platform (OCP) have built an on-premise or in the cloud PaaS. With the next step, they have started to run their applications on OCP. One of the important aspects of running applications in production is the capacity of quickly restoring services to the normal service level after an incident followed by the identification and the resolution of the underlying problem. In this respect, I want to present in this blog a few approaches for troubleshooting Java applications running on OpenShift. Similar approaches can be taken with other languages.

Debugging applications during development phase can be done thanks to features like:

  • Debug mode for resolving issues during startup.
  • Port forwarding for connecting an IDE like JBDS to an application running in a remote container and debugging it with breakpoints and object inspection.

This has been presented in blogs like here and here.

In this blog, on the contrary, I want to focus on troubleshooting applications in production and to cover things like capturing heap and thread dumps, resource consumption per thread. These are techniques that have more than once been helpful in the past for resolving deadlocks, memory leaks or performance degradation due to excessive garbage collection for instance.

Let’s get into the heart of the matter! Continue reading

Building images from artefacts with Maven

OpenShift S2I build process is a very convenient way of building docker images from code source. It is however not unusual that companies have already advanced processes and infrastructure in place for building and storing application artefacts.

In such a situation I have seen different approaches

  • Artefacts may get copied directly in a deployments directory of a git source repository. Drawback of this approach is that source versioning systems are not meant to store binary artefacts.
  • Binary builds are used and a CI tool like Jenkins copies the artefacts into a deployments directory before triggering the binary build, which induces a unnecessary additional step.
  • The assemble script in charge of the build part of the S2I process is amended so that the artefacts get downloaded during the build process with curl or wget.

There is however another possible approach for Java images using Maven as build tool (similar approaches may be available with other tools). Maven comes indeed with the convenient maven:dependencies plugin. This plugin allows the copy of artefacts from Maven repositories using the usual Maven syntax: groupId, artifactId, version. Therefore a single pom.xml specifying the artefacts to download needs to be defined. This pom.xml can then be stored in the company versioning system.

You can find an example in my git repository, which just copies the jolokia war from Maven central into the deployment directory of a Tomcat image. To try the example you can just run the following two commands inside your OpenShift cluster:

# oc new-build jboss-webserver30-tomcat8-openshift~https://github.com/fgiloux/ocp-mvn-deps.git
# oc new-app ocp-mvn-deps

Development workflows with Fuse Integration Services (FIS)

This post was originally published on Red Hat Developers, the community to learn, code, and share faster. To read the original post, click here.

Development cycles

Fuse Integration Services (FIS) is a great product bringing routing (Apache Camel), SOAP and Rest services (CXF) and messaging (JMS) to the modern age of containers and PaaS and all its goodies: encapsulation, immutability, scalability and self healing. OpenShift provides the PaaS infrastructure to FIS.

In this blog entry I want to present the development workflows available with their pros and cons and I will stop there. A lot of information and great blogs have already been written that well describe approaches for CI and promotion. Here are a few pointers:

A developer may well implement a module in isolation on his own machine but it often makes sense, especially when we talk about integration services, to have the code being validated in a complex integrated environment. Something that OpenShift is great at providing. It is so easy to spawn an environment including frontend, backend and peer services that you may not want to wait for the integration test phase to see how your code performs and to identify where issues may happen. You want to see it right away, during your dev! This blog entry details a couple of workflows with their pros and cons for just doing that.

Project start

FIS provides several maven archetypes so that generating a complete working project, including a selection of modules (CXF, Camel for instance) and an already configured maven pom for generating a Docker image and pushing it into a registry is as simple as selecting a few items in dropdown lists of your favorite IDE or running a single command:

# mvn archetype:generate \
    -DarchetypeCatalog=https://repo.fusesource.com/nexus/content/groups/public/archetype-catalog.xml \
    -DarchetypeGroupId=io.fabric8.archetypes \
    -DarchetypeVersion=2.2.0.redhat-079 \
    -DarchetypeArtifactId=java-camel-spring-archetype

When that’s done you already have a project able to build your source code, to create a Docker container and to run it on OpenShift! You can focus on your code.

Fabric8 maven workflow

Let start with the default workflow in FIS. It consists in having maven building the Java artefacts and the Docker image locally and then pushing the generated image into the OpenShift registry. This works great and it is as easy as using the “mvn -Pf8-deploy” and the “mvn fabric8:json faric8:apply” commands.

This is well explained in the documentation [1].

The advantage of this worflow is that the developer can use the same configuration to build and run the application in a pre-Docker world and to have it deployed on OpenShift. That said it has a couple of requirements that may not make it suitable for every developer:

  • The developer will need to have Docker available where the build happens (on his machine or on a server, root access is currently required for Docker)
  • The developer will need to have the rights to push into the OpenShift registry (or to an external registry)

These are things that may be restricted in some companies. In the future the maven forkflow may support something called binary build to work around these limitations. We will see what it is and how you can use it today in the following chapters.

Source to Image (S2I) worflow

S2I is a framework provided by OpenShift that makes it easy to create Docker images from source code. The build happens directly in OpenShift so that you don’t need to have Docker available locally and to remotely push to the registry.

Per default the S2I process extracts the code from your source repository, spawns a container that builds your source code, generates a Docker image for your application and push it into the OpenShift registry. This is a great process, perfect for continuous integration, where you want to have a reproducible way of creating your application images.
Now looking at it from a developer point of view you may want to save a couple of steps in your development workflow and not want to commit into a git repository every change you are trying out. That’s where binary build can help you. We will look at it in the next chapter. Here is first an overview of the S2I workflow and where the time may get spent between the point where you trigger it and the point where your application is running in OpenShift. At first it may be quite disappointing as it takes a long long time but we will look at how this can be addressed.

overview
  1. The code is pulled from a git repository or from the developer machine after having been compressed and copied onto the S2I builder image
  2. The S2I builder image is pulled from the registry into the node, where the build will take place
  3. The maven build takes place, pulling the required dependencies and generating the application artefacts.
  4. A new Docker image is created, based on the FIS S2I image and containing the application artefacts.
  5. The Docker image is pushed into the OpenShift registry
  6. The container gets started on an OpenShift node

On my laptop running everything including OpenShift I get the following figures for the Camel Spring quickstart. You may have different values with dedicated servers but the order should still be similar.

  1. Copying the source code: between 1 and 16 seconds. I will explain the two values later on.
  2. Pulling the S2I image: 15 seconds
  3. Maven build: 25 minutes!
  4. Docker build: 16 seconds
  5. Docker push: 35 seconds
  6. Container start 7 seconds

Binary build

So we want to use binary build to avoid pushing our change into a git repository. Binary build consists in streaming your source code directly from your computer to the builder container. Nothing complicated and you can use the same build configuration that you use for your CI build.

Let assume that you have already created your application using the template: quickstart-template.json, provided by the maven archetype. You could have done that with the following command:

# oc new-app -f quickstart-template.json -p GIT_REPO=https://github.com/yourrepo/osev3-example.git

You will need to point the URL of the git repo to your own repository.

Triggering a binary build is then just a single command:

# oc start-build BUILDCONFIG --follow --from-repo=.

BUILDCONFIG needs to be replaced by the name of your build configuration.

An interesting point is that you don’t need to modify your build configuration for that. The command overwrite the source type “Git” to “Binary” behind the hood, which means that you can still use the same build configuration as in your CI process. In one case the sources are transferred from your local file system in the other case from your git server. Neat!

There are three flavour of binary builds:

  • –from-dir: the files in the specified directory are streamed to the builder image. It is to note that binaries in target directory may increase time to copy the sources if the application has first been built locally
  • –from-repo: the files in the local git repository are streamed to the builder image. It requires a git commit (not a push). The target directory is usually excluded through .gitignore.
  • –from-file: a zip archive is streamed to the builder image

You can display the build logs with the following command:

# oc logs bc/BUILDCONFIG

In the logs you can see for instance:

I0906 10:43:10.557946       1 sti.go:581] [INFO] ------------------------------------------------------------------------
I0906 10:43:10.557968       1 sti.go:581] [INFO] BUILD SUCCESS
I0906 10:43:10.557972       1 sti.go:581] [INFO] ------------------------------------------------------------------------
I0906 10:43:10.557975       1 sti.go:581] [INFO] Total time: 25:56 min

And that’s preceded by a long list of dependencies that get downloaded.

Using a maven mirror

Well, a usual way of addressing this kind of issue in the old world is to use a maven mirror, like Nexus or Artifactory, to cache the dependencies. It is not too different with OpenShift but dependency caching is critical here as the builder container does not have a pre-populated local .m2 repository (this may change in the future) and is like a new born baby every time a new build happen. You can set an environment variable so that your build uses a mirror. Therefore the build configuration can be edited and the maven mirror added under the source strategy:

"sourceStrategy": {
           "from": {
             "kind": "ImageStreamTag",
             "namespace": "openshift",
             "name": "fis-java-openshift:1.0"
           },
           "forcePull": true,
           "env": [
             {
               "name": "BUILD_LOGLEVEL",
               "value": "5"
             },
             {
               "name": "MAVEN_MIRROR_URL",
               "value": "nexus.ci-infra:8081/content/groups/public/"
             }

If your company does not already have a maven mirror available there is a Docker image with Nexus, which makes it possible to deploy a maven mirror in OpenShift in minutes. See this page for the details:

https://github.com/openshift/openshift-docs/blob/master/dev_guide/app_tutorials/maven_tutorial.adoc

Incremental builds

Another option to limit the amount of time required to download dependencies is using incremental builds. With incremental builds the dependencies contained in the .m2 directory get persisted at the end of the build and copied again to the builder image by the next build. You then have them locally.

A drawback of this approach compared to a maven mirror is that the artefacts are not shared between different applications, which means that your first build will still take very, very long.

To turn on incremental build, which will be activated per default with the next FIS release [2], you just need to amend the source strategy in the build configuration accordingly:

"sourceStrategy": {
           "from": {
             "kind": "ImageStreamTag",
             "namespace": "openshift",
             "name": "fis-java-openshift:1.0"
           },
           "forcePull": true,
           "env": [
             {
               "name": "BUILD_LOGLEVEL",
               "value": "5"
             }
           ],
           "incremental": true

Further considerations

Using a maven mirror or incremental builds we have now reduced the build time to a couple of minutes. That is totally acceptable as part of a CI process… but still not that good for a developer wanting to quickly test changes.

A few other considerations allow to further reduce the build time. The first one is nothing new. You may want to review your pom file and make sure that you need all the listed dependencies and plugins. Depending on your worflow you may for instance not need the Jolokia/docker plugin. Even if these plugins are not called during the artefact generation they have to be downloaded or copied into the builder image.

Another point to consider is that you don’t want your local target directory (and possibly others) getting streamed to the builder image. It may well be over 60 MB. If you use the –from-repo option make sure that you have .gitignore configured properly. Your git server will also thank you for that. If you use –from-dir make sure that your target directory has been cleansed before.

One step that costs around 15 seconds is that the Docker daemon running your build first downloads the latest S2I image before starting the build. You can save a bit of time by deactivating this mechanism called “force pull” in the build configuration:

"sourceStrategy": {
           "from": {
             "kind": "ImageStreamTag",
             "namespace": "openshift",
             "name": "fis-java-openshift:1.0"
           },
           "forcePull": false,

I would however strongly recommend to have the force pull option activated as part of your integration build, which will generate your reference image. If you deactivate it in the build configuration that you use during development you will have to support two different versions of the build configuration.

Hot deployment

Independent on whether you use the maven plugin or the S2I workflow you now have a process for building and deploying a change taking a few minutes. This is similar to what you may have had in the past when developing locally and not using a PaaS. A usual solution was then to hot deploy to shorten the time to something acceptable, probably under 30 seconds. How can you achieve this with FIS running on OpenShift?

Container technologies follow the principle that an image should be immutable. It should be possible to promote an image from one environment to the next, from QA to production for instance, and be sure to have the same behaviour in both environments. Image immutability provides consistency and reproducibility. This speaks against hot deployment and that surely one reason why it has been deactivated in the Karaf base image and it is not available with Hawtapp.

This sounds reasonable to you but does not help with having short cycles for applying changes?

We have two FIS flavors, one based on the OSGi container Karaf, the other one plain java. Let look at them one after the other.

Hot deployment with Apache Karaf

Hot deployment in Karaf is simply done by dropping a file in the deploy directory. Karaf will detect it and deploy it. As mentioned earlier this mechanism is deactivated in the Karaf assemble generated by the default pom. This can easily be activated by adding a couple of lines under the configuration of the karaf-maven-plugin:

<!-- deployer and aries-blueprint are required for hot deployment during the dev phase -->
<feature>deployer</feature>
<feature>aries-blueprint</feature>
<!-- shell and ssh are required to log in with the client during the dev phase -->
<feature>shell</feature>
<feature>ssh</feature>

Remember you should not have hot deployment in your reference image (the one being promoted from integration to production passing by QA and staging). Therefore it is best to create a dev profile in your pom, which is only activated on the developer machine. Again for reproducibility purpose the reference image should be created by your CI tool not manually by a developer.

So, we have now a hot deployment capable image but it has no access to your local file system. How does it work?

OpenShift has a nice feature for that: “oc rsync”. It allows to copy file to or from a container. In our case:

# oc rsync ./deployment/ POD:/deployments/karaf/deploy

This will transfer the files from the local deployment directory to the container. You just need to replace POD with the real name of your pod. Have a look at “oc rsync –help” for further options.

Last point: You can simply use “oc rsh POD” to get a shell inside your container. Karaf configuration is under /deployments/karaf/etc so that you can quickly test changes. Remember once the container is restarted all changes are lost.

That’s great to try out but you can always revert to the baseline (your image) by restarting the container. To change the baseline… just use the maven plugin or the S2I worfllow.

Hot deployment with Hawtapp

Hawtapp does not support hot deployment. This means that you may need to look at hot swapping for a solution, which is outside of the scope of this blog entry. Note that OpenShift allows to open a tunnel between your local machine and the remote container thanks to the command: “oc port-forward”. Use the “–help” option for the details of the command.

So yes we are not quite there with Hawtapp but the engineering team is looking at using Spring Boot for FIS 2.0 [4] and this comes with nice hot deployment and troubleshooting capabilities !

In the meantime a hybrid workflow with quick changes being tested locally and broader development tests being done in OpenShift may be the solution.

Java Console

Another nice feature of OpenShift is that it integrates a Java console for all the containers exposing a port named “jolokia”, what both the Karaf and Hawtapp images do per default. This gives you access to a nice web console similar to what you have with standalone Fuse instances. From the console you can follow the messages going through the camel routes, trace them, debug your camel routes or even modify them. This is great for troubleshooting!

To open the Java console, follow the link “Open Java Console” in the details page of the pod in your OpenShift web console:

link2console

You can then display your route:

route1

 

And modify the source:

route2

I hope you enjoyed reading this blog entry and congratulations if you made it till this last sentence!

 

Deactivation of non-TLS routes in Openshift v3

Introduction

OpenShift v3 comes with HAProxy as router per default. The router is used to expose services delivered by applications running inside OpenShift to the external world. With the current version of OpenShift 3.1 two port HTTP and TLS/SNI are configured on the router.
I have recently provided consulting on OpenShift for a customer, who was willing to disable the HTTP port on the router. When a route is created it is possible to define whether it is secured (encrypted) and what to do with the non-encrypted port (block, redirect or keep). The customer was however willing to take the option away from the route designer and enforce the company policy (external communication has to be encrypted) at the router level.
There may be more than a way of doing that. I would like here just to show how the HAProxy can be customized to fulfil this request by enforcing the redirection of unsecured connections to the TLS port. This approach may be used for other HAProxy customizations.

Customization

The default or current configuration of the HAProxy can be saved in an haproxynew folder with the following command

# docker run –rm –interactive=true –tty –entrypoint=cat
registry.access.redhat.com/openshift3/ose-haproxy-router:v3.1.1.6 haproxy-
config.template > haproxynew/haproxy-config.template

 

In the HAProxy configuration there is a section dedicated to the processing of unsecured requests:

# check if we need to redirect/force using https.
acl secure_redirect base,map_beg(/var/lib/haproxy/conf/os_edge_http_redirect.map) -m
found
redirect scheme https if secure_redirect

 

This section can be replaced as follows:

# force redirect/force using https.
redirect scheme https

 

A Dockerfile needs to be created so that an image can be generated with the new configuration

FROM openshift/origin-haproxy-router
ADD haproxynew/haproxy-config.template /var/lib/haproxy/conf/
ENV TEMPLATE_FILE=/var/lib/haproxy/conf/haproxy-config.template \RELOAD_SCRIPT=/var/lib/haproxy/reload-haproxy
ENTRYPOINT [“/usr/bin/openshift-router”]

 

The image can then get built locally

# docker build -t sandbox/haproxynew haproxynew

 

Finally the image can get tagged and pushed into the OpenShift registry or copied on the pods running the router. The new router can now get deployed.

# sudo oadm router router –replicas=1 –credentials=’/etc/origin/master/openshift
-router.kubeconfig’ –selector=’region=infra’ –service-account=router –default
-cert=apps-sandbox-com.pem –images=’sandbox/haproxynew’

 
We have now the following scenarios with the new router configuration:

  • The route has been created without tls: 503 service unavailable gets returned
  • The route has been created with http and tls: the insecureEdgeTerminationPolicy is ignored and the http connection gets always redirected to https.

That’s it!

Sandpit in OpenShift v3

Introduction

I have been doing an OpenShift v3 PoC for a customer and one member of the customer staff came up with an interesting use case for OpenShift, which is not what customers usually use it for. The idea is to be able to quickly spawn a container to evaluate or validate a package or configuration and to get it destroyed afterwards. The customer well understood that when the container terminates all modifications done are lost and he was fine with that. All what he wanted is really to have a sandpit, similar to what can be done with this docker command:

# sudo docker run -i -t rhel7 /bin/bash

The command just starts a RHEL image with a batch process.

The rational behind having it done in OpenShift rather than doing it directly with docker is that on one side sudo rights are needed to run docker on a server, which will only get granted to a limited number of employees. And on the other side the team in charge or managing computers may not be willing to provide support for docker on staff computer. OpenShift provides fine granularity for managing user rights and allows the execution of privileged operations in a secured way.

The OpenShift way

A container can easily get started with a batch process in Openshift with a command similar to the following one.

# oc run –restart=Never –attach –stdin –tty –image rhel7 rhel7 /bin/sh

Where things get more complicated is when the user wants to install new packages. This is generally not possible for security reasons when the container gets started in OpenShift by a standard user. A way to work around that is to create a Dockerfile extending the base image with the required packages. The image build will be launched by the builder service account, which is able to run privileged containers for this purpose. Here is an example of such a Dockerfile.

FROM rhel7.2# RHEL 7.2 Image extension for my Sandpit
#MAINTAINER Your Name <yourname@yourcompany.com>RUN INSTALL_PKGS=”tar httpd” && \
yum install -y $INSTALL_PKGS && \
rpm -V $INSTALL_PKGS && \
yum clean all

CMD [“/bin/bash”]

Once the image has been built you can “oc run” it as shown earlier. That’s it!

Tips for automated performance tests

Introduction

Depending on the application field non functional requirements like response time, throughput or availability may be as important as functional ones. Not being able to fulfil them may make an application unfit for production.
Based on that it may be critical to keep an eye on response times or throughput during the development and maintenance life cycles. This is best addressed by automated tests, which can be orchestrated by a continuous integration tool like Jenkins.
In this blog post I don’t want to write about big design, complex infrastructure or comparison between tools. My aim is more humble. I just one to mention two pitfalls when you integrate your performance or load tests in your favourite CI tool and to give possible workarounds.

Be on time

You want to start your load tests when your application has fully started but not before. Sure, you can just let enough time for your application to start before you launch your tests but it may not be that much more complicate to watch for an event stating that your application has finished starting. Most of the applications would write something in the logs. On a Linux system like RHEL it is straightforward to watch the logs so that you may not need to do any change to your log monitoring system if you happen to use something like Logstash or Graylog. This short shell script does the job:

# start-monitor.sh
#!/bin/sh
FILE=${1:-server.log}
SEMAPHORE=${2:-JBAS015874}

if [ ! -f “$FILE” ]; then
echo “File could not be found!”
echo “Usage: start-monitor.sh FILE SEMAPHORE”
exit 1
fi

tail -f “$FILE” | while read LOGLINE
do
[[ “${LOGLINE}” == *${SEMAPHORE}* ]] && pkill -P $$ tail
done
echo “Server started, tests can begin!”

It is self explanatory the trick is to read from tail and to check whether the start completed event has been recorded. Note with Wildfly AS 7 / JBoss EAP 6 it is: JBAS015874.

Quick but not too quick

So we are now able to start our load tests when we are sure that the application has started. This is already a good point. But some time ago I was doing some performance testing by a customer and it reminds me that from a statistic point of view when you look at a decent size population a few elements may have a significant impact on the global outcome. By Telcos it is quite usual to monitor the 95 percentile for round trip time and other metrics. By doing so extreme values get filtered out. What does it have to do with our load tests? We are already taking care of the fact that the application is fully started. Fully started yes, but initialised?
An application server like JBoss EAP may load modules on demand depending on its configuration to reduce the starting time. In a same way threads or connection pools may only get fed when there is the need. Are you using a layer 2 or some other type of cache? Values need first to get populated into it before the database round trips can be avoided.
Coming back to my customer. He was simulating 60 parallel user sessions. 4 web services were repeatedly called.

Starting the tests right after sever start-up gave the following results:

  number of runs max time (ms) average time (ms)
Service 1 ~50 31000 3750
Service 2 ~4000 23000 560
Service 3 ~2000 22000 310
Service 4 ~1500 11000 260

A second time we started recording measures after we had performed a few calls on each service and the results looked quite different:

  number of runs max time (ms) average time (ms)
Service 1 ~50 1400 1150
Service 2 ~4000 2000 50
Service 3 ~2000 1600 50
Service 4 ~1500 850 45

I am not saying that edge conditions should not be considered but know what you are measuring otherwise you may get surprised… as I did.

Controlling resources with cgroups for performance testing

Introduction

Today I want to write about the options available to limit resources in use for running performance tests in a shared environment. A very powerful tool for this is cgroups [1]. It is a Linux kernel feature that allows limiting the resource usage (CPU, memory, disk I/O, etc..) of a collection of processes.
Nowadays it is easy with virtual machines or container technologies, like Docker, which is using cgroups by the way, to compartment applications and make sure that they are not eating resources that have not been allocated to them. But you may face, as I did, situations where these technologies are not available and if your application happens to run on a Linux distribution like RHEL 6 and later you can directy rely on cgroups for this purpose.
I was by a customer, who was in the process of migrating his servers to a virtualisation platform. At that point in time they were having a virtualised production platform, whereas the integration platform, where the tests were to be performed was based on physical hardware having specs, which were not aligned with the production platform the application was running on. A server of the integration platform was indeed used to run two applications, split into two different VMs in production. But I wasn’t either able nor willing to create load for both applications. The performance tests were designed for only one of them. If I was to run them just like that I would not have results representative of the production platform. The customer was indeed putting the virtualised environment in question as they were seeing degraded performances compared to the physical environment used for the integration tests. I decided to use Linux capabilities to limit CPU access and make my tests an acceptable simulation of the production. By doing this I helped the customer regaining trust in their new virtualisation environment.
Another scenario, where one may use cgroups is for evaluating the resource specifications required by an application. It is then easy on the same server to perform tests with one, two, four or sixteen CPUs (provided you have them available). You could tell me that the same can easily be achieved with the hypervisor of a virtualised environment and you would be right. But if you don’t have access to it? The process for getting the person having the rights to apply your changes may take days rather than minutes.
So let come to the point now and look at how easy it is to configure cgroups for this purpose after I have enumerated a few important caveats:
– You need sudo rights to change the cgroups configuration
– Be careful if you have already cgroups configured on your system. It is easy to destroy your existing configuration. Don’t blame me for that. Know what you are doing.
If one of these points prevent you of using cgroups, taskset described later in this blog entry may be a good alternative. Though, it only applies to CPU affinity whereas cgroups can be used to limit other resource types as well.

Installation

Cgroup is already installed on RHEL 7.
On RHEL 6 the libcgroup package is the easiest way to work with cgroups and its installation is straightforward: yum install libcgroup
However this package is deprecated on RHEL 7 since it can easily create conflicts with the default cgroup hierarchy.

Configuration

Cgroups are organized hierarchically, like processes, and child cgroups inherit some of the attributes of their parents. A subsystem (also called resource controller) represents a single resource, such as CPU time or memory. Note that the configuration differs between RHEL 6 and 7. See Red Hat Enterprise Linux 6/7 Resource Management Guide [2] and [2′] for more information.

RHEL 7

By default systemd automatically creates a hierarchy of slice, scope and service units. Also systemd automatically mounts hierarchies for important resource controllers in the /sys/fs/cgroup/ directory.
Services are started by Systemd whereas scope groups are externally created processes. Slices organize a hierarchy in which scopes and services are placed. Processes are attached to services and scopes not to slices. A service is a process or a group of processes, which are started by systemd, whereas a scope is a process or a group of processes, which are created externally (fork).

So let start a new process. The command systemd-run allows to start a new process and to attach it to a specific transient unit. If the unit does not exist it will get created. In the same way a slice can be created with the option –slice. Per default the command creates a new service. The option –scope allows the creation of a scope.

# sudo systemd-run –unit=fredunit –scope –slice=fredslice top

We have now a new slice and it is possible to visualise it:

# systemctl status fredslice.slice
fredslice.slice
Loaded: loaded
Active: active since Sat 2015-08-15 09:04:09 CEST; 9min ago
CGroup: /fredslice.slice
└─fredunit.scope
└─4277 /bin/top

In the same way it is possible to display the new scope:

# systemctl status fredunit.scope
fredunit.scope – /bin/top
Loaded: loaded (/run/systemd/system/fredunit.scope; static)
Drop-In: /run/systemd/system/fredunit.scope.d
└─90-Description.conf, 90-SendSIGHUP.conf, 90-Slice.conf
Active: active (running) since Sat 2015-08-15 09:04:09 CEST; 8min ago
CGroup: /fredslice.slice/fredunit.scope
└─4277 /bin/top

Transient cgroups are stored in /run/systemd/system.

Let look at our new process:

# cat /proc/4277/cgroup
10:hugetlb:/
9:perf_event:/
8:blkio:/
7:net_cls:/
6:freezer:/
5:devices:/
4:memory:/
3:cpuacct,cpu:/
2:cpuset:/
1:name=systemd:/fredslice.slice/fredunit.scope

Cgroups use subsystems (also called resource controllers) to represent a single resource, such as CPU time or memory. cpuset [3] provides a mechanism for assigning a set of CPUs and Memory nodes to a set of tasks.
If the cgroup is not already mounted it can easily be done:

# sudo mount -t cgroup -ocpuset cpuset /sys/fs/cgroup/cpuset

We can now navigate to the cpuset hierarchy and create a new cpuset called fred:

# cd /sys/fs/cgroup/cpuset
# sudo mkdir fred

It is possible to move to the newly created cpuset and to configure it, for instance to allocate the 1st CPU to it

# cd fred
# echo 0 | sudo tee -a cpuset.cpus

We will need a memory node as well

# echo 0 | sudo tee -a cpuset.mems

And we can attach our process to the newly created cpuset

# echo 4277 | sudo tee -a tasks

Let look at the process configuration again. The cpuset fred is now attached to it:

# more /proc/4277/cgroup
10:hugetlb:/
9:perf_event:/ bin
8:blkio:/
7:net_cls:/
6:freezer:/
5:devices:/
4:memory:/
3:cpuacct,cpu:/
2:cpuset:/fred
1:name=systemd:/fredslice.slice/fredunit.scope

For tests we may also want to limit the amount of memory available to processes. Therefore we can use the memory limit setting:

# sudo systemctl set-property –runtime fredunit.scope MemoryLimit=1G

And this now appears at the process level:

# more /proc/4277/cgroup
10:hugetlb:/
9:perf_event:/
8:blkio:/
7:net_cls:/
6:freezer:/
5:devices:/
4:memory:/fredslice.slice/fredunit.scope
3:cpuacct,cpu:/
2:cpuset:/fred
1:name=systemd:/fredslice.slice/fredunit.scope

After reloading the systemd daemon there is now an additional file 90-MemoryLimit.conf in the cgroup:

# sudo systemctl daemon-reload
# systemctl status fredunit.scope
fredunit.scope – /bin/top
Loaded: loaded
Drop-In: /run/systemd/system/fredunit.scope.d
└─90-Description.conf, 90-MemoryLimit.conf, 90-SendSIGHUP.conf, 90-Slice.conf
Active: active (running) since Sat 2015-08-15 12:07:08 CEST; 19min ago
CGroup: /fredslice.slice/fredunit.scope
└─14869 /bin/top

This file contains the memory limit

# sudo cat /run/systemd/system/fredunit.scope.d/90-MemoryLimit.conf
[Scope]
MemoryLimit=1073741824

Looks good!

We have achieved our aim of limiting a process to a single CPU and to 1G of memory.

Using the exact same mechanisms it is possible to start a shell. The good thing is that every process started from this shell will then inherit the cgroup configuration.

To start the shell and to set its memory limit:

# sudo systemd-run –unit=fredunit –scope –slice=fredslice sh
# sudo systemctl set-property –runtime fredunit.scope MemoryLimit=1G
# sudo systemctl daemon-reload

We can control that the shell process has been bound to the cgroup and that the memory limitation has been applied:

# systemctl status fredunit.scope
systemctl status fredunit.scope
fredunit.scope – /bin/sh
Loaded: loaded
Drop-In: /run/systemd/system/fredunit.scope.d
└─90-Description.conf, 90-MemoryLimit.conf, 90-SendSIGHUP.conf, 90-Slice.conf
Active: active (running) since Sat 2015-08-15 12:37:30 CEST; 3min 0s ago
CGroup: /fredslice.slice/fredunit.scope
└─16842 /bin/sh

Again we add the shell process to the tasks of the fred cpuset we have created earlier and it gets reflected in the cgroup attached to the process:

# echo 16842 | sudo tee -a tasks
# more /proc/16842/cgroup
10:hugetlb:/
9:perf_event:/
8:blkio:/
7:net_cls:/
6:freezer:/
5:devices:/
4:memory:/fredslice.slice/fredunit.scope
3:cpuacct,cpu:/
2:cpuset:/fred
1:name=systemd:/fredslice.slice/fredunit.scope

From the newly created shell we can start an application:

# top

And we can control in another shell that the new process has inherited the cgroup configuration:

# systemctl status fredunit.scope
fredunit.scope – /bin/sh
Loaded: loaded
Drop-In: /run/systemd/system/fredunit.scope.d
└─90-Description.conf, 90-MemoryLimit.conf, 90-SendSIGHUP.conf, 90-Slice.conf
Active: active (running) since Sat 2015-08-15 12:37:30 CEST; 5min ago
CGroup: /fredslice.slice/fredunit.scope
├─16842 /bin/sh
└─17163 top# cat /proc/17163/cgroup
10:hugetlb:/
9:perf_event:/
8:blkio:/
7:net_cls:/
6:freezer:/
5:devices:/
4:memory:/fredslice.slice/fredunit.scope
3:cpuacct,cpu:/
2:cpuset:/fred
1:name=systemd:/fredslice.slice/fredunit.scope

Awesome!

RHEL 6

I haven’t tested the settings for RHEL 6 but by using libcgroup cgroups can be configured in /etc/cgconfig.conf.
The default /etc/cgconfig.conf file installed with the libcgroup package creates and mounts an individual hierarchy for each subsystem, and attaches the subsystems to these hierarchies. You can mount subsystems and define groups with different accesses to CPU and memory as follows:

mount {
cpuset = /cgroup/cpuset;
memory = /cgroup/memory;
}# no limitation group
group nolimit {
cpuset {
# No alternate memory nodes if the system is not NUMA
cpuset.mems=”0″;
# Make my 2 CPU cores available to tasks
cpuset.cpus=”0,1″;
}
memory {
# Allocate my 4 GB of memory to tasks
memory.limit_in_bytes=”4G”;
}
}# group with limitation
group limited {
cpuset {
# No alternate memory nodes if the system is not NUMA
cpuset.mems=”0″;
# Make only one of my CPU cores available to tasks
cpuset.cpus=”0″;
}
memory {
# Allocate at most 2 GB of memory to tasks
memory.limit_in_bytes=”2G”;
}
}

You must restart the cgconfig service for the changes in /etc/cgconfig.conf to take effect. Note that restarting this service causes the entire cgroup hierarchy to be rebuilt, which removes any previouly existing cgroups.
To start a process in a control group: cgexec -g controllers:path_to_cgroup command arguments
It is also possible to add the –sticky option before the command to keep any child processes in the same cgroup.

Taskset as an alternative

If you don’t have the rights for using cgroups or you only want to limit the number of CPU cores used by your application it is still possible to do it with taskset. You would then rely on the JVM settings for memory limitation if you want to test a java application. This is obviously not the same but it may be an acceptable approximation. As you can read in the “man” description taskset [4] is used to set or retrieve the CPU affinity of a running process given its PID or to launch a new COMMAND with a given CPU affinity. CPU affinity is a scheduler property that “bonds” a process to a given set of CPUs on the system. The Linux scheduler will honor the given CPU affinity and the process will not run on any other CPUs.

For instance if you have one java process running taskset -p provides information on the current affinity mask:

# ps -ef | grep java
frederic 971 1 99 17:56 pts/0 00:00:09 java…
# taskset -p 971
pid 971’s current affinity mask: 3

With taskset -cp you can then get the list of processors that your process is bound to:

# taskset -cp 971
pid 971’s current affinity list: 0,1

How to interpret the result of these two commands?
Let look at what “man” says: “The CPU affinity is represented as a bitmask, with the lowest order bit corresponding to the first logical CPU and the highest order bit corresponding to the last logical CPU.”
I have two CPUs 0x01 and 0x10, which are allocated to my process, which matches the mask 0x11 and this is 3 in base 16.
Let take another example that may explain it better. If you have 8 processors, they will be numbered this way:
00000001
00000010
00000100
00001000
00010000
00100000
01000000
10000000
if your process has access to the 8 of them, the mask will be 11111111, which is 0x1 + 0x2 + 0x4 + 0x8 + 0x10 + 0x20 + 0x40 + 0x80 = 0xff in base 16
if your process has only access to the 1st and 2nd one, as it was the case for me the mask is 11 or 3 in base 16.
now if the process has only access to the 1st and the 8th one, the mask would be 10000001, which is 0x51 in base 16.

Note my process runs in a virtual machine, which I have allocated two CPUs to. It is not that I should through my laptop to the garbage. ;o)

Well, now that we know what we are doing we can allocate the process to a single CPU, the first one for instance.
taskset -p is the command to use. To have the process bound to the first processor:

# taskset -p 1 971
pid 971’s current affinity mask: 3
pid 971’s new affinity mask: 1

if you wanted to run a new command rather than operating on an existing one, which is actually the default behavior it is not more complicate: taskset [arguments]

Et voilà!

[1] https://en.wikipedia.org/wiki/Cgroups
[2] https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_Linux/6/html/Resource_Management_Guide/sec-cpu_and_memory-use_case.html
[2′] https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_Linux/7/pdf/Resource_Management_Guide/Red_Hat_Enterprise_Linux-7-Resource_Management_Guide-en-US.pdf
[3] https://www.kernel.org/doc/Documentation/cgroups/cpusets.txt
[4] https://en.wikipedia.org/wiki/Processor_affinity

PS: I hurried up for writing this blog entry as it may get quickly deprecated with the rocket adoption of containers 😉

Hello world!

My name is Frédéric and I am a Red Hat employee with a focus on middleware. In my job I help Red Hat’s customers to get the best out of our products and I sometime face issues, limitations or solutions that are not always well documented. The first aim of this blog is to provide information on approaches that have been taken to address issues and limitations. By doing so I hope to help a bit people with getting the best out of our great open source products. I will also post links to articles presenting the latest developments on technologies I am interested in and share my views on them.
I hope you will enjoy the reading and that you won’t have too long to wait for the first blogs to come! Feedback, contributions and even controversies are welcome. Courtesy is the only requirement.

Frédéric