The case for PerfKit

I wanted to take a few moments to explain my silly little vision of perfkit's destiny. It already has the basis for some of this, however, that doesn't change the fact that it is still a toy and in its infancy. (I will continue to emphasize that it is a toy until it actually solves a non-trivial problem for me).

To explain my motivation, I need to provide a little background. While I was an engineer at MySpace, I spent a majority of my time in the so called "Dev Ops" team. We were responsible for creating tools to provide insight into our infrastructure. One tool we came to depend on every moment of every day was a performance collection system. We knew as much as necessary about each of our thousands of servers at every moment. Whether that was CPU, memory (especially page-fault information), or requests-per-second, it didn't matter. There was a simple interface to query and you had live graphs updating every second and the ability to deep dive.

Additionally, we had tools that would use this data to see what was happening on clusters of machines and respond intelligently. So if a server fell out of the standard deviation for a particular counter, the NOC could take appropriate action.

So that was nice. I miss it. I miss the instant feed-back of typing a command on the system and seeing how it affected real web-server performance.

So I kept this in mind when designing the perfkit infrastructure. I tried to create a balance that will allow us to gracefully move between the traditional "systems monitoring" space to the profiling space. Does it do this today? No. Because its just a toy and it is worthless. Will it do it in the future? Yes. I believe it is capable of it.

Perfkit is separated into a few components. There is an agent that is intended to be run on the target host. It is a goal that this agent can be light enough to run on my phone someday. Given tools will talk to the agent using libperfkit. It is a shared library that contains all the proper RPCs to communicate with the agent. The RPC transport is abstracted and I'll be adding new transports specialized for given situations. Currently, the control channel is over DBus and profiling data is sent over a unix socket. I intend to write a TCP transport.

I should also note that protocols are provided in plug-ins. So if you hate DBus and create another transport mechanism, it wont be loaded into the process.

I like scripting systems. So I created a command line interface that can connect to the agent and give you a readline based shell (with auto-completion) to interact with the agent. It's aptly named perfkit-shell. It looks something like this:

perfkit> manager add-channel
    channel: 0
perfkit> channel=$1
perfkit> channel set-target $channel /bin/nc
perfkit> channel set-args $channel '-l 10000'

You get the idea. You can write scripts and pipe them in or load them from within the shell. Basic variables work too as you can see above. The $number variables correlate to the output parameters of the given RPC.

But let me repeat myself. Auto-completion. We can haz it.

I had a few problems with GObject introspection a while ago, but I'll finish fixing documentation eventually to support it. Once it is done, I'd like to make it simple to script data collection across your clusters for one-off metrics and analytics (with python or similar). This would be ideal for web shops that have a plug-in written to extract counters from Apache, Nginx, Cherokee, etc (no, those plug-ins do not exist today; in fact, no plug-ins really exist today).

Additionally, I want a simple GUI that can be used by a general sysadmin to connect to a host and do basic troubleshooting. Maybe even have some pretty uber-graph integration. I have the basis for this, but it needs a lot of love before I would subject a sysadmin to it. The ability to grab detailed system state for post-analysis is probably a good idea.

So, now, the profiling GUI. It's hard work. So if I don't move as fast as you want me to; I'm truly sorry. I already don't have a life and not sure I can move any faster alone. But lets look on the bright side. You are probably more clever and a better programmer than me and can help make it go faster.

Let me make this clear. I have no intention of replacing systems like lttng, systemtap, oprofile, valgrind, perf, ftrace, etc. My goal is to consume their data and provide it in a novel and genuinely useful way. Abstractions are leaky, so these tools will still have a purpose. I just, as a developer, can't stand to be using them every day anymore. Cause lets face it, those tools are a PITA.

But that said, it's time to start rolling up sleeves and coming up with clever ways to extract interesting information from your systems. The data visualization in perfkit utilizes clutter. I think that will allow for some pretty cool OpenGL based visualizers in the future.

Will you help me?

-- Christian Hergert 2010-11-05

Back to Index