The benefit of using this interface to write a profiler is that it can gather stack traces asynchronously. This avoids the inaccuracies of only being able to profile code at safe points, and the overhead of having to stop the JVM to gather a stack trace. The result is a more accurate profiler that avoids the 10-20% overhead of something like hprof. For a more thorough explanation as to why this is interesting, see my 2010 writeup on the subject.
Another feature of the profiler is that it will tell you what stack traces are being executed, rather than, like many sampling Java profilers, what stack traces could be executed. That is, with traditional profilers, if you stop the VM and ask it for stack traces, it will tell you all of the stack traces from all of the threads that are not currently blocked. This profiler tells you the stack trace from the thread that got interrupted when you asked for the stack trace. The tradeoff is reasonably important if you are trying to think about thread starvation and scheduling; with this approach, you get to see what isn't being scheduled. With the "get everything" approach, you get more samples, so there is a meaningful benefit to both approaches.
Code is here. I'm not likely to invest a great deal of time in making it wonderful, but patches from interested parties are welcome.
Edited to add: Some of the commenters below wonder why the stack trace grabber can't print out the stack traces when it gets them. The answer to this lies in what I meant when I called this function asynchronous. I was a little (deliberately) vague on what "asynchronous" means. When I say that the timer that goes off to grab a stack trace interrupts a running thread asynchronously, I mean that it can interrupt the thread at any point. The thread can be in the middle of GC, or the middle of adding a frame to the stack, or in the middle of acquiring a lock. The reader can imagine that writing something asynchronous-safe is somewhat more tricky than writing something thread-safe: not only do you have to worry about what other threads are doing, you also have to worry about what the current thread is doing.
Needless to say, there are a lot of things you can't (or shouldn't) do asynchronously. For example, you can't call malloc, because if the thread is already in the middle of executing malloc when you interrupt it, you can destroy malloc's internal data structures. In fact, for the standard C library functions, the POSIX standard specifies a list of functions that have to be async-safe (see man 7 signal for the list, if you are curious). Note that functions like printf are not async-safe, so you can't call them asynchronously. As far as I know, AsyncGetCallTrace is pretty much the only non-trivial async-safe JNI/JVMTI call. Others, including the standard JVMTI stack trace gathering function and the functions that make sense out of what AsyncGetCallTrace returns, are not async-safe, and are likely to crash your program if you try to call them asynchronously.
For this programming exercise, I was lazy, and printed the stack traces at VM exit. That doesn't mean that a sufficiently clever approach can't get the stack traces incrementally. A separate, synchronous thread can read the data that was written out asynchronously. You can't use a standard blocking lock to protect the data, though, because standard lock implementations aren't async-safe. You could use a spin lock, but you would have to be careful about the async handler interrupting a thread that holds the spin lock. In that case, the async handler won't be able to acquire the lock, so it would have to do a trylock and give up if it can't complete the acquire. There are also clever things you can do with atomic operations if you apply some creativity; I'll leave those as an exercise for the reader.