After several months of playing with a simulated Serengeti machine I have started to have some fun with a real hardware containing an UltraSPARC III CPU. It is a Sun Blade 1500 workstation which I’ve been lent by Sun Microsystems globalization division.

A Spanish (or French?) keyboard is connected to it (you know, globalization division…), but this is not the most interesting challenge. The most interesting thing is a way how to debug the kernel on such a machine.

The kernel crashes before the output gets initialized, so it is not possible to use printf to identify the exact point of failure. With some help of the OpenBoot PROM, however, it is possible to find an area in the physical address space to which the screen is (pixel by pixel) mapped and draw something to the screen by writing to that area.

The code which makes it can look like this:

static void sb1500stripe(int color, int position)
{
    int i;
    for (i = 0; i < 0xA00; i++)
        asm volatile ("stba %0, [%1] 0x15 \n" ::
            "r" ((color)),
            "r" (0x7f708000000 + (position)*0x1400 + i));
}

This will draw a thin horizontal stripe. The value 0×7f708000000 is the physical address where the framebuffer starts, position determines a distance from the top of the screen.

The basic idea how to use this for debugging purposes is simple: call sb1500stripe from several places of a function you suspect of causing the crash, each time passing a different value of position to it. Then deduce the exact line in your source code which is to blame.

The idea is simple, yet not always working. There are basically two problems with this approach. The first one is that some functions do not cause the crash generally, but only when called with some special parameters. If such a function is called n times before it causes the crash, it succeeds n - 1 times, each time drawing all the stripes. From the programmer’s point of view all the stripes have been drawn, so it seems that the function has not caused any crash at all. The solution is straightforward: at the beginning the function must clear (blank) the area where it is about to draw the stripes.

The second problem is much more tricky. When I first encountered it, it seemed that malloc failed when calling slab_alloc, but the slab_alloc function seemed to proceed until the return statement. I was quite confused. What the hell can break during the return? I checked the value of the CANRESTORE register (maybe some strange error during register window fill?), but its value was 3. Then I was trying to find (using binary lookup) the value of the %i7 register (which contains the return address during the function call), which was complicated by the fact that this value depends on the kernel binaries, hence changes when you modify the kernel code (which I was doing). After several hours I found out that even the %i7 register was correct.

Actually, the return statement was causing no failure at all! But the debugging method was wrong. What happened was that the slab_alloc function called itself recursively. The nested call succeeded (painted all the stripes), but not the call which made the recursive call. Blanking the area for the stripes did not help here, as all the stripes were being painted during the recursive call (i. e. after the blanking took place).

The solution which I came with and which works just fine for me is that I have a static variable (call_order) defined in the suspect function, which is being incremented (and copied to a local variable call_order_copy) at the beginning of the function call. The position of the particular stripe depends not only on the place within the function from which it is being painted, but also on the value of the call_order_copy variable. So every call of the suspect function has its own set of stripes. From that the exact point of failure within the suspect function can be determined.