Automated string compaction in Linux

The size of strings in the Linux kernel is somewhat concerning. It's probably worse in user-space software, and maybe this idea will find applicability elsewhere, but I care about the Linux kernel.

Back in the early nineties, I used to use a system called RISC OS. It certainly had its flaws, but it had some ideas which have not been implemented in Linux which I think may have some merit.

One such idea is called OS_PrettyPrint. It somewhat resembles printf() but with an API designed by an assembly programmer who might have heard of C, but certainly doesn't approve (this was a common attitude at the time; if you wanted to write something quickly, you wrote it in BASIC, and if you wanted it to run fast, you wrote it in assembler).

Human readable strings are quite bulky, and I managed to achieve great savings (30%?) in the size of my programs by using OS_PrettyPrint. If we want to compact Linux strings by using a concept like "the system dictionary", how might we go about it?

Some ground rules are important. We need humans to be able to write freeform text and have tools to do the compaction. Back in the nineties, people were willing to embed random sequences of characters into their strings, but it's more important now to be able to grep for an error message. This is a bit of a shame because using the system dictionary meant that error messages were somewhat standardised, but greppability above all else.

First, we should mark strings for compaction. The gettext people have shown us the way with their _() macro. That hasn't caught on inside the kernel yet, because we don't internationalise our kernel messages (for very good reasons), but this offers us a good pattern to follow.

Next we need a tool that analyses the corpus of text and finds the best compaction. It doesn't need to take into consideration how frequently each string is printed; we're only trying to save bytes in the kernel image.

Then we need a tool that does the compaction. This should be a separate tool from the analysis tool because we may wish to build modules later to insert into the running kernel, and they should use the same dictionary as the existing kernel. Also, we may not wish to run the analysis tool every time (maybe it will take some time to run?)

Finally, we need support in printk(). More specifically in format_decode(). We should figure out how to get the system dictionary linked into the kernel. I do not think we need to make the system dictionary replacable or augmentable (per thread? per cpu? Just Say No).

I suggest we do not follow RISC OS and use ASCII 27 (ESC) for indicating a dictionary entry. The ESC codes are well standardised by ANSI. Instead, I propose we use ASCII 16 (DLE) as it seems quite unused. Also, I propose we use the sequence 10 FF to indicate a literal DLE.

We need to decide on a format for the dictionary. I suggest not using the RISC OS format, but rather the following:

Start with a table of offsets. These can be just 16 bits as a 64k limit seems sufficient. Also, we do not have to use all 255 entries.
Followed by NUL-terminated strings.

That lets printk() look up the intended string directly instead of chasing offsets through the dictionary as the RISC OS dictionary format did.

Here's an example of what I think it should look like:

The programmer writes:
printk(_("Hello, %s world\n"), x ? "happy" : "cruel");

The analysis tool happens to produce a dictionary with two elements in it,
"ello" and "%s wor".  This is not a very likely outcome, but it's good for
this example.  That dictionary, on a little endian machine, looks like this:
04 00 09 00 e l l o 00 % s 20 w o r 00
(just two entries, first one starting at offset 4, second at offset 9).

The compaction tool turns the string into this:
H 10 00 , 20 10 01 l d 0A 00

This example demonstrates that you can put a %s in a dictionary entry. I don't think we need the ability to refer to one dictionary entry from another. I also don't want to allow the argument to a "%s" format string to be compacted -- this might introduce a vulnerability if an attacker somehow gets a DLE byte into an argument.