I got bogged down in the messy detail of handling the AVR program space functions, especially when trying to make it also compile and run for ESP8266. The Arduino core for ESP8266 provides similar looking functions but they don't behave exactly the same. Also I encountered some messiness when trying to use them inside a C++ class, in ways I should have realised was verboten.
Regardless, now if you pass the library a string type argument stored in program space it marks it as such and accesses it differently, depending on the processor. This includes using different methods to find the length of strings for centring and so on.
Time consuming fiddly stuff but it had to be done and finally you can run the example code on an Arduino Nano again, only consuming 1124 of 2048 bytes dynamic memory. The sketch does use most of the program storage, which is down to the large amount of text stored there. It runs stably, something you can easily compromise when throwing lots of string data around with a small margins on the stack.
I've enjoyed this learning exercise, but I now need to test my library with other architectures supported by the Arduino IDE, such as a Teensy with an ARM Cortex-M4. In these cases it shouldn't be so necessary to store small chunks of text in program storage but it is something I would like to be sure of working. I believe I have a Teensy 3.2 somewhere.