gccemacs

Table of Contents

1 gccemacs

<2019-11-27 Wed>

gccemacs is a modified Emacs capable of compiling and running Emacs Lisp as native code in form of re-loadable elf files. As the name suggests this is achieved blending together Emacs and the gcc infrastructure.

The scope of this project is to experiment and discuss the proposed compiler design on the emacs-devel mailing list (part1, part2).

I've no current plan to have this as a long standing maintained out of tree Emacs fork.

Main tech goals for gccemacs are:

  • show that elisp run-time can be considerably improved by the proposed compiler approach.
  • test for a viable and non disruptive way to have a more capable lisp implementation.

Having elisp self-hosted would enable not only a faster environment but also an easier Emacs to be modified requiring less C code to be written.

The actual code can be found here:

https://gitlab.com/koral/gccemacs

1.1 Not a jitter

gccemacs is not a jitter. The compilation output (.eln files) are elf files suitable for being reloaded between different sessions.

I believe this solution has advantages over the jitter approach not having the duty to recompile at every start-up the same code.

This is a key advantage especially for an application sensitive to the start-up time as Emacs is. Moreover more time can be spent for advanced compile time optimizations.

Some complication to have the system re-loadable exists but is worth paying.

A further advantage of this solution is the potential for reducing load-time. Optimizing for load time should be further investigated.

1.2 libgccjit

Plugging into the gcc infrastructure is done through libgccjit.

Despite what the name suggest libgccjit is usable not just for making jitters but for any other kind of compiler.

libgccjit can be seen as a way to drive the gcc in form of shared library using function calls. In other words this is probably the 'cheapest' way to write what is technically named a front-end in compiler jargon.

1.3 Usage

1.3.1 Which libgccjit

libgccjit was added to gcc 5 thanks to David Malcolm. gccemacs should not be using any recently added entry point therefore in theory any libgccjit should be fine. BUT:

  • gcc 9 was affected by PR91928 https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91928. I fixed it for the trunk and did the back port into the 9 branch so as of revision 276625 also gcc 9 should be functional again. gcc 8 and earlier should not be affected but I had no time to do any serious test with them.
  • gcc 7 shipped with Ubuntu 18.04 was reported not to pass gccemacs tests.

Consider these information if you want to try the libgccjit shipped by your distribution.

I suggest a recent gcc 9 if you are going to compile. Please report any success or failure of using gccemacs with different versions of gcc otherwise.

libgccjit compilation instructions and doc is here: https://gcc.gnu.org/onlinedocs/jit/index.html

1.3.2 Compilation

Once libgccjit is installed a normal autogen.sh + configure + make should do the job for gccemacs.

Once compiled, the native compiler tests can be run with:

~/emacs/src/emacs -batch -l ert -l ~/emacs/test/src/comp-tests.el -f ert-run-tests-batch-and-exit

(adjust path accordingly).

1.3.3 Entry points

gccemacs adds two main entry points for compilation:

  • native-compile
  • native-compile-async

Some special variables influence compilation most notably comp-speed. This can range from 0 to 3 as in Common Lisp style.

See the doc for details.

Once a compilation unit is compiled the output is an .eln file. This similarly to an .elc and can be loaded by require or manually.

A lisp file can be compiled as follows:

(native-compile-async "file.el")

Once the compilation completes, the produced file.eln can be loaded conventionally.

Please note that file.el has to be lexically bound in order to be compiled.

A native compiled function foo should then present on a describe-function something like:

foo is a built-in function in ‘C source code’.

A more common usage is to compile more files or entire folders. The following example will compile asynchronously all the installed packages (the number 4 is equivalent to -j4 for make).

(native-compile-async "~/.emacs.d/elpa/" 4 t)

Progress can be monitored into the *Async-native-compile-log* buffer.

1.4 A foreword about why optimizing outside gcc

While I'm writing I recall few (I think good) reasons why I decided to replicate some optimizing compiler algorithmics outside gcc:

  • The gcc infrastructure has no knowledge that a call to a primitive (say Fcons) will return an object with certain tag bits set (in this case say is a cons). The only feasible way to inform it would be to use LTO but I've some doubts this would be practical for this application.
  • I decided to layout the activation frame with an array to hold lisp objects local to the function. This not to have to move and store these whenever a call by ref is emitted. To avoid having the compiler clobber all the local objects in the activation frame every time one of these function calls is emitted, I decided to lift the objects that are not involved by this kind of calls in the equivalent of conventional automatic local variables. To do this I need to propagate the ref property within the control flow graph.
  • Gcc infrastructure has no knowledge of what lisp pure functions are therefore this is a barrier for its propagation.
  • Gcc does not provide help for boxing and unboxing values.
  • The propagation engine can be used for giving warnings and errors at compile time.

1.5 Internals

gccemacs compilation process can be divided into 3 main phases:

  • byte-compilation

    This is the conventional Emacs byte compilation phase, is needed because the output of the byte compiler is used as input for the following phase.

  • middle-end

    Here the code goes through different transformations. This is implemented into lisp/emacs-lisp/comp.el

  • back-end

    This defines inliner and helper functions and drives libgccjit for the final code generation. It is implemented in src/comp.c.

1.5.1 Passes

The gccemacs native compiler is divided in the following passes:

  • spill-lap:

    The input used for compiling is the internal representation created by the byte compiler (LAP). This is used to get the byte-code before being assembled. This pass is responsible for running the byte compiler end extracting the LAP IR.

  • limplify:

    The main internal representation used by gccemacs is called LIMPLE (as a tribute to the gcc GIMPLE). This pass is responsible for converting LAP output into LIMPLE. At the base of LIMPLE is the struct comp-mvar representing a meta-variable.

       (cl-defstruct (comp-mvar (:constructor make--comp-mvar))
         "A meta-variable being a slot in the meta-stack."
         (slot nil :type fixnum
               :documentation "Slot number.
       -1 is a special value and indicates the scratch slot.")
         (id nil :type (or null number)
             :documentation "SSA number when in SSA form.")
         (const-vld nil :type boolean
                    :documentation "Valid signal for the following slot.")
         (constant nil
                   :documentation "When const-vld non nil this is used for holding
        a value known at compile time.")
         (type nil
               :documentation "When non nil indicates the type when known at compile
    time.")
         (ref nil :type boolean
              :documentation "When t the m-var is involved in a call where is passed by
        reference."))
    

    As an example the following LIMPLE insn assigns to the frame slot number 9 the symbol a (found as third element into the relocation array).

    (setimm #s(comp-mvar 9 1 t 3 nil) 3 a)
    

    As other example, this is moving the 5th frame slot content into the 2nd frame slot.

    (set #s(comp-mvar 2 6 nil nil nil nil) #s(comp-mvar 5 4 nil nil nil nil))
    

    At this stage the function gets decomposed into basic blocks being each of these a list of insns.

    In this phase the original push & pop within the byte-vm stack is translated into a sequence of assignments within the frame slots of the function. Every slot into the frame is represented by a comp-mvar. All this spurious moves will be eventually optimized. It is important to highlight that doing this we are moving all the push and pop machinery from the run-time into the compile time.

  • ssa

    This is responsible for porting LIMPLE into SSA form. After this pass every m-var gets a unique id and is assigned only once.

    Contextually also phi functions are placed.

  • propagate

    This pass iteratively propagates within the control flow graph for each m-var the following properties: value, type and if the m-var will be used for a function call by reference.

    Propagate removes also function calls to pure functions when all the arguments are or become known during propagation.

    NOTE: this is done also by the byte optimizer but propagate has greater chances to succeeds due to the CFG analysis itself.

  • call-optim

    This is responsible for (depending on the optimization level) trying to emit call that do not go through the funcall trampoline.

    Note that after this all calls to subrs can have equal dignity despite the fact that they originally got a dedicate byte-op-code ore not.

    Please note also that at speed 3 the compiler is optimizing also calls within the same compilation unit. This saves from the trampoline cost and allow gcc for further in-lining and propagation but makes this functions non safely re-definable unless the whole compilation unit is recompiled.

  • dead-code

    This tries to remove unnecessary assignments.

    Spurious assignment to non floating frame slots cannot be optimized out by gcc in case the frame is clobbered by function calls with the frame used by reference so is good to remove them.

  • final

    This drives LIMPLE into libgccjit IR and invokes the compilation. Final is also responsible for:

    • Defining the inline functions that gives gcc visibility on the lisp implementation.
    • Suggesting to them the correct type if available while emitting the function call.
    • Lifting variables from the frame array if these are to be located into the floating frame.

1.6 Compiler hints

Having realized it was quite easy to feed the propagation engine with additional information I've implemented two entry points.

These are:

  • comp-hint-fixnum
  • comp-hint-cons

As an example this is to promise that the result of the following expression evaluates to a cons:

(comp-hint-cons x)

Propagation engine will use this information to propagate for all the m-vars influenced by the evaluation result of (comp-hint-cons x) in the control flow graph.

At comp-speed smaller then 3 the compiler hints will translate into assertions verifying the property declared.

These low level primitives could be used to implement something like the CL the.

1.7 Type checks removal

This section applies to the (few) functions exposed to the gcc, namely: car, cdr, setcar, setcdr, 1+, 1-, - (Fnegate).

At comp-speed 3 type checks will not be emitted if the propagation has proved the type to be the expected one for the called function.

1.8 Debugging the compiler

The native compiler can be debugged in several ways but I think the most propaedeutic one is to set comp-debug to 1. In this way the compiler will depose a .c file aside the .eln.

This is not a true compilable C file but something sufficiently close to understand what's going on. Also debug symbols are emitted into the .eln therefore is possible to trap or step into it using gdb.

Note that comp-debug set to 1 has a negative impact on compile time.

1.9 Quick performance discussion

In order to have something to optimize for and measure we have put together a small collection of elisp (nano and non) benchmarks.

https://gitlab.com/koral/elisp-benchmarks

The choice of the benchmarks and their weight is certainly very arbitrary but still… better than nothing.

All benchmarks can be considered u-benchmarks except nbody and dhrystone. Worth nothing that these last two has not been tuned specifically for the native compiler.

On the machine I'm running (i7-6600U) the latest performance figure I've got is the following:

name byte-code native-bench native-all native-all vs.
  (sec) (sec) (sec) byte-code
bubble 30.44 6.54 6.29 4.8x
bubble-no-cons 42.42 10.22 10.20 4.2x
fibn 39.36 16.96 16.98 2.3x
fibn-rec 20.64 8.06 7.99 2.6x
fibn-tc 19.43 6.33 6.89 2.8x
inclist-tc 20.43 2.16 2.18 9.4x
listlen-tc 17.66 0.66 0.70 25.2x
inclist-no-type-hints 45.57 5.60 5.81 7.8x
inclist-type-hints 45.89 3.67 3.70 12.4x
nbody 112.99 23.65 24.44 4.6x
dhrystone 112.40 64.67 47.12 2.4x
tot 507.23 148.52 132.3 3.8x

Column explanation:

  • byte-code benchmark running byte compiled (baseline).
  • native-bench benchmark running native compiled on a non native compiled emacs.
  • native-all benchmark running with the benchmark itself and all emacs lisp native compiled.

All benchmarks has been compiled with comp-speed 3 while the Emacs lisp for the native-all column with comp-speed 2.

Here follows a very preliminary and incomplete performance discussion:

bubble bubble-no-cons fibn u-benchmarks shows fair results with no specific tuning.

Recursive benchmarks (suffixed with -rec or -tc) shows an enormous uplifts most likely dominated by much faster calling convention. Note that no TCO here is performed.

inclist-no-type-hints and inclist-type-hints shows a considerable performance uplift just due to type checks being optimized out thanks to the information coming from the type hints.

nbody is ~4.6x faster with no lisp specific optimization triggered. This benchmark calls only primitives (as all previous u-benchs) therefore having all Emacs native compiled does not gives any advantage. Probably intra compilation unit call optimization plays a role into the result.

On the other side dhrystone benefits quite a lot from the full Emacs compilation having to funcall into store-substring and string>.

Hope this indicates a potential for considerable performance benefits.

1.10 Status and verification.

A number of tests is defined in comp-tests.el. This is including a number of micro tests (some of these inherited by Tom Tromey's jitter) plus a classical bootstrap test were the compiler compiles itself two times and the binary is compared.

A couple of colleagues and me are running in production gccemacs with all the Emacs lexically scoped lisp compiled plus all the elpa folder for each of our setups.

While I'm writing this I've loaded on this instance 18682 native compiled elisp functions loaded from 669 .eln files.

So far this seems stable.

1.11 Current limitations

The following limitations are not due to design limitations but just the state of a "work in progress":

  • Only lexically scoped code is compiled.
  • Interactive functions are included into eln files but as byte-compiled. (2)
  • No doc string support for native compiled functions. (2)
  • Limited lisp implementation exposure: Just few function are defined and exposed to gcc: car, cdr, setcar, setcdr, 1+, 1-, - (Fnegate). This functions can be fully in-lined and optimized.
  • Type propagation:

    Very few functions have a return type declared to the compiler (see comp-known-ret-types). Moreover type propagation can't relay on the CL equivalent of sub-type-p.

  • Garbage collection support has to be done:

    Once a .eln file is loaded it is never unloaded even if the subrs there defined are redefined.

  • No integration with the build system:

    No native compilation is automatically triggered by the build process.

  • Native compiler is not re-entrant:

    Just top level functions are native compiled, the others (lambda included) are still kept as byte-code.

1.12 Further compiler improvements

  • Better lisp exposure

    As mentioned in the previous paragraph the lisp machinery exposed to the gcc infrastructure is at this point quite limited and should be improved.

  • Unboxing

    Introducing the unboxing mechanism was considered out of scope for this investigation but is certainly possible to extend the current infrastructure to have it. Unboxing would have a big impact on performance. Combined with compiler hints should be possible to generate efficient code closing some more gap with C (in some cases probably all).

1.13 Long term

  • Further improving load time performance using the portable dumper for object de/serialization within elns (currently the print read mechanism is in use).
  • Provide better compile time warning and errors relying on the propagation engine?
  • The real final goal would be to be able to assemble the Emacs image having the native compiled functions replacing all the byte-compiled one out of the box.

1.14 Final word

All of this is very experimental. Take it (at your risk) as what it is, a proof of concept.

Feedback is very welcome.

2 Update 1

<2019-12-08 Sun>

I've implemented the support for docstrings and interactive functions.

As a consequence as of today all lexically scoped Elisp functions can be native compiled with no exceptions.

Ex: this is the output I get for C-h f org-mode:

org-mode is native compiled Lisp function in ‘org.el’.

(org-mode)

  Parent mode: ‘outline-mode’.
  Probably introduced at or before Emacs version 22.1.

Outline-based notes management and organizer, alias
"Carsten’s outline-mode for keeping track of everything."

Org mode develops organizational tasks around a NOTES file which
contains information about projects as plain text.  Org mode is
...

As expected clicking on org.el correctly drives me to the original location in the source code.

Being native compiled org-mode function cell satisfies subrp.

(subrp (symbol-function #'org-mode)) => t

I'll try to keep the /master branch as the one with the most updated working state for the project.

Please fire a mail to akrl AT sdf DOT org in case you find it to be broken.

Author: Andrea Corallo

Created: 2019-12-09 Mon 08:25

Validate