I believe that powerpc should be able to, in theory, compile this program that uses a common pattern in interpreters and parsers of tail calling a pointer from a lookup table:
https://godbolt.org/z/7P63fGMzW
@JUMP_TABLE = dso_local constant [2 x ptr] [
ptr dso_local_equivalent @increment,
ptr dso_local_equivalent @decrement
], align 8
define dso_local i32 @increment(ptr %stream.0, i64 %stream.1, i32 %state) unnamed_addr {
start:
%_3 = add i32 %state, 1
%0 = musttail call i32 @dispatch(ptr %stream.0, i64 %stream.1, i32 %_3)
ret i32 %0
}
define dso_local i32 @decrement(ptr %stream.0, i64 %stream.1, i32 %state) unnamed_addr {
start:
%_3 = sub i32 %state, 1
%0 = musttail call i32 @dispatch(ptr %stream.0, i64 %stream.1, i32 %_3)
ret i32 %0
}
define dso_local i32 @dispatch(ptr %stream.0, i64 %stream.1, i32 %state) unnamed_addr {
start:
%_5 = icmp uge i64 %stream.1, 1
br i1 %_5, label %bb2, label %bb1
bb1:
%_3 = icmp eq i64 %stream.1, 0
call void @llvm.assume(i1 %_3)
ret i32 %state
bb2:
%rest.0 = getelementptr inbounds nuw i8, ptr %stream.0, i64 1
%rest.1 = sub i64 %stream.1, 1
%1 = load i8, ptr %stream.0, align 1
%_11 = trunc nuw i8 %1 to i1
%_12 = zext i1 %_11 to i64
%_14 = icmp ult i64 %_12, 2
call void @llvm.assume(i1 %_14)
%slot = getelementptr inbounds [2 x ptr], ptr @JUMP_TABLE, i64 0, i64 %_12
%f = load ptr, ptr %slot, align 8
%2 = musttail call i32 %f(ptr %rest.0, i64 %rest.1, i32 %state), !callees !0
ret i32 %2
}
!0 = !{ptr @increment, ptr @decrement}
Currently this program fails because
fatal error: error in backend: failed to perform tail call elimination on a call site marked musttail
PLEASE submit a bug report to https://github.com/llvm/llvm-project/issues/ and include the crash backtrace, preprocessed source, and associated run script.
Stack dump:
0. Program arguments: /opt/compiler-explorer/clang-assertions-trunk/bin/clang++ -g -o /app/output.s -masm=intel -fno-verbose-asm -S -x ir -fcolor-diagnostics -fno-crash-diagnostics -O3 -target powerpc64le-unknown-linux-gnu <source>
1. Code generation
2. Running pass 'Function Pass Manager' on module '<source>'.
3. Running pass 'PowerPC DAG->DAG Pattern Instruction Selection' on function '@increment'
Tail calls on powerpc are tricky due to the TOC. My understanding of this mechanism is that a per-object side table is used to keep instruction encodings smaller. A pointer to this table, the TOC base, is stored in a register. When a call is made to a function in a different object, the TOC base is swapped to the one of the object, the call is performed, and then the TOC is restored. The trouble with tail calls is that there is no way to restore the TOC base.
The actual implementation is quite conservative, excluding all indirect calls from being promoted to tail calls. However in this example the !callees metadata combined with all the functions being dso_local means that the calls can be resolved. Hence, I believe, but am not sure, that powerpc (both 32-bit and 64-bit) could compile this program.
The above is a simplified version of the implementation that many interpreters use (e.g. wasmtime, cpython). Using tail calls duplicates the branching logic which helps the branch predictor (on modern hardware this provides a small but still significant benefit), and makes it possible to optimize small functions in isolation. These are valuable programming tools, and it would be neat if powerpc could make use of them.
Mostly I'm interested in whether this is feasible at all. Even if it is, actually implementing it appears non-trivial. After some quick experimentation, there are several places where it is implicitly assumed that an indirect call cannot tail call (and vice versa).
This issue was reported at rust-lang/rust#153827 where PowerPC is unable to compile a basic practical example of how tail calls work. It seems possible for rustc to provide the !callees annotation, but currently that would not be used.
Guaranteed tail calls have a long history on PowerPC, see e.g. #63214.
I believe that powerpc should be able to, in theory, compile this program that uses a common pattern in interpreters and parsers of tail calling a pointer from a lookup table:
https://godbolt.org/z/7P63fGMzW
Currently this program fails because
Tail calls on powerpc are tricky due to the TOC. My understanding of this mechanism is that a per-object side table is used to keep instruction encodings smaller. A pointer to this table, the TOC base, is stored in a register. When a call is made to a function in a different object, the TOC base is swapped to the one of the object, the call is performed, and then the TOC is restored. The trouble with tail calls is that there is no way to restore the TOC base.
The actual implementation is quite conservative, excluding all indirect calls from being promoted to tail calls. However in this example the
!calleesmetadata combined with all the functions beingdso_localmeans that the calls can be resolved. Hence, I believe, but am not sure, that powerpc (both 32-bit and 64-bit) could compile this program.The above is a simplified version of the implementation that many interpreters use (e.g. wasmtime, cpython). Using tail calls duplicates the branching logic which helps the branch predictor (on modern hardware this provides a small but still significant benefit), and makes it possible to optimize small functions in isolation. These are valuable programming tools, and it would be neat if powerpc could make use of them.
Mostly I'm interested in whether this is feasible at all. Even if it is, actually implementing it appears non-trivial. After some quick experimentation, there are several places where it is implicitly assumed that an indirect call cannot tail call (and vice versa).
This issue was reported at rust-lang/rust#153827 where PowerPC is unable to compile a basic practical example of how tail calls work. It seems possible for
rustcto provide the!calleesannotation, but currently that would not be used.Guaranteed tail calls have a long history on PowerPC, see e.g. #63214.