2020-07-12 02:52:58 hazel[m] yk what 2020-07-12 02:53:08 hazel[m] i am going to write a lisp that compiles to laser tomorrow 2020-07-12 02:53:12 hazel[m] solely as a bit
TLDR: i designed a minimal low-level intermediate representation language for lowering to arbitrary targets. it's called (predictably)
shark-compiler and the code is available at https://git.lain.faith/haskal/shark-compiler/ (with thanks to hazel for helpful contributions). the version from when this blog post was written can be found at https://git.lain.faith/haskal/shark-compiler/src/commit/45920d016132b5f984fc527d2bff20038ef346c6
DISCLAIMER: i have not read the dragon book yet. i'm sorry. i know it has a dragon on it, which makes it mandatory reading
the main motivation behind shark-compiler LLIR is retargetability. i wanted it to be as easy as possible to plug minimal target-specific code into the compiler system and have it able to emit working code (by "target" i mean some language or assembly or lambda calculus or whatever that you are able to compile programs for). the goal is kind of met right now because right now it compiles this weird befunge-like language and the non-abstracted targeting part of the code is 125 lines, which could probably be improved with more abstraction.
ELVM (https://github.com/shinh/elvm) is a project that provides compiler infrastructure for compiling to esoteric languages. unfortunately, it's written in C and retargeting seems overly difficult, judging from a couple of the target implementations i read. but if you're looking for a real project that isn't a toy that still has most actual functionality missing, look to ELVM probably.
okay C is pretty bad for sure but haskal why did you write this in Racket aaaaaaaa
i like Racket, shush
first, here's the spec for LLIR
;; LLIR ;; defs ;; (global x) ;; (const x) ;; TODO ;; (set! x immediate) ;; ;; the only thing that accepts immediates is set! ;; all other ops need globals or consts ;; all globals, constants, and labels share a single namespace. ;; for example, there cannot be a global named helloworld and a label also named helloworld ;; ;; branching ;; (label x) ;; (goto x) ;; (ifeq a b x y) ;; note: implement ifne by swapping x,y ;; (iflt a b x y) ;; note: implement ifgt by swapping a,b ;; ifle by swapping a,b and x,y ;; ifge by swapping x,y ;; (terminate) ;; ;; arithmetic ;; (add! a b) a += b ;; (sub! a b) a -= b ;; (mul! a b) etc ;; (div! a b) ;; (mod! a b) ;; bitwise ;; (and! a b) ;; (or! a b) ;; (xor! a b) ;; (invert! a) ;; ;; target language ;; passes all args to the target assembler ;; (asm! ...) ;; (asmgoto! [x y] ...) ;; impl-defined branching operation ;; (prelude) implementation-defined prelude
as a summary, here's what LLIR provides
- variables (they're all global all the time, for simplicity) and constants (with important note on the constants,) and it's able to set variables to an immediate value. note word size is not specified. LLIR really doesn't try to make any guarantees on properties like word size because those could get annoying to actually provide. instead, the idea is to work with the target properties that get exposed through LLIR to higher level code.
- branching and conditionals (unconditional goto, and conditional branching based on = and <). termination is also counted as a "branch"
- basic arithmetic. this part is really target-defined for the most part, and all the operations might not be necessarily available
- target-specific "inline assembly" - a plain variant and a variant that is allowed to include branching. finally, an internal prelude op that is for target-specific setup - think of it as the opposite of terminate
besides leaving properties like word size unspecified, it also specifically doesn't provide any concept of arrays or stacks in memory, only single variables. this is a current design choice that is very likely to change in the future but i made it this way right now for simplicity1. this means LLIR is concretely not Turing-complete (as opposed to e.g. "theoretically Turing-complete but limited by available computer RAM") but this is useless pedanticism :)
how do you lower shark-LLIR into target code? this is the approximate flow
- basic source transformation to enable better machine-processability
- convert flat IR listing to a graph of basic blocks
- invoke targeting code to assemble each node of the BB graph into target code
- simplify, prune, and flatten the BB graph into an ordered list of blocks
- invoke the target "linker" to lay out this flat assembled block list in the target's code format
the compilation process is abstracted such that targeting code only has to provide three operations
- assemble: takes a basic block of LLIR and emits a basic block of target instructions
- merge: combines two basic blocks of target instructions into one block that contains the instructions of the first one followed by the second. this is an operation only because i didn't want to make any assumptions about the data type of the actual target instructions. sure they're probably just strings, but what if you wanted to target piet?
- link: takes a list of target instruction basic blocks and emits the final program in target format
when these operations are called, there is an effort to provide as much preprocessing as possible to make the rest of the work as easy as it could be. in particular, variable names are replaced with sequential numeric references because for most languages where memory is represented as a tape or stack, numbers directly correlate with the locations in memory. and targeting code doesn't need to worry about what the original variable names were. also, basic blocks are numbered for the linking step, which makes it easier to point branches at the blocks when linking the target program
assembling is the easier targeting step. it's where you take one BB, with a list of LLIR goodies in it and convert that to a list of target instructions, with the tail branch instruction. the easy way to do this is to expect some certain target state to be set at the beginning and end of each LLIR instruction and maintain that state. for example, the current demo compiles to a befunge-like language with multiple stacks, and the design choice for compiling i made was that there would be one input stack, one output stack, one variables stack, and one scratch space stack, and at the beginning of each LLIR instruction it would be expected that the cursor is currently on the input stack. so each instruction gets compiled as 1) focus the necessary stack to start the operation, usually the variables stack 2) do the operation 3) re-focus the input stack. additionally, to actually access individual variables on the variables stack, there is a target "stack rotate" instruction that is used, and the targeting makes sure the variables stack is always rotated to the same place for each LLIR instruction.
now this is a really inefficient way to do things. most assembly steps involve setting up some target state in this way, then doing the required operation, and then resetting the target state, which when chained together ends up in a lot of useless state setting and resetting. i ignore this for now, because the point is not efficiency or speed but ease of targeting, and making this "consistent target state" assumption makes assembly really easy. any sort of actual target state abstraction between sequential LLIR instructions would make target assembly harder, but i might try it in the future.
the abstracted parts
ok let's back up, what is a "basic block graph" anyway?
the basic block graph is a representation of the program that's simplified so it's easy to reason about for compiling. the rules of the LLIR basic block graph are that 1) there is an entry BB 2) each BB contains a body of instructions that do not branch 3) each BB ends with exactly one instruction that performs a branch, which includes terminating. the tail branch could branch to zero or more other BBs to execute next, which can include the current BB. so, each node of this graph is a coherent set of instructions that execute in sequence, and then one branch that moves to a different node of the graph. the initial graph is built by going through the input LLIR listing and splitting off a new basic block every time a label or branching instruction is encountered, such that we end up with a graph of BBs that each end with exactly one LLIR branch instruction.
this gets a bit complicated after the body of each BB is assembled from LLIR to target code. because depending on the target, branches could require a sequence of target instructions before and after the branch occurs. for example, in the demo implementation each comparison-based branch requires 1) fetching the variables for the comparison, 2) doing the comparison, 3) doing the branch based on the result, and then 4) whether the branch is taken or not the result of the comparison needs to be deleted afterwards. the before part is easy, that can just be included with the rest of the body. the after part, which i call "cleanup" gets annoying because there can't be any instructions after the branch of a BB. so, part of the compilation splits off the cleanup steps into their own BBs, which restores the property of BBs ending in a branch only. since this complicates the graph a lot, there is also a step that does basic simplification by merging BBs (like, the inverse of edge-splitting). if there are two BBs A and B such that A only has an outward edge to B and B only has an inward edge from A, then A and B can be merged. this is where the merge operation comes in.
so now, there's a BB graph with actual target instructions, converted out of LLIR. the final step is what i call "linking" and this is entirely target-dependent. the goal of "linking" is to actually point the branches from each basic block at the actual place they're supposed to go instead of abstractly representing it in a graph. there are a few types of linking that could be required, as examples
- your typical CPU assembly sort of thing where everything goes in one linear blob and branches are based on relative offsets which you need to calculate
- the whole execution model is two dimensional, like befunge
- the execution model doesn't provide arbitrary enough branching capabilities, like brainfuck, so you actually need to emulate the BB graph using a state machine
i actually got really worried that i needed to re-architect this entire thing specifically for brainfuck linking, but it turns out that's not necessary because you can do the CFG emulation thing. and this should be generalizable to any other targets with wacky control flow too. (thanks to iitalics for help with this part 🦈)
right now for this entire step you're basically on your own, but i will elaborate on various implementation ideas (the demo currently does the befunge-style thing)
in befunge-like languages, execution can flow in any of the four cardinal directions. normally, it keeps going in the same direction but it can be changed by unconditional direction-changing instructions as well as conditional ones (the branches). so a linking strategy that can be used is to put each basic block on every 3rd line, and have 2 other lines as "return" lines from that basic block, which could have up to 2 return paths (ignoring potentially fancier instructions which you don't need). then, on the left side of the program implement a "switching matrix" with a diagonal sequence of
> such that entering any column i of that region will move execution to the _i_th basic block line. it looks like this
> basic block 1 code here bb 1 return path 1 bb 1 return path 2 > basic block 2 code here bb 2 return path 1 bb 2 return path 2 > basic block 3 code here bb 3 return path 1 bb 3 return path 2
now if bb 2 return path 1 needs to go to bb 1 you insert the corresponding switch character in the matrix
> basic block 1 code here bb 1 return path 1 ^ bb 1 return path 2 > basic block 2 code here bb 2 return path 1 bb 2 return path 2 > basic block 3 code here bb 3 return path 1 bb 3 return path 2
and that's basically it, this is how you can connect arbitrary BBs to each other. incidentally, this is exactly what the demo currently does
regular assembly style
this is fairly straightforward. assume each branch instruction has a fixed size (if not, choose the largest possible size and pad with nops or something). compute the size of each BB body including the tail branch instructions. then compute the offsets for each basic block in the final blob by summing the sizes in order. finally, add the branch instructions with the offsets to the actual target BB that are needed, and concatenate all the blocks.
ah, brainfuck. fuck brainfuck tbh
brainfuck only has
] and that's about it, so you can't actually lower an arbitrary BB graph with arbitrary branches between the blocks. but you can emulate the graph with a state machine. first, dedicate a tape location to the current state and one to next state. assume state 0 is terminate, and otherwise each basic block is numbered with a sequential state. initialize the state to the entry BB number. have an outer loop on the state variable (which will exit once the state is zero), and inside the loop, insert each BB guarded such that it will only execute if the state variable matches its state number - and the rest of the blocks get skipped. each block writes the next state that will be executed to the next state variable. finally, copy the next state to the current state and outer loop.
so in my opinion, LLIR is a bit nicer to program in than annoying esoteric targets, but it's still obviously low level. the next step that i plan on is to create a higher level language that emits LLIR, or perhaps just a compiler for an existing high level language to LLIR, and then shark-compiler would be actually a useful thing.
the original goal of this was to provide infrastructure for writing complex languages for arbitrary targets really fast -- particularly for esolang-coding CTF challenges. right now i'm not entirely convinced that writing the targeting code is really worth it compared to just manually coding but i don't know. with some work maybe this could be actually worth using. contributions are welcome 🦈✨
indirect tape access in brainfuck for example is really obtuse and i wanted to avoid having to bloat targeting code with such constructs. you could realistically avoid indirect tape access by always working relatively to a current "stack position" - this would mean if you have stack frames with local variables as well as global variables that are accessed from any stack frame you'd actually need to transform that so the set of "global variables" actually lives in each stack frame and gets copied to the next frame when entering one, and copied back when exiting a frame. this comes with the drawback that you still cannot have any variable-sized regions of tape. sizes must always be known. and if that's okay, this is a totally reasonable way to do things but right now, shark-compiler doesn't do it. perhaps in the future it will be an option