Toy frontend for LLVM written in Rust: a Guide for beginners
translator's note
Given in the article code compiled with old enough versions of mainframes peg and peg_syntax_ext. For current versions in the source code you need to make minimal changes. I pasted the changed sections in spoilers in the text of the article. To build the code, install the nightly Rust compiler.
Full source code with my changes you can download it here: https://github.com/arktur04/rust-llvm-toy-frontend
Full source code with my changes you can download it here: https://github.com/arktur04/rust-llvm-toy-frontend
I am currently working on a compiler that is written in Rust, and generates the LLVM IR. LLVM API looks a bit scary for beginners, and it is not so many tutorials (and they are all in C++, so it is not quite obvious how to do the same to Rust). I wish someone had held my hand when I started all this, and this article is what I would like to show myself at the time.

In Rust the best opportunity to interact with LLVM — through crate llvm-sys. One kind person posted the documentation here. Of course, you should also consider guide for LLVM, since it will help you to understand how LLVM “thinks”. This post is mostly a translation into Rust subsets of this guide.
The full source code for this tutorial is here.
the
Get a working development environment
For beginners, here's a way to start working with LLVM:
the
# `curl` is just so we can next install Rust
sudo apt-get -y install clang llvm curl-3.8-dev
curl https://sh.rustup.rs -sSf | sh
# The `llvm-sys` crate expects something called `llvm-config` on your PATH.
sudo ln-s /usr/bin/llvm-config-3.8 /usr/bin/llvm-config
If you are working on a fresh Ubuntu (you may need apt-get update), all ready, can begin. If not, you can start the virtual machine for Vagrant, using Vagrantfile:
the
Vagrant.configure("2") do |config|
config.vm.box = "./ubuntu 16.04"
end
You can start by running cargo init llvm-example --bin and placing the following(taken from llvm-sys) in src/main.rs:
the
//! Construct a function that does nothing in LLVM IR.
extern crate llvm_sys as llvm;
use std::ptr;
fn main() {
unsafe {
// Set up a context, and module builder in that context.
let context = llvm::core::LLVMContextCreate();
let module = llvm::core::LLVMModuleCreateWithName(b"nop\0".as_ptr() as *const _);
let builder = llvm::core::LLVMCreateBuilderInContext(context);
// Get the type signature for void nop(void);
// Then create it in our module.
let void = llvm::core::LLVMVoidTypeInContext(context);
let function_type = llvm::core::LLVMFunctionType(void, ptr::null_mut(), 0, 0);
let function = llvm::core::LLVMAddFunction(module b"nop\0".as_ptr() as *const _,
function_type);
// Create a basic block in the function and set our builder to generate
// code in it.
let bb = llvm::core::LLVMAppendBasicBlockInContext(context, function,
b"entry\0".as_ptr() as *const _);
llvm::core::LLVMPositionBuilderAtEnd(builder, bb);
// Emit a `ret void` into the function
llvm::core::LLVMBuildRetVoid(builder);
// Dump the module as IR to stdout.
llvm::core::LLVMDumpModule(module);
// Clean up. Values created in the context mostly get cleaned up there.
llvm::core::LLVMDisposeBuilder(builder);
llvm::core::LLVMDisposeModule(module);
llvm::core::LLVMContextDispose(context);
}
}
And in your Cargo.toml:
the
[package]
name = "llvm-example"
version = "0.1.0"
authors = ["Ulysse Carion <ulysse@ulysse.io>"]
[[bin]]
name = "main"
[dependencies]
llvm-sys = "0.2"
You should get the following:
the
vagrant@vagrant:/vagrant$ cargo run
Compiling llvm-example v0.1.0 (file:///vagrant)
Running `target/debug/main`
; ModuleID = 'nop'
define void @nop() {
entry:
ret void
}
Yay! We can begin to write its own programme.
the
a Slightly less trivial program
For a start, compile a program that just returns the exit code by returning an integer from a function main.
Here's how I did it: (soon We will need a parser, and I've added it now; I used crate peg):
Approx. pens.
Text Cargo.toml:
the
the
[package]
name = "llvm-example"
version = "0.1.0"
authors = ["Ulysse Carion <ulysse@ulysse.io>"]
[[bin]]
name = "main"
[dependencies]
llvm-sys = "38"
peg = "0.5.4"
peg-syntax-ext = "0.5.2"
the
#![feature(plugin)]
#![plugin(peg_syntax_ext)]
extern crate llvm_sys as llvm;
use std::ffi::CString;
use std::fs::File;
use std::io::Read;
use std::ptr;
fn main() {
let mut input = String::new();
let mut f = File::open("in.ex").unwrap();
f.read_to_string(&mut input).unwrap();
let parsed_input = parser::program(&input).unwrap();
unsafe {
codegen(parsed_input);
}
}
peg! parser(r#"
#[pub]
program -> String
= i:int_literal "\n" { i }
int_literal -> String
= [0-9]+ { match_str.to_owned() }
"#);
unsafe fn codegen(input: String) {
let context = llvm::core::LLVMContextCreate();
let module = llvm::core::LLVMModuleCreateWithName(b"example_module\0".as_ptr() as *const _);
let builder = llvm::core::LLVMCreateBuilderInContext(context);
// In LLVM, you get your types from functions.
let int_type = llvm::core::LLVMInt64TypeInContext(context);
let function_type = llvm::core::LLVMFunctionType(int_type, ptr::null_mut(), 0, 0);
let function = llvm::core::LLVMAddFunction(module b"main\0".as_ptr() as *const _, function_type);
let entry_name = CString::new("entry").unwrap();
let bb = llvm::core::LLVMAppendBasicBlockInContext(context, function, entry_name.as_ptr());
llvm::core::LLVMPositionBuilderAtEnd(builder, bb);
// The juicy part: construct a `LLVMValue` from a Rust value:
let int_value: u64 = input.parse().unwrap();
let int_value = llvm::core::LLVMConstInt(int_type, int_value, 0);
llvm::core::LLVMBuildRet(builder, int_value);
// Instead of dumping to stdout, let's write out the IR to `out.ll`
let out_file = CString::new("out.ll").unwrap();
llvm::core::LLVMPrintModuleToFile(module, out_file.as_ptr(), ptr::null_mut());
llvm::core::LLVMDisposeBuilder(builder);
llvm::core::LLVMDisposeModule(module);
llvm::core::LLVMContextDispose(context);
}
Approx. pens.
Change in the parser:
the
the
peg! parser(r#"
#[pub]
program -> String
= i:int_literal "\n" { i }
int_literal -> String
= n:$([0-9]+) { n.to_owned() }
"#);
Works! Check:
the
vagrant@vagrant:/vagrant$ cat in.ex
42
vagrant@vagrant:/vagrant$ cargo run
Running `target/debug/main`
vagrant@vagrant:/vagrant$ lli-3.8 out.ll ; echo $?
42
Cool! Here is out.ll:
the
; ModuleID = 'example_module'
define i64 @main() {
entry:
ret i64 42
}
the
Arithmetic
Will add support for addition, subtraction, multiplication, and division of numbers. To do this, we need to extend our grammar. Let's introduce the enum, which represents the AST:
the
pub enum Expr {
Add(Box<Expr> Box<Expr>),
Sub(Box<Expr> Box<Expr>),
Mul(Box<Expr> Box<Expr>),
Div(Box<Expr> Box<Expr>),
Literal(String)
}
We also need to expand the grammar:
the
// `product` and `sum` are that way to get operator precedence
peg! parser(r#"
use super::Expr;
#[pub]
program -> Expr
= e:expression '\n ' {e }
expression -> Expr
= sum
sum -> Expr
= a:product _ "+" _ b:sum { Expr::Add(Box::new(a), Box::new(b)) }
a:product _ "-" _ b:sum { Expr::Sub(Box::new(a), Box::new(b)) }
/ product
product -> Expr
= a:int_literal _ "*" _ b:product { Expr::Mul(Box::new(a), Box::new(b)) }
a:int_literal _ "/" _ b:product { Expr::Div(Box::new(a), Box::new(b)) }
/ int_literal
int_literal -> Expr
= [0-9]+ { Expr::Literal(match_str.to_owned()) }
_ = " "*
"#);
Approx. pens.
Changes to the parser:
the
the
// `product` and `sum` are that way to get operator precedence
peg! parser(r#"
use super::Expr;
#[pub]
program -> Expr
= e:expression '\n ' {e }
expression -> Expr
= sum
sum -> Expr
= a:product _ "+" _ b:sum { Expr::Add(Box::new(a), Box::new(b)) }
a:product _ "-" _ b:sum { Expr::Sub(Box::new(a), Box::new(b)) }
/ product
product -> Expr
= a:int_literal _ "*" _ b:product { Expr::Mul(Box::new(a), Box::new(b)) }
a:int_literal _ "/" _ b:product { Expr::Div(Box::new(a), Box::new(b)) }
/ int_literal
int_literal -> Expr
= n:$([0-9]+) { Expr::Literal(n.to_owned()) }
_ = " "*
"#);
Then we generate code. You can specify strings such as “addtmp” and they will be used as part of the name of the corresponding register in the IR.
the
// When you write out instructions in LLVM, you get back `LLVMValueRef's. You
// can then use these references in other instructions.
unsafe fn codegen_expr(context: LLVMContextRef, builder: LLVMBuilderRef, expr: Expr) -> LLVMValueRef {
expr match {
Expr::Literal(int_literal) => {
let int_type = llvm::core::LLVMInt64TypeInContext(context);
llvm::core::LLVMConstInt(int_type, int_literal.parse().unwrap(), 0)
},
Expr::Add(lhs, rhs) => {
let lhs = codegen_expr(context, builder, *lhs);
let rhs = codegen_expr(context, builder, *rhs);
let name = CString::new("addtmp").unwrap();
llvm::core::LLVMBuildAdd(builder, lhs, rhs, name.as_ptr())
},
Expr::Sub(lhs, rhs) => {
let lhs = codegen_expr(context, builder, *lhs);
let rhs = codegen_expr(context, builder, *rhs);
let name = CString::new("subtmp").unwrap();
llvm::core::LLVMBuildSub(builder, lhs, rhs, name.as_ptr())
},
Expr::Mul(lhs, rhs) => {
let lhs = codegen_expr(context, builder, *lhs);
let rhs = codegen_expr(context, builder, *rhs);
let name = CString::new("multmp").unwrap();
llvm::core::LLVMBuildMul(builder, lhs, rhs, name.as_ptr())
},
Expr::Div(lhs, rhs) => {
let lhs = codegen_expr(context, builder, *lhs);
let rhs = codegen_expr(context, builder, *rhs);
let name = CString::new("divtmp").unwrap();
llvm::core::LLVMBuildUDiv(builder, lhs, rhs, name.as_ptr())
}
}
Now you can execute programs like 10 * 4 + 20 / 2 — 8! Very cool, I think.
the
Variables
We go the easy route and say that our program does various annoying things such as references to undefined variables. We just keep variables in registers, and save them in a HashMap<String, LLVMValueRef>. It works because the program has only one path of execution.
Extensible language and parser:
the
pub enum Expr {
Literal(String)
Ref(String),
Assign(String, Box<Expr>),
Add(Box<Expr> Box<Expr>),
Sub(Box<Expr> Box<Expr>),
Mul(Box<Expr> Box<Expr>),
Div(Box<Expr> Box<Expr>),
}
peg! parser(r#"
use super::Expr;
#[pub]
program -> Vec<Expr>
= e:(expression ** "\n") "\n" { e }
expression -> Expr
= i:identifier _ "=" _ s:sum { Expr::Assign(i, Box::new(s)) }
/ sum
sum -> Expr
= a:product _ "+" _ b:sum { Expr::Add(Box::new(a), Box::new(b)) }
a:product _ "-" _ b:sum { Expr::Sub(Box::new(a), Box::new(b)) }
/ product
product -> Expr
= a:ref_or_literal _ "*" _ b:product { Expr::Mul(Box::new(a), Box::new(b)) }
a:ref_or_literal _ "/" _ b:product { Expr::Div(Box::new(a), Box::new(b)) }
/ ref_or_literal
ref_or_literal -> Expr
= i:identifier { Expr::Ref(i) }
/ int_literal
identifier -> String
= [a-zA-Z]+ { match_str.to_owned() }
int_literal -> Expr
= [0-9]+ { Expr::Literal(match_str.to_owned()) }
_ = " "*
"#);
Approx. pens.
Change in the parser:
the
the
peg! parser(r#"
use super::Expr;
#[pub]
program -> Vec<Expr>
= e:(expression ** "\n") "\n" { e }
expression -> Expr
= i:identifier _ "=" _ s:sum { Expr::Assign(i, Box::new(s)) }
/ sum
sum -> Expr
= a:product _ "+" _ b:sum { Expr::Add(Box::new(a), Box::new(b)) }
a:product _ "-" _ b:sum { Expr::Sub(Box::new(a), Box::new(b)) }
/ product
product -> Expr
= a:ref_or_literal _ "*" _ b:product { Expr::Mul(Box::new(a), Box::new(b)) }
a:ref_or_literal _ "/" _ b:product { Expr::Div(Box::new(a), Box::new(b)) }
/ ref_or_literal
ref_or_literal -> Expr
= i:identifier { Expr::Ref(i) }
/ int_literal
identifier -> String
= n:$([a-zA-Z]+) { n.to_owned() }
int_literal -> Expr
= n:$([0-9]+) { Expr::Literal(n.to_owned()) }
_ = " "*
"#);
Then add support for two new expressions:
the
unsafe fn codegen_expr(context: LLVMContextRef, builder: LLVMBuilderRef, names: &mut HashMap<String, LLVMValueRef> expr: Expr) -> LLVMValueRef {
expr match {
// ...
Expr::Ref(name) => {
*names.get(&name).unwrap()
},
Expr::Assign(name, expr) = > {
let new_value = codegen_expr(context, builder, names, *expr);
names.insert(name, new_value);
new_value
},
}
}
And slightly change the function of codegen:
the
let int_type = llvm::core::LLVMInt64TypeInContext(context);
let zero = llvm::core::LLVMConstInt(int_type, 0, 0);
let mut names = HashMap::new();
let mut return_value = zero; // return value on empty program
for expr in input {
return_value = codegen_expr(context, builder, &mut names, expr);
}
llvm::core::LLVMBuildRet(builder, return_value);
Voila! Check:
the
vagrant@vagrant:/vagrant$ cat in.ex
a = 3
b = 76
a + b
vagrant@vagrant:/vagrant$ cargo run
Running `target/debug/main`
vagrant@vagrant:/vagrant$ cat out.ll
; ModuleID = 'example_module'
define i64 @main() {
entry:
ret i64 79
}
the
If statement
if everything becomes more complicated. The easiest way to get it to work is to keep local variables on the stack, and allow LLVM to perform optimization. In LLVM, you create a stack variable with the command alloca, and then read/write commands load/store.
In order to do this, we are again expanding the language and grammar, adding new rules to the parser:
the
expression -> Expr
= if_expression
/ i:identifier _ "=" _ s:expression { Expr::Assign(i, Box::new(s)) }
/ sum
if_expression -> Expr
= "if" _ e:expression _ "{\n" _ then_body:statements _ "}" _ "else" _ "{\n" _ else_body:statements _ "}" {
Expr::If(Box::new(e), then_body, else_body)
}
And add new node type AST:
the
pub enum Expr {
Literal(String)
Ref(String),
Assign(String, Box<Expr>),
Add(Box<Expr> Box<Expr>),
Sub(Box<Expr> Box<Expr>),
Mul(Box<Expr> Box<Expr>),
Div(Box<Expr> Box<Expr>),
If(Box<Expr> Vec<Expr> Vec<Expr>),
}
Finally, the generated code for the expression if:
the
unsafe fn codegen_expr(context: LLVMContextRef, builder: LLVMBuilderRef, func: LLVMValueRef, names: &mut HashMap<String, LLVMValueRef> expr: Expr) -> LLVMValueRef {
expr match {
// ...
Expr::If(condition, then_body, else_body) => {
let condition_value = codegen_expr(context, builder, func, names, *condition);
let int_type = llvm::core::LLVMInt64TypeInContext(context);
let zero = llvm::core::LLVMConstInt(int_type, 0, 0);
// `is_nonzero` is a 1-bit integer
let name = CString::new("is_nonzero").unwrap();
let is_nonzero = llvm::core::LLVMBuildICmp(builder, llvm::LLVMIntPredicate::LLVMIntNE, condition_value, zero, name.as_ptr());
// It's fine to create blocks first, and then fill them in later.
let then_block = llvm::core::LLVMAppendBasicBlockInContext(context, func, entry_name.as_ptr());
let else_block = llvm::core::LLVMAppendBasicBlockInContext(context, func, entry_name.as_ptr());
let merge_block = llvm::core::LLVMAppendBasicBlockInContext(context, func, entry_name.as_ptr());
llvm::core::LLVMBuildCondBr(builder, is_nonzero, then_block, else_block);
llvm::core::LLVMPositionBuilderAtEnd(builder, then_block);
let mut then_return = zero;
for expr in then_body {
then_return = codegen_expr(context, builder, func, names, expr);
}
llvm::core::LLVMBuildBr(builder, merge_block);
llvm::core::LLVMPositionBuilderAtEnd(builder, else_block);
let mut else_return = zero;
for expr in else_body {
else_return = codegen_expr(context, builder, func, names, expr);
}
llvm::core::LLVMBuildBr(builder, merge_block);
// Position the builder so that it's ready to work on the next
// expression.
llvm::core::LLVMPositionBuilderAtEnd(builder, merge_block);
zero
}
}
}
A lot of code, but it does what we expected. We can now run this program:
the
a = 1
if a {
a = 42
} else {
a = 13
}
a
which produces this IR:
the
; ModuleID = 'example_module'
define i64 @main() {
entry:
%a = alloca i64
store i64 1, i64* %a
%a1 = load i64, i64* %a
%is_nonzero = icmp ne i64 %a1, 0
br i1 %is_nonzero, label %entry2, label %entry3
entry2: ; preds = %entry
store i64 42, i64* %a
br label %entry4
entry3: ; preds = %entry
store i64 13, i64* %a
br label %entry4
entry4: ; preds = %entry3, %entry2
%a5 = load i64, i64* %a
ret i64 %a5
}
However, we are not yet finished. Now our "expression" if is always zero. Instead, if must evaluate to then_return if you put “then”, or else_return otherwise.
How to get LLVM to keep track of which way was made? With the help of node “Phi”. You give the user phi a list of pairs of (unit, value), and then the phi-node will return the value corresponding to the block that was being executed in front of him.
We are finished with if. Noted that we need to update then_block and else_block. This is done in order to get the last block in the construction of “then”/”else”, and formerly then_block was the first unit in the “then”/”else".
the
// ...
// This is mostly the same code as before, just note the new calls to
// `LLVMGetInsertBlock`.
llvm::core::LLVMPositionBuilderAtEnd(builder, then_block);
let mut then_return = zero;
for expr in then_body {
then_return = codegen_expr(context, builder, func, names, expr);
}
llvm::core::LLVMBuildBr(builder, merge_block);
let then_block = llvm::core::LLVMGetInsertBlock(builder);
llvm::core::LLVMPositionBuilderAtEnd(builder, else_block);
let mut else_return = zero;
for expr in else_body {
else_return = codegen_expr(context, builder, func, names, expr);
}
llvm::core::LLVMBuildBr(builder, merge_block);
let else_block = llvm::core::LLVMGetInsertBlock(builder);
// Insert the phi node
llvm::core::LLVMPositionBuilderAtEnd(builder, merge_block);
let phi_name = CString::new("iftmp").unwrap();
let phi = llvm::core::LLVMBuildPhi(builder, int_type, phi_name.as_ptr());
let mut values = vec![then_return, else_return];
let mut blocks = vec![then_block, else_block];
llvm::core::LLVMAddIncoming(phi, values.as_mut_ptr(), blocks.as_mut_ptr(), 2);
phi
And here it is, amazing the compiler:
the
vagrant@vagrant:/vagrant$ cat in.ex
a = 1
b = 0
c = if a {
if b {
11
} else {
40
}
} else {
if b {
10
} else {
20
}
}
c + 2
vagrant@vagrant:/vagrant$ cargo run
Running `target/debug/main`
vagrant@vagrant:/vagrant$ lli-3.8 out.ll ; echo $?
42
Cool! Here is the code that is generated for the sample input program:
the
; ModuleID = 'example_module'
define i64 @main() {
entry:
%a = alloca i64
%b = alloca i64
%c = alloca i64
store i64 1, i64* %a
store i64 0, i64* %b
%a1 = load i64, i64* %a
%is_nonzero = icmp ne i64 %a1, 0
br i1 %is_nonzero, label %entry2, label %entry3
entry2: ; preds = %entry
%b5 = load i64, i64* %b
%is_nonzero6 = icmp ne i64 %b5, 0
br i1 %is_nonzero6, label %entry7, label %entry8
entry3: ; preds = %entry
%b10 = load i64, i64* %b
%is_nonzero11 = icmp ne i64 %b10, 0
br i1 %is_nonzero11, label %entry12, label %entry13
entry4: ; preds = %entry14, %entry9
%iftmp16 = phi i64 [ %iftmp, %entry9 ], [ %iftmp15, %entry14 ]
store i64 %iftmp16, i64* %c
%c17 = load i64, i64* %c
%addtmp = add i64 %c17, 2
ret i64 %addtmp
entry7: ; preds = %entry2
br label %entry9
entry8: ; preds = %entry2
br label %entry9
entry9: ; preds = %entry8, %entry7
%iftmp = phi i64 [ 11, %entry7 ], [ 40, %entry8 ]
br label %entry4
entry12: ; preds = %entry3
br label %entry14
entry13: ; preds = %entry3
br label %entry14
entry14: ; preds = %entry13, %entry12
%iftmp15 = phi i64 [ 10, %entry12 ], [ 20, %entry13 ]
br label %entry4
}
Note the pattern formed by the blocks except for the block entry, they form groups of three, where the first branch goes "then", then the branch "else", and then block the "merge" (which can be found at the phi instructions). This is a consequence of the fact that every time we find the expression "if", we add three new blocks in main. Three blocks are arranged in the order in which the tree traversal AST.
That's all! I hope that now you have a sufficient basis to act independently.
Комментарии
Отправить комментарий