Skip to content

anonepo/LLM4Decompile

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Decompile-Bench

🚀 Pipeline | 📚 Benchmark | 🤗 Models | 🖥️ Evaluation | 📊 Experiment | 📎 Quick Start

Experiment

  • Edit Similarity
HumanEval O0 O1 O2 O3 AVG
GPT-4.1-mini 46.09 33.83 34.75 29.66 36.08
IDA 25.47 21.01 20.18 17.92 21.15
LLM4Decompile-End-1.3B 43.37 36.91 36.76 36.30 38.34
Idioms-1.3B 48.84 38.08 35.93 34.66 39.35
LLM4Decompile-DCBench-1.3B 54.36 43.54 44.21 42.78 46.22
LLM4Decompile-DCBench-6.7B 62.32 51.91 51.66 52.99 54.72
Claude-Sonnet-4-reasoning 60.75 48.45 47.12 46.40 50.68
MBPP O0 O1 O2 O3 AVG
GPT-4.1-mini 47.52 37.34 39.15 32.63 39.16
IDA 27.66 23.63 22.01 19.43 23.18
LLM4Decompile-End-1.3B 44.82 39.67 39.01 38.13 40.41
Idioms-1.3B 49.35 38.13 35.91 34.36 39.44
LLM4Decompile-DCBench-1.3B 56.38 48.14 46.76 45.79 49.28
LLM4Decompile-DCBench-6.7B 64.12 55.40 53.87 53.39 56.70
Claude-Sonnet-4-reasoning 64.64 54.27 53.10 51.98 55.99
GitHub2025 O0 O1 O2 O3 AVG
GPT-4.1-mini 21.15 18.64 19.38 18.43 19.40
IDA 22.17 18.21 19.69 18.93 19.75
LLM4Decompile-End-1.3B 23.09 20.61 21.77 20.81 21.57
Idioms-1.3B 30.27 24.04 25.09 24.18 25.90
llm4decompile-1.3b-DCBench 30.99 29.21 30.23 27.59 29.51
llm4decompile-6.7b-DCBench 34.29 32.74 34.18 29.96 32.79
Claude-Sonnet-4-reasoning 36.29 32.80 33.12 31.32 33.38
ProRec Edit Sim
GPT-4.1-mini 34.74
IDA 27.24
LLM4Decompile-End-1.3B 34.26
Idioms-1.3B 36.03
LLM4Decompile-DCBench-1.3B 38.85
LLM4Decompile-DCBench-6.7B 45.23
Claude-Sonnet-4-reasoning 41.99
  • Re-executability
HumanEval O0 O1 O2 O3 AVG
GPT-4.1-mini 21.95 11.58 10.07 10.06 13.42
IDA 18.60 19.81 17.69 16.77 18.22
LLM4Decompile-End-1.3B 26.22 12.81 14.03 13.42 16.22
Idioms-1.3B 30.56 16.10 12.63 12.36 17.91
LLM4Decompile-DCBench-1.3B 33.23 18.60 16.47 15.24 20.89
LLM4Decompile-DCBench-6.7B 61.59 30.18 34.15 32.01 39.48
Claude-Sonnet-4-reasoning 65.85 42.68 39.63 39.02 46.79
MBPP O0 O1 O2 O3 AVG
GPT-4.1-mini 31.37 16.74 16.64 14.79 19.89
IDA 25.62 25.05 23.72 23.57 24.49
LLM4Decompile-End-1.3B 29.16 16.99 17.92 18.07 20.54
Idioms-1.3B 33.97 20.47 18.13 17.30 22.47
LLM4Decompile-DCBench-1.3B 35.06 21.56 22.80 20.28 24.93
LLM4Decompile-DCBench-6.7B 58.32 39.58 39.73 37.06 43.67
Claude-Sonnet-4-reasoning 67.76 51.69 53.02 50.25 55.68
  • R2I
HumanEval O0 O1 O2 O3 AVG
GPT-4.1-mini 62.38 52.63 55.68 53.90 56.14
IDA 41.49 36.29 35.85 35.32 37.23
LLM4Decompile-End-1.3B 65.69 60.48 60.66 59.37 61.55
Idioms-1.3B 68.18 66.92 67.46 65.48 67.01
LLM4Decompile-DCBench-1.3B 68.93 68.74 69.03 67.76 68.62
LLM4Decompile-DCBench-6.7B 69.35 68.91 69.79 68.42 69.12
Claude-Sonnet-4-reasoning 61.09 54.94 55.65 55.28 56.74
MBPP O0 O1 O2 O3 AVG
GPT-4.1-mini 61.79 55.34 57.05 55.83 57.50
IDA 41.82 34.87 35.16 36.21 37.02
LLM4Decompile-End-1.3B 67.93 63.47 65.69 63.01 65.03
Idioms-1.3B 69.12 67.01 63.91 62.35 65.60
LLM4Decompile-DCBench-1.3B 69.13 70.97 68.03 67.79 68.98
LLM4Decompile-DCBench-6.7B 72.30 71.99 72.25 70.67 71.80
Claude-Sonnet-4-reasoning 64.78 60.62 61.53 61.71 62.16
GitHub2025 O0 O1 O2 O3 AVG
GPT-4.1-mini 51.65 39.64 46.62 55.83 48.43
IDA 45.87 38.85 36.99 36.20 39.48
LLM4Decompile-End-1.3B 54.26 51.73 53.42 50.56 52.49
Idioms-1.3B 61.76 58.06 53.26 51.19 56.07
LLM4Decompile-DCBench-1.3B 64.40 65.72 61.74 63.31 63.79
LLM4Decompile-DCBench-6.7B 72.67 70.23 66.55 67.76 69.30
Claude-Sonnet-4-reasoning 55.70 43.88 45.04 51.71 49.08
ProRec R2I
GPT-4.1-mini 55.01
IDA 38.35
LLM4Decompile-End-1.3B 57.49
Idioms-1.3B 64.86
LLM4Decompile-DCBench-1.3B 65.73
LLM4Decompile-DCBench-6.7B 66.15
Claude-Sonnet-4-reasoning 57.38
  • Variable naming
GitHub2025 O0 O1 O2 O3 Average
GPT-4.1-mini 48.99 42.24 43.07 39.98 43.57
IDA 33.66 27.16 29.49 28.99 29.83
LLM4Decompile-End 64.15 63.48 62.39 63.84 63.47
LLM4Decompile-DCBench 76.38 77.18 77.53 76.69 76.95
  • Control flow
GitHub2025 O0 O1 O2 O3 Average
GPT-4.1-mini 63.25 50.09 50.41 50.10 53.46
IDA 63.28 59.42 60.35 60.62 60.92
LLM4Decompile-End 73.75 73.49 73.61 74.65 73.88
LLM4Decompile-DCBench 83.61 85.13 85.56 84.87 84.79
  • Type recovery
GitHub2025 O0 O1 O2 O3 Average
GPT-4.1-mini 55.69 45.18 46.75 44.93 48.14
IDA 63.53 60.37 62.29 61.67 61.97
LLM4Decompile-End 76.47 77.82 78.98 77.42 77.67
LLM4Decompile-DCBench 80.13 82.26 82.22 81.55 81.54

Prompt to evaluate variable naming, control flow and type recovery

You are an expert in reverse-engineering and decompiler evaluation.  I will give you a decompiled code snippet; your job is to evaluate it on three criteria:

    1. variable_naming: How well the decompiler recovered meaningful variable names.
    2. control_flow: How faithfully complex control-flow constructs (loops, branches, gotos) have been reconstructed.
    3. type_recovery: How accurately types (primitives, structs, pointers, arrays, etc.) were inferred.

    For each criterion:
    • Assign an integer score from 1 (very poor) to 100 (excellent).
    • Provide a one- or two-sentence rationale.

    Produce only a single JSON object, with exactly these fields:

    {
    "variable_naming": {
        "score": <int>,
        "rationale": "<string>"
    },
    "control_flow": {
        "score": <int>,
        "rationale": "<string>"
    },
    "type_recovery": {
        "score": <int>,
        "rationale": "<string>"
    }
    }

Do not include any extraneous keys and directly output the result without any explanation. 
The source code: {source code}.
Now evaluate this snippet: {decompiled code}

Updates

  • [2025-05-21]: Release LLM4Decompile-DCBench, a 1.3 billion-parameter model trained on 10% of the Decompile-Bench, specifically designed to decompile C/C++ code.
  • [2025-05-20]: Release Decompile-Bench, contains two million binary-source function pairs for training, and 70K function pairs for evaluation.

About

  • Decompile-Bench is the first open-source dataset comprising two million binary-source function pairs condensed from 100 million collected function pairs, i.e., 450GB of binaries compiled from permissively licensed GitHub projects.
  • Decompile-Bench-Eval includes manually crafted binaries from the well-established HumanEval and MBPP, alongside the compiled GitHub repositories released after 2025 to mitigate data leakage issues.

Pipeline

image

Compile-Trace-Filter framework that automates project compilation, precisely traces function‐level binary-source mappings, and applies robust filters to retain only high-quality pairs.

Benchmark

Decompile-Bench contains the following columns:

{
"name":"demangled name for the function",
"code":"source code",
"asm":"assembly",
"file":"source code path"
}

Decompile-Bench-Eval contains three splits, huameval, mbpp, and github2025. We also provide a json verison for the data. They contains the following columns:

{
"index":"index of the function", 
"func_name":"demangled name for he function", 
"func_dep":"function dependecies (includes, help functions), or the path to the source code", 
"func":"source code", 
"test":"unit tests for the function, empty for github data", 
"opt":"optimization, O0, O1, O2, O3", 
"language":"language, c or cpp", 
"asm":"assembly", 
"ida_asm":"assembly from ida pro", 
"ida_pseudo":"decompiled results (pseudo code) from ida pro", 
"ghidra_asm":"assembly from ghidra", 
"ghidra_pseudo":"decompiled results (pseudo code) from ghidra"
}

Models

Model Checkpoint Size HumanEval-Decompile Alias
llm4decompile-1.3b-v1.5 🤗 HF Link 1.3B 16.22% LLM4Decompile-End
llm4decompile-1.3b-v1.6 🤗 HF Link 1.3B 20.89% LLM4Decompile-DCBench

Metrics

  • Re-executability evaluates whether the decompiled code can execute properly and pass all the predefined test cases.
  • Edit Similarity based on Levenshtein Distance, this metric captures the minimum number of insertions, deletions, or substitutions needed to turn the generated code into the reference.

For R2I, please refer to the source project.

Requirements

  • vllm >= 0.5.2
https://docs.vllm.ai/en/v0.5.2/getting_started/installation.html

IMPORTANT: the libs are required for the compilation, otherwise, the compilation will fail.

apt-get update
apt-get install -y libboost-dev libssl-dev
pip install editdistance

Evaluation

  • Re-executability
python3 run_exe_rate.py \
--model_path LLM4Binary/llm4decompile-1.3b-v1.6 \
--dataset_path ./data/humaneval-decompile.json \
--output_path ./data/humaneval
  • Edit Similarity
# Note that we assume the decompiled results are stored in the ./data/humaneval
python3 ./metrics/cal_edit_sim.py

Quick Start

Open In Colab

Setup: Please use the script below to install the necessary environment.

git clone https://github.com/albertan017/LLM4Decompile.git
cd LLM4Decompile
conda create -n 'llm4decompile' python=3.9 -y
conda activate llm4decompile
pip install -r requirements.txt

Here is an example of how to use our model (For previous models, please check the corresponding model page at HF). Note: Replace the "func0" with the function name you want to decompile.

Preprocessing: Compile the C code into binary, and disassemble the binary into assembly instructions.

import subprocess
import os
func_name = 'func0'
OPT = ["O0", "O1", "O2", "O3"]
fileName = 'samples/sample' #'path/to/file'
for opt_state in OPT:
    output_file = fileName +'_' + opt_state
    input_file = fileName+'.c'
    compile_command = f'gcc -o {output_file}.o {input_file} -{opt_state} -lm'#compile the code with GCC on Linux
    subprocess.run(compile_command, shell=True, check=True)
    compile_command = f'objdump -d {output_file}.o > {output_file}.s'#disassemble the binary file into assembly instructions
    subprocess.run(compile_command, shell=True, check=True)
    
    input_asm = ''
    with open(output_file+'.s') as f:#asm file
        asm= f.read()
        if '<'+func_name+'>:' not in asm: #IMPORTANT replace func0 with the function name
            raise ValueError("compile fails")
        asm = func_name+':' + asm.split('<'+func_name+'>:')[-1].split('\n\n')[0] #IMPORTANT replace func0 with the function name
        asm_clean = ""
        asm_sp = asm.split("\n")
        for tmp in asm_sp:
            if len(tmp.split("\t"))<3 and '00' in tmp:
                continue
            idx = min(
                len(tmp.split("\t")) - 1, 2
            )
            tmp_asm = "\t".join(tmp.split("\t")[idx:])  # remove the binary code
            tmp_asm = tmp_asm.split("#")[0].strip()  # remove the comments
            asm_clean += tmp_asm + "\n"
    input_asm = asm_clean.strip()
    before = f"# This is the assembly code:\n"#prompt
    after = "\n# What is the source code?\n"#prompt
    input_asm_prompt = before+input_asm.strip()+after
    with open(fileName +'_' + opt_state +'.asm','w',encoding='utf-8') as f:
        f.write(input_asm_prompt)

Assembly instructions should be in the format:

FUNCTION_NAME:
OPERATIONS
OPERATIONS

Typical assembly instructions may look like this:

func0:
endbr64
lea    (%rdi,%rsi,1),%eax
retq

Decompilation: Use LLM4Decompile to translate the assembly instructions into C:

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

model_path = 'LLM4Binary/llm4decompile-1.3b-v1.6' # V1.6 Model
tokenizer = AutoTokenizer.from_pretrained(model_path)
model = AutoModelForCausalLM.from_pretrained(model_path,torch_dtype=torch.bfloat16).cuda()

with open(fileName +'_' + OPT[0] +'.asm','r') as f:#optimization level O0
    asm_func = f.read()
inputs = tokenizer(asm_func, return_tensors="pt").to(model.device)
with torch.no_grad():
    outputs = model.generate(**inputs, max_new_tokens=2048)### max length to 4096, max new tokens should be below the range
c_func_decompile = tokenizer.decode(outputs[0][len(inputs[0]):-1])

with open(fileName +'.c','r') as f:#original file
    func = f.read()

print(f'original function:\n{func}')# Note we only decompile one function, where the original file may contain multiple functions
print(f'decompiled function:\n{c_func_decompile}')

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •  
pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy