Decompile-Bench

Experiment

Edit Similarity

HumanEval	O0	O1	O2	O3	AVG
GPT-4.1-mini	46.09	33.83	34.75	29.66	36.08
IDA	25.47	21.01	20.18	17.92	21.15
LLM4Decompile-End-1.3B	43.37	36.91	36.76	36.30	38.34
Idioms-1.3B	48.84	38.08	35.93	34.66	39.35
LLM4Decompile-DCBench-1.3B	54.36	43.54	44.21	42.78	46.22
LLM4Decompile-DCBench-6.7B	62.32	51.91	51.66	52.99	54.72
Claude-Sonnet-4-reasoning	60.75	48.45	47.12	46.40	50.68

MBPP	O0	O1	O2	O3	AVG
GPT-4.1-mini	47.52	37.34	39.15	32.63	39.16
IDA	27.66	23.63	22.01	19.43	23.18
LLM4Decompile-End-1.3B	44.82	39.67	39.01	38.13	40.41
Idioms-1.3B	49.35	38.13	35.91	34.36	39.44
LLM4Decompile-DCBench-1.3B	56.38	48.14	46.76	45.79	49.28
LLM4Decompile-DCBench-6.7B	64.12	55.40	53.87	53.39	56.70
Claude-Sonnet-4-reasoning	64.64	54.27	53.10	51.98	55.99

GitHub2025	O0	O1	O2	O3	AVG
GPT-4.1-mini	21.15	18.64	19.38	18.43	19.40
IDA	22.17	18.21	19.69	18.93	19.75
LLM4Decompile-End-1.3B	23.09	20.61	21.77	20.81	21.57
Idioms-1.3B	30.27	24.04	25.09	24.18	25.90
llm4decompile-1.3b-DCBench	30.99	29.21	30.23	27.59	29.51
llm4decompile-6.7b-DCBench	34.29	32.74	34.18	29.96	32.79
Claude-Sonnet-4-reasoning	36.29	32.80	33.12	31.32	33.38

ProRec	Edit Sim
GPT-4.1-mini	34.74
IDA	27.24
LLM4Decompile-End-1.3B	34.26
Idioms-1.3B	36.03
LLM4Decompile-DCBench-1.3B	38.85
LLM4Decompile-DCBench-6.7B	45.23
Claude-Sonnet-4-reasoning	41.99

Re-executability

HumanEval	O0	O1	O2	O3	AVG
GPT-4.1-mini	21.95	11.58	10.07	10.06	13.42
IDA	18.60	19.81	17.69	16.77	18.22
LLM4Decompile-End-1.3B	26.22	12.81	14.03	13.42	16.22
Idioms-1.3B	30.56	16.10	12.63	12.36	17.91
LLM4Decompile-DCBench-1.3B	33.23	18.60	16.47	15.24	20.89
LLM4Decompile-DCBench-6.7B	61.59	30.18	34.15	32.01	39.48
Claude-Sonnet-4-reasoning	65.85	42.68	39.63	39.02	46.79

MBPP	O0	O1	O2	O3	AVG
GPT-4.1-mini	31.37	16.74	16.64	14.79	19.89
IDA	25.62	25.05	23.72	23.57	24.49
LLM4Decompile-End-1.3B	29.16	16.99	17.92	18.07	20.54
Idioms-1.3B	33.97	20.47	18.13	17.30	22.47
LLM4Decompile-DCBench-1.3B	35.06	21.56	22.80	20.28	24.93
LLM4Decompile-DCBench-6.7B	58.32	39.58	39.73	37.06	43.67
Claude-Sonnet-4-reasoning	67.76	51.69	53.02	50.25	55.68

R2I

HumanEval	O0	O1	O2	O3	AVG
GPT-4.1-mini	62.38	52.63	55.68	53.90	56.14
IDA	41.49	36.29	35.85	35.32	37.23
LLM4Decompile-End-1.3B	65.69	60.48	60.66	59.37	61.55
Idioms-1.3B	68.18	66.92	67.46	65.48	67.01
LLM4Decompile-DCBench-1.3B	68.93	68.74	69.03	67.76	68.62
LLM4Decompile-DCBench-6.7B	69.35	68.91	69.79	68.42	69.12
Claude-Sonnet-4-reasoning	61.09	54.94	55.65	55.28	56.74

MBPP	O0	O1	O2	O3	AVG
GPT-4.1-mini	61.79	55.34	57.05	55.83	57.50
IDA	41.82	34.87	35.16	36.21	37.02
LLM4Decompile-End-1.3B	67.93	63.47	65.69	63.01	65.03
Idioms-1.3B	69.12	67.01	63.91	62.35	65.60
LLM4Decompile-DCBench-1.3B	69.13	70.97	68.03	67.79	68.98
LLM4Decompile-DCBench-6.7B	72.30	71.99	72.25	70.67	71.80
Claude-Sonnet-4-reasoning	64.78	60.62	61.53	61.71	62.16

GitHub2025	O0	O1	O2	O3	AVG
GPT-4.1-mini	51.65	39.64	46.62	55.83	48.43
IDA	45.87	38.85	36.99	36.20	39.48
LLM4Decompile-End-1.3B	54.26	51.73	53.42	50.56	52.49
Idioms-1.3B	61.76	58.06	53.26	51.19	56.07
LLM4Decompile-DCBench-1.3B	64.40	65.72	61.74	63.31	63.79
LLM4Decompile-DCBench-6.7B	72.67	70.23	66.55	67.76	69.30
Claude-Sonnet-4-reasoning	55.70	43.88	45.04	51.71	49.08

ProRec	R2I
GPT-4.1-mini	55.01
IDA	38.35
LLM4Decompile-End-1.3B	57.49
Idioms-1.3B	64.86
LLM4Decompile-DCBench-1.3B	65.73
LLM4Decompile-DCBench-6.7B	66.15
Claude-Sonnet-4-reasoning	57.38

Variable naming

GitHub2025	O0	O1	O2	O3	Average
GPT-4.1-mini	48.99	42.24	43.07	39.98	43.57
IDA	33.66	27.16	29.49	28.99	29.83
LLM4Decompile-End	64.15	63.48	62.39	63.84	63.47
LLM4Decompile-DCBench	76.38	77.18	77.53	76.69	76.95

Control flow

GitHub2025	O0	O1	O2	O3	Average
GPT-4.1-mini	63.25	50.09	50.41	50.10	53.46
IDA	63.28	59.42	60.35	60.62	60.92
LLM4Decompile-End	73.75	73.49	73.61	74.65	73.88
LLM4Decompile-DCBench	83.61	85.13	85.56	84.87	84.79

Type recovery

GitHub2025	O0	O1	O2	O3	Average
GPT-4.1-mini	55.69	45.18	46.75	44.93	48.14
IDA	63.53	60.37	62.29	61.67	61.97
LLM4Decompile-End	76.47	77.82	78.98	77.42	77.67
LLM4Decompile-DCBench	80.13	82.26	82.22	81.55	81.54

Prompt to evaluate variable naming, control flow and type recovery

You are an expert in reverse-engineering and decompiler evaluation.  I will give you a decompiled code snippet; your job is to evaluate it on three criteria:

    1. variable_naming: How well the decompiler recovered meaningful variable names.
    2. control_flow: How faithfully complex control-flow constructs (loops, branches, gotos) have been reconstructed.
    3. type_recovery: How accurately types (primitives, structs, pointers, arrays, etc.) were inferred.

    For each criterion:
    • Assign an integer score from 1 (very poor) to 100 (excellent).
    • Provide a one- or two-sentence rationale.

    Produce only a single JSON object, with exactly these fields:

    {
    "variable_naming": {
        "score": <int>,
        "rationale": "<string>"
    },
    "control_flow": {
        "score": <int>,
        "rationale": "<string>"
    },
    "type_recovery": {
        "score": <int>,
        "rationale": "<string>"
    }
    }

Do not include any extraneous keys and directly output the result without any explanation. 
The source code: {source code}.
Now evaluate this snippet: {decompiled code}

Updates

[2025-05-21]: Release LLM4Decompile-DCBench, a 1.3 billion-parameter model trained on 10% of the Decompile-Bench, specifically designed to decompile C/C++ code.
[2025-05-20]: Release Decompile-Bench, contains two million binary-source function pairs for training, and 70K function pairs for evaluation.

About

Decompile-Bench is the first open-source dataset comprising two million binary-source function pairs condensed from 100 million collected function pairs, i.e., 450GB of binaries compiled from permissively licensed GitHub projects.
Decompile-Bench-Eval includes manually crafted binaries from the well-established HumanEval and MBPP, alongside the compiled GitHub repositories released after 2025 to mitigate data leakage issues.

Pipeline

Compile-Trace-Filter framework that automates project compilation, precisely traces function‐level binary-source mappings, and applies robust filters to retain only high-quality pairs.

Benchmark

Decompile-Bench contains the following columns:

{
"name":"demangled name for the function",
"code":"source code",
"asm":"assembly",
"file":"source code path"
}

Decompile-Bench-Eval contains three splits, huameval, mbpp, and github2025. We also provide a json verison for the data. They contains the following columns:

{
"index":"index of the function", 
"func_name":"demangled name for he function", 
"func_dep":"function dependecies (includes, help functions), or the path to the source code", 
"func":"source code", 
"test":"unit tests for the function, empty for github data", 
"opt":"optimization, O0, O1, O2, O3", 
"language":"language, c or cpp", 
"asm":"assembly", 
"ida_asm":"assembly from ida pro", 
"ida_pseudo":"decompiled results (pseudo code) from ida pro", 
"ghidra_asm":"assembly from ghidra", 
"ghidra_pseudo":"decompiled results (pseudo code) from ghidra"
}

Models

Model	Checkpoint	Size	HumanEval-Decompile	Alias
llm4decompile-1.3b-v1.5	🤗 HF Link	1.3B	16.22%	LLM4Decompile-End
llm4decompile-1.3b-v1.6	🤗 HF Link	1.3B	20.89%	LLM4Decompile-DCBench

Metrics

Re-executability evaluates whether the decompiled code can execute properly and pass all the predefined test cases.
Edit Similarity based on Levenshtein Distance, this metric captures the minimum number of insertions, deletions, or substitutions needed to turn the generated code into the reference.

For R2I, please refer to the source project.

Requirements

vllm >= 0.5.2

https://docs.vllm.ai/en/v0.5.2/getting_started/installation.html

IMPORTANT: the libs are required for the compilation, otherwise, the compilation will fail.

apt-get update
apt-get install -y libboost-dev libssl-dev
pip install editdistance

Evaluation

Re-executability

python3 run_exe_rate.py \
--model_path LLM4Binary/llm4decompile-1.3b-v1.6 \
--dataset_path ./data/humaneval-decompile.json \
--output_path ./data/humaneval

Edit Similarity

# Note that we assume the decompiled results are stored in the ./data/humaneval
python3 ./metrics/cal_edit_sim.py

Quick Start

Setup: Please use the script below to install the necessary environment.

git clone https://github.com/albertan017/LLM4Decompile.git
cd LLM4Decompile
conda create -n 'llm4decompile' python=3.9 -y
conda activate llm4decompile
pip install -r requirements.txt

Here is an example of how to use our model (For previous models, please check the corresponding model page at HF). Note: Replace the "func0" with the function name you want to decompile.

Preprocessing: Compile the C code into binary, and disassemble the binary into assembly instructions.

import subprocess
import os
func_name = 'func0'
OPT = ["O0", "O1", "O2", "O3"]
fileName = 'samples/sample' #'path/to/file'
for opt_state in OPT:
    output_file = fileName +'_' + opt_state
    input_file = fileName+'.c'
    compile_command = f'gcc -o {output_file}.o {input_file} -{opt_state} -lm'#compile the code with GCC on Linux
    subprocess.run(compile_command, shell=True, check=True)
    compile_command = f'objdump -d {output_file}.o > {output_file}.s'#disassemble the binary file into assembly instructions
    subprocess.run(compile_command, shell=True, check=True)
    
    input_asm = ''
    with open(output_file+'.s') as f:#asm file
        asm= f.read()
        if '<'+func_name+'>:' not in asm: #IMPORTANT replace func0 with the function name
            raise ValueError("compile fails")
        asm = func_name+':' + asm.split('<'+func_name+'>:')[-1].split('\n\n')[0] #IMPORTANT replace func0 with the function name
        asm_clean = ""
        asm_sp = asm.split("\n")
        for tmp in asm_sp:
            if len(tmp.split("\t"))<3 and '00' in tmp:
                continue
            idx = min(
                len(tmp.split("\t")) - 1, 2
            )
            tmp_asm = "\t".join(tmp.split("\t")[idx:])  # remove the binary code
            tmp_asm = tmp_asm.split("#")[0].strip()  # remove the comments
            asm_clean += tmp_asm + "\n"
    input_asm = asm_clean.strip()
    before = f"# This is the assembly code:\n"#prompt
    after = "\n# What is the source code?\n"#prompt
    input_asm_prompt = before+input_asm.strip()+after
    with open(fileName +'_' + opt_state +'.asm','w',encoding='utf-8') as f:
        f.write(input_asm_prompt)

Assembly instructions should be in the format:

FUNCTION_NAME:
OPERATIONS
OPERATIONS

Typical assembly instructions may look like this:

func0:
endbr64
lea    (%rdi,%rsi,1),%eax
retq

Decompilation: Use LLM4Decompile to translate the assembly instructions into C:

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

model_path = 'LLM4Binary/llm4decompile-1.3b-v1.6' # V1.6 Model
tokenizer = AutoTokenizer.from_pretrained(model_path)
model = AutoModelForCausalLM.from_pretrained(model_path,torch_dtype=torch.bfloat16).cuda()

with open(fileName +'_' + OPT[0] +'.asm','r') as f:#optimization level O0
    asm_func = f.read()
inputs = tokenizer(asm_func, return_tensors="pt").to(model.device)
with torch.no_grad():
    outputs = model.generate(**inputs, max_new_tokens=2048)### max length to 4096, max new tokens should be below the range
c_func_decompile = tokenizer.decode(outputs[0][len(inputs[0]):-1])

with open(fileName +'.c','r') as f:#original file
    func = f.read()

print(f'original function:\n{func}')# Note we only decompile one function, where the original file may contain multiple functions
print(f'decompiled function:\n{c_func_decompile}')

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Decompile-Bench

Experiment

Prompt to evaluate variable naming, control flow and type recovery

Updates

About

Pipeline

Benchmark

Models

Metrics

Requirements

Evaluation

Quick Start

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 2

Uh oh!

Languages

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
data		data
metrics		metrics
train		train
README.md		README.md
llm_server.py		llm_server.py
run_exe_rate.py		run_exe_rate.py

anonepo/LLM4Decompile

Folders and files

Latest commit

History

Repository files navigation

Decompile-Bench

Experiment

Prompt to evaluate variable naming, control flow and type recovery

Updates

About

Pipeline

Benchmark

Models

Metrics

Requirements

Evaluation

Quick Start

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 2

Uh oh!

Languages

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Packages