- Create the training dataset
- Fine-tune the model
- Copy and switch to the eval directory
- Create the evaluation dataset
- Evaluate the model
- Build the eval files based on the LLM output
- Evaluate Clang
- Evaluate correctness and compute the metrics
- Show the evaluation
cd single_transformations/train; python3 create_training_data_single.py --tokenizer deepseek-ai/deepseek-coder-6.7b-instruct --max_tokens 6144 --number_of_samples 3000python3 llm.py --model_type deepseek-coder-instruct --train_model deepseek-ai/deepseek-coder-6.7b-instruct --train_file datasets/obfuscation_dataset_encode_arithmetic_6144.txt --trained_model_path models/deepseek-coder-instruct-7b-encode_arithmetic --max_tokens 6144mkdir ../eval/models ../eval/datasets; cp -r models/deepseek-coder-instruct-7b-encode_arithmetic ../eval/models/deepseek-coder-instruct-7b-encode_arithmetic; cd ../eval # It is important to set two different directories for training and eval data to prevent source files from being overwritten!python3 create_eval_data_single.py --tokenizer deepseek-ai/deepseek-coder-6.7b-instruct --max_tokens 6144 --number_of_samples 200python3 llm.py --model_type deepseek-coder-instruct --eval_model models/deepseek-coder-instruct-7b-encode_arithmetic/ --eval_out_path datasets/deobfuscated --eval_file datasets/obfuscation_dataset_encode_arithmetic_6144_eval.json --max_tokens 6144 --data_suffix _encode_arithmeticpython3 llm.py --model_type deepseek-coder-instruct --eval_model models/deepseek-coder-instruct-7b-encode_arithmetic/ --eval_out_path datasets/deobfuscated --eval_file datasets/obfuscation_dataset_encode_arithmetic_6144_eval.json --max_tokens 6144 --obfs_data_suffix _encode_arithmetic --data_suffix _encode_arithmetic --build_eval_files 1python3 llvm.py --eval_file datasets/obfuscation_dataset_encode_arithmetic_6144_eval.json --obfs_data_suffix _encode_arithmetic --data_suffix _encode_arithmeticpython3 eval_deobf.py --eval_dataset_path datasets/obfuscation_dataset_encode_arithmetic_6144_eval.json --original_path datasets/original --obfuscated_path datasets/obfuscated --deobfuscated_path datasets/deobfuscated --obfs_data_suffix _encode_arithmetic --data_suffix _encode_arithmetic --io_path datasets/input_samplespython3 show_eval.py --eval_dataset_path datasets/obfuscation_dataset_encode_arithmetic_6144_eval.json --obfs_data_suffix _encode_arithmetic --data_suffix _encode_arithmetic --original_path datasets/original --obfuscated_path datasets/obfuscated --deobfuscated_path datasets/deobfuscated
- Create the training dataset
- Fine-tune the model
- Copy and switch to the eval directory
- Create the evaluation dataset
- Evaluate the model
- Build the evaluation files around the LLM-generated samples
- Evaluate clang
- Evaluate correctness and compute the metrics
- Show the evaluation
python3 create_training_data_chain.py --tokenizer deepseek-ai/deepseek-coder-6.7b-instruct --max_tokens 6144 --chain_length 1 --number_of_samples 3000python3 llm.py --model_type deepseek-coder-instruct --train_model deepseek-coder-instruct-7b-chain-6144 --train_file datasets/obfuscation_dataset_chain_6144_all_training.txt --trained_model_path models/deepseek-coder-instruct-7b-chain-6144 --max_tokens 6144cp -r models/deepseek-coder-instruct-7b-chain-6144 eval/models/deepseek-coder-instruct-7b-chain-6144; cd ../evalpython3 create_eval_data_chain.py --tokenizer deepseek-ai/deepseek-coder-6.7b-instruct --max_tokens 6144 --chain_length 1 --number_of_samples 1000python3 llm.py --model_type deepseek-coder-instruct --eval_model models/deepseek-coder-instruct-7b-chain-6144 --eval_out_path datasets/deobfuscated --eval_file datasets/obfuscation_dataset_1_chain_eval2_l.json --max_tokens 6144 --obfs_data_suffix _1_chain --data_suffix _1_chain_6144python3 llm.py --model_type deepseek-coder-instruct --eval_model models/deepseek-coder-instruct-7b-chain-6144 --eval_out_path datasets/deobfuscated --eval_file datasets/obfuscation_dataset_1_chain_eval2_l.json --max_tokens 6144 --obfs_data_suffix _1_chain --data_suffix _1_chain_6144 --build_eval_files 1python3 llvm.py --eval_file datasets/obfuscation_dataset_1_chain_eval2_l.json --orig_data_suffix _1 --obfs_data_suffix _1_chain --data_suffix _1_chain_6144;python3 llvm.py --eval_file datasets/obfuscation_dataset_2_chain_eval2_l.json --orig_data_suffix _2 --obfs_data_suffix _2_chain --data_suffix _2_chain_6144python3 eval_deobf.py --eval_dataset_path datasets/obfuscation_dataset_1_chain_eval2_l.json --no_metrics --original_path datasets/original --obfuscated_path datasets/obfuscated --deobfuscated_path datasets/deobfuscated --orig_data_suffix _1 --obfs_data_suffix _1_chain --data_suffix _1_chain_6144 --io_path datasets/input_samplespython3 show_eval.py --eval_dataset_path datasets/obfuscation_dataset_1_chain_eval2_l.json --orig_data_suffix _1 --obfs_data_suffix _1_chain --data_suffix _1_chain_6144 --original_path datasets/original --original_io_path datasets/original_eval --obfuscated_path datasets/obfuscated --deobfuscated_path datasets/deobfuscated
- Build the memorization dataset
- Evaluate the model with the memorized samples
- Build evaluation files around the LLM generated samples
- Evaluate correctness and compute the metrics
- Show the evaluation
- Manually check for memorized constants (We only need the correctness part of the evaluation here since we only have to examine semantically incorrect samples and don't need the deobfuscation performance)
python3 build_memorization_dataset.py --input_dataset ../FineTuning-6144-eval/datasets/obfuscation_dataset_encode_arithmetic_6144_eval.json --output_dataset _gpt-4-32k-0314-memorization --original_path ../FineTuning-6144-eval/datasets/original --obfuscated_path ../FineTuning-6144-eval/datasets/obfuscated --deobfuscated_path ../FineTuning-6144-eval/datasets/deobfuscated --data_suffix _encode_arithmetic_gpt-4-32k-0314,_encode_branches_gpt-4-32k-0314,_flatten_gpt-4-32k-0314,_opaque_gpt-4-32k-0314,_randomize_arguments_gpt-4-32k-0314python3 llm.py --model_type openai --eval_model ../FineTuning-6144-eval/models/codellama-7b-encode_arithmetic-6144/ --eval_out_path MemTest/deobfuscated_modified --eval_file obfuscation_dataset_gpt-4-32k-0314-memorization_encode_arithmetic.json --max_tokens 6144 --data_suffix _encode_arithmetic_gpt-4-32k-0314python3 llm.py --model_type openai --eval_model ../FineTuning-6144-eval/models/codellama-7b-encode_arithmetic-6144/ --eval_out_path MemTest/deobfuscated_modified --eval_file obfuscation_dataset_gpt-4-32k-0314-memorization_encode_arithmetic.json --obfs_data_suffix _encode_arithmetic --max_tokens 6144 --data_suffix _encode_arithmetic_gpt-4-32k-0314 --build_eval_files 1python3 eval_deobf.py --eval_dataset_path obfuscation_dataset_gpt-4-32k-0314-memorization_encode_arithmetic.json --original_path MemTest/original_modified --obfuscated_path MemTest/obfuscated_modified --deobfuscated_path MemTest/deobfuscated_modified --obfs_data_suffix _encode_arithmetic --data_suffix _encode_arithmetic_gpt-4-32k-0314 --io_path ../FineTuning-6144-eval/datasets/input_samplespython3 show_eval.py --eval_dataset_path obfuscation_dataset_gpt-4-32k-0314-memorization_encode_arithmetic.json --original_path MemTest/original_modified --obfuscated_path MemTest/obfuscated_modified --deobfuscated_path MemTest/deobfuscated_modified --obfs_data_suffix _encode_arithmetic --data_suffix _encode_arithmetic_gpt-4-32k-0314