节点文献
混淆代码变量名恢复及可视化方法研究
Variable Name Recovery and Visualization Method of Obfuscated Code
【作者】 杨涛;
【导师】 胡海波;
【作者基本信息】 重庆大学 , 软件工程, 2022, 硕士
【摘要】 在前端项目正式部署前,开发人员为了提高程序运行效率、降低网络传输开销、避免原始代码暴露在客户端应用程序中,会在不改变运行结果的前提下,对项目中的Java Script代码进行混淆和压缩处理。但对安全分析人员来说,通过逆向工程对源代码进行审查是非常有必要的,而混淆压缩的代码大大增加了审查的难度。因此需要一种变量名恢复方法帮助安全分析人员快速理解、分析代码执行逻辑。理论上来说,无法从代码本身携带的信息得到原始的变量名称,但大多数的代码在现有代码库中存在相同或相似实现,因此基于上下文预测混淆变量名的原始命名是理论可行的。此外,现有的解决方案恢复准确率较低,依赖分析人员阅读源代码进行二次评估,这一过程通常会耗费大量时间。解决上述问题的关键,一方面是改进变量名恢复模型以提高其准确率,另一方面需要一种直观的方式来帮助分析人员更容易地阅读并理解代码。本文从上述两个方面入手,基于深度学习和可视化技术,提出了更加高效、准确的变量名恢复方案。本文的主要研究工作如下:(1)在开源代码社区Git Hub中获取了上万个优秀前端项目,从中提取超过100万个Java Script源代码文件,共计668万个函数,构建了Java Script混淆代码数据集。(2)设计实现了基于BERT的编码器-解码器结构模型,引入对代码具有良好特征提取能力的Code BERT预训练模型,能够有效地恢复混淆代码中的原始变量名。(3)设计了编辑器增强视图、抽象语法树表征图、逻辑节点生命周期图等一系列代码可视化视图。通过部署变量名恢复模型,提供了一套高效可用的变量名恢复可视化工具。利用本研究构建的数据集对变量名恢复模型进行训练,达到了75.69%的准确率和21.12%的字符错误率,优于已有研究成果。可视化工具提供了完整的交互逻辑,能够帮助分析者快速理解代码逻辑。经过用户研究实验证明,本文提出的基于深度学习的混淆代码变量名恢复可视化方法研究能够快速、有效地帮助分析人员分析代码并确定原始变量名。
【Abstract】 Before the front-end project is officially deployed,in order to improve the efficiency of program operation,reduce network transmission overhead,and avoid exposing the original code to the client application,developers will obfuscate the Java Script code in the project without changing the running result.Compression processing.But for security analysts,it is necessary to review the source code through reverse engineering,and the obfuscated compressed code greatly increases the difficulty of review.Therefore,a variable name recovery method is needed to help security analysts quickly understand and analyze the code execution logic.In theory,the original variable names cannot be obtained from the information carried by the code itself,but most codes have the same or similar implementations in the existing code base,so it is theoretically feasible to predict the original naming of obfuscated variable names based on context.In addition,existing solutions have low recovery accuracy and rely on analysts to read the source code for secondary evaluation,which is often time-consuming.The key to solving the above problems,on the one hand,is to improve the variable name recovery model to improve its accuracy,and on the other hand,it needs an intuitive way to help analysts read and understand the code more easily.Starting from the above two aspects,based on deep learning and visualization technology,this thesis proposes a more efficient and accurate variable name recovery scheme.The main research work of this thesis is as follows:(1)Obtained tens of thousands of excellent front-end projects in the open source code community Git Hub,extracted more than 1 million Java Script source code files,a total of 6.68 million functions,and constructed a Java Script obfuscated code dataset.(2)The encoder-decoder structure model based on BERT was designed and implemented,and the Code BERT pre-training model with good feature extraction ability is introduced,which can effectively restore the original variable names in the obfuscated code.(3)A series of code visualization views were designed,such as editor enhanced view,abstract syntax tree representation diagram,and logical node life cycle diagram.By deploying the variable name recovery model,a set of efficient and usable variable name recovery visualization tools is provided.This study uses the proposed dataset to train the variable name recovery model,and achieves an accuracy of 75.69% and a character error rate of 21.12%,which is better than the existing research results.Visualization tools provide complete interactive logic,which can help analysts quickly understand code logic.After user research experiments,it is proved that the research on the visualization method of variable name recovery of obfuscated code based on deep learning proposed in this thesis can quickly and effectively help analysts analyze the code and determine the original variable name.
【Key words】 Code Obfuscation; Code Visualization; CodeBERT; Code Representation; Variable Names Recovery;
- 【网络出版投稿人】 重庆大学 【网络出版年期】2024年 09期
- 【分类号】TP311.52