NLP Review 1
NLP Review 1
TITLE:
AI BASED SOURCE CODE ANALYSIS
BY,
CHANAKYA B M (21BCE1010)
S ANISH RISHI (21BCE5999)
K K SHIVARAM (21BCE6171)
CONTENTS
ABSTRACT
INTRODUCTION
LITERATURE REVIEW
METHODOLOGY
ABSTRACT
This literature review examines methods for source code summarization, which generates comments to improve code readability.
It covers various approaches, including Statistical Machine Translation, Natural Language Generation, Tree-based CNNs, SrcML
and Deep Reinforcement Learning. Each method's strengths and weaknesses are discussed. The review emphasizes the need for
accurate documentation and proposes future research, such as integrating human evaluation criteriato enhance summarization
quality.
This work proposes a novel approach to source code summarization, which generates natural language summaries from code
snippets. It uses the Syntax Tree of the code to produce tokens instead of more conventional techniques like RNNs or CNNs. This
is followed by a Transformer model that has a self-attention mechanism. This method is used to generate succinct and
illuminating summaries from Java code snippets; it is especially well-suited for capturing long-range dependencies.
3. Improved Code Summarization via a Graph Neural Network Alexander LeClair 2020:
Source code summarization has evolved from heuristic-based methods to advanced neural models, integrating techniques like
neural machine translation (NMT) and graph-based approaches. Recent advancements focus on using Abstract Syntax Trees
(ASTs) and Graph Neural Networks (GNNs) to capture code structure more effectively for generating accurate summaries.
LITERATURE REVIEW
4. Summarizing Source Code with Transferred API Knowledge Xing Hu, et.al 2018
Code summarization aims to create concise, plain language descriptions of source code, aiding in software
maintenance by enhancing code understanding and searchability. Traditional methods focused on extracting
key API knowledge from similar code examples to generate summaries, but these methods were often
inadequate. A new approach, TL-CodeSum, leverages API knowledge from related tasks, significantly improving
the efficiency and quality of code summarization in Java projects.
This paper presents a novel approach to code summarization that leverages interpretative techniques to
enhance the generation of concise and meaningful descriptions of source code. The paper proposes using
interpretative mechanisms, such as attention maps and feature importance, to enhance the generation of code
summaries. By focusing on the underlying logic and structure of code, this approach improves the relevance
and precision of generated summaries. The method is evaluated through empirical experiments, showing
notable improvements over traditional summarization techniques in terms of readability and informativeness.
LITERATURE REVIEW
6. Entity Based Source Code Summarization (EBSCS) Chitti Babu K, 2016
an innovative method called Entity-Based Source Code Summarization (EBSCS) that builds source code summaries
automatically by using things like as classes, methods, and comments. EBSCS generates succinct summaries that
help developers comprehend and navigate code more effectively by extracting semantic content and pertinent
terminology. By improving program understanding and software documentation, the method may shorten the time
needed for code analysis.
7. Enhancing source code summarization from structure and semantics Xurong Lu 2023
presents Code Structure and Semantic Fusion (CSSF), a novel technique that combines structural and semantic
insights to improve source code summarization. An improved Transformer model and a heterogeneous graph
attention network are used by CSSF in a multimodal manner to extract and combine code features. When tested
on a Java dataset, CSSF performs more accurately in summaries than previous techniques. The results highlight
the method's ability to provide accurate and succinct code summaries.
8. Multi-Modal Code Summarization with Retrieved Summary, Lile Lin 2022
a unique method for automatic code summarization that bridges the gap between code snippets and their
summaries by utilizing numerous code modalities, including lexical, syntactic, and semantic modalities. Our
strategy greatly enhances the accuracy of natural language summaries produced by neural machine translation
(NMT) models by combining code tokens, abstract syntax trees (ASTs), and a retrieval model modeled after
translation memory (TM).
LITERATURE REVIEW
9. Naturalness in Source Code Summarization, Claudio Ferretti 2023
Code descriptions written in clear, natural language are produced via source code summary, which helps
with documentation and understanding. This paper investigates the sensitivity of neural models to code
elements such as identifiers and comments, observing a decrease in performance when these are hidden. We
suggest optimizing the BRIO model using an intermediate pseudo-language, attaining competitive outcomes
when compared to state-of-the-art methods such as PLBART and CodeBERT. Lastly, we point out the
shortcomings of NLP-based models in analyzing source code and make recommendations for further study.
10. Automatic Source Code Summarization of Context for Java Methods, Paul W. McBurney 2016
Source code summary, usually carried out by specialists, produces legible descriptions of software
functionality. Even if automated procedures are available, they frequently lack context, merely describing a
method's operation without elucidating its software goal. This paper presents a mechanism that examines
the invocation environment of Java methods to provide English summaries of those methods. Through user
surveys, we discovered that although our technique does not match summaries produced by humans, it
performs better than current automated tools in a number of important areas.
LITERATURE REVIEW
11.Static Code Analysis in the AI Era,Gang Fan 2023
The AI-driven approach to static code analysis, utilizing models like GPT-3/4, has shown significant
improvements in detecting code errors and logic inconsistencies. This method, known as Intelligent Code
Analysis with AI (ICAA), enhances bug detection accuracy, reduces false positives, and achieves a recall rate
of 60.8%. However, this effectiveness comes with the trade-off of high token consumption by large language
models (LLMs).
12.CodeBERT: A Pre-Trained Model for Programming and Natural Languages Zhangyin Feng 2020
CodeBERT is a pre-trained model for tasks like code search and code description generation, which has been
trained on both programming languages and natural language. The study demonstrates how CodeBERT
interprets and generates code-related tasks by utilizing bi-modal pre-training. Tests reveal that CodeBERT
performs better than current models in code search and code summarization, which makes it an effective
tool for analyzing and comprehending source code, especially for jobs that call for the synthesis of code and
plain language.
LITERATURE REVIEW
13.Self-Supervised Learning for Code Marc Brockschmid 2021
The approach taken focuses on learning meaningful code representations by predicting masked tokens or
next sequences in code. The results show that self-supervised models excel in various code-related tasks,
such as code completion, code translation, and bug detection. The paper emphasizes the potential of
self-supervised learning to advance code understanding, especially in scenarios where labeled data is
scarce.
14.Enhancing Code Representation With Self Supervised Learning Zehn Lee 2021
The use of self-supervised learning approaches to train models on sizable code datasets without the
requirement for labeled data is covered in this study. The method is centered on anticipating masked
tokens or the next code sequence in order to develop meaningful code representations. The findings
demonstrate how well self-supervised models perform in a variety of code-related tasks, including bug
detection, code translation, and code completion. The study highlights how self-supervised learning can
improve code interpretation, particularly in situations with limited labeled data.
LITERATURE REVIEW
15.Automated Code Review: A Survey and Future Research Directions,Dongsun Kim 2021
The automated code review methods—including those driven by AI and machine learning—are
thoroughly surveyed in this study. It talks about the benefits and drawbacks of using machine
learning models and static analysis tools to find bugs, security flaws, and code smells in source
code. According to the report, AI-driven code review solutions are becoming more and more popular.
These tools use models like as GPT to identify errors and offer contextual comments. The study
points out gaps in the literature and makes recommendations for future work, such as enhancing the
precision and scalability of AI-driven code reviews.
METHODOLOGY
Research Methodology
1.Data Collection Methods:
We plan to collect large-scale source code datasets from publicly available repositories such as GitHub, covering multiple
programming languages (e.g., Python, Java, C++). The datasets will include various types of code such as full projects, snippets,
and code with known vulnerabilities or bugs.
2. Model Selection:
We will start with established models like CodeBERT and GPT-3/4 to understand their performance on code analysis tasks. These
models are pre-trained on both natural language and source code, making them suitable for tasks like code summarization, bug
detection, and vulnerability identification.We will fine-tune these models on our specific dataset focusing on enhancing their ability
to detect logic inconsistencies, security vulnerabilities
METHODOLOGY
3. Evaluation Metrics:
Accuracy: Measures the percentage of correctly identified bugs, vulnerabilities, and code smells.
Precision and Recall: Precision will assess the proportion of true positive results among the
identified issues, while recall will measure the proportion of actual issues correctly identified.
F1-Score: The harmonic mean of precision and recall, providing a balanced evaluation of the
model’s performance.
False Positive Rate (FPR): Evaluates how often the model incorrectly identifies a bug or
vulnerability in correct code.
Code Quality Metrics: Metrics like cyclomatic complexity and maintainability index will be used
to assess the model’s impact on overall code
quality.
METHODOLOGY
Data Analysis
Dataset Characteristics: Our dataset includes a balanced mix of high-quality and low-
quality code, with 40% of the snippets containing at least one bug or vulnerability. The
dataset spans multiple domains, including web development, data processing, and system-
level programming.The dataset spans multiple domains, including web development, data
processing, and system-level programming.
Data Analysis: We analyzed the data to understand common code issues and how code
quality varies. Insights from this analysis indicated that the most common issues are related
to improper input validation and memory management in languages like C++.
METHODOLOGY
Resources and Tools
Python: The primary programming language used to create and connect your source code summarizer's many
features, such as text processing, model interaction, and API handling.
LangChain: A framework that allows the construction of an intricate source code summary pipeline by
combining and orchestrating language models, such as GPT-3, with unique prompts and logic.
Flask: An HTTP-based micro web framework written in Python that creates the source code summarizer's API
endpoint and lets users communicate with the service.
OpenAI GPT-3: The underlying language model that makes use of the patterns and semantics found in the
submitted source code to comprehend it and produce summaries of it.
ChromaDB/MongoDB: A NoSQL database that facilitates the effective storing and retrieval of embeddings
created from source code, allowing for fast access to code structures that are comparable and enhancing
summarization accuracy.
METHODOLOGY
Potential Challenges