|
1 | 1 | # Design Space for Code Search
|
2 | 2 |
|
| 3 | +Code search is a critical tool for developers, enabling them to find, understand, and reuse existing code. |
| 4 | + |
| 5 | +ast-grep at its core is a code search tool: other features like linting and rewriting are all derived from the basic code search functionality. |
| 6 | + |
| 7 | + |
| 8 | +This blog is a recap of a great review paper: Code Search: A Survey of Techniques for Finding Code https://www.lucadigrazia.com/papers/acmcsur2022.pdf |
| 9 | + |
| 10 | + |
| 11 | +We will not cover all the details in the paper, but focus on specifically the design space for code search tool. These factors are: |
| 12 | +1. Query design |
| 13 | +2. Indexing |
| 14 | +3. Retrieval |
| 15 | + |
| 16 | + |
| 17 | +## Query Design |
| 18 | + |
| 19 | +The starting point of every search is a query. We define a query as an explicit expression of the |
| 20 | +intent of the user of a code search engine. This intent can be expressed in various ways, and |
| 21 | +different code search engines support different kinds of queries. |
| 22 | + |
| 23 | +The designers of a code search |
| 24 | +engine typically aim at several goal when deciding what kinds of queries to support: |
| 25 | +• Ease. A query should be easy to formulate, enabling users to use the code search engine |
| 26 | +without extensive training. If formulating an effective query is too difficult, users may get |
| 27 | +discouraged from using the code search engine. |
| 28 | +• Expressiveness. Users should be able to formulate whatever intent they have when searching |
| 29 | +for code. If a user is unable to express a particular intent, the search engine cannot find the |
| 30 | +desired results. |
| 31 | +• Precision. The queries should allow specifying the intent as unambiguously as possible. If the |
| 32 | +queries are imprecise, the search is likely to yield irrelevant results. |
| 33 | + |
| 34 | + |
| 35 | + |
| 36 | +These three goals are often at odds with each other. |
| 37 | + |
| 38 | + |
| 39 | +## PREPROCESSING AND EXPANSION OF QUERIES |
| 40 | + |
| 41 | +The query provided by a user may not be the best possible query to obtain the results a user |
| 42 | +expects. One reason is that natural language queries suffer from the inherent imprecision of natural |
| 43 | +language. Another reason is that the vocabulary used in a query may not match the vocabulary |
| 44 | +used in a potential search result. For example, a query about “container” is syntactically different |
| 45 | +from “collection”, but both refer to similar concepts. Finally, a user may initially be unsure what |
| 46 | +exactly she wants to find, which can cause the initial query to be incomplete. |
| 47 | +To address the limitations of user-provided queries, approaches for preprocessing and expanding |
| 48 | +queries have been developed. We discuss these approaches by focusing on three dimensions: (i) |
| 49 | +the user interface, i.e., if and how a user gets involved in modifying queries, (ii) the information |
| 50 | +used to modify queries, i.e., what additional source of knowledge an approach consults, and (iii) |
| 51 | +the actual technique used to modify queries. Table 1 summarizes different approaches along these |
| 52 | +three dimensions, and we discuss them in detail in the following. |
| 53 | + |
| 54 | +## INDEXING OR TRAINING, FOLLOWED BY RETRIEVAL OF CODE |
| 55 | + |
| 56 | +The perhaps most important component of a code search engine is about retrieving code examples |
| 57 | +relevant for a given query. The vast majority of approaches follows a two-step approach inspired |
| 58 | +by general information retrieval: At first, they either index the data to search through, e.g., by |
| 59 | +representing features of code examples in a numerical vector, or train a model that learns representations of the data to search through. Then, they retrieve relevant data items based on the |
| 60 | +pre-computed index or the trained model. To simplify the presentation, we refer to the first phase |
| 61 | +as “indexing” and mean both indexing in the sense of information retrieval and training a model |
| 62 | +on the data to search through. |
| 63 | +The primary goal of indexing and retrieval is effectiveness, i.e., the ability to find the “right” code |
| 64 | +examples for a query. To effectively identify these code examples, various ways of representing |
| 65 | +code and queries to compare them with each other have been proposed. A secondary goal, which |
| 66 | +is often at odds with achieving effectiveness, is efficiency. As users typically expect code search |
| 67 | +engines to respond within seconds [108], building an index that is fast to query is crucial. Moreover, |
| 68 | +as the code corpora to search through are continuously increasing in size, the scalability of both |
| 69 | +indexing and retrieval is important as well [4]. |
| 70 | +We survey the many different approaches to indexing, training and retrieval in code search |
| 71 | +engines along four dimensions, as illustrated in Figure 4. Section 4.1 discuss what kind of artifacts a |
| 72 | +search engine indexes. Section 4.2 describes different ways of representing the extracted information. |
| 73 | +Section 4.3 presents techniques for comparing queries and code examples with each other. Table 2 |
| 74 | +summarizes the approaches along these first three dimensions. Finally, Section 4.4 discusses different |
| 75 | +levels of granularity of the source code retrieved by search engines. |
| 76 | + |
| 77 | + |
| 78 | +## Representing the Information for Indexing |
| 79 | +* Individual Code Elements: Representing code as sets of individual elements, such as tokens or function calls, without considering their order or relationships. |
| 80 | +* Sequences of Code Elements: Preserving the order of code elements by extracting sequences from Abstract Syntax Trees (ASTs) or control flow graphs. |
| 81 | +* Relations between Code Elements: Extracting and representing relationships between code elements, such as parent-child relationships, method calls, and data flow. |
| 82 | + |
| 83 | +ast-grep index the individual code elements |
| 84 | + |
| 85 | +## Representing the Information for Retrieval |
| 86 | + |
| 87 | +Techniques to Compare Queries and Code |
| 88 | + |
| 89 | +* Feature Vectors: Algorithmically extracted feature vectors represent code and queries as numerical vectors. Standard distance measures like cosine similarity or Euclidean distance are used to compare these vectors. |
| 90 | +* Machine Learning-Based Techniques: End-to-end neural learning models embed both queries and code into a joint vector space, allowing for efficient retrieval based on learned representations. |
| 91 | +* Database-Based Techniques: General-purpose databases, such as NoSQL or relational databases, store and retrieve code examples based on precise matches to the query. |
| 92 | +* Graph-Based Matching: Code and queries are represented as graphs, and graph similarity scores or rewrite rules are used to match them. |
| 93 | +* Solver-Based Matching: SMT solvers are used to match queries against code examples by solving constraints that describe input-output relationships. |
| 94 | + |
| 95 | + |
| 96 | + |
3 | 97 | * Code Search: A Survey of Techniques for Finding Code https://www.lucadigrazia.com/papers/acmcsur2022.pdf
|
4 | 98 | * Aroma: Code recommendation via structural search https://arxiv.org/pdf/1812.01158
|
5 | 99 | * Deep code search: https://guxd.github.io/papers/deepcs.pdf
|
|
0 commit comments