Skip to content

Commit c8da0d7

Browse files
doc: add some blog to do
1 parent 5577145 commit c8da0d7

File tree

1 file changed

+94
-0
lines changed

1 file changed

+94
-0
lines changed

website/blog/code-search-design-space.md

Lines changed: 94 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,99 @@
11
# Design Space for Code Search
22

3+
Code search is a critical tool for developers, enabling them to find, understand, and reuse existing code.
4+
5+
ast-grep at its core is a code search tool: other features like linting and rewriting are all derived from the basic code search functionality.
6+
7+
8+
This blog is a recap of a great review paper: Code Search: A Survey of Techniques for Finding Code https://www.lucadigrazia.com/papers/acmcsur2022.pdf
9+
10+
11+
We will not cover all the details in the paper, but focus on specifically the design space for code search tool. These factors are:
12+
1. Query design
13+
2. Indexing
14+
3. Retrieval
15+
16+
17+
## Query Design
18+
19+
The starting point of every search is a query. We define a query as an explicit expression of the
20+
intent of the user of a code search engine. This intent can be expressed in various ways, and
21+
different code search engines support different kinds of queries.
22+
23+
The designers of a code search
24+
engine typically aim at several goal when deciding what kinds of queries to support:
25+
• Ease. A query should be easy to formulate, enabling users to use the code search engine
26+
without extensive training. If formulating an effective query is too difficult, users may get
27+
discouraged from using the code search engine.
28+
• Expressiveness. Users should be able to formulate whatever intent they have when searching
29+
for code. If a user is unable to express a particular intent, the search engine cannot find the
30+
desired results.
31+
• Precision. The queries should allow specifying the intent as unambiguously as possible. If the
32+
queries are imprecise, the search is likely to yield irrelevant results.
33+
34+
35+
36+
These three goals are often at odds with each other.
37+
38+
39+
## PREPROCESSING AND EXPANSION OF QUERIES
40+
41+
The query provided by a user may not be the best possible query to obtain the results a user
42+
expects. One reason is that natural language queries suffer from the inherent imprecision of natural
43+
language. Another reason is that the vocabulary used in a query may not match the vocabulary
44+
used in a potential search result. For example, a query about “container” is syntactically different
45+
from “collection”, but both refer to similar concepts. Finally, a user may initially be unsure what
46+
exactly she wants to find, which can cause the initial query to be incomplete.
47+
To address the limitations of user-provided queries, approaches for preprocessing and expanding
48+
queries have been developed. We discuss these approaches by focusing on three dimensions: (i)
49+
the user interface, i.e., if and how a user gets involved in modifying queries, (ii) the information
50+
used to modify queries, i.e., what additional source of knowledge an approach consults, and (iii)
51+
the actual technique used to modify queries. Table 1 summarizes different approaches along these
52+
three dimensions, and we discuss them in detail in the following.
53+
54+
## INDEXING OR TRAINING, FOLLOWED BY RETRIEVAL OF CODE
55+
56+
The perhaps most important component of a code search engine is about retrieving code examples
57+
relevant for a given query. The vast majority of approaches follows a two-step approach inspired
58+
by general information retrieval: At first, they either index the data to search through, e.g., by
59+
representing features of code examples in a numerical vector, or train a model that learns representations of the data to search through. Then, they retrieve relevant data items based on the
60+
pre-computed index or the trained model. To simplify the presentation, we refer to the first phase
61+
as “indexing” and mean both indexing in the sense of information retrieval and training a model
62+
on the data to search through.
63+
The primary goal of indexing and retrieval is effectiveness, i.e., the ability to find the “right” code
64+
examples for a query. To effectively identify these code examples, various ways of representing
65+
code and queries to compare them with each other have been proposed. A secondary goal, which
66+
is often at odds with achieving effectiveness, is efficiency. As users typically expect code search
67+
engines to respond within seconds [108], building an index that is fast to query is crucial. Moreover,
68+
as the code corpora to search through are continuously increasing in size, the scalability of both
69+
indexing and retrieval is important as well [4].
70+
We survey the many different approaches to indexing, training and retrieval in code search
71+
engines along four dimensions, as illustrated in Figure 4. Section 4.1 discuss what kind of artifacts a
72+
search engine indexes. Section 4.2 describes different ways of representing the extracted information.
73+
Section 4.3 presents techniques for comparing queries and code examples with each other. Table 2
74+
summarizes the approaches along these first three dimensions. Finally, Section 4.4 discusses different
75+
levels of granularity of the source code retrieved by search engines.
76+
77+
78+
## Representing the Information for Indexing
79+
* Individual Code Elements: Representing code as sets of individual elements, such as tokens or function calls, without considering their order or relationships.
80+
* Sequences of Code Elements: Preserving the order of code elements by extracting sequences from Abstract Syntax Trees (ASTs) or control flow graphs.
81+
* Relations between Code Elements: Extracting and representing relationships between code elements, such as parent-child relationships, method calls, and data flow.
82+
83+
ast-grep index the individual code elements
84+
85+
## Representing the Information for Retrieval
86+
87+
Techniques to Compare Queries and Code
88+
89+
* Feature Vectors: Algorithmically extracted feature vectors represent code and queries as numerical vectors. Standard distance measures like cosine similarity or Euclidean distance are used to compare these vectors.
90+
* Machine Learning-Based Techniques: End-to-end neural learning models embed both queries and code into a joint vector space, allowing for efficient retrieval based on learned representations.
91+
* Database-Based Techniques: General-purpose databases, such as NoSQL or relational databases, store and retrieve code examples based on precise matches to the query.
92+
* Graph-Based Matching: Code and queries are represented as graphs, and graph similarity scores or rewrite rules are used to match them.
93+
* Solver-Based Matching: SMT solvers are used to match queries against code examples by solving constraints that describe input-output relationships.
94+
95+
96+
397
* Code Search: A Survey of Techniques for Finding Code https://www.lucadigrazia.com/papers/acmcsur2022.pdf
498
* Aroma: Code recommendation via structural search https://arxiv.org/pdf/1812.01158
599
* Deep code search: https://guxd.github.io/papers/deepcs.pdf

0 commit comments

Comments
 (0)
pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy