0% found this document useful (0 votes)

27 views19 pages

2311.11904v2 (Copy)

paper

Uploaded by

प्रकाश बा. पिंपळे

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

27 views19 pages

2311.11904v2 (Copy)

paper

Uploaded by

प्रकाश बा. पिंपळे

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 19

LLMs as Visual Explainers: Advancing Image Classification with

Evolving Visual Descriptions

Songhao Han * 1 Le Zhuo * 1 Yue Liao 1 Si Liu 1

Abstract (a) Previous Method Q: What are useful features for

distinguishing a Cactus Wren in a photo?
A: There are some useful features to tell
there is a Cactus Wren in a photo: …
arXiv:2311.11904v2 [cs.CV] 19 Feb 2024

Class Names
Vision-language models (VLMs) offer a promis- Cactus Wren:
Desert-dwelling

ing paradigm for image classification by com- Cactus Wren Large Language Model Striped plumage
Short tail
Short stubby beak

paring the similarity between images and class

embeddings. A critical challenge lies in crafting (b) Our Method
Images
Visual Feedback
precise textual representations for class names. Confusion Matrix:
Overall Acc: 92.7%
Class-wise Acc: VLM
While previous studies have leveraged recent ad- [64.1%, …]

vancements in large language models (LLMs) to Class Names Iterative Optimization

Cactus Wren Cactus Wren:
enhance these descriptors, their outputs often suf- Carolina Wren
Marsh Wren Large Language Model
Spotty belly
Desert songster wren
Hollow cactus chamber
House Wren
fer from ambiguity and inaccuracy. We attribute Cactus fruit pecker

this to two primary factors: 1) the reliance on

single-turn textual interactions with LLMs, lead- Figure 1: Schematic of the method. (a) Previous meth-
ing to a mismatch between generated text and ods use an LLM to generate descriptive prompts for each
visual concepts for VLMs; 2) the oversight of the class directly. (b) Our method optimizes class descriptions
inter-class relationships, resulting in descriptors through an evolutionary process. We utilize a VLM (such as
that fail to differentiate similar classes effectively. CLIP (Radford et al., 2021)) to obtain visual feedback, e.g.,
In this paper, we propose a novel framework that the confusion matrix, assessing the quality of current de-
integrates LLMs and VLMs to find the optimal scriptions. Upon building the visual feedback, an LLM
class descriptors. Our training-free approach de- generates refined category descriptions, iterating multiple
velops an LLM-based agent with an evolutionary times to achieve the final optimal category descriptions.
optimization strategy to iteratively refine class
descriptors. We demonstrate our optimized de-
scriptors are of high quality which effectively im-
proves classification accuracy on a wide range Notably, CLIP (Radford et al., 2021) achieves outstanding
of benchmarks. Additionally, these descriptors results in zero-shot image classification tasks across various
offer explainable and robust features, boosting datasets. This is achieved by employing a combination of
performance across various backbone models and class names and pre-defined templates as input prompts and
complementing fine-tuning-based methods. then matching images to the most similar prompt.
This paradigm, though effective, is highly dependent on
the quality of class prompts. For instance, in datasets with
1. Introduction abstract or ambiguous class names, such as CUB (Wah
et al., 2011) and Flowers102 (Nilsback & Zisserman, 2008),
In recent years, a plethora of vision-language mod- CLIP struggles to effectively distinguish images using class
els (VLMs) (Radford et al., 2021; Alayrac et al., 2022; Li names as the sole prompts. The advent of Large Language
et al., 2023) has emerged, showcasing impressive transfer Models (LLMs) (Brown et al., 2020; OpenAI, 2023; Chowd-
learning capabilities across diverse visual tasks. These mod- hery et al., 2022) has prompted research (Menon & Von-
els, by pretraining on large datasets, learn to align images drick, 2023; Roth et al., 2023; Novack et al., 2023; Pratt
and text within a shared embedding space. Unlike con- et al., 2023) into enhancing class descriptions through LLMs.
ventional models, VLMs classify images by computing the These methods exploit LLMs’ extensive world knowledge
similarity between the input image and textual descriptions. to generate more detailed and semantically rich descrip-
*
Equal contribution 1 Beihang University, Beijing, China. Cor- tions for each category, thereby enriching the class prompts.
respondence to: Yue Liao <liaoyue.ai@gmail.com>. Despite these advancements, current approaches exhibit sev-
eral drawbacks. LLMs, trained exclusively on text, lack a

1
LLMs as Visual Explainers: Advancing Image Classification with Evolving Visual Descriptions

nuanced understanding of visual concepts. Consequently, that are conducive to image classification. We highlight an-
when provided only with textual class names, LLMs tend other key insight that the final optimized class descriptions
to produce ambiguous or inaccurate descriptions (Menon serve as robust visual knowledge with strong interpretability
& Vondrick, 2023), e.g., “short stubby beak” for Cactus and transferability, which can consistently enhance model
Wren, which actually has curved and relatively long beaks. performance across different backbones and augment the
Evidence from WaffleCLIP (Roth et al., 2023) also suggests efficacy of fine-tuning-based methods
replacing LLM-generated class descriptions with random,
In summary, our contributions are: 1) We identify the limita-
meaningless characters does not hurt the overall classifi-
tions in existing methods and propose a novel paradigm for
cation performance, questioning the effectiveness of these
LLM-augmented visual classification by optimizing class
methods. Moreover, the fundamental goal of LLM-based
descriptions with visual feedback from VLMs. 2) To find
methods is to approximate the global optimal centroids
the optimized descriptions, we design a genetic algorithm-
within the CLIP embedding space for all classes using de-
inspired agent for iterative refinement without training. 3)
scriptive texts generated by LLMs. Achieving this necessi-
We conduct extensive experiments across 9 image classifica-
tates considering inter-category relationships and engaging
tion datasets to validate the effectiveness of our optimized
in an iterative optimization process. Current methodologies,
descriptions with additional explainability and transferabil-
as depicted in Figure 1, are tailored to generate descriptions
ity. The code of our methods and baselines are provided in
for individual classes in a single iteration. As a result, the
the Supplementary Material.
generated descriptions tend to be overly general, with multi-
ple categories sharing similar phrases, e.g., “various colors”
in bird classification. This generality hinders the ability to 2. Related Work
effectively discriminate between similar categories.
Large Language Models. Large Language Models (LLMs)
In light of these limitations, a key question arises: How can have exhibited remarkable proficiency and sophisticated
we design an automated pipeline that empowers LLMs to reasoning skills, significantly influencing various domains
discover globally optimal class descriptions, thereby im- within artificial intelligence (Devlin et al., 2018; Raffel
proving the overall visual classification performance? In et al., 2020; Brown et al., 2020; OpenAI, 2023; Touvron
this paper, we introduce a novel approach, named Iterative et al., 2023; Longpre et al., 2023; Chowdhery et al., 2022).
Optimization with Visual Feedback, which demonstrates These models have been proven capable of solving complex
how an LLM agent can collaborate with VLMs, employing tasks, once thought to be solely within human capability,
the feedback of visual classification to progressively refine such as mathematical reasoning (Wei et al., 2022; Imani
class descriptions (Figure 1). Our method formulates this et al., 2023), drug discovery (Liang et al., 2023; Liu et al.,
task as a combinatorial optimization problem - identifying 2023c), and decision makeing (Yu et al., 2023; Ma et al.,
the combination of class descriptions for each category that 2023). Their success in these areas underscores the plan-
maximizes VLM image classification performance. Given ning and reasoning capabilities of LLMs. Furthermore,
the problem’s infinitely complex search space, we develop LLMs have shown immense potential in the multimodal do-
an LLM agent integrated with a Genetic Algorithm, where main (Huang et al., 2023; Zhu et al., 2023; Liu et al., 2023b;
descriptions are evolved toward better solutions. Within Zhao et al., 2023; Wu et al., 2023). Most researchers align
each iterative cycle, the agent first conducts mutation based well-trained encoders from various modalities with LLMs
on the last round’s descriptions and then performs crossover through instruction tuning, equipping these models to inter-
among various candidates to produce optimized concepts. pret multimodal inputs. In contrast, our approach leverages
This dual process of mutation and crossover allows the a gradient-free method to integrate visual knowledge into
agent to explore the solution space both locally and globally, LLMs without any need for fine-tuning. Nevertheless, the
searching for the most effective visual concepts. We further inherent limitations of LLMs in comprehending alternate
introduce the concept of visual feedback to reduce variance modalities exacerbate the phenomenon of “hallucination”
across different results and computational resources, using in multimodal contexts (Yin et al., 2023; Liu et al., 2023a;
image classification metrics from CLIP. Visual feedback Cui et al., 2023), leading to inaccurate or even erroneous
can serve as both reward and memory for our agent, steer- outputs. Thus, addressing this issue of hallucinations in
ing the LLM towards rational optimization and mitigating LLMs represents a pivotal challenge.
random-walk behavior during the process.
Prompt Engineering. Originating in NLP, prompt engineer-
Extensive experiments conducted across nine image classifi- ing significantly impacts the performance of VLMs in down-
cation benchmark datasets reveal that our approach signifi- stream tasks, leading many extensive research into identi-
cantly outperforms current LLM-based methods as well as fying the optimal prompt. Prompt tuning (Jia et al., 2022;
vanilla CLIP. We demonstrate that our LLM agent is able Huang et al., 2022; Zhang et al., 2022; Gao et al., 2021;
to iteratively discover highly descriptive visual descriptions Zhou et al., 2022), a method of parameter-efficient fine-

2
LLMs as Visual Explainers: Advancing Image Classification with Evolving Visual Descriptions

(b) Images
User: Please improve
Confusion Matrix Visual Feedback the descriptors based
on visual feedback
and memory banks:
Descriptors 1

y
Lil
(a) [Previous Descriptors]

er
y
sh

Lil

n
ia

ow
bu

Lil

ia
Image

v
[Visual Feedback]

lfl

er
or

ru
ve
az

Tig
Sw

Pe
Overall Acc: 92.68%

Sil

W
G
[Memory Banks]
Encoder Class-wise Acc: [64.10%, …]
Class Names LLM:…

ia
an
Confusion Matrix:

az
G
[NxN mat]

h
us
rb
Peruvian Lily

ve
Sil

y
Lil
Tiger Lily

d
or
Class Descriptors Memory Banks

Sw
Sword Lily

er
ow
lfl
al
Wallflower

W
Peruvian Lily, trumpet-shaped
Peruvian Lily

y
Lil
er
Silverbush Peruvian Lily, speckled petals Text

Tig

y
Encoder

Lil
Gazania Peruvian Lily, various colors Pos: [linear leaves, recurved, …]

n
iav
Neg: [herbaceous, bulbous plant, …]

ru
…

Pe
……

Visual Feedback from CLIP

Natural
Evaluation Correct Descriptor
Initialization Mutation Crossover Selection
Ambigus Descriptor

Iterate N Times Incorrect Descriptor

(c)

Iteration 0 Iteration 1 Iteration N

Peruvian Lily Tiger Lily Peruvian Lily Tiger Lily Peruvian Lily Tiger Lily
alstroemeria
trumpet-shaped
bright color
herbaceous
alstroemeria
decorative cluster
bright color
dark freckles
… alstroemeria
decorative cluster
bold contrast
dark freckles
umbels bulbous plant speckled petals bulbous plant soft texture erect stem
various colors orange pollen various colors orange pollen bicolored charm orange pollen
tuberous linear leaves tuberous linear leaves tuberous linear leaves

Figure 2: Illustration of iterative optimization with visual feedback. (a) Given raw class names as input, we first prompt
the LLM to generate an initialization of class descriptors. These descriptors undergo an iterative optimization comprising
three stages: mutation, where diverse new candidates are generated; crossover, involving mixing and matching across
different candidates to produce better candidates; and natural selection, selecting the most suitable candidate based on a
fitness function. (b) In each iteration, we compute visual metrics including classification accuracy and confusion matrix for
current class descriptors. We further use these metrics to construct visual feedback, update memory banks, and pick the best
candidate in natural selection. (c) Through this iterative optimization, the LLM progressively identifies the most effective
class descriptors, thereby enhancing the differentiation between ambiguous classes.

tuning, involves introducing learnable parameters before the this, WaffleCLIP (Roth et al., 2023) incorporated high-level
input text or image, which are then optimized through gradi- concepts related to the dataset to mitigate ambiguities in
ent updates. For instance, CoOp (Gao et al., 2021) improves class names. CuPL (Pratt et al., 2023) utilizes a series of
class descriptions by incorporating a set of parameters to hand-crafted prompt templates to enable LLMs to produce
represent dataset context, optimizing prediction accuracy diverse descriptions for each class. However, we observed
via cross-entropy loss minimization. While prompt tuning that descriptions generated by these LLM-based methods of-
notably increases accuracy, it necessitates additional train- ten suffer from inaccuracies and ambiguities. Consequently,
ing. Our method achieves comparable results without any we propose a method involving iterative optimization to
training and serves as a complementary approach to prompt continuously refine descriptions, integrating visual feedback
tuning, offering further precision improvements when ap- within the optimization process to enable LLMs to maxi-
plied subsequently. mize differentiation between distinct categories.
Using LLMs for Prompt Engineering. Recent advance-
ments have seen the emergence of methods that employ 3. Method
LLMs to generate semantically richer descriptions for im-
3.1. Classification with Descriptor Ensembling
proving class prompts (Menon & Vondrick, 2023; Roth
et al., 2023; Pratt et al., 2023; Novack et al., 2023; Yan et al., CLIP (Radford et al., 2021) consists of an image encoder
2023). Menon & Vondrick (2023) initially demonstrated and a text encoder, which has been trained on 400M
that ensembling class-dependent descriptions generated by image-text pairs to learn a joint embedding space. Given
LLMs can improve classification accuracy. Building on a query image x and a predefined set of classes C =

3
LLMs as Visual Explainers: Advancing Image Classification with Evolving Visual Descriptions

{c1 , c2 , c3 , ..., cn } in natural language, CLIP performs zero- samples. In each iteration, it generates the next generation
shot image classification by first encoding both image and via two predefined operators, mutation and crossover, then
class names into the shared embedding space, then com- selects the most promising offspring based on some fitness
puting the cosine similarity between the image and each function that evaluates the quality of an individual. We illus-
class, finally selecting the one with highest similarity as the trate the detailed process of each iteration in Algorithm 1,
predicted class, which will be explained in the following paragraphs.

c̃ = arg max cos(ϕI (x), ϕT (f (c))), (1) Initialization. Since the size of class labels can be large,
c∈C especially for datasets like Imagenet (Deng et al., 2009),
we first design a splitting strategy to group all classes into
where ϕI , ϕT are the image encoder and text encoder, and
clusters based on the similarity of their names. We extract
f (c) is the prompt template like "A photo of a {c}".
the text embedding of class names and then employ the
Prior work (Menon & Vondrick, 2023) proposed a simple K-means algorithm to cluster them into groups, where each
yet effective method to augment the class names C using group represents semantically similar classes. To make sure
LLMs. They prompt LLMs to generate a set of descrip- that LLM always focuses on confusing classes, the cluster-
tors for each category, e.g., Hen: two legs; red, brown, or ing step is dynamically conducted not only at initialization
white feathers; a small body. With these descriptive texts, but also at the start of each iteration, where we compute the
they improve image classification accuracy by computing a average embedding of all descriptors for each class.
comprehensive similarity score for each category:
After clustering, we condition the LLM to generate initial
1 X class descriptors D0 . Specifically, we instruct the LLM
c̃ = arg max cos(ϕI (x), ϕT (d)), (2)
c∈C |D(c)| to generate n0 descriptors for each class, providing task
d∈D(c)
description, output formatting, and some design tips. We
where D(c) indicates the descriptors for class c, and we use provide more implementation details about our prompts in
D for short. By averaging the scores of all class descriptors, the Supplementary Materials. The initialization step is sim-
which achieves prompt ensembling, we argue that it reduces ilar to previous methods (Menon & Vondrick, 2023; Roth
the noise of class name embedding and leads to more robust et al., 2023; Pratt et al., 2023). LLM can generate plausi-
visual classification. ble descriptors usually related to colors, shapes, textures,
etc. On the first try, however, LLM sometimes generates
3.2. Iterative Optimization with Visual Feedback features that are ambiguous and irrelevant to visual clas-
sification due to the high degree of diversity and lack of
A key drawback of existing LLM-based methods is that specific pre-training data for visual understanding. There-
they generate class descriptors in a single run, where LLM fore, it is necessary to introduce both feedback from CLIP
is frozen and not updated. By contrast, the ways humans and iterative optimization to mitigate these issues.
recognize new objects always involve a dynamic learning
process, i.e., we gradually update our knowledge base of Visual Feedback. The key idea of visual feedback is ground-
objects via interaction with the environment, remembering ing LLMs with visual knowledge in VLMs to better distin-
useful features and forgetting useless features. Inspired guish between ambiguous classes during the optimization
by that, we identify two fundamental points: interaction process. Instead of directly updating the model parameters
with the environment and iterative optimization, which are via instruction tuning in recent multimodal LLMs, we pro-
missing in existing methods. We formulate the problem pose a gradient-free method to inject visual knowledge into
of finding the optimal class descriptors as a combinatorial LLMs. Specifically, given a current set of class descriptors
optimization problem. Further, we propose a novel method D, we construct visual feedback V (D) through task-related
to dynamically optimize the set of class descriptors in an evaluation metrics for CLIP, e.g., top-1 overall accuracy,
iterative manner integrated with visual feedback from CLIP. class-wise accuracy, and confusion matrix in image classifi-
cation. These metrics offer a holistic evaluation of model
Taking advantage of extended world knowledge and remark- performance given current class descriptors.
able reasoning skills showcased by LLMs (OpenAI, 2023;
Wei et al., 2022), we empower LLMs to act as prompt- Apart from conventional metrics, we propose an improved
optimization agents to search for the best combination of version of the confusion matrix to more effectively capture
class descriptors. Considering the complex solution space of intricate relationships within classes. We define a confusing
possible combinations of class descriptors, we introduce an threshold λ and categorize each prediction as a positive
evolutionary process to search for optimal class descriptors. sample based on its cosine similarity score compared to λ
Classic genetic algorithms have been proven superior perfor- times the cosine similarity score of the ground-truth label.
mance in solving complex optimization problems. The evo- In this approach, the improved confusion matrix, denoted as
lution usually starts with a population of randomly generated M̃ , is computed by aggregating positive sample indicators.

4
LLMs as Visual Explainers: Advancing Image Classification with Evolving Visual Descriptions

Another challenge is we find that LLMs sometimes struggle Algorithm 1 Interative Optimization with Visual Feedback
to interpret the raw confusion matrix with shape |D| ∗ |D|, Require: class labels C, LLM LLM, prompt prompt, vi-
particularly as the number of classes |D| increases. Hence, sual feedback V
we refine this process by extracting the top-m classes from Hyperparameters: iterations N , mutation sample size K
each row of M̃ , representing the most confusing classes for
1: Set C ← K-means(C)
CLIP. The visual feedback using the improved confusion
matrix is formulated as follows: 2: for i = 1 to |C| do
3: D0,i = LLM(ci , prompt)
4: end for
(
1 if cos(x, d′ ) > λ cos(x, d)
Pos(x, d, d′ ) = , (3) 5: Set M ← ∅, D̃ ← D0
0 otherwise
X 6: for i = 1 to N do
M̃dd′ = Pos(x, d, d′ ) for d, d′ ∈ D, (4) 7: Set Di−1 ← K-means(Di−1 )
x∈Xd
8: for j = 1 to |Di−1 | do
0 K
[
V (D) = Top-m M̃d∗ , (5) 9: Di,j , ..., Di,j = LLM(prompt, Di−1,j , M, V )
K+1 0 K
d∈D 10: Di,j = LLM(prompt, Di−1,j , Di,j , ..., Di,j ,V )
where Xd indicates the images of descriptor class d to com- 11: Di,j = arg maxD∈{D0 ,...,DK+1 } V (D)
i,j i,j

pute classification matrics in visual feedback, and we fix 12: end for
λ = 0.9, m = |D| 13: M = Update(M, Di−1 , Di , V )
2 in all experiments.
14: D̃ = arg maxD∈{Di D̃} V (D)
Though this metric-based visual feedback is simple to con- 15: end for
struct, it serves an important role in estimating the diver-
output D̃
gence between LLM-generated descriptors and optimal clas-
sification centroids in CLIP latent space. It has three major
applications in our optimization process. First, we can con-
vert the visual feedback V (D) into natural language. During {Di0 , Di1 , ..., DiK+1 }. We adopt the overall accuracy in
mutation and crossover, the textual version of V (D) can be our visual feedback as the fitness score for natural selec-
integrated into prompts to help LLM distinguish the target tion. This fitness-based process ensures our method always
category from confusing classes. Second, we adopt V (D) chooses the best candidate as the starting point of the next
as the fitness function to evaluate sample quality and per- iteration and produces class descriptors that better discrim-
form natural selection. Finally, we introduce the idea of inate different categories, thus gradually moving towards
memory banks M consisting of positive and negative his- the global optimal. To figure out the impact of descriptors
tory class descriptors. We dynamically update the memory and update the memory banks M, we compare the differ-
banks based on V (D) at the end of each iteration. ence of Di and Di−1 in detail, resulting in three groups of
Iterative Optimization. We first define the mutation and descriptors, i.e., unchanged, deleted, and added descriptors.
crossover operators to generate the next generation of class Then we compute the visual feedback for these descriptors
descriptors based on the previous version. For the i-th iter- and compare their overall accuracy. If the accuracy of Di
ation, providing the previous set of class descriptors Di−1 , is greater than that of unchanged descriptors, it indicates
V (Di−1 ), and memory banks M , we query LLM to pick the added descriptors are beneficial for CLIP and we add
the top ni most useless descriptors in the current set and re- them to the positive memory bank, otherwise, we add them
place them with ni new descriptors to emphasize its distinct to the negative one. Similarly, if the accuracy of Di−1 is
visual features, representing the mutation operation. We greater than that of unchanged descriptors, it indicates the
generate K independent candidates {Di0 , Di1 , ..., DiK } from deleted descriptors are beneficial for CLIP and we add them
LLM in each iteration to ensure sufficient genetic diversity to the negative memory bank, otherwise, we add them to the
for optimization. As for the crossover operation, we provide positive one.
K generated candidates in the mutation operation and their
corresponding visual feedback as inputs then prompt LLM 3.3. Comparison with Other Methods
to perform mix and match between different samples, then
Our method distinguishes itself from existing LLM-based
output a new sample DiK+1 . The key idea for crossover is
techniques (Menon & Vondrick, 2023; Roth et al., 2023;
to ensemble different useful descriptors of different samples
Pratt et al., 2023) in its reliance on labeled image data for op-
and produce an offspring with overall better performance.
timization, deviating from the conventional zero-shot image
At the end of each iteration, we select the best performance classification paradigm. However, the ultimate goal of our
candidate as Di to update the current descriptor set among method is to obtain an optimal set of class descriptions for
the population of generated class descriptors denoted as the target dataset. Considering the prevalent issues of am-

5
LLMs as Visual Explainers: Advancing Image Classification with Evolving Visual Descriptions

Table 1: Quality comparison of generated descriptions with other LLM-based methods and vanilla CLIP. All data in
the table represent top-1 accuracy (%) on the test set, where bold figures indicate the highest accuracy consistently achieved
by our method, while the underlined figures indicate the second-highest accuracy among the compared methods. ∆ CLIP
denote the absolute improvements of our method over vanilla CLIP.

Dataset
Method Average
ImageNet EuroSAT UCF101 SUN Caltech DTD CIFAR-10 Flowers102 CUB
CLIP 61.80 36.83 61.01 61.51 91.24 42.73 84.28 63.59 52.26 61.69
DCLIP 63.00 49.70 61.46 62.51 91.68 43.38 85.23 67.21 53.25 64.16
WaffleCLIP 62.83 49.69 60.88 63.65 89.29 43.97 85.61 66.58 53.47 64.00
CuPL 64.02 48.05 63.26 64.74 91.72 46.04 85.29 65.25 53.21 64.62
Ours 64.53 56.28 67.01 66.22 92.70 51.42 86.33 72.19 56.13 68.09
∆ CLIP +2.73 +19.45 +6.00 +4.71 +1.46 +8.69 +2.05 +8.60 +3.87 +6.40

biguity and inaccuracy in category descriptions generated (SUN) (Xiao et al., 2016), Caltech (Fei-Fei et al., 2004),
by LLMs in a zero-shot manner, the integration of VLMs Describable Textures Dataset (DTD) (Cimpoi et al., 2014),
into the optimization process is essential. Our experiments CIFAR-10 (Krizhevsky et al., 2009), Flowers102 (Nilsback
demonstrate that the class descriptions derived through our & Zisserman, 2008), and CUB (Wah et al., 2011).
method more effectively enhance CLIP in image classifica-
Compared Methods. We compare our work with
tion and yield comparable results even in a low-shot setting.
vanilla CLIP and three state-of-the-art methods using
Furthermore, our method also significantly diverges from LLM to augment class descriptions. CLIP (Radford
fine-tuning-based approaches (Gao et al., 2021; Zhou et al., et al., 2021) sets a simple template as "A photo of
2022). Our approach is training-free, thereby avoiding po- a {class name}" as input prompt. DCLIP (Menon
tential overfitting issues, which is orthogonal to these works. & Vondrick, 2023) usess LLM-generated class de-
Our empirical results underscore the robustness and trans- scriptors with a few in-context examples to improve
ferability of the optimized class descriptions, demonstrating CLIP classification. They build the prompt in the for-
their efficacy across various model backbones. This adapt- mat of "{classname}, which (is/has/etc)
ability suggests promising applications, including integrat- {descriptor}". WaffleCLIP (Roth et al., 2023) further
ing these descriptions into fine-tuning processes to further improves DCLIP by introducing high-level concepts at the
elevate their performance and generalizability, a claim we beginning of the prompt and replacing class descriptors
substantiate in our experimental section. with random characters. The extended prompt is "A
photo of a {concept}: a {classname},
4. Experiments which (is/has/etc) {random sequence}".
CuPL (Pratt et al., 2023) designs hand-crafted prompts for
4.1. Experimental Setup LLMs to generate diverse descriptive sentences.
Implementation Details. In our method, there are four As clarified in Section 3.3, our method utilizes label infor-
main hyperparameters, including the number of iterations mation of training samples to guide the optimization process.
N , the number of descriptors at initialization n0 , the number Therefore, we are not comparing the above methods in the
of descriptors to change in mutation and crossover ni , and conventional zero-shot setting. Instead, we aim to directly
the number of mutated candidates in each iteration K. We compare the effectiveness of the generated descriptions via
set N = 10, n0 = 30, ni = 15, K = 4 for all datasets. As the evaluation of image classification.
for the number of groups in K-means clustering, we set it
to round( |C|10 ) to make sure there are roughly 10 classes
4.2. Main Result
for each group. Unless specified, we adopt GPT-4 (OpenAI,
As shown in Table 1, the quality of the descriptions gener-
2023) (with temperature fixed at 1.0) to construct our LLM
ated by our method surpasses that of existing LLM-based
agent and CLIP ViT-B/32 backbone (Radford et al., 2021)
methods, as validated across a wide range of datasets, and
to extract the image and text embeddings.
is significantly higher than the category labels of vanilla
Datasets. Our experiments leverages the dataset parti- CLIP. Notably, our method achieves an average increase of
tioning introduced by CoOp (Gao et al., 2021) on 9 dif- over 6% compared to CLIP. Our method excels in relatively
ferent image classification benchmarks, including: Ima- abstract, visually challenging datasets, such as the texture
geNet (Deng et al., 2009), EuroSAT (Helber et al., 2019), classification dataset, DTD, and the satellite image classifi-
UCF101 (Soomro et al., 2012), Scene UNderstanding cation dataset, EuroSAT, with 8.69% and 19.45% absolute

6
LLMs as Visual Explainers: Advancing Image Classification with Evolving Visual Descriptions
EuroSAT Flowers102 DTD

55.0
72
52
Table 2: Ablation on visual feedback.
52.5 50
70
50.0
48
47.5
Method Caltech EuroSAT Flowers102
Accuracy

Accuracy

Accuracy
68
45.0
46
42.5 66

40.0 44
Ours(iCM) 92.01 52.26 71.34
Fitness Score Fitness Score Fitness Score
37.5

0
CLIP baseline
1 2 3 4 5 6
Accuracy
7 8 9
64

0
CLIP baseline

1 2 3 4 5 6
Accuracy
7 8 9 0
CLIP baseline

1 2 3 4 5 6
Accuracy
7 8 9
Ours(CM) 91.98 48.14 70.47
Iteration Iteration Iteration
w/o Memory 90.89 50.59 67.97
Figure 3: Ablation on iterative optimizations. X-axis: w/o Feedback 91.11 47.16 68.32
iteration rounds, Y-axis: Accuracy (%). Red stars represent
the accuracy of vanilla CLIP.
descriptor optimization described in Section 3.2. 2) Ours
with confusion matrix (CM). This baseline simply replaces
improvements, respectively. For datasets with finer category the improved confusion matrix with the conventional confu-
granularity, like Flowers102, CUB, UCF101, and SUN, sion matrix. 3) Without Memory. We remove the memory
our method also outperforms CLIP by over 3.5%, with a bank recording iteration history, but the improved confusion
remarkable 8.6% increase on the Flowers102 dataset. How- matrix is kept. 4) Without Feedback. In this setting, we
ever, other LLM-based methods’ best performances on these remove all components related to visual feedback, including
datasets are limited. This highlights our method’s superior- both the confusion matrix and memory banks.
ity in finding better descriptions of various datasets, where
As shown in Table 2, the use of the improved confusion
multi-round iterations and visual feedback are essential in
matrix demonstrates the best performance on three datasets.
distinguishing closely related and confusing categories. This
The performance with the standard confusion matrix is sec-
is not achievable in single-category and single-turn optimiza-
ond best, highlighting the effectiveness of the memory bank
tion methods like DCLIP, WaffleCLIP, and CuPL. Without
and visual feedback components designed in our work. No-
visual knowledge in pre-trained VLMs, these methods gen-
tably, removing either the memory bank or the visual feed-
erate ambiguous and inaccurate descriptions in a single run,
back components individually leads to a decrease in gener-
providing no discriminative information, e.g., “black bill”
ation accuracy and instability. For detailed graphs, please
for most classes of birds in CUB. On datasets like CIFAR-10,
refer to the appendix C, where our method shows the most
Caltech, and ImageNet, the margins between our approach
stable improvements toward better descriptions.
and other LLM-based baselines are relatively smaller. We
attribute this to the fact that these datasets already include Few-shot Setting. Our few-shot ablation experimental re-
distinct classes, which are easier for LLMs to recognize and sults are shown in the appendix A, demonstrating that high-
understand through class names. quality class descriptions can be obtained using our method
even with a very small proportion of samples.
4.3. Ablation Study
4.4. Transferability
Iterative Optimization. First, we discuss the impact of
iterative optimization. Figure 3 shows that as the number Transfer to Different Backbones. To verify the gener-
of iterations increases, both fitness score and test accuracy alization and transferability of our method, we tested the
exhibit an initial growth followed by stable oscillations. Af- optimized descriptions obtained using CLIP ViT-B/32 on
ter iterations of optimization, the final results significantly different backbones, including RN101, ViT-B/16, and ViT-
surpass the performance of CLIP baseline and single-turn L/14. Our results, presented in Table 3, demonstrate that our
methods at iteration 0. Experimental results indicate that optimized class descriptions consistently surpass the base-
adopting our genetic algorithm-inspired optimization effec- line CLIP model in terms of accuracy. A notable observation
tively moves the class descriptors toward better solutions, is the significant performance enhancement on the Flow-
advancing the overall classification accuracy. Furthermore, ers102 dataset, with increases exceeding 9% when trans-
the trends of the two lines in each graph are closely aligned, ferred to RN101 and ViT-B/16. These results emphasize
showcasing that there are no “overfitting” issues during our method generates more generalizable natural language
optimization. This emphasizes the potential of our final prompts, avoiding overfitting to a specific architecture.
descriptors as a universal textual feature across different
Transfer to Fine-tuning Setting. We further demon-
model backbones, which is proven in the following text.
strate that our optimized descriptions can be transferred
Visual Feedback. Next, we study the importance of each to fine-tuning-based approaches for CLIP. We take CLIP-
component of visual feedback in our method. As illustrated A-Self (Maniparambil et al., 2023) as an example, which
in Table 2, we design four baselines for ablations: 1) Ours employs a self-attention-based adapter for feature selection
with improved confusion matrix (iCM). This is our stan- and aggregation. As indicated in Table 4, integrating de-
dard setting, where we adopt both the improved confusion scriptions generated by our method into the CLIP-A-Self
matrix as visual feedback and memory banks to enhance framework results in notable performance enhancements

7
LLMs as Visual Explainers: Advancing Image Classification with Evolving Visual Descriptions

Table 3: Transferring optimized descriptors to different model architectures.

ImageNet EuroSAT Caltech Flowers102

CLIP Architecture Ours CLIP ∆ Ours CLIP ∆ Ours CLIP ∆ Ours CLIP ∆
ViT-B/32 64.53 61.80 +2.73 56.28 36.83 +19.45 92.70 91.24 +1.46 72.19 63.59 +8.60
RN101 63.36 60.65 +2.71 36.30 32.02 +4.28 91.81 89.49 +2.32 70.85 61.39 +9.46
ViT-B/16 69.51 66.63 +2.88 52.09 42.96 +9.13 94.48 92.54 +1.94 75.48 66.46 +9.02
ViT-L/14 76.11 72.85 +3.26 67.40 52.86 +14.54 96.80 94.04 +2.76 81.73 75.96 +5.77

Ours: First Iteration Ours: Last Iteration CuPL

Residential Building

Average Average Average

residential streets residential estates structures where people live
residential district apartment complexes typically structures where people live
houses household communities building where people live
playsets townhouses multi-story structure with apartments or condos
solar panels housing layouts number of stories, the presence of a garage
balconies communal living apartments or condos stacked on top of each other

Average
Prince of Wales

Average Average
grain amaranth upright amaranth visual heraldic badge found in a number of countries
Feathers

residential district vertical amaranth stature small, intricately arranged feathers

red flowers crimson feathery cascade plume that is often used in millinery
woolly texture colorful summer annual are black and white, and there are three of them
pendant tassels architectural panicle spikes curve upward and are fastened together
purple inflorescences grand vertical silhouette three ostrich plumes: black, white, and green

Figure 4: Examples of interpretability. We select two categories from EuroSAT and Flowers102 and list the top-3 and
last-3 descriptions for each category, ranked by their similarity scores averaging across class test samples.

Table 4: Fine-tuning results using generated descriptions. Due to the high degree of diversity in CuPL, the quality of
generated descriptions greatly varies, and it is prone to gen-
Setting Ours ImageNet EuroSAT UCF DTD Flowers102 erate vague and overlapping descriptions, e.g., “structures
1 shot
✗ 62.6 46.0 65.6 53.1 72.8 where people live” and “building where people live”. In
✓ 64.8 56.9 67.9 56.0 74.2 contrast, our method generates more concise and relevant
✗ 62.4 55.5 68.6 54.6 75.9 descriptions at initialization and continues to refine them
2 shot
✓ 64.3 62.7 70.2 57.9 77.1
during optimization, showing consistent growth in similar-
✗ 63.6 58.4 73.2 57.3 84.9 ity scores of all descriptions. Our final descriptions reveal
4 shot
✓ 65.1 61.5 73.7 57.4 82.6
strong class-dependant semantic information which is bene-
✗ 65.0 63.8 76.9 61.3 90.2 ficial for visual classification. In addition, for the confusing
8 shot
✓ 65.8 68.7 77.3 61.9 90.1
class name “Prince of Wales Feathers” in Flowers102, CuPL
✗ 66.6 74.3 81.1 66.6 94.0
16 shot
✓ 67.0 75.4 81.2 67.6 93.9
mistook it as the heraldic badge of the Prince of Wales in-
stead of an annual herb, generating completely irrelevant
descriptions. We further find this issue exists in all LLM-
across all datasets. Moreover, our method exhibits a clear based methods. Instead, our method successfully generates
advantage in low-shot learning scenarios. correct descriptions since we provide the information of
related classes and leverage feedback from CLIP for visual
4.5. Interpretability and Analysis grounding, emphasizing the importance of visual feedback.

Figure 4 visualizes LLM-generated descriptions that con- 5. Conclusion

tribute the most and least for classifying images of different
classes. Specifically, we select our results at the first and last In this work, we present a novel paradigm for image classifi-
iteration, as well as results from CuPL (Pratt et al., 2023) for cation that leverages LLMs to iteratively refine class descrip-
visualization, with class Residential Building in EuroSAT tors with visual feedback from VLMs to guide the optimiza-
and Prince of Wales Feathers in Flowers102. To compare tion process. We validate the effectiveness of optimized
the effectiveness of descriptions, we compute the similar- descriptions by our method across 9 image classification
ity score for each description using the average of all test datasets, showcasing superior performance with multiple
images in that category. After sorting them in descending benefits including interpretability and transferability.
order, we visualize the top-3 and last-3 descriptions and
their similarity scores in each setting.

8
LLMs as Visual Explainers: Advancing Image Classification with Evolving Visual Descriptions

6. Broader Impact Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. BERT:
Pre-training of deep bidirectional transformers for lan-
The integration of Large Language Models and Vision- guage understanding. arXiv preprint arXiv:1810.04805,
Language Models for optimized image classification, as 2018.
discussed in the paper, presents both ethical and social im-
plications. Ethically, it enhances accuracy and explainability Fei-Fei, L., Fergus, R., and Perona, P. Learning generative
in critical applications like medical diagnostics, raising con- visual models from few training examples: An incremen-
cerns about potential biases in training data affecting fair- tal bayesian approach tested on 101 object categories.
ness. Socially, the increased interpretability of models may CVPR Workshop, 2004.
foster greater trust and wider adoption in various sectors.
Gao, P., Geng, S., Zhang, R., Ma, T., Fang, R., Zhang,
However, it also poses risks related to privacy and misuse
Y., Li, H., and Qiao, Y. Clip-adapter: Better vision-
in surveillance. Addressing these challenges necessitates
language models with feature adapters. arXiv preprint
future research focused on ensuring fairness, privacy, and
arXiv:2110.04544, 2021.
transparency.
Helber, P., Bischke, B., Dengel, A., and Borth, D. Eurosat:
References A novel dataset and deep learning benchmark for land
use and land cover classification. IEEE Journal of Se-
Alayrac, J.-B., Donahue, J., Luc, P., Miech, A., Barr, I., lected Topics in Applied Earth Observations and Remote
Hasson, Y., Lenc, K., Mensch, A., Millican, K., Reynolds, Sensing, 2019.
M., et al. Flamingo: a visual language model for few-shot
learning. NeurPS, 35, 2022. Huang, S., Dong, L., Wang, W., Hao, Y., Singhal, S., Ma,
S., Lv, T., Cui, L., Mohammed, O. K., Liu, Q., et al. Lan-
Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D., guage is not all you need: Aligning perception with lan-
Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., guage models. arXiv preprint arXiv:2302.14045, 2023.
Askell, A., et al. Language models are few-shot learners.
Huang, T., Chu, J., and Wei, F. Unsupervised prompt
In NeurIPS, 2020.
learning for vision-language models. arXiv preprint
Chowdhery, A., Narang, S., Devlin, J., Bosma, M., Mishra, arXiv:2204.03649, 2022.
G., Roberts, A., Barham, P., Chung, H. W., Sutton, Imani, S., Du, L., and Shrivastava, H. Mathprompter: Math-
C., Gehrmann, S., Schuh, P., Shi, K., Tsvyashchenko, ematical reasoning using large language models. arXiv
S., Maynez, J., Rao, A., Barnes, P., Tay, Y., Shazeer, preprint arXiv:2303.05398, 2023.
N., Prabhakaran, V., Reif, E., Du, N., Hutchinson, B.,
Pope, R., Bradbury, J., Austin, J., Isard, M., Gur-Ari, Jia, M., Tang, L., Chen, B.-C., Cardie, C., Belongie, S.,
G., Yin, P., Duke, T., Levskaya, A., Ghemawat, S., Dev, Hariharan, B., and Lim, S.-N. Visual prompt tuning. In
S., Michalewski, H., Garcia, X., Misra, V., Robinson, ECCV, 2022.
K., Fedus, L., Zhou, D., Ippolito, D., Luan, D., Lim,
Krizhevsky, A., Hinton, G., et al. Learning multiple layers
H., Zoph, B., Spiridonov, A., Sepassi, R., Dohan, D.,
of features from tiny images. 2009.
Agrawal, S., Omernick, M., Dai, A. M., Pillai, T. S.,
Pellat, M., Lewkowycz, A., Moreira, E., Child, R., Polo- Li, J., Li, D., Savarese, S., and Hoi, S. BLIP-2: boot-
zov, O., Lee, K., Zhou, Z., Wang, X., Saeta, B., Diaz, strapping language-image pre-training with frozen image
M., Firat, O., Catasta, M., Wei, J., Meier-Hellstern, K., encoders and large language models. In ICML, 2023.
Eck, D., Dean, J., Petrov, S., and Fiedel, N. PaLM: Scal-
ing language modeling with pathways. arXiv preprint Liang, Y., Zhang, R., Zhang, L., and Xie, P. Drugchat: to-
arXiv:2204.02311, 2022. wards enabling chatgpt-like capabilities on drug molecule
graphs. arXiv preprint arXiv:2309.03907, 2023.
Cimpoi, M., Maji, S., Kokkinos, I., Mohamed, S., and Liu, F., Lin, K., Li, L., Wang, J., Yacoob, Y., and Wang, L.
Vedaldi, A. Describing textures in the wild. In CVPR, Aligning large multi-modal model with robust instruction
2014. tuning. arXiv preprint arXiv:2306.14565, 2023a.
Cui, C., Zhou, Y., Yang, X., Wu, S., Zhang, L., Zou, J., and Liu, H., Li, C., Wu, Q., and Lee, Y. J. Visual instruction
Yao, H. Holistic analysis of hallucination in gpt-4v(ision): tuning. arXiv preprint arXiv:2304.08485, 2023b.
Bias and interference challenges, 2023.
Liu, S., Wang, J., Yang, Y., Wang, C., Liu, L., Guo, H.,
Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., and Fei-Fei, and Xiao, C. Chatgpt-powered conversational drug edit-
L. Imagenet: A large-scale hierarchical image database. ing using retrieval and domain feedback. arXiv preprint
In CVPR, 2009. arXiv:2305.18090, 2023c.

9
LLMs as Visual Explainers: Advancing Image Classification with Evolving Visual Descriptions

Longpre, S., Hou, L., Vu, T., Webson, A., Chung, H. W., Azhar, F., et al. Llama: Open and efficient foundation lan-
Tay, Y., Zhou, D., Le, Q. V., Zoph, B., Wei, J., et al. The guage models. arXiv preprint arXiv:2302.13971, 2023.
flan collection: Designing data and methods for effec-
tive instruction tuning. arXiv preprint arXiv:2301.13688, Wah, C., Branson, S., Welinder, P., Perona, P., and Belongie,
2023. S. The caltech-ucsd birds-200-2011 dataset. 2011.

Ma, Y. J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Wei, J., Wang, X., Schuurmans, D., Bosma, M., Ichter, B.,
Jayaraman, D., Zhu, Y., Fan, L., and Anandkumar, A. Xia, F., Chi, E., Le, Q., and Zhou, D. Chain-of-thought
Eureka: Human-level reward design via coding large lan- prompting elicits reasoning in large language models.
guage models. arXiv preprint arXiv: Arxiv-2310.12931, NeurIPS, 2022.
2023. Wu, S., Fei, H., Qu, L., Ji, W., and Chua, T.-S. Next-
Maniparambil, M., Vorster, C., Molloy, D., Murphy, N., gpt: Any-to-any multimodal llm. arXiv preprint
McGuinness, K., and O’Connor, N. E. Enhancing clip arXiv:2309.05519, 2023.
with gpt-4: Harnessing visual descriptions as prompts. In Xiao, J., Ehinger, K. A., Hays, J., Torralba, A., and Oliva,
Proceedings of the IEEE/CVF International Conference A. Sun database: Exploring a large collection of scene
on Computer Vision, pp. 262–271, 2023. categories. IJCV, 2016.
Menon, S. and Vondrick, C. Visual classification via de- Yan, A., Wang, Y., Zhong, Y., Dong, C., He, Z., Lu, Y.,
scription from large language models. In ICLR, 2023. Wang, W. Y., Shang, J., and McAuley, J. Learning concise
Nilsback, M.-E. and Zisserman, A. Automated flower classi- and descriptive attributes for visual recognition. In ICCV,
fication over a large number of classes. In Indian confer- 2023.
ence on computer vision, graphics & image processing, Yin, S., Fu, C., Zhao, S., Xu, T., Wang, H., Sui, D., Shen, Y.,
2008. Li, K., Sun, X., and Chen, E. Woodpecker: Hallucination
Novack, Z., Garg, S., McAuley, J., and Lipton, Z. Chils: correction for multimodal large language models. arXiv
Zero-shot image classification with hierarchical label sets. preprint arXiv:2310.16045, 2023.
In ICML, 2023.
Yu, W., Gileadi, N., Fu, C., Kirmani, S., Lee, K.-H., Are-
OpenAI. GPT-4 technical report. arXiv preprint nas, M. G., Chiang, H.-T. L., Erez, T., Hasenclever, L.,
arXiv:2303.08774, 2023. Humplik, J., et al. Language to rewards for robotic skill
synthesis. arXiv preprint arXiv:2306.08647, 2023.
Pratt, S., Covert, I., Liu, R., and Farhadi, A. What does a
platypus look like? generating customized prompts for Zhang, R., Fang, R., Zhang, W., Gao, P., Li, K., Dai, J., Qiao,
zero-shot image classification. In ICCV, 2023. Y., and Li, H. Tip-adapter: Training-free clip-adapter for
better vision-language modeling. In ECCV, 2022.
Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G.,
Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., Zhao, J., Zhuo, L., Shen, Y., Qu, M., Liu, K., Bronstein, M.,
et al. Learning transferable visual models from natural Zhu, Z., and Tang, J. Graphtext: Graph reasoning in text
language supervision. In ICML, 2021. space. arXiv preprint arXiv:2310.01089, 2023.

Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S., Zhou, K., Yang, J., Loy, C. C., and Liu, Z. Conditional
Matena, M., Zhou, Y., Li, W., and Liu, P. J. Exploring prompt learning for vision-language models. In CVPR,
the limits of transfer learning with a unified text-to-text 2022.
transformer. JMLR, 2020.
Zhu, D., Chen, J., Shen, X., Li, X., and Elhoseiny, M.
Roth, K., Kim, J. M., Koepke, A. S., Vinyals, O., Schmid, C., Minigpt-4: Enhancing vision-language understanding
and Akata, Z. Waffling around for performance: Visual with advanced large language models. arXiv preprint
classification with random words and broad concepts, arXiv:2304.10592, 2023.
2023.

Soomro, K., Zamir, A. R., and Shah, M. Ucf101: A dataset References

of 101 human actions classes from videos in the wild.
arXiv preprint arXiv:1212.0402, 2012. Alayrac, J.-B., Donahue, J., Luc, P., Miech, A., Barr, I.,
Hasson, Y., Lenc, K., Mensch, A., Millican, K., Reynolds,
Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M., et al. Flamingo: a visual language model for few-shot
M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., learning. NeurPS, 35, 2022.

10
LLMs as Visual Explainers: Advancing Image Classification with Evolving Visual Descriptions

Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D., Huang, S., Dong, L., Wang, W., Hao, Y., Singhal, S., Ma,
Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., S., Lv, T., Cui, L., Mohammed, O. K., Liu, Q., et al. Lan-
Askell, A., et al. Language models are few-shot learners. guage is not all you need: Aligning perception with lan-
In NeurIPS, 2020. guage models. arXiv preprint arXiv:2302.14045, 2023.

Chowdhery, A., Narang, S., Devlin, J., Bosma, M., Mishra, Huang, T., Chu, J., and Wei, F. Unsupervised prompt
G., Roberts, A., Barham, P., Chung, H. W., Sutton, learning for vision-language models. arXiv preprint
C., Gehrmann, S., Schuh, P., Shi, K., Tsvyashchenko, arXiv:2204.03649, 2022.
S., Maynez, J., Rao, A., Barnes, P., Tay, Y., Shazeer,
N., Prabhakaran, V., Reif, E., Du, N., Hutchinson, B., Imani, S., Du, L., and Shrivastava, H. Mathprompter: Math-
Pope, R., Bradbury, J., Austin, J., Isard, M., Gur-Ari, ematical reasoning using large language models. arXiv
G., Yin, P., Duke, T., Levskaya, A., Ghemawat, S., Dev, preprint arXiv:2303.05398, 2023.
S., Michalewski, H., Garcia, X., Misra, V., Robinson,
Jia, M., Tang, L., Chen, B.-C., Cardie, C., Belongie, S.,
K., Fedus, L., Zhou, D., Ippolito, D., Luan, D., Lim,
Hariharan, B., and Lim, S.-N. Visual prompt tuning. In
H., Zoph, B., Spiridonov, A., Sepassi, R., Dohan, D.,
ECCV, 2022.
Agrawal, S., Omernick, M., Dai, A. M., Pillai, T. S.,
Pellat, M., Lewkowycz, A., Moreira, E., Child, R., Polo- Krizhevsky, A., Hinton, G., et al. Learning multiple layers
zov, O., Lee, K., Zhou, Z., Wang, X., Saeta, B., Diaz, of features from tiny images. 2009.
M., Firat, O., Catasta, M., Wei, J., Meier-Hellstern, K.,
Eck, D., Dean, J., Petrov, S., and Fiedel, N. PaLM: Scal- Li, J., Li, D., Savarese, S., and Hoi, S. BLIP-2: boot-
ing language modeling with pathways. arXiv preprint strapping language-image pre-training with frozen image
arXiv:2204.02311, 2022. encoders and large language models. In ICML, 2023.

Cimpoi, M., Maji, S., Kokkinos, I., Mohamed, S., and Liang, Y., Zhang, R., Zhang, L., and Xie, P. Drugchat: to-
Vedaldi, A. Describing textures in the wild. In CVPR, wards enabling chatgpt-like capabilities on drug molecule
2014. graphs. arXiv preprint arXiv:2309.03907, 2023.

Cui, C., Zhou, Y., Yang, X., Wu, S., Zhang, L., Zou, J., and Liu, F., Lin, K., Li, L., Wang, J., Yacoob, Y., and Wang, L.
Yao, H. Holistic analysis of hallucination in gpt-4v(ision): Aligning large multi-modal model with robust instruction
Bias and interference challenges, 2023. tuning. arXiv preprint arXiv:2306.14565, 2023a.

Liu, H., Li, C., Wu, Q., and Lee, Y. J. Visual instruction
Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., and Fei-Fei,
tuning. arXiv preprint arXiv:2304.08485, 2023b.
L. Imagenet: A large-scale hierarchical image database.
In CVPR, 2009. Liu, S., Wang, J., Yang, Y., Wang, C., Liu, L., Guo, H.,
and Xiao, C. Chatgpt-powered conversational drug edit-
Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. BERT:
ing using retrieval and domain feedback. arXiv preprint
Pre-training of deep bidirectional transformers for lan-
arXiv:2305.18090, 2023c.
guage understanding. arXiv preprint arXiv:1810.04805,
2018. Longpre, S., Hou, L., Vu, T., Webson, A., Chung, H. W.,
Tay, Y., Zhou, D., Le, Q. V., Zoph, B., Wei, J., et al. The
Fei-Fei, L., Fergus, R., and Perona, P. Learning generative flan collection: Designing data and methods for effec-
visual models from few training examples: An incremen- tive instruction tuning. arXiv preprint arXiv:2301.13688,
tal bayesian approach tested on 101 object categories. 2023.
CVPR Workshop, 2004.
Ma, Y. J., Liang, W., Wang, G., Huang, D.-A., Bastani, O.,
Gao, P., Geng, S., Zhang, R., Ma, T., Fang, R., Zhang, Jayaraman, D., Zhu, Y., Fan, L., and Anandkumar, A.
Y., Li, H., and Qiao, Y. Clip-adapter: Better vision- Eureka: Human-level reward design via coding large lan-
language models with feature adapters. arXiv preprint guage models. arXiv preprint arXiv: Arxiv-2310.12931,
arXiv:2110.04544, 2021. 2023.

Helber, P., Bischke, B., Dengel, A., and Borth, D. Eurosat: Maniparambil, M., Vorster, C., Molloy, D., Murphy, N.,
A novel dataset and deep learning benchmark for land McGuinness, K., and O’Connor, N. E. Enhancing clip
use and land cover classification. IEEE Journal of Se- with gpt-4: Harnessing visual descriptions as prompts. In
lected Topics in Applied Earth Observations and Remote Proceedings of the IEEE/CVF International Conference
Sensing, 2019. on Computer Vision, pp. 262–271, 2023.

11
LLMs as Visual Explainers: Advancing Image Classification with Evolving Visual Descriptions

Menon, S. and Vondrick, C. Visual classification via de- Yan, A., Wang, Y., Zhong, Y., Dong, C., He, Z., Lu, Y.,
scription from large language models. In ICLR, 2023. Wang, W. Y., Shang, J., and McAuley, J. Learning concise
and descriptive attributes for visual recognition. In ICCV,
Nilsback, M.-E. and Zisserman, A. Automated flower classi- 2023.
fication over a large number of classes. In Indian confer-
ence on computer vision, graphics & image processing, Yin, S., Fu, C., Zhao, S., Xu, T., Wang, H., Sui, D., Shen, Y.,
2008. Li, K., Sun, X., and Chen, E. Woodpecker: Hallucination
correction for multimodal large language models. arXiv
Novack, Z., Garg, S., McAuley, J., and Lipton, Z. Chils: preprint arXiv:2310.16045, 2023.
Zero-shot image classification with hierarchical label sets.
In ICML, 2023. Yu, W., Gileadi, N., Fu, C., Kirmani, S., Lee, K.-H., Are-
nas, M. G., Chiang, H.-T. L., Erez, T., Hasenclever, L.,
OpenAI. GPT-4 technical report. arXiv preprint Humplik, J., et al. Language to rewards for robotic skill
arXiv:2303.08774, 2023. synthesis. arXiv preprint arXiv:2306.08647, 2023.
Pratt, S., Covert, I., Liu, R., and Farhadi, A. What does a Zhang, R., Fang, R., Zhang, W., Gao, P., Li, K., Dai, J., Qiao,
platypus look like? generating customized prompts for Y., and Li, H. Tip-adapter: Training-free clip-adapter for
zero-shot image classification. In ICCV, 2023. better vision-language modeling. In ECCV, 2022.
Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G., Zhao, J., Zhuo, L., Shen, Y., Qu, M., Liu, K., Bronstein, M.,
Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., Zhu, Z., and Tang, J. Graphtext: Graph reasoning in text
et al. Learning transferable visual models from natural space. arXiv preprint arXiv:2310.01089, 2023.
language supervision. In ICML, 2021.
Zhou, K., Yang, J., Loy, C. C., and Liu, Z. Conditional
Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S., prompt learning for vision-language models. In CVPR,
Matena, M., Zhou, Y., Li, W., and Liu, P. J. Exploring 2022.
the limits of transfer learning with a unified text-to-text
transformer. JMLR, 2020. Zhu, D., Chen, J., Shen, X., Li, X., and Elhoseiny, M.
Minigpt-4: Enhancing vision-language understanding
Roth, K., Kim, J. M., Koepke, A. S., Vinyals, O., Schmid, C., with advanced large language models. arXiv preprint
and Akata, Z. Waffling around for performance: Visual arXiv:2304.10592, 2023.
classification with random words and broad concepts,
2023.

Soomro, K., Zamir, A. R., and Shah, M. Ucf101: A dataset

of 101 human actions classes from videos in the wild.
arXiv preprint arXiv:1212.0402, 2012.

Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux,

M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E.,
Azhar, F., et al. Llama: Open and efficient foundation lan-
guage models. arXiv preprint arXiv:2302.13971, 2023.

Wah, C., Branson, S., Welinder, P., Perona, P., and Belongie,
S. The caltech-ucsd birds-200-2011 dataset. 2011.

Wei, J., Wang, X., Schuurmans, D., Bosma, M., Ichter, B.,
Xia, F., Chi, E., Le, Q., and Zhou, D. Chain-of-thought
prompting elicits reasoning in large language models.
NeurIPS, 2022.

Wu, S., Fei, H., Qu, L., Ji, W., and Chua, T.-S. Next-
gpt: Any-to-any multimodal llm. arXiv preprint
arXiv:2309.05519, 2023.

Xiao, J., Ehinger, K. A., Hays, J., Torralba, A., and Oliva,
A. Sun database: Exploring a large collection of scene
categories. IJCV, 2016.

12
LLMs as Visual Explainers: Advancing Image Classification with Evolving Visual Descriptions

A. Ablation on Few shot Study

Table 5: Ablation on the number of shots used for calculating the confusion matrix.

Setting UCF Caltech Flowers102 CUB

1 shot 64.26 89.68 69.07 54.87
2 shot 64.31 90.81 68.45 55.28
4 shot 65.19 90.68 70.48 55.37
8 shot 66.14 91.04 69.54 55.37
16 shot 65.27 91.33 70.65 55.54

In this section, we present the experimental results of a few-shot study, as shown in the table. The term “shot” refers to the
number of labeled samples used per category to compute visual feedback. However, our few-shot setting is different from
conventional ones since we do not require any training. Overall, the quality of the generated descriptions improves as the
number of shots increases. Even with a very limited number of shots, it is possible to achieve satisfactory performance.

B. Iterative Optimization Visualization

Ours: Iteration 0 Ours: Iteration 4 Ours: Iteration 9 CuPL

Figure 5: Iterative Optimization Visualization. From left to right, the sequence is as follows: the 0-th round of our method,
the 4-th round of our method, the 9-th round of our method, and the CuPL (Pratt et al., 2023) method. The darker the color
in the heatmap, the higher the corresponding value in the confusion matrix.

To further demonstrate the superiority of iterative optimization, in this section, we present the confusion matrix as heatmaps,
offering a more intuitive visualization of the improvements brought about by the optimization process and a comparison
with a previous state-of-the-art method. As shown in Figure 5, the three images on the left respectively represent the initial,
mid-stage, and final-round confusion matrix heatmaps of our method, while the image on the far right is the heatmap of
the confusion matrix generated using the CuPL (Pratt et al., 2023) method for description generation. In our approach, as
seen in the three randomly selected areas, the color intensity in the non-diagonal regions lightens as the iterations increase.
This indicates a reduction in the number of categories confused with the diagonal class over time, further underscoring the
effectiveness of iterative optimization. Comparing our method with CuPL, we notice many rows in the heatmap are much
darker, indicating a severe blurring of class distinction along the diagonals. This also reveals that the strategy of using LLMs
to generate detailed categories in previous methods has limited impacts on optimizing classification effectiveness.

C. Visual Feedback Experiment

In Figure 6, solid lines represent the complete setup, while dashed lines indicate scenarios where components of visual
feedback are omitted. Comparing the two solid lines, the improved confusion matrix demonstrates overall superior
performance to the standard confusion matrix. The conventional confusion matrix not only contains excessive redundant

1
LLMs as Visual Explainers: Advancing Image Classification with Evolving Visual Descriptions
EuroSAT Flowers102 Caltech
56 72
92.6

55 71 92.4

54 92.2
70
Accuracy

Accuracy

Accuracy
92.0
53 69
91.8
52 68
91.6
51 Ours (iCM) Ours (iCM) Ours (iCM)
Ours (CM) 67 Ours (CM) 91.4 Ours (CM)
wo Memory wo Memory wo Memory
50 wo Feedback wo Feedback 91.2 wo Feedback
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9
Iteration Iteration Iteration

Figure 6: Ablation on visual feedback. X-axis: iteration rounds, where iteration 0 indicates initialization, Y-axis: Accuracy
(%). “iCM” indicates improved confusion matrix, “CM” indicates confusion matrix, “wo Memory” indicates without
memory banks, and “wo Feedback” indicates without all components related to visual feedback.

information, which may hamper the understanding of LLMs, but also discards certain critical information to discriminate
related classes since it relies solely on the top-1 accuracy for construction. Our improved version effectively addresses these
issues. Examining the dashed line for “without memory” alongside the two solid lines showcases that removing the memory
bank leads to significant fluctuations in accuracy. Although occasionally reaching relatively high accuracy, such a model
contains high variances but lacks robustness. Hence, the memory bank significantly enhances the stability and robustness
during the optimization process. Regarding the setting without feedback, which is the removal of both the confusion matrix
and memory banks, we notice a minimal increase or even drop in accuracy, reaffirming the crucial role of visual feedback.

D. Full Prompts
In this section, we provide full prompts for our proposed method of iterative optimization with visual feedback. There are
three components in this framework: initialization, mutation, and crossover. Each component has a system prompt and a
user prompt.

Listing 1: Initialization system prompt.

You are a prompt engineer trying to optimize the text description of class labels for image classification. The CLIP
model performs zero-shot image classification by computing the cosine similarities between input images and class l-
abels. You are given original class labels and some hints of confusion classes that are indistinguishable to CLIP f-
or each class. Your goal is to generate a list of visual concepts to improve the description of current class labels
to maximize their distinctions for CLIP to better recognize.
The output class names should be a Python list of strings in the following format:
‘‘‘
["original_class_name1: concept1, concept2, concept3, ...", "original_class_name2: concept1, concept2, concept3,
...", ...]
‘‘‘

Some helpful tips for optimizing the class descriptions:

1. You should generate {n_concepts_init} different high-level concept words for each class to emphasize the
distinct visual features of this class but not appear in other classes, split them with ",".

2. DO NOT give me non-visual words!

3. The class description must also be general enough to cover most training images of this class.

4. Most importantly, always focus on the visual features of all classes, since there are only images input. Do
not produce text describing invisible features, e.g., voice, mental character, etc.

5. Do not include any other class names in the description of one class, which means do not use text like "not
like a cat" in dogs or "distinct from birds"in plans. Only focus on the features of the class itself.

6. The concept words for each class should be diverseand not too similar to each other.

Listing 2: Initialization user prompt.

The original class names are:
{current_class}

The confusion classes for CLIP are:

2
LLMs as Visual Explainers: Advancing Image Classification with Evolving Visual Descriptions

{visual_feedback}
Note, classes in the front are more prone to be confused than the ones in the back.

Write the optimized class names leveraging the hints of confusion classes list above. Directly generate your answer
in the given format without any extra information.

Listing 3: Mutation system prompt.

You are a prompt engineer trying to optimize the text description of class labels for image classification. The CLIP
model performs zero-shot image classification by computing the cosine similarities between input images and class l-
abels. You are given current class labels, some hints of confusion classes that are indistinguishable to CLIP and l-
ists of good and bad history words for each class. Your goal is to optimize the description of current class labels
to maximize their distinctions for CLIP to better recognize images.
The output class names should be a Python list of strings in the following format:
‘‘‘
["original_class_name1: concept1, concept2, concept3, ...", "original_class_name2: concept1, concept2, concept3,
...", ...]
‘‘‘

Some helpful tips for optimizing the class descriptions:

1. Base on all the provided information, you should identify {n_concepts} bad high-level concept words in each
class and replace them with better ones to emphasize the distinct visual feature of this class but not appear in
others’, split them with ",".

2. DO NOT give me non-visual words!

3. The class description must also be general enough to cover most training images of this class.

4. Most importantly, always focus on the visual features of all classes, since there are only images input. Do
not produce text describing invisible features, e.g., voice, mental character, etc.

5. Do not include any other class names in the description of one class, which means do not use text like "not
like a cat" in dogs or "distinct from birds" in plans. Only focus on the features of the class itself.

6. The concept words for each class should be diverse and not too similar to each other.

Listing 4: Mutation user prompt.

The current class names are:
{current_class}

The confusion classes for CLIP are:

{visual_feedback}
Note, classes in the front are more prone to be confused than the ones in the back.

History list class descriptions that will improve classification accuracy are:
{positive_list}

History list records for class descriptions that will decrease classification accuracy are:
{negative_list}

Write the optimized class names leveraging the hints of confusion classes and two lists above. Directly generate
your answer in the given format without any extra information.

Listing 5: Crossover system prompt.

You are a prompt engineer trying to optimize the text description of class labels for image classification. The CLIP
model performs zero-shot image classification by computing the cosine similarities between input images and class l-
abels. You are given {n_samples} versions of class labels with different text descriptions, and classification metr-
ics for each of them. Your task is to find the best combination of given text description for each class label based
on the classification metrics to achieve the best overall classification performance.
The output class names should be a Python list of strings in the following format:
‘‘‘
["original_class_name1: concept1, concept2, concept3, ...", "original_class_name2: concept1, concept2, concept3,
...", ...]
‘‘‘

Some helpful tips for optimize the class descriptions:

1. You should only select from the existing concept words in class descriptions. You cannot create new words.

2. For each class, you should combine various concepts in different versions of descriptions, and generate
the description that is most conducive to distinguishing this class.

3. You should leverage the classification metrics for your selection. The overall accuracy indicates the global
performance, while the class-wise accuracy indicates the performance for each class.

4. Keep the number of concepts in each class description unchanged, i.e., {n_concepts_init} concepts for each
class.

3
LLMs as Visual Explainers: Advancing Image Classification with Evolving Visual Descriptions

Listing 6: Crossover user prompt.

The {n_samples} versions of class descriptions and corresponding classification metrics are:{class_samples}

Output the optimized class descriptions according to the classification metrics above. Directly generate your answer
in given format without any extra information.

E. Result Example
In this section, we showcase some examples generated by our method. We select a subset of class labels from two datasets
for demonstration. For the complete set of labels, please refer to the .txt file in our Supplementary Material.
Listing 7: Examples of generated descriptions for Flowers102.
{
"tiger lily": [
"erect perennial growth",
"pollen-covered anthers",
"intense orange hue",
"lance-shaped leaf structure",
"vibrant orange-red coloring",
"summer garden fixture",
"bold tiger-striped petals",
"straight growth form",
"midsummer peak",
"lanceolate leaf shape"
],
"giant white arum lily": [
"robust flora",
"monochromatic palette",
"tropically adapted",
"arrow-shaped leaves",
"swamp inhabitant",
"stately appearance",
"massive spoon-like spathe",
"lush marsh foliage",
"imposing white florescence",
"prominent central spadix",
"large ovate bracts",
"waxy texture",
"South African native",
"water-adjacent growth"
],
"fire lily": [
"south african endemic",
"scorching colors",
"wild habitat specialist",
"volcanic color palette",
"excessive pollen attractor",
"specific African habitat",
"nectar-abundant flower",
"curved petal silhouettes"
],
"orange dahlia": [
"horticultural pride",
"summer crescendo",
"elegant growth habit",
"fiery orange inflorescence",
"lush verdant foliage",
"decorative cutting flower",
"opposite leaf arrangement",
"compacted petal rows",
"bold spherical blooms",
"mid-summer grandeur",
"fiery",
"ornamental",
"foliage"
],
"pink-yellow dahlia": [
"cut flower favorite",
"standout horticultural beauty",
"large petal-packed inflorescence",
"bedding plant",
"luminous garden feature",
"autumn flowering",
"ornamental garden treasure",
"delicate pink-yellow gradient petals",
"soft petal curvature",

4
LLMs as Visual Explainers: Advancing Image Classification with Evolving Visual Descriptions

"opulent look",
"gradient",
"autumn",
"opulent"
],
"cautleya spicata": [
"woodland understory preference",
"shade-loving",
"soft yellow tones",
"Himalayan forest dweller",
"orchid resemblance",
"clusters of golden flowers",
"semi-evergreen",
"rhizomes rooting",
"delicate mountainous ginger",
"forest dweller"
],
"japanese anemone": [
"tall asiatic perennial",
"serene woodland flower",
"east-asian native flower",
"stoloniferous growth",
"silky pink petal radiance",
"woodland garden favorite",
"east-asian perennial blossom",
"late-season floral display",
"stately forest edge flower",
"tranquil late bloomer",
"carpel rich center",
"long-flowering autumnal star",
"tall asiatic ornamental beauty",
"delicate purplish-pink blossom array",
"windflower delicate beauty"
],
"black-eyed susan": [
"rugged hairy rudbeckia",
"coneflower-like black-eyed plant",
"central black cone flower",
"pollinator-friendly rudbeckia",
"golden-yellow wildflower",
"distinct black-eyed blossom",
"sunny field susan",
"daisy family rudbeckia",
"showpiece planting",
"fuzzy-stemmed rudbeckia",
"brown-domed center",
"golden petal surround",
"easy-growing rudbeckia species",
"sunny open site favorite",
"bright American wildflower"
]
...
}

Listing 8: Examples of generated descriptions for SUN.

{
"abbey": [
"religious architecture",
"peaceful prayer courtyard",
"historic ecclesiastical complex",
"monastic buildings",
"carved stone reliefs",
"silent contemplative gardens",
"clerical chambers",
"ancient liturgical hall",
"sequestered religious refuge",
"spiritual monastic center",
"high-altitude abbey setting",
"medieval landmark",
"picturesque monastery",
"tranquil cloister enclosure",
"secluded pilgrimage destination",
"ancestral cloister",
"spirituality center",
"parochial complex",
"ecclesiastical estate",
"tranquil abbey gardens",
"monastic cellar vaults",
"placid cloister quarters",

5
LLMs as Visual Explainers: Advancing Image Classification with Evolving Visual Descriptions

"benedictine architecture",
"clerical stonework",
"Gregorian chant resonance",
"secluded spiritual refuge",
"illuminated manuscript repository",
"divine service chapel",
"clerestory window designs",
"ancestral cloister"
],
"airplane cabin": [
"aeronautical window shape",
"fuselage cross-section",
"overhead bin latch",
"cabin window shade",
"recessed cabin lighting",
"seatback tray table",
"emergency exit handle",
"aircraft cabin",
"cabin crew interphone",
"cabin altitude sign",
"air nozzle",
"avionics panel",
"in-flight magazine pouch",
"fuselage interior design",
"cabin class divider"
],
"airport terminal": [
"terminal retail stores",
"digital flight board",
"entrance to arrivals",
"airport lounge",
"tax-free goods shop",
"automated check-in spot",
"security screening zone",
"airport seating arrangements",
"passenger check-in desks",
"departure lounge seating",
"airport check-in island",
"gate podiums",
"flight information boards"
],
"engine room": [
"turbine vibration control zone",
"mechanical performance monitoring space",
"heavy machinery operational platform",
"maritime technical engineering station",
"powertrain thermal regulation sector",
"central power distribution chamber",
"diesel generator maintenance location",
"kinetic propulsion system area",
"ship engine operation center",
"marine engine service station",
"mechanical room vibration analysis",
"power distribution control centre",
"engine diagnostic and repair shop",
"power regeneration equipment area"
],
"indoor escalator": [
"dynamic stairway",
"motorized belt",
"visible step cleats",
"longitudinal motion path",
"smooth balustrade glass",
"public indoor traversal",
"automated staircase",
"shopping center feature",
"kinetic stair mechanism",
"dynamic railing",
"continuous movement",
"moving handrail",
"glass side panels",
"flat escalator landing"
],
"excavation": [
"archaeological dig layers",
"artefact excavation pits",
"ancient civilization studies",
"soil strata analysis",
"dig site grid system",
"stratigraphy profiles",
"excavation site documentation",

6
LLMs as Visual Explainers: Advancing Image Classification with Evolving Visual Descriptions

"earth sifting screens",

"fossil dig uncovering",
"sedimentary layer analysis",
"archeological site",
"dig zone",
"soil screening",
"historical excavation",
"artifact discovery",
"stratigraphy",
"archeological survey",
"relic preservation"
],
"fairway": [
"expansive fairway views",
"tree-obstacle layout",
"picturesque water features",
"distinctive golf hole design",
"strategic golf hole locations",
"bunker-guarded green",
"rough grass edges",
"undulating golfer’s terrain",
"serpentine cart paths",
"slope-contoured play lane"
]
...
}

A Comprehensive Overview of Large Language Models: Preprint 1
No ratings yet
A Comprehensive Overview of Large Language Models: Preprint 1
46 pages
By My Eyes: Grounding Multimodal Large Language Models With Sensor Data Via Visual Prompting
No ratings yet
By My Eyes: Grounding Multimodal Large Language Models With Sensor Data Via Visual Prompting
23 pages
Paper 2 Base
No ratings yet
Paper 2 Base
39 pages
Rethinking Vlms and Llms For Image Classification
No ratings yet
Rethinking Vlms and Llms For Image Classification
23 pages
F-VLM Open-Vocabulary Object Detection
No ratings yet
F-VLM Open-Vocabulary Object Detection
20 pages
2302.04858v2 ReViLM
No ratings yet
2302.04858v2 ReViLM
14 pages
Vision-Language Pre-Training
No ratings yet
Vision-Language Pre-Training
102 pages
Lijuan Slides Cvpr2024 Fundationmodels
No ratings yet
Lijuan Slides Cvpr2024 Fundationmodels
25 pages
Lijuan Slides Cvpr2024 Fundationmodels
No ratings yet
Lijuan Slides Cvpr2024 Fundationmodels
25 pages
L L M H T C: A S R: Arge Anguage Odels For Ealthcare EXT Lassification Ystematic Eview
No ratings yet
L L M H T C: A S R: Arge Anguage Odels For Ealthcare EXT Lassification Ystematic Eview
55 pages
Fastvlm: Efficient Vision Encoding For Vision Language Models
No ratings yet
Fastvlm: Efficient Vision Encoding For Vision Language Models
20 pages
北京大学利用多模态大语言模型与对比学习，开发出Finedefics，实现细粒度视觉识别性能显著提升
No ratings yet
北京大学利用多模态大语言模型与对比学习，开发出Finedefics，实现细粒度视觉识别性能显著提升
18 pages
Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal Llms
No ratings yet
Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal Llms
45 pages
2022 Alayrac
No ratings yet
2022 Alayrac
54 pages
L M I T S: Inearly Apping From Mage To EXT Pace
No ratings yet
L M I T S: Inearly Apping From Mage To EXT Pace
35 pages
Efficient Few-Shot Continual Learning in Vision-Language Models
No ratings yet
Efficient Few-Shot Continual Learning in Vision-Language Models
27 pages
LVLM Survey
No ratings yet
LVLM Survey
22 pages
2501.02189v3 - 2025
No ratings yet
2501.02189v3 - 2025
35 pages
VLMs Basics
No ratings yet
VLMs Basics
29 pages
Amherstp 2
No ratings yet
Amherstp 2
19 pages
An Introduction To Vision-Language Modeling: Aishwarya Agrawal Kate Saenko Asli Celikyilmaz Vikas Chandra
No ratings yet
An Introduction To Vision-Language Modeling: Aishwarya Agrawal Kate Saenko Asli Celikyilmaz Vikas Chandra
76 pages
MM1 Methods, Analysis, Insights From Multimodal LLM Pre Training
No ratings yet
MM1 Methods, Analysis, Insights From Multimodal LLM Pre Training
41 pages
Artificial Intelligence Neural Contradictory
No ratings yet
Artificial Intelligence Neural Contradictory
14 pages
Concept Map
No ratings yet
Concept Map
37 pages
Llm2Clip: P L M U R V R: Owerful Anguage Odel Nlocks Icher Isual Epresentation
No ratings yet
Llm2Clip: P L M U R V R: Owerful Anguage Odel Nlocks Icher Isual Epresentation
13 pages
Pratt What Does A Platypus Look Like Generating Customized Prompts For ICCV 2023 Paper
No ratings yet
Pratt What Does A Platypus Look Like Generating Customized Prompts For ICCV 2023 Paper
11 pages
677 A Survey On Bridging VLMs
No ratings yet
677 A Survey On Bridging VLMs
20 pages
NeurIPS 2022 Flamingo A Visual Language Model For Few Shot Learning Paper Conference
No ratings yet
NeurIPS 2022 Flamingo A Visual Language Model For Few Shot Learning Paper Conference
21 pages
For Visual Studio User'S Manual: Motoplus SDK
No ratings yet
For Visual Studio User'S Manual: Motoplus SDK
84 pages
Images in Language Space: Exploring The Suitability of Large Language Models For Vision & Language Tasks
No ratings yet
Images in Language Space: Exploring The Suitability of Large Language Models For Vision & Language Tasks
13 pages
A Comprehensive Overview of Large Language Models: A B, C, D, E, F, G F, G H, J I J
No ratings yet
A Comprehensive Overview of Large Language Models: A B, C, D, E, F, G F, G H, J I J
47 pages
CAAFE
No ratings yet
CAAFE
23 pages
Contrastive Language and Vision Learning of General Fashion Concepts
No ratings yet
Contrastive Language and Vision Learning of General Fashion Concepts
13 pages
Nabl Test Report Cross Verification Methodology and Steps For Report Authenticity
No ratings yet
Nabl Test Report Cross Verification Methodology and Steps For Report Authenticity
4 pages
Visual Large Language Models For Generalized and Specialized Applications
No ratings yet
Visual Large Language Models For Generalized and Specialized Applications
29 pages
Grounding Descriptions in Images Informs Zero-Shot Visual Recognition
No ratings yet
Grounding Descriptions in Images Informs Zero-Shot Visual Recognition
18 pages
What Matters When Building Vision-Language Models?: Prompt Idefics2 Output
No ratings yet
What Matters When Building Vision-Language Models?: Prompt Idefics2 Output
26 pages
Negative Yields Positive
No ratings yet
Negative Yields Positive
25 pages
A Comprehensive Overview of Large Language Models: A A, B, C, D, E, F E, F G, I H I
No ratings yet
A Comprehensive Overview of Large Language Models: A A, B, C, D, E, F E, F G, I H I
46 pages
Large Language Models Are Good Prompt Learners
No ratings yet
Large Language Models Are Good Prompt Learners
12 pages
Exploring
No ratings yet
Exploring
16 pages
Vision-Language Models For Vision Tasks: A Survey: Jingyi Zhang, Jiaxing Huang, Sheng Jin and Shijian Lu
No ratings yet
Vision-Language Models For Vision Tasks: A Survey: Jingyi Zhang, Jiaxing Huang, Sheng Jin and Shijian Lu
24 pages
Paper 1
No ratings yet
Paper 1
17 pages
Guiding Vision-Language Model Selection For Visual Question-Answering Across Tasks, Domains, and Knowledge Types
No ratings yet
Guiding Vision-Language Model Selection For Visual Question-Answering Across Tasks, Domains, and Knowledge Types
19 pages
LMM Model
No ratings yet
LMM Model
41 pages
Mech Softwares List
No ratings yet
Mech Softwares List
1 page
BLIVA - A Simple Multimodal LLM For Better Handling of Text-Rich Visual Questions
No ratings yet
BLIVA - A Simple Multimodal LLM For Better Handling of Text-Rich Visual Questions
12 pages
Title: Summary of The Research
No ratings yet
Title: Summary of The Research
13 pages
Lu Prompt Distribution Learning CVPR 2022 Paper
No ratings yet
Lu Prompt Distribution Learning CVPR 2022 Paper
10 pages
Cobra: Extending Mamba To Multi-Modal Large Language Model For Efficient Inference
No ratings yet
Cobra: Extending Mamba To Multi-Modal Large Language Model For Efficient Inference
12 pages
1-2024-arxiv- LLM-Seg：连接图像分割和大型语言模型推理
No ratings yet
1-2024-arxiv- LLM-Seg：连接图像分割和大型语言模型推理
10 pages
1Z0-1041-24 Exam Questions
100% (1)
1Z0-1041-24 Exam Questions
25 pages
A Survey of Text Classification With Transformers How Wide How Large How Long How Accurate How Expensive How Safe
No ratings yet
A Survey of Text Classification With Transformers How Wide How Large How Long How Accurate How Expensive How Safe
14 pages
1-2024-arxiv-MobileVLM V2：更快更强的视觉语言模型基线
No ratings yet
1-2024-arxiv-MobileVLM V2：更快更强的视觉语言模型基线
15 pages
Beyond Captioning - Task-Specific Prompting For Improved VLM Performance in Mathematical Reasoning
No ratings yet
Beyond Captioning - Task-Specific Prompting For Improved VLM Performance in Mathematical Reasoning
10 pages
DW & Caption Generator - Paper 1
No ratings yet
DW & Caption Generator - Paper 1
6 pages
Grounding Language Models To Images For Multimodal Inputs and Outputs
No ratings yet
Grounding Language Models To Images For Multimodal Inputs and Outputs
18 pages
CSS - 1st Sem - 1st Quarter - DLL
100% (2)
CSS - 1st Sem - 1st Quarter - DLL
44 pages
(2023-Arxiv) VisionLLM Large Language Model Is Also An Open-Ended Decoder For Vision-Centric Tasks
No ratings yet
(2023-Arxiv) VisionLLM Large Language Model Is Also An Open-Ended Decoder For Vision-Centric Tasks
15 pages
Unit 2 VR Modeling
No ratings yet
Unit 2 VR Modeling
23 pages
CLIP Report
No ratings yet
CLIP Report
7 pages
Research Paper 5
No ratings yet
Research Paper 5
10 pages
Us Deloitte and Ibm App Modernization Field Guide
No ratings yet
Us Deloitte and Ibm App Modernization Field Guide
34 pages
Cogvlm Paper
No ratings yet
Cogvlm Paper
18 pages
Minigpt-4 Enhancing Vision-Language Understanding With Advanced Large Language Models
No ratings yet
Minigpt-4 Enhancing Vision-Language Understanding With Advanced Large Language Models
15 pages
Cis Kilimanjaro With Server Details
No ratings yet
Cis Kilimanjaro With Server Details
3 pages
PSPD LAB ACTIVITY 3A
50% (2)
PSPD LAB ACTIVITY 3A
6 pages
Revolutionizing Talent Acquisition A Comparative Study of Large Language Models in Resume Classification
No ratings yet
Revolutionizing Talent Acquisition A Comparative Study of Large Language Models in Resume Classification
6 pages
USAA Bank Statement 5 Page
No ratings yet
USAA Bank Statement 5 Page
8 pages
School Management System Report
No ratings yet
School Management System Report
30 pages
27999-Article Text-32053-1-2-20240324
No ratings yet
27999-Article Text-32053-1-2-20240324
9 pages
L3 PM Project Planning
No ratings yet
L3 PM Project Planning
10 pages
04 LD 301
No ratings yet
04 LD 301
12 pages
Project Xpressbee
No ratings yet
Project Xpressbee
52 pages
Readymade Dissertation in Delhi
100% (1)
Readymade Dissertation in Delhi
4 pages
Set P6 Mid Term One 2024-2025
No ratings yet
Set P6 Mid Term One 2024-2025
5 pages
Syllabus Adhoc
No ratings yet
Syllabus Adhoc
2 pages
BusTicketingSystem PPT
No ratings yet
BusTicketingSystem PPT
18 pages
The Acronis Anydata Engine Meets Any Data Management Challenge 2
No ratings yet
The Acronis Anydata Engine Meets Any Data Management Challenge 2
4 pages
Project Management: Openings For Disruption From AI and Advanced Analytics
No ratings yet
Project Management: Openings For Disruption From AI and Advanced Analytics
30 pages
32-Bit Fixed and Floating-Point Hardware Implementation For Enhanced Inverter Control Leveraging FPGA in Recurrent Neural Network Applications
No ratings yet
32-Bit Fixed and Floating-Point Hardware Implementation For Enhanced Inverter Control Leveraging FPGA in Recurrent Neural Network Applications
14 pages
User-Defined Functions in C
No ratings yet
User-Defined Functions in C
6 pages
Catalogue 1VAP428601-DB - SCV - ABB
No ratings yet
Catalogue 1VAP428601-DB - SCV - ABB
2 pages
LAB 1-WP Evaluation Sheet
No ratings yet
LAB 1-WP Evaluation Sheet
1 page
Software Project Management Unit-3 - 1 PDF
No ratings yet
Software Project Management Unit-3 - 1 PDF
2 pages
Em70 140
No ratings yet
Em70 140
2 pages
Assignment1 20CE31002
No ratings yet
Assignment1 20CE31002
5 pages
Ais615 Lesson Plan Semester Oct 2023
No ratings yet
Ais615 Lesson Plan Semester Oct 2023
3 pages
Lab 2
No ratings yet
Lab 2
3 pages

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

2311.11904v2 (Copy)

Uploaded by

2311.11904v2 (Copy)

Uploaded by

LLMs as Visual Explainers: Advancing Image Classification with

Evolving Visual Descriptions

Songhao Han * 1 Le Zhuo * 1 Yue Liao 1 Si Liu 1

Abstract (a) Previous Method Q: What are useful features for

paring the similarity between images and class

vancements in large language models (LLMs) to Class Names Iterative Optimization

this to two primary factors: 1) the reliance on

Visual Feedback from CLIP

Iterate N Times Incorrect Descriptor

Iteration 0 Iteration 1 Iteration N

Table 3: Transferring optimized descriptors to different model architectures.

ImageNet EuroSAT Caltech Flowers102

Ours: First Iteration Ours: Last Iteration CuPL

Average Average Average

residential district vertical amaranth stature small, intricately arranged feathers

Figure 4 visualizes LLM-generated descriptions that con- 5. Conclusion

Soomro, K., Zamir, A. R., and Shah, M. Ucf101: A dataset References

Soomro, K., Zamir, A. R., and Shah, M. Ucf101: A dataset

Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux,

A. Ablation on Few shot Study

Setting UCF Caltech Flowers102 CUB

B. Iterative Optimization Visualization

Ours: Iteration 0 Ours: Iteration 4 Ours: Iteration 9 CuPL

C. Visual Feedback Experiment

Listing 1: Initialization system prompt.

Some helpful tips for optimizing the class descriptions:

2. DO NOT give me non-visual words!

Listing 2: Initialization user prompt.

The confusion classes for CLIP are:

Listing 3: Mutation system prompt.

Some helpful tips for optimizing the class descriptions:

2. DO NOT give me non-visual words!

Listing 4: Mutation user prompt.

The confusion classes for CLIP are:

Listing 5: Crossover system prompt.

Some helpful tips for optimize the class descriptions:

Listing 6: Crossover user prompt.

Listing 8: Examples of generated descriptions for SUN.

"earth sifting screens",

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.