0% found this document useful (0 votes)
29 views49 pages

Arxiv 2404.14135

arxiv

Uploaded by

Rio Carthiis
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
29 views49 pages

Arxiv 2404.14135

arxiv

Uploaded by

Rio Carthiis
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 49

Highlights

Text in the Dark: Extremely Low-Light Text Image Enhancement

Che-Tsung Lin, Chun Chet Ng, Zhi Qin Tan, Wan Jun Nah, Xinyu Wang,
arXiv:2404.14135v1 [cs.CV] 22 Apr 2024

Jie Long Kew, Pohao Hsu, Shang Hong Lai, Chee Seng Chan, Christopher
Zach

• We present a new method to enhance low-light images, especially scene


text regions.

• We developed a novel Supervised-DCE model to synthesize extremely


low-light images.

• We create 3 new low-light text datasets SID-Sony-Text, SID-Fuji-Text,


and LOL-Text.

• Our new datasets assess enhanced low-light images with scene text
extraction tasks.

• Our method achieves the best results on all datasets quantitatively &
qualitatively.
Text in the Dark: Extremely Low-Light Text Image
Enhancement
Che-Tsung Lina,1 , Chun Chet Ngb,1 , Zhi Qin Tanb , Wan Jun Nahb , Xinyu
Wangd , Jie Long Kewb , Pohao Hsuc , Shang Hong Laic , Chee Seng Chanb,∗,
Christopher Zacha
a
Chalmers University of Technology, Gothenburg, Sweden
b
Universiti Malaya, Kuala Lumpur, Malaysia
c
National Tsing Hua University, Hsinchu, Taiwan
d
The University of Adelaide, Adelaide, Australia

Abstract

Text extraction in extremely low-light images is challenging. Although ex-


isting low-light image enhancement methods can enhance images as pre-
processing before text extraction, they do not focus on scene text. Further
research is also hindered by the lack of extremely low-light text datasets.
Thus, we propose a novel extremely low-light image enhancement frame-
work with an edge-aware attention module to focus on scene text regions.
Our method is trained with text detection and edge reconstruction losses to
emphasize low-level scene text features. Additionally, we present a Super-
vised Deep Curve Estimation model to synthesize extremely low-light images
based on the public ICDAR15 (IC15) dataset. We also labeled texts in the
extremely low-light See In the Dark (SID) and ordinary LOw-Light (LOL)
datasets to benchmark extremely low-light scene text tasks. Extensive exper-


Corresponding Author.
Email address: cs.chan@um.edu.my (Chee Seng Chan)
1
These authors contributed equally to this work.

Preprint submitted to Signal Processing: Image Communication April 23, 2024


iments prove our model outperforms state-of-the-art methods on all datasets.
Code and dataset will be released publicly at https://github.com/chunchet-
ng/Text-in-the-Dark.
Keywords: Extremely Low-Light Image Enhancement, Edge Attention,
Text Aware Augmentation, Scene Text Detection, Scene Text Recognition

1. Introduction

(a) (b) (c) (d)

Figure 1: From left to right: (a) Original images; (b) Enhanced results with our proposed
method; (c-d) Zoomed-in (2x) regions of the blue and green bounding boxes. Top row:
SID-Sony-Text; Middle row: SID-Fuji-Text; Bottom row: LOL-Text. Extremely low-light
images in the SID dataset are significantly darker than those in the LOL dataset, and our
model enhances the images to the extent that texts are clearly visible with sharp edges.

Scene text understanding involves extracting text information from im-


ages through text detection and recognition, which is a fundamental task in
computer vision. However, performance drops sharply when images are cap-
tured under low-light conditions. The main difficulty in detecting text in low-
light images is that low-level features, such as edges and character strokes,

2
are no longer prominent or hardly visible. On the other hand, enhancing
images captured in extremely low-light conditions pose a greater challenge
than ordinary low-light images due to the higher noise levels and greater
information loss. For instance, we show the difference in darkness level in
Figure 1, where it is evident that the See In the Dark (SID) datasets [1] are
darker and, in theory, more difficult to enhance than the LOw-Light (LOL)
dataset [2]. Quantitatively, we calculated the PSNR and SSIM values for two
subsets of SID, SID-Sony and SID-Fuji, and LOL by comparing each image
against pure black images in Table 1. Based on each dataset’s average per-
ceptual lightness (L* in the CIELAB color space), images in SID are at least
15 times darker than those in LOL. Hence, low-light image enhancement is a
necessary pre-processing step for scene text extraction under such conditions.
Over the years, many general or low-light image enhancement models
have been proposed to improve the interpretability and extraction of infor-
mation in images by providing better input for subsequent image content
analysis. Early methods [3, 18, 19] typically attempted to restore the sta-
tistical properties of low-light images to those of long-exposure images from
a mathematical perspective. On the other hand, deep learning-based meth-
ods [2, 1, 23, 25] aim to learn the mapping between low-light images and
their corresponding long-exposure versions via regression. To the best of our
knowledge, most existing low-light image enhancement works have not ex-
plicitly addressed the restored image quality in terms of downstream scene
text tasks.
Recent advancements in visual attention mechanisms have demonstrated
their effectiveness in identifying and boosting salient features in images.

3
Dataset PSNR ↑ SSIM ↑ Avg. L* ↓
SID-Sony [1] 44.350 0.907 0.009
SID-Fuji [1] 41.987 0.820 0.004
LOL [2] 23.892 0.195 0.142
Pure Black ∞ 1.000 0.000

Table 1: The difference between the extremely low-light dataset, SID, and the ordinary
low-light dataset, LOL, is shown in terms of PSNR and SSIM values, computed by compar-
ing short-exposure images against pure black images. Avg. L* is the average perceptual
lightness in the CIELAB color space, calculated based on short-exposure images. Scores
are averaged across training and test sets. Higher PSNR and SSIM values, along with
lower Avg. L*, indicate darker images that are more challenging for image enhancement
and scene text extraction.

Channel-only attention [11, 12, 13], spatial attention [14, 15] or the subse-
quent channel-spatial attention [16, 17] modules were proposed to emphasize
the most informative areas. However, these methods cannot preserve texture
details, especially fine-grained edge information that is intuitively needed to
enhance extremely low-light images with complex textures. To overcome this
limitation, we introduce Edge-Aware Attention (Edge-Att). This novel at-
tention module simultaneously performs channel and spatial attention-based
feature learning on high-level image and edge features. Our model also con-
siders text information in the image through a text-aware loss function. This
way, our model can effectively enhance low-light images while preserving
fine-grained edge information, texture details, and legibility of text.
The scarcity of extremely low-light text datasets presents a hurdle for
further research. To address this, we annotated all text instances in both
the training and testing sets of the SID and LOL datasets, creating three
new low-light text datasets: SID-Sony-Text, SID-Fuji-Text, and LOL-Text.
We then proposed a novel Supervised Deep Curve Estimation (Supervised-

4
DCE) model to synthesize extremely low-light scene text images based on the
commonly used ICDAR15 (IC15) scene text dataset. It allows researchers
to easily translate naive scene text datasets into extremely low-light text
datasets. In addition to the previously published conference version of this
work [45], we have made four significant extensions. Firstly, we propose
a novel dual encoder-decoder framework that can achieve superior perfor-
mance on low-light scene text tasks (Section 3.1). Secondly, we introduce a
new image synthesis method capable of generating more realistic extremely
low-light text images (Section 4.1). Thirdly, we have further annotated texts
in the Fuji and LOL datasets, thereby forming the largest low-light scene
text datasets to date (Section 5). Fourthly, comprehensive experiments and
analyses are carried out to study the latest methods along with our pro-
posed methods on all synthetic and real low-light text datasets. The main
contributions of our work are as follows:

• We present a novel scene text-aware extremely low-light image enhance-


ment framework with dual encoders and decoders to enhance extremely
low-light images, especially scene text regions within them. Our pro-
posed method is equipped with Edge-Aware Attention modules and
trained with new Text-Aware Copy-Paste (Text-CP) augmentation.
Our model can restore images in challenging lighting conditions without
losing low-level features.

• We developed a Supervised-DCE model to synthesize extremely low-


light images. This allows us to use existing publicly available scene
text datasets such as IC15 to train our model alongside genuine ones
for scene text research under such extreme lighting conditions.

5
• We labeled the texts in the SID-Sony, SID-Fuji, and LOL datasets
and named them SID-Sony-Text, SID-Fuji-Text, and LOL-Text, re-
spectively. This provides a new perspective for objectively assessing
enhanced extremely low-light images through scene text tasks.

2. Related Works

Low-light Image Enhancement. Retinex theory assumes that an image


can be decomposed into illumination and reflectance. Most Retinex-based
methods enhance results by removing the illumination part [3], while oth-
ers such as LIME [18] keep a portion of the illumination to preserve natu-
ralness. BIMEF [19] further designs a dual-exposure fusion framework for
accurate contrast and lightness enhancement. RetinexNet [2] combines deep
learning and Retinex theory, adjusting illumination for enhancement after
image decomposition. The recent successes of generative adversarial net-
works (GANs) [20] have attracted attention from low-light image enhance-
ment because GANs have proven successful in image translation. Pix2pix [21]
and CycleGAN [22] have shown good image-translation results in paired
and unpaired image settings, respectively. To overcome the complexity of
CycleGAN, EnlightenGAN [23] proposed an unsupervised one-path GAN
structure. Besides general image translation, [1] proposed learning-based
low-light image enhancement on raw sensor data to replace much of the tra-
ditional image processing pipeline, which tends to perform poorly on such
data. EEMEFN [24] also attempted to enhance images using multi-exposure
raw data that is not always available.
Zero-Reference Deep Curve Estimation (Zero-DCE) [25] designed a light-

6
weight CNN to estimate pixel-wise high-order curves for dynamic range ad-
justment of a given image without needing paired images. [26] designed
a novel Self-Calibrated Illumination (SCI) learning with an unsupervised
training loss to constrain the output at each stage under the effects of a
self-calibrated module. ChebyLighter [27] learns to estimate an optimal
pixel-wise adjustment curve under the paired setting. Recently, the Trans-
former [28] architecture has become the de-facto standard for Natural Lan-
guage Processing (NLP) tasks. ViT [29] applied the attention mechanism
in the vision task by splitting the image into tokens before sending it into
Transformer. Illumination Adaptive Transformer (IAT) [30] uses attention
queries to represent and adjust ISP-related parameters. Most existing models
enhance images in the spatial domain. Fourier-based Exposure Correction
Network (FECNet) [31] presents a new perspective for exposure correction
with spatial-frequency interaction and has shown that their model can be
extended to low-light image enhancement.
Scene Text Extraction. Deep neural networks have been widely used for
scene text detection. CRAFT [32] predicts two heatmaps: the character re-
gion score map and the affinity score map. The region score map localizes
individual characters in the image, while the affinity score map groups each
character into a single word instance. Another notable scene text detection
method is Pixel Aggregation Network (PAN) [33] which is trained to pre-
dict text regions, kernels, and similarity vectors. Both text segmentation
models have proven to work well on commonly used scene text datasets such
as IC15 [34] and TotalText [35]. Inspired by them, we introduced a text
detection loss in our proposed model to focus on scene text regions during

7
extremely low-light image enhancement. Furthermore, state-of-the-art text
recognition methods such as ASTER [36] and TRBA [37] are known to per-
form well on images captured in complex scenarios. ASTER [36] employs
a flexible rectification module to straighten the word images before passing
them to a sequence-to-sequence model with the bi-directional decoder. The
experimental results of ASTER showed that the rectification module could
achieve superior performance on multiple scene text recognition datasets,
including the likes of IC15 and many more. Besides, TRBA [37] provided
interesting insights by breaking down the scene text recognition framework
into four main stages: spatial transformation, character feature extraction,
followed by sequence modeling, and the prediction of character sequences.
Given these methods’ robustness on difficult texts, they are well-suited to
recognize texts from enhanced low-light images.

3. Extremely Low-Light Text Image Enhancement

3.1. Problem Formulations

Let x ∈ RW ×H×3 be a short-exposure image of width W and height


H. An ideal image enhancement expects that a neural network LE(x; θ)
parameterized by θ can restore this image to its corresponding long-exposure
image, y ∈ RW ×H×3 , i.e., LE(x; θ) ≃ y. However, previous works normally
pursued the lowest per-pixel intensity difference, which should not be the
goal for image enhancement because we usually expect that some high-level
computer vision tasks can work reasonably well on those enhanced images.
For example, in terms of text detection, the goal of the neural network can be
the lowest detection bounding boxes discrepancy, i.e., B(LE(x; θ)) ≃ B(y).

8
Our novel image enhancement model consists of a U-Net accommodating
extremely low-light images and edge maps using two independent encoders.
During model training, instead of channel attention, the encoded edges guide
the spatial attention sub-module in the proposed Edge-Att to attend to edge
pixels related to text representations. Besides the image enhancement losses,
our model incorporates text detection and edge reconstruction losses into the
training process. This integration effectively guides the model’s attention to-
wards text-related features and regions, facilitating improved image textual
content analysis. As a pre-processing step, we introduced a novel augmen-
tation technique called Text-CP to increase the presence of non-overlapping
and unique text instances in training images, thereby promoting comprehen-
sive learning of text representations.

3.2. Network Design

Our model was inspired by U-Net[1] with some refinements. Firstly, the
network expects heterogeneous inputs, i.e., extremely low-light images, x,
and the corresponding RCF [38] edge maps, e. Secondly, input-edge pairs
are handled by two separate encoders with edge-aware attention modules
between them. The attended features are then bridged with the decoder
through skip connections. Finally, our multi-tasking network predicts the
enhanced image, x′ , and the corresponding reconstructed edge, e′ . The over-
all architecture of our network can be seen in Figure 2 and modeled as:

x′ , e′ = LE(x, e; θ). (1)

9
Figure 2: Illustration of the architecture of our proposed framework, designed to enhance
extremely low-light images while incorporating scene text awareness.

3.3. Objectives

Our proposed model is trained to optimize four loss functions. The first
two, Smooth L1 loss and multi-scale SSIM loss focus on enhancing the overall
image quality. The third, text detection loss, targets the enhancement of
scene text regions specifically. The fourth, edge reconstruction loss, focuses
on crucial low-level edge features.
Firstly, we employ smooth L1 loss as the reconstruction loss to better
enforce low-frequency correctness [21] between x′ and y as:

 0.5 · (x′ − y)2 /δ, if |x′ − y| < δ
Lrecons = (2)
 |x′ − y| − 0.5 · δ, otherwise
where we empirically found that δ = 1 achieved good result. The authors of
Pix2Pix [21] showed that by utilizing L1 loss, the model can achieve better

10
Figure 3: (a) Visual representation of our edge decoder, wherein A and B represent the
output from the corresponding convolution blocks in Figure 2 and S denotes the scaling
of the image. (b) Illustration of the proposed Edge-Aware Attention module.

results as the generated images are less blurry and proved that L1 loss can
better enforce the learning of low-frequency details, which is also essential
for OCR tasks. On the other hand, the L1 norm is less sensitive to outliers
than the L2 norm, thus resulting in a more robust model towards extreme
pixel intensities.
Secondly, the multi-scale SSIM metric was proposed in [39] for reference-
based image quality assessment, focusing on image structure consistency. An
M -scale SSIM between the enhanced image x′ and ground truth image y is:
YM
SSIMM S (x′ , y) = [lM (x′ , y)]τ · [cj (x′ , y)]ϕ [sj (x′ , y)]ψ , (3)
j=1

where lM is the luminance at M -scale; cj and sj represent the contrast and the
structure similarity measures at the j-th scale; τ , ϕ, and ψ are parameters to

11
adjust the importance of the three components. Inspired by [39], we adopted
the M -scale SSIM loss function in our work to enforce the image structure
of x′ to be close to that of y:

LSSIMM S = 1 − SSIMM S (x′ , y). (4)

Thirdly, a well-enhanced extremely low-light image implies that we could


obtain similar text detection results on both the enhanced and ground truth
images. As such, we propose to employ CRAFT [32] to localize texts in
images through its region score heatmap. To implicitly enforce our model to
focus on scene text regions, we define the text detection loss, Ltext as:

Ltext = ∥R(x′ ) − R(y)∥1 , (5)

where R(x′ ) and R(y) denote the region score heatmaps of the enhanced and
ground truth images, respectively.
Fourthly, the edge reconstruction decoder in our model is designed to
extract edges better, which are essential for text pixels. Figure 3(a) shows
an overview of the edge decoder. The loss at pixel i of detected edge, ei , with
respect to the ground truth edge, gi is defined as:

 α · log(1 − P (e )), if g = 0
i i
l(ei ) = (6)
 β · logP (ei ), if gi = 1
where
|Y + |
α=λ· ,
|Y + | + |Y − |
(7)
|Y − |
β= + ,
|Y | + |Y − |
Y + and Y − denote the positive and negative sample sets, respectively. λ is set
to 1.1 to balance both types of samples. The ground truth edge is generated

12
using a Canny edge detector [40], and P(ei ) is the sigmoid function. Then,
the overall edge reconstruction loss can be formulated as:
|I| J
X X ′
Ledge = l(eji ) + l(ei ), (8)
i=1 j=1

where l(eji ) is the predicted edge at pixel i and level j. J = 3 is the number

of side edge outputs in our model. ei is the final predicted edge map from
the concatenation of side outputs. |I| is the number of pixels in a cropped
image during training.
Finally, the total joint loss function, Ltotal en of our proposed model is:

Ltotal en = ωrecons Lrecons + ωtext Ltext + ωSSIMM S LSSIMM S + ωedge Ledge , (9)

where ωrecons , ωtext , ωSSIMM S , and ωedge are the weights to address the im-
portance of each loss term during training.

3.4. Edge-Aware Attention

Polarized Self-Attention (PSA) [41] is one of the first works to propose


an attention mechanism catered to high-quality pixel-wise regression tasks.
However, we found that the original PSA module that only considers a single
source of feature map for both channel and spatial attention is ineffective for
extremely low-light image enhancement. Under low light conditions, the
details of content such as the edges of the texts are barely discernible which
is less effective in guiding the network to attend to spatial details. Therefore,
we designed our Edge-Aware Attention (Edge-Att) module to take in feature
maps from two encoders and process them differently, i.e., the feature maps
of extremely low-light images from the image encoder are attended channel-
wise, whereas the spatial attention submodule attends to feature maps from

13
the edge encoder. By doing so, we can ensure that Edge-Att can attend
to rich images and edge features simultaneously. The proposed attention
module is illustrated in Figure 3(b).
Firstly, the feature map from the image encoder, F is fed into the channel
attention, Ach (F ) ∈ RC×1×1 with calculation as follows:

Ach (F ) = σ3 [FSG (Wz (σ1 (Wv (F )) × FSM (σ2 (Wq (F )))))] , (10)

where Wq , Wv , and Wz are 1x1 convolution layers, σ1 , σ2 and σ3 are tensor


reshape operators. FSM (.) and FSG (.) refer to softmax and sigmoid operators.
J Jch
The output of this branch is Ach (F ) ch F ∈ RC×H×W , where is a
channel-wise multiplication operator.
Secondly, given the edge-branch feature map E, the edge-aware spatial
attention, Asp (E) ∈ R1×H×W , is defined as:

Asp (E) = FSG [σ3 (FSM (σ1 (FGP (Wq (E)))) × σ2 (Wv (E)))] , (11)

where Wq and Wv are 1x1 convolution layers, σ1 , σ2 , and σ3 are three tensor
reshape operators. FSM (.) is a softmax operator, FGP (.) is a a global pooling
operator, and FSG (.) is a sigmoid operator. The output of this branch is
J J
Asp (E) sp F ∈ RC×H×W , where sp is a spatial-wise multiplication opera-
tor, and F is the image enhancement branch’s feature map. Finally, output
of the proposed Edge-Att module is the composition of two submodules:

Edge-Att(F, E) = Ach (F ) ⊙ch F + Asp (E) ⊙sp F. (12)

3.5. Text-Aware Copy-Paste Augmentation

This work aims to enhance extremely low-light images to improve text


detection and recognition. However, the dataset’s limited number of text in-

14
Before Augmentation After Augmentation

Using Copy-Paste Using Text-CP

i. Sample Text Instances ii. Dataset-Aware Sampling iii. Resize Text & iv. Paste Text
( , )

Figure 4: Illustration of the Text-Aware Copy-Paste (Text-CP) data augmentation. Com-


pared with the original Copy-Paste, our method generates images with non-overlapping
text instances that allow the detection of texts outside their usual context.

stances could hinder the model’s ability. Although Copy-Paste Augmentation


[42] can increase the number of text instances, overlapping texts introduced
by random placement might confuse CRAFT in text detection loss since
CRAFT is not trained to detect such texts. In the commonly used scene
text datasets such as ICDAR15 [34], overlapping texts are marked as ”do
not care” regions which are excluded from models’ training and evaluation.
Thus, to adhere to ICDAR’s standard and to address overlapping text issues,
we propose a novel approach called Text-Aware Copy-Paste Augmentation
(Text-CP). Text-CP considers each text box’s location and size by leveraging
uniform and Gaussian distributions derived from the dataset. For a training
image t of width wt and height ht to be augmented, we initialize a set of

15
labeled text boxes in the training set as C, which is:

C = (u1 , v1 , w1 , h1 ), ..., (u|C| , v|C| , w|C| , h|C| )) , (13)

where each tuple represents the top left position of a text located at uk and vk
with width, wk , and height, hk with k representing the index of the current
text’s box in the set. We then sample a target number of text instances,
ntarget , from the set of C to form Ct , defined as the set of text boxes to be
pasted on that training image, t. The next step is to crop and paste the
sampled texts without overlapping. For each ck ∈ Ct , we adopt two uniform
distributions in modeling the position of the texts, ûk and v̂k :

ûk ∼ U (0, wt ),
(14)
v̂k ∼ U (0, ht ).
As for wk and hk , they are sampled from Gaussian distribution as:
2
ŵk ∼ N (µW , σW ),
(15)
2
ĥk ∼ N (µH , σH ),

where µ and σ 2 are estimated means and variances of width W and height
H from all the labeled texts in the training set. We illustrate the overall
data augmentation process of Text-CP and its augmented results in Figure
4. The pseudocode of Text-CP is detailed in the supplementary material.

4. Extremely Low-Light Image Synthesis

4.1. Problem Formulations


To the best of our knowledge, the research community has not exten-
sively explored extremely low-light image synthesis, mainly due to the lim-
ited availability of datasets designed explicitly for extremely low-light scene

16
text. While extremely low-light dataset, SID, and low-light dataset, LOL,
exist, they are not primarily collected with scene text in mind. This scarcity
of dedicated datasets for extremely low-light scene text poses challenges for
evaluating the performance of existing image enhancement methods in terms
of image quality and scene text metrics. In order to address this issue, we
define the extremely low-light image synthesis problem as follows:

x̂ = LS(y; θs ), (16)

where given a long-exposure image y, a low-light image synthesis neural


network, LS(y; θs ) parameterized by θs , will synthesize a set of images x̂,
such that B(LS(y; θs )) ≃ B(x). We want the synthesized extremely low-
light images to be as realistic as possible to genuine low-light images, x.
Therefore, we introduce a Supervised-DCE model focusing on synthe-
sizing a set of realistic extremely low-light images, enabling existing image
enhancement techniques to leverage publicly available scene text datasets.
Consequently, existing low-light image enhancement methods can benefit
from training with synthetic data to the extent that they can perform better
on the downstream scene text detection task, as detailed in Section 6.5.

4.2. Network Design

Zero-DCE [25] was originally proposed to perform image enhancement


through curve estimation. However, its network can only adjust brightness
slightly since the per-pixel trainable curve parameter, α, in the quadratic
curve limits the pixel variation. The advantage of performing intensity ad-
justment in terms of the quadratic curve is that the pixel range can be bet-
ter constrained. In this work, we propose a Supervised-DCE model that

17
Figure 5: Illustration of the proposed Supervised-DCE model for extremely low-light image
synthesis.

learns to provide reconstructable extremely low-light images with paired


short- and long-exposure images. The framework of our image synthesis
network, Supervised-DCE, can be seen in Figure 5. Our goal is to push most
values closer to zero n the context of synthesizing extremely low-light images.
Accordingly, we propose a reformulation of the DCE model as follows:

x̂ = −(H(y) + U (y))y 2 + (1 + H(y))y, (17)

where y is the input (i.e., long-exposure image); x̂ is the synthesized low-


light image; H(y) and U (y) are the output of Tanh and ReLU branches,
respectively. By introducing the second U (y) branch, we eliminate the need
for iteratively applying the model to produce the desired output, and drastic
intensity adjustment can be done with only a single iteration.
In the original Zero-DCE model, image enhancement is learned by setting
the exposure value to 0.6 in the exposure control loss. However, manually
setting an exposure value to synthesize extremely low-light images is too
heuristic and inefficient. In our proposed synthesis framework, the overall

18
learning is done by training SID long-exposure images to be similar to their
short-exposure counterparts. Most importantly, these images are translated
in the hope that their text information is somewhat deteriorated as the one
in genuine extremely low-light images and cannot be easily reconstructed.
Then, the trained model can be used to transform scene text datasets in
the public domain to boost the performance of extremely low-light image
enhancement in terms of text detection.

4.3. Objectives

During extremely low-light image synthesis, we expect the output to


maintain spatial consistency while reducing the overall proximity loss:

Lprox = ∥x̂ − x∥1 + Lentropy (x̂, x) + Lsmoothness (x̂, x), (18)

where x̂ is the synthesized extremely low-light image given the long-exposure


image y, and x, is the genuine low-light image, i.e., ground truth for x̂.
Entropy loss, Lentropy , and smoothness loss, Lsmoothness [43], are also used to
encourage the differences to be both sparse and local. With the introduction
of Lprox , we removed the color constancy loss of the original Zero-DCE model
since color constancy can be enforced through the supervised loss.
The spatial consistency loss, Lspa encourages spatial coherence of the
synthesized image by preserving the difference of neighboring regions between
the input image and its synthesized low-light version:
M
1 X X
Lspa = (|X̂i − X̂j | − αs log10 (9|Yi − Yj | + 1))2 , (19)
M i=1
j∈ω(i)

where M is the number of local regions, and ω(i) is four neighboring regions
(top, down, left, right) centered at region i. X̂ and Y are the averaged inten-

19
sity values of local regions of the synthesized images and the long-exposure
images, respectively. We introduced logarithm operation and αs parameter
to reduce the large spatial difference of Y where αs is set to 0.05. We set the
local region size to 4 × 4, following the original setting of Zero-DCE.
Besides spatial consistency, we also expect the monotonicity relation be-
tween neighboring pixels to be preserved. To achieve this, we reused the
illumination smoothness loss:
X
LtvZ = (|∇x Z c | + |∇y Z c |)2 , ξ = {R, G, B} , (20)
∀c∈ξ

where ∇x and ∇y are gradient operations on the x-axis and y-axis, respec-
tively. Illumination smoothness loss, LtvZ , is applied on both H(y) and U (y),
i.e., the curve parameter maps of the two branches, respectively, by substi-
tuting Z with H and U , resulting in LtvH and LtvU .
In summary, the overall learning objective, Ltotal syn to train our ex-
tremely low-light image synthesis network is defined as:

Ltotal syn = ωprox Lprox + ωspa Lspa + ωtvH LtvH + ωtvU LtvU . (21)

5. New Low-Light Text Datasets

Training Set Testing Set


Dataset
GT Img. Leg. Illeg. µW µH σW σH GT Img. Leg. Illeg.
SID-Sony-Text 161 5937 2128 79.270 34.122 123.635 50.920 50 611 359
SID-Fuji-Text 135 6213 4534 128.579 57.787 183.199 68.466 41 1018 1083
LOL-Text 485 613 1423 23.017 14.011 21.105 17.542 15 28 45
IC15 1000 4468 7418 78.410 29.991 55.947 24.183 500 2077 3153

Table 2: Statistics reported based on long-exposure images for all datasets. GT Img.
stands for ground truth image count, where Leg. and Illeg. stand for legible and illegible
text count, respectively.

20
In this work, we annotated all text instances in the extremely low-light
dataset, SID [1], and the ordinary low-light dataset, LOL [2]. SID has two
subsets: SID-Sony, captured by Sony α7S II, and SID-Fuji, captured by
Fujifilm X-T2. For this work, we included 878/810 short-exposure images
and 211/176 long-exposure images at a resolution of 4240×2832/6000×4000
from SID-Sony and SID-Fuji, respectively. The short-exposure time is 1/30,
1/25, and 1/10, while the corresponding reference (long-exposure) images
were captured with 100 to 300 times longer exposure, i.e., 10 to 30 seconds.
In our experiments, we converted short- and long-exposure SID images to
RGB format. The LOL dataset provides low/normal-light image pairs taken
from real scenes by controlling exposure time and ISO. There are 485 and 15
images at a resolution of 600×400 in the training and test sets, respectively.
We closely annotated text instances in the SID and LOL datasets following
the common IC15 standard. We show some samples in Figure 6. The newly
annotated datasets are named SID-Sony-Text, SID-Fuji-Text, and LOL-Text
to differentiate them from their low-light counterparts.

(a) SID-Sony-Text (b) SID-Fuji-Text (c) LOL-Text

Figure 6: Green boxes represent legible texts, and blue boxes represent illegible texts.

IC15 dataset was introduced in the ICDAR 2015 Robust Reading Com-
petition for incidental scene text detection and recognition. It contains 1500
scene text images at a resolution of 1280 × 720. In this study, IC15 is pri-

21
marily used to synthesize extremely low-light scene text images. Detailed
statistics of the text annotations for SID-Sony-Text, SID-Fuji-Text, LOL-
Text, and IC15 are shown in Table 2, where we included the statistics for
long-exposure images only for the sake of brevity. In this table, we also report
relevant statistics of the mean and standard deviation of labeled texts’ width
and height to be used by the proposed Text-Aware Copy-Paste augmen-
tation. The text annotations for SID-Sony-Text, SID-Fuji-Text, and LOL-
Text datasets will be released at https://github.com/chunchet-ng/Text-in-
the-Dark.
Moreover, we synthesized extremely low-light images based on IC15 by
using U-Net and our proposed Supervised-DCE model, respectively. To study
the difference between these two variations of image synthesis methods, we
generated a total of four sets of images by using the aforementioned two
models trained on SID-Sony and SID-Fuji, individually. Naming convention
of such synthetic datasets follows the format of “{Syn-IC15}-{Sony/Fuji}-
{v1/v2}”. “{Sony/Fuji}” is an indication of which dataset the image synthe-
sis model is trained on, while “{v1/v2}” differentiates the image synthesis
models where v1 is U-Net and v2 is our proposed Supervised-DCE model.
For instance, the synthetic images generated by a U-Net trained on SID-Sony
and SID-Fuji, are named Syn-IC15-Sony-v1 and Syn-IC15-Fuji-v1. And, syn-
thetic images generated by our proposed Supervised-DCE model are denoted
as Syn-IC15-Sony-v2 and Syn-IC15-Fuji-v2.

22
6. Experimental Results

6.1. Experiment Setup

Datasets and Metrics. All low-light image enhancement methods are


trained and tested on the datasets detailed in Section 5. They are then
evaluated in terms of intensity metrics (PSNR, SSIM), perceptual similar-
ity (LPIPS), and text detection (H-Mean). For the SID-Sony-Text, SID-
Fuji-Text, and LOL-Text datasets, which are annotated with text bound-
ing boxes only, we used well-known and commonly used scene text detec-
tors (CRAFT [32] and PAN [33]) to analyze the enhanced images. For
IC15, which provides both text detection and text recognition labels, we
conducted a two-stage text spotting experiment using the aforementioned
text detectors (CRAFT, PAN) and two robust text recognizers (TRBA [37]
and ASTER [36]) on the synthesized IC15 images after enhancement. The
metric for text spotting is case-insensitive word accuracy.
Implementation Details. We trained our image enhancement model for
4000 epochs using the Adam optimizer [44] with a batch size of 2. The initial
learning rate is set to 1e−4 and decreased to 1e−5 after 2000 epochs. At each
training iteration, we randomly cropped a 512 × 512 patch with at least one
labeled text box inside and applied random flipping and image transpose as
data augmentation strategies. The weightings of each loss term, i.e., ωrecons ,
ωtext , ωSSIMM S , and ωedge , were empirically set to 0.2125, 0.425, 0.15, and
0.2125 respectively, following the work of ELIE STR [45]. For other image
enhancement methods, we re-trained them on all datasets using the best set
of hyperparameters specified in their respective code repositories or papers.
As for the Supervised-DCE model, we used a batch size of 8 and trained

23
for 200 epochs using the Adam optimizer with default parameters and a fixed
learning rate of 1e−4 . It was trained on 256 × 256 image patches with loss
weightings of ωprox , ωspa , ωtvA and ωtvB , set to 1, 20, 10, and 10 respectively.

6.2. Results on SID-Sony-Text and SID-Fuji-Text Datasets

Our model’s performance is demonstrated in Table 3, achieving the high-


est H-Mean scores on all datasets with CRAFT and PAN. Following [45],
we illustrate the CRAFT text detection results on SID-Sony-Text in Figure
7. Qualitative results of existing methods on SID-Fuji-Text are presented
in the supplementary material. The effectiveness of our model in enhancing
extremely low-light images to a level where text can be accurately detected
is readily apparent. In Figure 7, only the images enhanced by our proposed
model yield accurate text detection results. On the other hand, existing
methods generally produce noisier images, resulting in inferior text detec-
tion results. While GAN-enhanced images tend to be less noisy, the text
regions are blurry, making text detection challenging. Moreover, our model
achieves the highest PSNR and SSIM scores on both SID-Sony-Text and
SID-Fuji-Text datasets, showing that our enhanced images are the closest to
the image quality of ground truth images. In short, better text detection is
achieved on our enhanced images through the improvement of overall image
quality and preservation of fine details within text regions.

6.3. Results on LOL-Text Dataset

To demonstrate the effectiveness of our model in enhancing low-light im-


ages with varying levels of darkness, we conducted experiments on the widely

24
Image Quality H-Mean
Type Method
PSNR ↑ SSIM ↑ LPIPS ↓ CRAFT ↑ PAN ↑
Input - - - 0.057 0.026
LIME [18] 13.870 0.135 0.873 0.127 0.057
TRAD
BIMEF [19] 12.870 0.110 0.808 0.136 0.079
Zero-DCE [25] 10.495 0.080 0.999 0.196 0.157
ZSL Zero-DCE++ [46] 12.368 0.076 0.982 0.218 0.162
SID-Sony-Text

SCI [26] 11.814 0.100 1.000 0.201 0.151


CycleGAN [22] 15.340 0.453 0.832 0.090 0.053
UL
EnlightenGAN [23] 14.590 0.426 0.793 0.146 0.075
RetinexNet [2] 15.490 0.368 0.785 0.115 0.040
Pix2Pix [21] 21.070 0.662 0.837 0.266 0.190
ChebyLighter [27] 15.418 0.381 0.787 0.260 0.184
SL FECNet [31] 22.575 0.648 0.788 0.245 0.188
IAT [30] 19.234 0.562 0.778 0.244 0.176
ELIE STR [45] 25.507 0.716 0.789 0.324 0.266
Ours 25.596 0.751 0.751 0.368 0.298
GT - - - 0.842 0.661

Input - - - 0.048 0.005


Zero-DCE [25] 8.992 0.035 1.228 0.249 0.061
ZSL Zero-DCE++ [46] 11.539 0.047 1.066 0.262 0.077
SCI [26] 10.301 0.056 1.130 0.300 0.073
SID-Fuji-Text

CycleGAN [22] 17.832 0.565 0.735 0.277 0.191


UL
EnlightenGAN [23] 18.834 0.572 0.822 0.310 0.277
Pix2Pix [21] 19.601 0.599 0.803 0.353 0.296
ChebyLighter [27] 20.313 0.616 0.791 0.412 0.318
FECNet [31] 18.863 0.365 0.829 0.382 0.185
SL IAT [30] 19.647 0.537 0.844 0.445 0.277
ELIE STR [45] 19.816 0.614 0.801 0.426 0.333
Ours 21.880 0.649 0.788 0.487 0.356
GT - - - 0.775 0.697

Input - - - 0.333 0.133


Zero-DCE [25] 14.928 0.587 0.328 0.421 0.229
ZSL Zero-DCE++ [46] 15.829 0.537 0.408 0.389 0.242
SCI [26] 14.835 0.549 0.335 0.421 0.171
CycleGAN [22] 19.826 0.734 0.288 0.250 0.133
LOL-Text

UL
EnlightenGAN [23] 15.800 0.654 0.300 0.343 0.125
Pix2Pix [21] 20.581 0.771 0.247 0.353 0.129
ChebyLighter [27] 19.820 0.769 0.199 0.353 0.176
FECNet [31] 20.432 0.787 0.231 0.378 0.229
SL IAT [30] 20.437 0.772 0.234 0.421 0.188
ELIE STR [45] 19.782 0.824 0.167 0.462 0.235
Ours 21.330 0.828 0.163 0.474 0.294
GT - - - 0.439 0.205

Table 3: Quantitative results of PSNR, SSIM, LPIPS, and text detection H-Mean for
low-light image enhancement methods on SID-Sony-Text, SID-Fuji-Text, and LOL-Text
datasets. Please note that TRAD, ZSL, UL, and SL stand for traditional methods, zero-
shot learning, unsupervised learning, and supervised learning respectively. Scores in bold
are the best of all.

25
(a) Low-Light (b) LIME [18] (c) BIMEF [19] (d) Zero-DCE [25]

(e) Zero-DCE++ [46] (f) SCI [26] (g) CycleGAN [22] (h) EnlightenGAN [23]

(i) RetinexNet [2] (j) Pix2Pix [21] (k) ChebyLighter [27] (l) FECNet [31]

(m) IAT [30] (n) ELIE STR [45] (o) Ours (p) Ground Truth

Figure 7: Comparison with state-of-the-art methods on the SID-Sony-Text dataset is


shown in the following manner: for each column, the first row displays enhanced images
marked with blue boxes as regions of interest. The second row displays zoomed-in regions
of enhanced images overlaid with red text detection boxes from CRAFT [32]. Column 7a
displays the low-light image. Columns 7b to 7o show image enhancement results from all
related methods. The last cell displays ground truth images.

26
used LOL dataset, which is relatively brighter than the SID dataset, as de-
picted in Table 1. Interestingly, we found that our enhanced images achieved
the best detection results on LOL-Text among existing methods, as shown in
Table 3. Surprisingly, despite the lower resolution (600x400) of the images in
LOL, our method’s enhanced images with sharper and crisper low-level de-
tails surpassed the ground truth images’ H-Mean scores. Qualitative results
on the LOL-Text dataset are illustrated in the supplementary material. Al-
though certain methods yielded output images with acceptable image quality
(i.e., bright images without color shift), their text detection results were in-
ferior to ours. Furthermore, our superior results on the LOL-Text dataset
emphasize our method’s ability to generalize well on both ordinary and ex-
tremely low-light images, effectively enhancing a broader range of low-light
images while making the text clearly visible.

6.4. Effectiveness of the Proposed Supervised-DCE Model

The goal of image synthesis in our work is to translate images captured


in well-illuminated scenarios to extremely low light. In this work, we choose
the commonly used IC15 scene text dataset as our main synthesis target.
The synthesized dataset then serves as additional data to train better scene
text-aware image enhancement models, which are studied in Section 6.5.
Intuitively, realistic synthesized images should possess similar character-
istics to genuine extremely low-light images. To verify the effectiveness of
our synthesis model, we compared our proposed Supervised-DCE model (v2)
with the U-Net proposed in SID [1] (v1). Specifically, we trained the syn-
thesizing models on the training set and synthesized the images based on
the corresponding test set. Then, we calculated the PSNR and SSIM of the

27
synthesized images by comparing them with the genuine ones along with the
average perceptual lightness in CIELAB color space. The comparison was
made on two SID datasets, SID-Sony and SID-Fuji.
In Table 4, we show that v2’s PSNR and SSIM are higher than v1’s, in-
dicating higher similarity between our synthesized and genuine images. Our
new method (v2) also exhibits closer Avg. L* values and H-Mean scores to
the genuine images than v1, indicating darker and more accurate deterio-
ration of fine text details. In addition, qualitative results for the proposed
Supervised-DCE model and results of synthetic IC15 datasets including Syn-
IC15-Sony-v1, Syn-IC15-Sony-v2, Syn-IC15-Fuji-v1, and Syn-IC15-Fuji-v2
are presented in the supplementary material for comprehensive analyses.

Dataset PSNR SSIM Avg. L* CRAFT PAN


Syn-SID-Sony-v1 41.095 0.809 0.176 0.294 0.083
Syn-SID-Sony-v2 45.442 0.942 0.003 0.135 0.014
Genuine SID-Sony - - 0.008 0.057 0.026
Syn-SID-Fuji-v1 39.187 0.784 0.172 0.402 0.042
Syn-SID-Fuji-v2 41.881 0.863 0.002 0.093 0.002
Genuine SID-Fuji - - 0.004 0.048 0.005

Table 4: The difference between genuine extremely low-light dataset, SID, and synthetic
extremely low-light images generated using U-Net (v1) and Supervised-DCE (v2). Please
note that synthetic images’ PSNR and SSIM values are based on comparison against
genuine low-light images in the test set instead of pure black images calculated in Table
1. Additionally, we can notice that v2-images are more realistic and darker, similar to
genuine extremely low-light images due to their higher values of PSNR and SSIM, along
with closer Avg. L*.

6.5. Results on Training with Mixed Datasets


We trained top-performing models from Section 6.2 using a mixture of
genuine (SID) and synthetic low-light (IC15) datasets to test whether ex-
tremely low-light image enhancement can benefit from synthesized images.

28
The trained models were evaluated on their respective genuine low-light
datasets. Results in Table 5 showed a significant increase in H-Mean, and we
found that both versions (v1 and v2) can fill the gap caused by the scarcity
of genuine low-light images. This justifies the creation of a synthetic IC15
dataset for such a purpose. Furthermore, v2-images, i.e., extremely low-
light images synthesized by our proposed Supervised-DCE, further pushed
the limit of H-mean scores on genuine extremely low-light images, and our
enhancement model benefited the most because it could learn more from text
instances and reconstruct necessary details to represent texts. Despite our
method’s success, a noticeable gap exists between our results and the ground
truth, emphasizing the need for further research and development to achieve
even more accurate and reliable scene text extraction in low-light conditions.

SID-Sony-Text + SID-Sony-Text + SID-Fuji-Text + SID-Fuji-Text +


Type Method Syn-IC15-Sony-v1 Syn-IC15-Sony-v2 Syn-IC15-Fuji-v1 Syn-IC15-Fuji-v2
CRAFT ↑ PAN ↑ CRAFT ↑ PAN ↑ CRAFT ↑ PAN ↑ CRAFT ↑ PAN ↑
Input 0.057 0.026 0.057 0.026 0.048 0.005 0.048 0.005
Zero-DCE++ [46] 0.230 0.159 0.242 0.153 0.274 0.080 0.281 0.076
ZSL
SCI [26] 0.240 0.154 0.243 0.160 0.307 0.076 0.313 0.084
CycleGAN [22] 0.180 0.071 0.219 0.143 0.297 0.284 0.310 0.277
UL
EnlightenGAN [23] 0.205 0.146 0.237 0.163 0.329 0.246 0.342 0.282
ELIE STR [45] 0.348 0.278 0.361 0.296 0.444 0.359 0.466 0.375
SL
Ours 0.383 0.311 0.395 0.319 0.515 0.392 0.549 0.416
GT 0.842 0.661 0.842 0.661 0.775 0.697 0.775 0.697

Table 5: Text detection H-Mean on genuine extremely low-light datasets when trained on
a combination of genuine and synthetic datasets. Scores in bold are the best of all.

6.6. Ablation Study of Proposed Modules

To understand the effect of each component of our model, we conducted


several ablation experiments by either adding or removing them one at a
time. Results are presented in Table 6. The baseline was a plain U-Net

29
Proposed Modules Image Quality H-Mean
Text-CP Dual Encoder Edge-Att Edge Decoder PSNR ↑ SSIM ↑ LPIPS ↓ CRAFT ↑ PAN ↑
- - - - 21.847 0.698 0.783 0.283 0.205
✓ - - - 21.263 0.658 0.771 0.304 0.252
✓ ✓ - - 20.597 0.655 0.780 0.335 0.261
✓ ✓ ✓ - 21.440 0.669 0.776 0.342 0.256
✓ ✓ - ✓ 21.588 0.674 0.779 0.353 0.285
✓ - ✓ ✓ 23.074 0.712 0.783 0.350 0.281
- ✓ ✓ ✓ 24.192 0.738 0.784 0.356 0.292
✓ ✓ ✓ ✓ 25.596 0.751 0.751 0.368 0.298

Table 6: Ablation study of proposed modules in terms of PSNR, SSIM, LPIPS, and text
detection H-Mean on the SID-Sony-Text dataset. Scores in bold are the best of all.

without any proposed modules. We initiated the ablation study by adding


Text-CP data augmentation, which improved CRAFT H-Mean from 0.283
to 0.304, indicating that involving more text instances during training is
relevant to text-aware image enhancement for models to learn text represen-
tation. Moreover, scores increased steadily by gradually stacking the baseline
with more modules. For instance, with the help of the dual encoder structure
and Edge-Att module in our proposed framework, CRAFT H-Mean increased
from 0.304 to 0.342. This shows that they can extract image features bet-
ter and attend to edges that shape texts in enhanced images. The edge
reconstruction loss calculated based on predictions from the edge decoder
helped strengthen the learning of edge features and empowered encoders in
our model. Interestingly, we found that removing one of the two most repre-
sentative modules (i.e., dual encoder or Edge-Att module) led to significant
differences in H-Mean because these two modules’ designs allow them to ex-
tract and attend to significant image features independently. We concluded
the experiment by showing that combining all proposed modules led to the
highest scores, as each module played an integral role in our final network.

30
Further analysis of Edge-Att and Text-CP are included in the supplementary
material to study their effectiveness as compared to the original versions.

7. Conclusion

This paper presents a novel scene text-aware extremely low-light image


enhancement framework consisting of a Text-Aware Copy-Paste augmenta-
tion method as a pre-processing step, followed by a new dual-encoder-decoder
architecture armed with Edge-Aware attention modules. With further as-
sistance from text detection and edge reconstruction losses, our model can
enhance images to the extent that high-level perceptual reasoning tasks can
be better fulfilled. Extremely low-light image synthesis has rarely been dis-
cussed over the years. Thus, we proposed a novel Supervised-DCE model
to provide better synthesized extremely low-light images so that extremely
low-light image enhancement can benefit from publicly available scene text
datasets such as IC15. Furthermore, our proposed extremely low-light im-
age enhancement model has been rigorously evaluated against various com-
peting methods, including traditional techniques and deep learning-based
approaches, on challenging datasets such as SID-Sony-Text, SID-Fuji-Text,
LOL-Text, and synthetic IC15. Through extensive experimentation, our find-
ings consistently demonstrate our model’s superiority in extremely low-light
image enhancement and text extraction tasks.

31
References

[1] C. Chen, Q. Chen, J. Xu, V. Koltun, Learning to see in the dark, in:
CVPR, 2018.

[2] C. Wei, W. Wang, W. Yang, J. Liu, Deep retinex decomposition for


low-light enhancement, arXiv preprint arXiv:1808.04560 (2018).

[3] D. J. Jobson, Z.-u. Rahman, G. A. Woodell, Properties and performance


of a center/surround retinex, IEEE Transactions on Image Processing
6 (3) (1997) 451–462.

[4] T. Çelik, T. Tjahjadi, Contextual and variational contrast enhancement,


IEEE Transactions on Image Processing 20 (2011) 3431–3441.

[5] C. Lee, C. Lee, C.-S. Kim, Contrast enhancement based on layered


difference representation of 2d histograms, IEEE Transactions on Image
Processing 22 (12) (2013) 5372–5384.

[6] L. Tao, C. Zhu, J. Song, T. Lu, H. Jia, X. Xie, Low-light image enhance-
ment using cnn and bright channel prior, in: ICIP, 2017.

[7] L. Tao, C. Zhu, G. Xiang, Y. Li, H. Jia, X. Xie, LLCNN: A convolutional


neural network for low-light image enhancement, in: VCIP, 2017.

[8] K. G. Lore, A. Akintayo, S. Sarkar, LLNet: A deep autoencoder ap-


proach to natural low-light image enhancement, Pattern Recognition 61
(2017) 650–662.

32
[9] M. Gharbi, J. Chen, J. Barron, S. W. Hasinoff, F. Durand, Deep bilat-
eral learning for real-time image enhancement, ACM Transactions on
Graphics 36 (4) (2017) 1–12.

[10] M. Liu, L. Tang, S. Zhong, H. Luo, J. Peng, Learning noise-decoupled


affine models for extreme low-light image enhancement, Neurocomput-
ing 448 (2021) 21–29.

[11] J. Hu, L. Shen, G. Sun, Squeeze-and-excitation networks, in: CVPR,


2018.

[12] J. Hu, L. Shen, S. Albanie, G. Sun, A. Vedaldi, Gather-excite: Exploit-


ing feature context in convolutional neural networks, in: NIPS, 2018.

[13] C. Chen, B. Li, An interpretable channelwise attention mechanism based


on asymmetric and skewed gaussian distribution, Pattern Recognition
139 (2023) 109467.

[14] L. Ju, J. Kittler, M. A. Rana, W. Yang, Z. Feng, Keep an eye on faces:


Robust face detection with heatmap-assisted spatial attention and scale-
aware layer attention, Pattern Recognition 140 (2023) 109553.

[15] X. Hou, M. Liu, S. Zhang, P. Wei, B. Chen, Canet: Contextual infor-


mation and spatial attention based network for detecting small defects
in manufacturing industry, Pattern Recognition 140 (2023) 109558.

[16] S. Woo, J. Park, J.-Y. Lee, I. S. Kweon, Cbam: Convolutional block


attention module, in: ECCV, 2018.

33
[17] J. Fu, J. Liu, H. Tian, Y. Li, Y. Bao, Z. Fang, H. Lu, Dual attention
network for scene segmentation, in: CVPR, 2019.

[18] X. Guo, Y. Li, H. Ling, Lime: Low-light image enhancement via illumi-
nation map estimation, IEEE Transactions on Image Processing 26 (2)
(2016) 982–993.

[19] Z. Ying, G. Li, W. Gao, A bio-inspired multi-exposure fusion frame-


work for low-light image enhancement, arXiv preprint arXiv:1711.00591
(2017).

[20] I. J. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley,


S. Ozair, A. C. Courville, Y. Bengio, Generative adversarial nets, in:
NIPS, 2014.

[21] P. Isola, J.-Y. Zhu, T. Zhou, A. A. Efros, Image-to-image translation


with conditional adversarial networks, in: CVPR, 2017.

[22] J.-Y. Zhu, T. Park, P. Isola, A. A. Efros, Unpaired image-to-image trans-


lation using cycle-consistent adversarial networkss, in: ICCV, 2017.

[23] Y. Jiang, X. Gong, D. Liu, Y. Cheng, C. Fang, X. Shen, J. Yang, P. Zhou,


Z. Wang, Enlightengan: Deep light enhancement without paired super-
vision, ArXiv abs/1906.06972 (2019).

[24] M. Zhu, P. Pan, W. Chen, Y. Yang, EEMEFN: Low-light image enhance-


ment via edge-enhanced multi-exposure fusion network., in: AAAI,
2020.

34
[25] C. G. Guo, C. Li, J. Guo, C. C. Loy, J. Hou, S. Kwong, R. Cong, Zero-
reference deep curve estimation for low-light image enhancement, in:
CVPR, 2020.

[26] L. Ma, T. Ma, R. Liu, X. Fan, Z. Luo, Toward fast, flexible, and robust
low-light image enhancement, in: CVPR, 2022, pp. 5637–5646.

[27] J. Pan, D. Zhai, Y. Bai, J. Jiang, D. Zhao, X. Liu, Chebylighter: Opti-


mal curve estimation for low-light image enhancement, in: ACM MM,
2022.

[28] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez,


L. Kaiser, I. Polosukhin, Attention is all you need, NIPS (2017).

[29] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai,


T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, et al.,
An image is worth 16x16 words: Transformers for image recognition at
scale, in: ICLR, 2021.

[30] Z. Cui, K. Li, L. Gu, S. Su, P. Gao, Z. Jiang, Y. Qiao, T. Harada, You
only need 90k parameters to adapt light: a light weight transformer for
image enhancement and exposure correction, in: BMVC, 2022.

[31] J. Huang, Y. Liu, F. Zhao, K. Yan, J. Zhang, Y. Huang, M. Zhou,


Z. Xiong, Deep fourier-based exposure correction network with spatial-
frequency interaction, in: ECCV, 2022.

[32] Y. Baek, B. Lee, D. Han, S. Yun, H. Lee, Character region awareness


for text detection, in: CVPR, 2019.

35
[33] W. Wang, E. Xie, X. Song, Y. Zang, W. Wang, T. Lu, G. Yu, C. Shen,
Efficient and accurate arbitrary-shaped text detection with pixel aggre-
gation network, in: ICCV, 2019.

[34] D. Karatzas, L. G. I. Bigorda, A. Nicolaou, S. Ghosh, A. D. Bag-


danov, M. Iwamura, J. Matas, L. Neumann, V. Chandrasekhar, S. Lu,
F. Shafait, S. Uchida, E. Valveny, ICDAR 2015 competition on robust
reading, in: ICDAR, 2015.

[35] C.-K. Ch’ng, C. S. Chan, C.-L. Liu, Total-text: toward orientation ro-
bustness in scene text detection, International Journal on Document
Analysis and Recognition (IJDAR) 23 (1) (2020) 31–52.

[36] B. Shi, M. Yang, X. Wang, P. Lyu, C. Yao, X. Bai, ASTER: An atten-


tional scene text recognizer with flexible rectification, IEEE Transactions
on Pattern Analysis and Machine Intelligence 41 (9) (2019) 2035–2048.

[37] J. Baek, G. Kim, J. Lee, S. Park, D. Han, S. Yun, S. J. Oh, H. Lee,


What is wrong with scene text recognition model comparisons? dataset
and model analysis, in: ICCV, 2019.

[38] Y. Liu, M.-M. Cheng, X. Hu, J. Bian, L. Zhang, X. Bai, J. Tang, Richer
convolutional features for edge detection, IEEE Transactions on Pattern
Analysis and Machine Intelligence 41 (8) (2019) 1939–1946.

[39] Z. Wang, E. Simoncelli, A. Bovik, Multiscale structural similarity for


image quality assessment, in: ACSSC, 2003.

[40] J. Canny, A computational approach to edge detection, IEEE Transac-


tions on Pattern Analysis and Machine Intelligence (6) (1986) 679–698.

36
[41] H. Liu, F. Liu, X. Fan, D. Huang, Polarized self-attention: Towards
high-quality pixel-wise regression, arXiv preprint arXiv:2107.00782
(2021).

[42] G. Ghiasi, Y. Cui, A. Srinivas, R. Qian, T.-Y. Lin, E. D. Cubuk, Q. V.


Le, B. Zoph, Simple copy-paste is a strong data augmentation method
for instance segmentation, in: CVPR, 2021.

[43] P. Samangouei, A. Saeedi, L. Nakagawa, N. Silberman, Explaingan:


Model explanation via decision boundary crossing transformations, in:
ECCV, 2018.

[44] D. P. Kingma, J. Ba, Adam: A method for stochastic optimization,


arXiv preprint arXiv:1412.6980 (2015).

[45] P.-H. Hsu, C.-T. Lin, C. C. Ng, J. L. Kew, M. Y. Tan, S.-H. Lai, C. S.
Chan, C. Zach, Extremely low-light image enhancement with scene text
restoration, in: ICPR, 2022.

[46] C. Li, C. Guo, C. C. Loy, Learning to enhance low-light image via zero-
reference deep curve estimation, IEEE Transactions on Pattern Analysis
and Machine Intelligence 44 (8) (2021) 4225–4238.

37
Supplementary Material: Text in the Dark: Extremely
Low-Light Text Image Enhancement
Che-Tsung Lina,1 , Chun Chet Ngb,1 , Zhi Qin Tanb , Wan Jun Nahb , Xinyu
Wangd , Jie Long Kewb , Pohao Hsuc , Shang Hong Laic , Chee Seng Chanb,∗,
arXiv:2404.14135v1 [cs.CV] 22 Apr 2024

Christopher Zacha
a
Chalmers University of Technology, Gothenburg, Sweden
b
Universiti Malaya, Kuala Lumpur, Malaysia
c
National Tsing Hua University, Hsinchu, Taiwan
d
The University of Adelaide, Adelaide, Australia

Abstract

This supplementary material presents the pseudocode for Text-CP, as well


as qualitative results for the SID-Fuji-Text and LOL-Text datasets and the
proposed Supervised-DCE Model. We also provide both quantitative and
qualitative results for the Synthesized IC15 dataset. Additionally, we analyze
the effectiveness of our proposed Edge-Att and Text-CP methods. Overall,
our proposed method outperforms existing approaches on all datasets.

1. Pseudocode of Text-CP

Algorithm 1 shows the detailed pseudocode of the proposed Text-CP.


Corresponding Author.
Email address: cs.chan@um.edu.my (Chee Seng Chan)
1
These authors contributed equally to this work.

Preprint submitted to Pattern Recognition April 23, 2024


Algorithm 1 Text-Aware Copy-Paste Algorithm

Require: C = (u1 , v1 , w1 , h1 ), ..., (u|C| , v|C| , w|C| , h|C| )) : All text instances
in dataset, t: Training image to be augmented, Ct ∈ C: Text instances
in t, γ: Minimum width-to-height ratio of text instances, ntarget : Target
number of text instances in an image
1: while |Ct | < ntarget do ▷ i. Sample text instance not in image t
2: bbox(u, v, w, h) ∼ (C − Ct )
3: û ∼ U (0, wt ) ▷ ii. Dataset-aware sampling
4: v̂ ∼ U (0, ht )
2
5: ŵ ∼ N (µW , σW )
2
6: ĥ ∼ N (µH , σH )
7: if ŵĥ ≥ γ & (û + ŵ) ≤ wt & (v̂ + ĥ) ≤ ht then
8: Resize bbox(u, v, w, h) to bbox(û, v̂, ŵ, ĥ) ▷ iii. Resize text
instance
9: end if
10: if bbox(û, v̂, ŵ, ĥ) does not overlap with all boxes in Ct then
11: Paste bbox on t at (û, v̂) ▷ iv. Paste text instance
12: Add bbox into Ct
13: end if
14: end while

2. Qualitative Results of Existing Methods on the SID-Fuji-Text


and LOL-Text

We present qualitative results of existing methods on the SID-Fuji-Text


dataset in Figure 1 and results on the LOL-Text dataset in Figure 2. Through
these two figures, we observe that the enhanced images of our proposed
method achieve the closest text detection results compared to the ground
truth images.

2
(a) Low-Light (b) Zero-DCE [? ] (c) Zero-DCE++ [? ] (d) SCI [? ] (e) CycleGAN [? ]

(f) EnlightenGAN [? ] (g) Pix2Pix [? ] (h) ChebyLighter [? ] (i) FECNet [? ]

(j) IAT [? ] (k) ELIE STR [? ] (l) Ours (m) Ground Truth

Figure 1: Comparison with state-of-the-art methods on the SID-Fuji-Text dataset is shown


in the following manner: for each column, the first row displays enhanced images marked
with blue boxes as regions of interest. The second row displays zoomed-in regions of
enhanced images overlaid with red text detection boxes from CRAFT [? ]. Column 1a
displays the low-light image. Columns 1b to 1l show image enhancement results from all
related methods. The last cell displays ground truth images.

3
(a) Low-Light (b) Zero-DCE [? ] (c) Zero-DCE++ [? ] (d) SCI [? ] (e) CycleGAN [? ]

(f) EnlightenGAN [? ] (g) Pix2Pix [? ] (h) ChebyLighter [? ] (i) FECNet [? ]

(j) IAT [? ] (k) ELIE STR [? ] (l) Ours (m) Ground Truth

Figure 2: Comparison with state-of-the-art methods on the LOL-Text dataset is shown


in the following manner: for each column, the first row displays enhanced images marked
with blue boxes as regions of interest. The second row displays zoomed-in regions of
enhanced images overlaid with red text detection boxes from CRAFT [? ]. Column 2a
displays the low-light image. Columns 2b to 2l show image enhancement results from all
related methods. The last cell displays ground truth images.

4
3. Qualitative Results of the Proposed Supervised-DCE Model

As shown in Figure 3, in comparison to synthetic v1-images, there are


minor differences between genuine extremely low-light images and synthetic
v2-images after being enhanced regarding text detection and recognition.
This confirms that our proposed low-light synthesis model, Supervised-DCE,
can synthesize more realistic extremely low-light images with similar intensity
and noise distributions as the genuine extremely low-light images.

4. Results on Synthesized IC15 Datasets

In this assessment, we first cropped the texts based on the detection re-
sults using a constraint of Intersection over Union (IoU) greater than 0.5 with
the ground truth bounding boxes. Subsequently, we performed text recogni-
tion (using TRBA and ASTER) on the cropped detected text regions. Given
that the quality of text detection plays a crucial role in text spotting, our pro-
posed method’s success in outperforming the runner-up method, ELIE STR,
by a significant margin, highlights the effectiveness of our approach in pre-
serving fine details within text regions while enhancing low-light images. The
complete quantitative results are presented in Table 1.
As for the qualitative results of the two-stage text spotting task, we
present comprehensive results in Figure 4 and Figure 5. In the first row,
we display the entire image to provide an overview of the scene. In the sub-
sequent row, we zoom in on regions of interest that contain text instances,
allowing for a closer examination of text spotting performance. Please note
that text recognition is performed only when the IoU of detected boxes with
respect to ground truth bounding boxes is larger than 0.5. Any recognition

5
rightly, solar, compactor bigbelly, solar, compactor you, solar, compactor

apple, nil, nil apple, apple, ampi apple, nil, nil


(a) Genuine SID-Sony (b) Syn-SID-Sony-v1 (c) Syn-SID-Sony-v2

Figure 3: Text detection results (CRAFT) and text recognition results (ASTER) of the
enhanced images (genuine SID-{Sony, Fuji}, Syn-SID-{Sony, Fuji}-v1, and Syn-SID-{Sony,
Fuji}-v2). (a) a genuine SID-Sony low-light image and its enhanced counterpart with
detection and recognition results; (b) a Syn-SID-Sony-v1 low-light image and its enhanced
counterpart with detection and recognition results; (c) a Syn-SID-Sony-v2 low-light image
and its enhanced counterpart with detection and recognition results. Text recognition
results of each detection box are listed from top-to-bottom then left to right in the captions.

results below this threshold are considered failed text recognition and labeled
as “nil”. Additionally, correctly predicted words are colored in green, while
incorrect text recognition results are indicated in red.
Existing methods often suffer from introducing excessive noise and arti-
facts when applying global and local image enhancement techniques simul-
taneously. As a result, enhanced images produced by these methods exhibit

6
high pixelation and visual distortions. In contrast, our model generates en-
hanced images that are significantly less noisy and exhibit fewer artifacts,
even in local text regions. While it is true that some text recognition results
on our enhanced images may not be perfect, our results consistently exhibit
the closest resemblance to ground truth compared to other methods.
In summary, our model can generate globally well-enhanced images given
extremely low-light images while retaining important local details such as
text features, leading to better text detection and recognition results.

5. Ablation Studies of Edge-Aware Attention and Text-Aware Copy-


Paste Augmentation

In this experiment, we compared our design choices in the proposed frame-


work against the original modules (i.e., original PSA vs. proposed Edge-Att
and Copy-Paste vs. proposed Text-CP) while keeping other modules of our
model unchanged. The quantitative results are reported in Table 2. In the
original PSA module, the authors required the same feature maps to be fed
into the individual channel and spatial attention submodules. In contrast,
the proposed Edge-Att module takes in two heterogeneous inputs: low-light
image feature maps for the channel attention part and edge feature maps
for the spatial attention part. This allows the Edge-Att module to attend
to rich high-level features from the low-light encoder while focusing on low-
level features from the edge encoder. Such a design enables our full model
to perform better than the naive-PSA model in terms of all metrics.
Furthermore, we studied the effectiveness of our proposed Text-CP com-
pared to the original Copy-Paste augmentation. The results in Table 2 show

7
Image Quality H-Mean Case-Insensitive Accuracy
Type Method
PSNR ↑ SSIM ↑ LPIPS ↓ C ↑ P ↑ C+T ↑ C+A ↑ P+T ↑ P+A ↑
Input - - - 0.355 0.192 0.114 0.118 0.062 0.067
Syn-IC15-Sony-v1

Zero-DCE ++ [? ] 13.505 0.575 0.587 0.379 0.329 0.101 0.093 0.089 0.079
ZSL
SCI [? ] 14.195 0.589 0.486 0.479 0.443 0.154 0.132 0.133 0.121
CycleGAN [? ] 23.420 0.720 0.397 0.428 0.458 0.119 0.144 0.122 0.138
UL
EnlightenGAN [? ] 21.030 0.661 0.469 0.458 0.461 0.140 0.157 0.123 0.139
ELIE STR [? ] 28.410 0.840 0.253 0.622 0.631 0.193 0.219 0.197 0.221
SL
Ours 28.568 0.859 0.248 0.660 0.662 0.222 0.256 0.221 0.248
GT - - - 0.800 0.830 0.526 0.584 0.555 0.591
Input - - - 0.198 0.068 0.046 0.047 0.018 0.019
Syn-IC15-Sony-v2

Zero-DCE ++ [? ] 8.189 0.063 0.753 0.259 0.238 0.059 0.049 0.051 0.044
ZSL
SCI [? ] 9.055 0.111 0.711 0.266 0.278 0.066 0.052 0.066 0.055
CycleGAN [? ] 20.212 0.768 0.404 0.443 0.429 0.113 0.114 0.104 0.107
UL
EnlightenGAN [? ] 19.610 0.753 0.440 0.477 0.456 0.116 0.114 0.099 0.104
ELIE STR [? ] 22.724 0.792 0.339 0.570 0.566 0.126 0.136 0.122 0.130
SL
Ours 22.938 0.809 0.318 0.593 0.577 0.138 0.157 0.134 0.143
GT - - - 0.800 0.830 0.526 0.584 0.555 0.591
Input - - - 0.347 0.175 0.100 0.102 0.097 0.095
Syn-IC15-Fuji-v1

Zero-DCE ++ [? ] 13.124 0.569 0.591 0.379 0.323 0.099 0.092 0.094 0.083
ZSL
SCI [? ] 15.426 0.615 0.599 0.468 0.463 0.144 0.129 0.139 0.124
CycleGAN [? ] 24.476 0.797 0.377 0.406 0.371 0.109 0.113 0.097 0.104
UL
EnlightenGAN [? ] 25.923 0.803 0.364 0.420 0.406 0.115 0.121 0.104 0.106
ELIE STR [? ] 27.528 0.837 0.264 0.646 0.645 0.199 0.216 0.193 0.200
SL
Ours 28.437 0.856 0.248 0.668 0.668 0.218 0.234 0.211 0.217
GT - - - 0.800 0.830 0.526 0.584 0.555 0.591
Input - - - 0.133 0.028 0.023 0.022 0.006 0.009
Syn-IC15-Fuji-v2

Zero-DCE ++ [? ] 8.793 0.134 0.708 0.193 0.137 0.040 0.033 0.028 0.025
ZSL
SCI [? ] 8.071 0.069 0.789 0.213 0.128 0.048 0.042 0.034 0.030
CycleGAN [? ] 23.920 0.733 0.367 0.397 0.386 0.070 0.074 0.066 0.073
UL
EnlightenGAN [? ] 24.600 0.740 0.389 0.415 0.396 0.089 0.091 0.081 0.084
ELIE STR [? ] 25.536 0.819 0.301 0.556 0.546 0.121 0.133 0.117 0.127
SL
Ours 25.999 0.830 0.283 0.598 0.583 0.143 0.152 0.128 0.144
GT - - - 0.800 0.830 0.526 0.584 0.555 0.591

Table 1: Quantitative results of PSNR, SSIM, LPIPS, and text detection H-Mean for
low-light image enhancement methods on Syn-IC15-Sony-v1, and Syn-IC15-Sony-v2, Syn-
IC15-Fuji-v1, and Syn-IC15-Fuji-v2 datasets. Two-stage text spotting case-insensitive
word accuracy are also reported, where C, P, T, and A are CRAFT, PAN, TRBA, and
ASTER, respectively. Please note that TRAD, ZSL, UL, and SL stand for traditional
methods, zero-shot learning, unsupervised learning, and supervised learning respectively.
Scores in bold are the best of all.

8
Image Quality H-Mean
Model Variations
PSNR ↑ SSIM ↑ LPIPS ↓ CRAFT ↑ PAN ↑
Full Model (naive-PSA) 23.768 0.725 0.774 0.354 0.291
Full Model (naive-Copy-Paste) 22.822 0.714 0.787 0.356 0.284
Full Model 25.596 0.751 0.751 0.368 0.298

Table 2: Ablation study of our full model using different versions of PSA and Copy-Paste
on SID-Sony-Text in terms of PSNR, SSIM, LPIPS, and text detection H-Mean. Scores
in bold are the best of all.

that our model trained with our proposed Text-CP scored better than the
naive-Copy-Paste. By ensuring that texts are separated when pasted onto
training image patches, our proposed method ensures that text detection
models will not be confused by overlapping texts and can generate better
text heatmaps for calculating text detection loss. This design also encour-
ages our model to focus better on text instances since, without overlapping
texts, text features will be clearer and more distinctive than other objects.

9
nil nil nil nil

nil nil nil nil


(a) Low-Light (b) Zero-DCE++ [? ] (c) SCI [? ] (d) CycleGAN [? ]

nil thing dining dining

nil was world world


(e) EnlightenGAN [? ] (f) ELIE STR [? ] (g) Ours (h) Ground Truth

Figure 4: Comparison of our model with other state-of-the-art methods on the Syn-IC15-
Sony-v1 (top) and Syn-IC15-Sony-v2 (bottom) dataset. PAN [? ] is used as the text
detector, while ASTER [? ] is used for text recognition. Blue boxes represent the
zoomed-in regions, while red boxes are drawn based on the predictions of PAN. Text
recognition is performed on the word images cropped by using PAN’s predictions (from
column 4a to column 4g) and ground truth text boxes (column 4h). “nil” stands for
either no recognition results or invalid prediction boxes (with IoU ≤ 0.5) which are not
considered for the subsequent recognition.

10
nil nil, the, for nil, out, from from, an, nil

nil nil nil nil


(a) Low-Light (b) Zero-DCE++ [? ] (c) SCI [? ] (d) CycleGAN [? ]

and, cut, but ereshly, cut, and freshly, cut, from freshly, cut, fruit

nil greads, butter bread, butter bread, butter


(e) EnlightenGAN [? ] (f) ELIE STR [? ] (g) Ours (h) Ground Truth

Figure 5: Comparison of our model with other state-of-the-art methods on the Syn-IC15-
Fuji-v1 (top) and Syn-IC15-Fuji-v2 (bottom) datasets. PAN [? ] is used as the text
detector, while ASTER [? ] is used for text recognition. Blue boxes represent the
zoomed-in regions, while red boxes are drawn based on the predictions of PAN. Text
recognition is performed on the word images cropped by using PAN’s predictions (from
column 5a to column 5g) and ground truth text boxes (column 5h). “nil” stands for
either no recognition results or invalid prediction boxes (with IoU ≤ 0.5) which are not
considered for the subsequent recognition.

11

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy