Arxiv 2404.14135
Arxiv 2404.14135
Che-Tsung Lin, Chun Chet Ng, Zhi Qin Tan, Wan Jun Nah, Xinyu Wang,
arXiv:2404.14135v1 [cs.CV] 22 Apr 2024
Jie Long Kew, Pohao Hsu, Shang Hong Lai, Chee Seng Chan, Christopher
Zach
• Our new datasets assess enhanced low-light images with scene text
extraction tasks.
• Our method achieves the best results on all datasets quantitatively &
qualitatively.
Text in the Dark: Extremely Low-Light Text Image
Enhancement
Che-Tsung Lina,1 , Chun Chet Ngb,1 , Zhi Qin Tanb , Wan Jun Nahb , Xinyu
Wangd , Jie Long Kewb , Pohao Hsuc , Shang Hong Laic , Chee Seng Chanb,∗,
Christopher Zacha
a
Chalmers University of Technology, Gothenburg, Sweden
b
Universiti Malaya, Kuala Lumpur, Malaysia
c
National Tsing Hua University, Hsinchu, Taiwan
d
The University of Adelaide, Adelaide, Australia
Abstract
∗
Corresponding Author.
Email address: cs.chan@um.edu.my (Chee Seng Chan)
1
These authors contributed equally to this work.
1. Introduction
Figure 1: From left to right: (a) Original images; (b) Enhanced results with our proposed
method; (c-d) Zoomed-in (2x) regions of the blue and green bounding boxes. Top row:
SID-Sony-Text; Middle row: SID-Fuji-Text; Bottom row: LOL-Text. Extremely low-light
images in the SID dataset are significantly darker than those in the LOL dataset, and our
model enhances the images to the extent that texts are clearly visible with sharp edges.
2
are no longer prominent or hardly visible. On the other hand, enhancing
images captured in extremely low-light conditions pose a greater challenge
than ordinary low-light images due to the higher noise levels and greater
information loss. For instance, we show the difference in darkness level in
Figure 1, where it is evident that the See In the Dark (SID) datasets [1] are
darker and, in theory, more difficult to enhance than the LOw-Light (LOL)
dataset [2]. Quantitatively, we calculated the PSNR and SSIM values for two
subsets of SID, SID-Sony and SID-Fuji, and LOL by comparing each image
against pure black images in Table 1. Based on each dataset’s average per-
ceptual lightness (L* in the CIELAB color space), images in SID are at least
15 times darker than those in LOL. Hence, low-light image enhancement is a
necessary pre-processing step for scene text extraction under such conditions.
Over the years, many general or low-light image enhancement models
have been proposed to improve the interpretability and extraction of infor-
mation in images by providing better input for subsequent image content
analysis. Early methods [3, 18, 19] typically attempted to restore the sta-
tistical properties of low-light images to those of long-exposure images from
a mathematical perspective. On the other hand, deep learning-based meth-
ods [2, 1, 23, 25] aim to learn the mapping between low-light images and
their corresponding long-exposure versions via regression. To the best of our
knowledge, most existing low-light image enhancement works have not ex-
plicitly addressed the restored image quality in terms of downstream scene
text tasks.
Recent advancements in visual attention mechanisms have demonstrated
their effectiveness in identifying and boosting salient features in images.
3
Dataset PSNR ↑ SSIM ↑ Avg. L* ↓
SID-Sony [1] 44.350 0.907 0.009
SID-Fuji [1] 41.987 0.820 0.004
LOL [2] 23.892 0.195 0.142
Pure Black ∞ 1.000 0.000
Table 1: The difference between the extremely low-light dataset, SID, and the ordinary
low-light dataset, LOL, is shown in terms of PSNR and SSIM values, computed by compar-
ing short-exposure images against pure black images. Avg. L* is the average perceptual
lightness in the CIELAB color space, calculated based on short-exposure images. Scores
are averaged across training and test sets. Higher PSNR and SSIM values, along with
lower Avg. L*, indicate darker images that are more challenging for image enhancement
and scene text extraction.
Channel-only attention [11, 12, 13], spatial attention [14, 15] or the subse-
quent channel-spatial attention [16, 17] modules were proposed to emphasize
the most informative areas. However, these methods cannot preserve texture
details, especially fine-grained edge information that is intuitively needed to
enhance extremely low-light images with complex textures. To overcome this
limitation, we introduce Edge-Aware Attention (Edge-Att). This novel at-
tention module simultaneously performs channel and spatial attention-based
feature learning on high-level image and edge features. Our model also con-
siders text information in the image through a text-aware loss function. This
way, our model can effectively enhance low-light images while preserving
fine-grained edge information, texture details, and legibility of text.
The scarcity of extremely low-light text datasets presents a hurdle for
further research. To address this, we annotated all text instances in both
the training and testing sets of the SID and LOL datasets, creating three
new low-light text datasets: SID-Sony-Text, SID-Fuji-Text, and LOL-Text.
We then proposed a novel Supervised Deep Curve Estimation (Supervised-
4
DCE) model to synthesize extremely low-light scene text images based on the
commonly used ICDAR15 (IC15) scene text dataset. It allows researchers
to easily translate naive scene text datasets into extremely low-light text
datasets. In addition to the previously published conference version of this
work [45], we have made four significant extensions. Firstly, we propose
a novel dual encoder-decoder framework that can achieve superior perfor-
mance on low-light scene text tasks (Section 3.1). Secondly, we introduce a
new image synthesis method capable of generating more realistic extremely
low-light text images (Section 4.1). Thirdly, we have further annotated texts
in the Fuji and LOL datasets, thereby forming the largest low-light scene
text datasets to date (Section 5). Fourthly, comprehensive experiments and
analyses are carried out to study the latest methods along with our pro-
posed methods on all synthetic and real low-light text datasets. The main
contributions of our work are as follows:
5
• We labeled the texts in the SID-Sony, SID-Fuji, and LOL datasets
and named them SID-Sony-Text, SID-Fuji-Text, and LOL-Text, re-
spectively. This provides a new perspective for objectively assessing
enhanced extremely low-light images through scene text tasks.
2. Related Works
6
weight CNN to estimate pixel-wise high-order curves for dynamic range ad-
justment of a given image without needing paired images. [26] designed
a novel Self-Calibrated Illumination (SCI) learning with an unsupervised
training loss to constrain the output at each stage under the effects of a
self-calibrated module. ChebyLighter [27] learns to estimate an optimal
pixel-wise adjustment curve under the paired setting. Recently, the Trans-
former [28] architecture has become the de-facto standard for Natural Lan-
guage Processing (NLP) tasks. ViT [29] applied the attention mechanism
in the vision task by splitting the image into tokens before sending it into
Transformer. Illumination Adaptive Transformer (IAT) [30] uses attention
queries to represent and adjust ISP-related parameters. Most existing models
enhance images in the spatial domain. Fourier-based Exposure Correction
Network (FECNet) [31] presents a new perspective for exposure correction
with spatial-frequency interaction and has shown that their model can be
extended to low-light image enhancement.
Scene Text Extraction. Deep neural networks have been widely used for
scene text detection. CRAFT [32] predicts two heatmaps: the character re-
gion score map and the affinity score map. The region score map localizes
individual characters in the image, while the affinity score map groups each
character into a single word instance. Another notable scene text detection
method is Pixel Aggregation Network (PAN) [33] which is trained to pre-
dict text regions, kernels, and similarity vectors. Both text segmentation
models have proven to work well on commonly used scene text datasets such
as IC15 [34] and TotalText [35]. Inspired by them, we introduced a text
detection loss in our proposed model to focus on scene text regions during
7
extremely low-light image enhancement. Furthermore, state-of-the-art text
recognition methods such as ASTER [36] and TRBA [37] are known to per-
form well on images captured in complex scenarios. ASTER [36] employs
a flexible rectification module to straighten the word images before passing
them to a sequence-to-sequence model with the bi-directional decoder. The
experimental results of ASTER showed that the rectification module could
achieve superior performance on multiple scene text recognition datasets,
including the likes of IC15 and many more. Besides, TRBA [37] provided
interesting insights by breaking down the scene text recognition framework
into four main stages: spatial transformation, character feature extraction,
followed by sequence modeling, and the prediction of character sequences.
Given these methods’ robustness on difficult texts, they are well-suited to
recognize texts from enhanced low-light images.
8
Our novel image enhancement model consists of a U-Net accommodating
extremely low-light images and edge maps using two independent encoders.
During model training, instead of channel attention, the encoded edges guide
the spatial attention sub-module in the proposed Edge-Att to attend to edge
pixels related to text representations. Besides the image enhancement losses,
our model incorporates text detection and edge reconstruction losses into the
training process. This integration effectively guides the model’s attention to-
wards text-related features and regions, facilitating improved image textual
content analysis. As a pre-processing step, we introduced a novel augmen-
tation technique called Text-CP to increase the presence of non-overlapping
and unique text instances in training images, thereby promoting comprehen-
sive learning of text representations.
Our model was inspired by U-Net[1] with some refinements. Firstly, the
network expects heterogeneous inputs, i.e., extremely low-light images, x,
and the corresponding RCF [38] edge maps, e. Secondly, input-edge pairs
are handled by two separate encoders with edge-aware attention modules
between them. The attended features are then bridged with the decoder
through skip connections. Finally, our multi-tasking network predicts the
enhanced image, x′ , and the corresponding reconstructed edge, e′ . The over-
all architecture of our network can be seen in Figure 2 and modeled as:
9
Figure 2: Illustration of the architecture of our proposed framework, designed to enhance
extremely low-light images while incorporating scene text awareness.
3.3. Objectives
Our proposed model is trained to optimize four loss functions. The first
two, Smooth L1 loss and multi-scale SSIM loss focus on enhancing the overall
image quality. The third, text detection loss, targets the enhancement of
scene text regions specifically. The fourth, edge reconstruction loss, focuses
on crucial low-level edge features.
Firstly, we employ smooth L1 loss as the reconstruction loss to better
enforce low-frequency correctness [21] between x′ and y as:
0.5 · (x′ − y)2 /δ, if |x′ − y| < δ
Lrecons = (2)
|x′ − y| − 0.5 · δ, otherwise
where we empirically found that δ = 1 achieved good result. The authors of
Pix2Pix [21] showed that by utilizing L1 loss, the model can achieve better
10
Figure 3: (a) Visual representation of our edge decoder, wherein A and B represent the
output from the corresponding convolution blocks in Figure 2 and S denotes the scaling
of the image. (b) Illustration of the proposed Edge-Aware Attention module.
results as the generated images are less blurry and proved that L1 loss can
better enforce the learning of low-frequency details, which is also essential
for OCR tasks. On the other hand, the L1 norm is less sensitive to outliers
than the L2 norm, thus resulting in a more robust model towards extreme
pixel intensities.
Secondly, the multi-scale SSIM metric was proposed in [39] for reference-
based image quality assessment, focusing on image structure consistency. An
M -scale SSIM between the enhanced image x′ and ground truth image y is:
YM
SSIMM S (x′ , y) = [lM (x′ , y)]τ · [cj (x′ , y)]ϕ [sj (x′ , y)]ψ , (3)
j=1
where lM is the luminance at M -scale; cj and sj represent the contrast and the
structure similarity measures at the j-th scale; τ , ϕ, and ψ are parameters to
11
adjust the importance of the three components. Inspired by [39], we adopted
the M -scale SSIM loss function in our work to enforce the image structure
of x′ to be close to that of y:
where R(x′ ) and R(y) denote the region score heatmaps of the enhanced and
ground truth images, respectively.
Fourthly, the edge reconstruction decoder in our model is designed to
extract edges better, which are essential for text pixels. Figure 3(a) shows
an overview of the edge decoder. The loss at pixel i of detected edge, ei , with
respect to the ground truth edge, gi is defined as:
α · log(1 − P (e )), if g = 0
i i
l(ei ) = (6)
β · logP (ei ), if gi = 1
where
|Y + |
α=λ· ,
|Y + | + |Y − |
(7)
|Y − |
β= + ,
|Y | + |Y − |
Y + and Y − denote the positive and negative sample sets, respectively. λ is set
to 1.1 to balance both types of samples. The ground truth edge is generated
12
using a Canny edge detector [40], and P(ei ) is the sigmoid function. Then,
the overall edge reconstruction loss can be formulated as:
|I| J
X X ′
Ledge = l(eji ) + l(ei ), (8)
i=1 j=1
where l(eji ) is the predicted edge at pixel i and level j. J = 3 is the number
′
of side edge outputs in our model. ei is the final predicted edge map from
the concatenation of side outputs. |I| is the number of pixels in a cropped
image during training.
Finally, the total joint loss function, Ltotal en of our proposed model is:
Ltotal en = ωrecons Lrecons + ωtext Ltext + ωSSIMM S LSSIMM S + ωedge Ledge , (9)
where ωrecons , ωtext , ωSSIMM S , and ωedge are the weights to address the im-
portance of each loss term during training.
13
the edge encoder. By doing so, we can ensure that Edge-Att can attend
to rich images and edge features simultaneously. The proposed attention
module is illustrated in Figure 3(b).
Firstly, the feature map from the image encoder, F is fed into the channel
attention, Ach (F ) ∈ RC×1×1 with calculation as follows:
Ach (F ) = σ3 [FSG (Wz (σ1 (Wv (F )) × FSM (σ2 (Wq (F )))))] , (10)
Asp (E) = FSG [σ3 (FSM (σ1 (FGP (Wq (E)))) × σ2 (Wv (E)))] , (11)
where Wq and Wv are 1x1 convolution layers, σ1 , σ2 , and σ3 are three tensor
reshape operators. FSM (.) is a softmax operator, FGP (.) is a a global pooling
operator, and FSG (.) is a sigmoid operator. The output of this branch is
J J
Asp (E) sp F ∈ RC×H×W , where sp is a spatial-wise multiplication opera-
tor, and F is the image enhancement branch’s feature map. Finally, output
of the proposed Edge-Att module is the composition of two submodules:
14
Before Augmentation After Augmentation
i. Sample Text Instances ii. Dataset-Aware Sampling iii. Resize Text & iv. Paste Text
( , )
15
labeled text boxes in the training set as C, which is:
C = (u1 , v1 , w1 , h1 ), ..., (u|C| , v|C| , w|C| , h|C| )) , (13)
where each tuple represents the top left position of a text located at uk and vk
with width, wk , and height, hk with k representing the index of the current
text’s box in the set. We then sample a target number of text instances,
ntarget , from the set of C to form Ct , defined as the set of text boxes to be
pasted on that training image, t. The next step is to crop and paste the
sampled texts without overlapping. For each ck ∈ Ct , we adopt two uniform
distributions in modeling the position of the texts, ûk and v̂k :
ûk ∼ U (0, wt ),
(14)
v̂k ∼ U (0, ht ).
As for wk and hk , they are sampled from Gaussian distribution as:
2
ŵk ∼ N (µW , σW ),
(15)
2
ĥk ∼ N (µH , σH ),
where µ and σ 2 are estimated means and variances of width W and height
H from all the labeled texts in the training set. We illustrate the overall
data augmentation process of Text-CP and its augmented results in Figure
4. The pseudocode of Text-CP is detailed in the supplementary material.
16
text. While extremely low-light dataset, SID, and low-light dataset, LOL,
exist, they are not primarily collected with scene text in mind. This scarcity
of dedicated datasets for extremely low-light scene text poses challenges for
evaluating the performance of existing image enhancement methods in terms
of image quality and scene text metrics. In order to address this issue, we
define the extremely low-light image synthesis problem as follows:
x̂ = LS(y; θs ), (16)
17
Figure 5: Illustration of the proposed Supervised-DCE model for extremely low-light image
synthesis.
18
learning is done by training SID long-exposure images to be similar to their
short-exposure counterparts. Most importantly, these images are translated
in the hope that their text information is somewhat deteriorated as the one
in genuine extremely low-light images and cannot be easily reconstructed.
Then, the trained model can be used to transform scene text datasets in
the public domain to boost the performance of extremely low-light image
enhancement in terms of text detection.
4.3. Objectives
where M is the number of local regions, and ω(i) is four neighboring regions
(top, down, left, right) centered at region i. X̂ and Y are the averaged inten-
19
sity values of local regions of the synthesized images and the long-exposure
images, respectively. We introduced logarithm operation and αs parameter
to reduce the large spatial difference of Y where αs is set to 0.05. We set the
local region size to 4 × 4, following the original setting of Zero-DCE.
Besides spatial consistency, we also expect the monotonicity relation be-
tween neighboring pixels to be preserved. To achieve this, we reused the
illumination smoothness loss:
X
LtvZ = (|∇x Z c | + |∇y Z c |)2 , ξ = {R, G, B} , (20)
∀c∈ξ
where ∇x and ∇y are gradient operations on the x-axis and y-axis, respec-
tively. Illumination smoothness loss, LtvZ , is applied on both H(y) and U (y),
i.e., the curve parameter maps of the two branches, respectively, by substi-
tuting Z with H and U , resulting in LtvH and LtvU .
In summary, the overall learning objective, Ltotal syn to train our ex-
tremely low-light image synthesis network is defined as:
Ltotal syn = ωprox Lprox + ωspa Lspa + ωtvH LtvH + ωtvU LtvU . (21)
Table 2: Statistics reported based on long-exposure images for all datasets. GT Img.
stands for ground truth image count, where Leg. and Illeg. stand for legible and illegible
text count, respectively.
20
In this work, we annotated all text instances in the extremely low-light
dataset, SID [1], and the ordinary low-light dataset, LOL [2]. SID has two
subsets: SID-Sony, captured by Sony α7S II, and SID-Fuji, captured by
Fujifilm X-T2. For this work, we included 878/810 short-exposure images
and 211/176 long-exposure images at a resolution of 4240×2832/6000×4000
from SID-Sony and SID-Fuji, respectively. The short-exposure time is 1/30,
1/25, and 1/10, while the corresponding reference (long-exposure) images
were captured with 100 to 300 times longer exposure, i.e., 10 to 30 seconds.
In our experiments, we converted short- and long-exposure SID images to
RGB format. The LOL dataset provides low/normal-light image pairs taken
from real scenes by controlling exposure time and ISO. There are 485 and 15
images at a resolution of 600×400 in the training and test sets, respectively.
We closely annotated text instances in the SID and LOL datasets following
the common IC15 standard. We show some samples in Figure 6. The newly
annotated datasets are named SID-Sony-Text, SID-Fuji-Text, and LOL-Text
to differentiate them from their low-light counterparts.
Figure 6: Green boxes represent legible texts, and blue boxes represent illegible texts.
IC15 dataset was introduced in the ICDAR 2015 Robust Reading Com-
petition for incidental scene text detection and recognition. It contains 1500
scene text images at a resolution of 1280 × 720. In this study, IC15 is pri-
21
marily used to synthesize extremely low-light scene text images. Detailed
statistics of the text annotations for SID-Sony-Text, SID-Fuji-Text, LOL-
Text, and IC15 are shown in Table 2, where we included the statistics for
long-exposure images only for the sake of brevity. In this table, we also report
relevant statistics of the mean and standard deviation of labeled texts’ width
and height to be used by the proposed Text-Aware Copy-Paste augmen-
tation. The text annotations for SID-Sony-Text, SID-Fuji-Text, and LOL-
Text datasets will be released at https://github.com/chunchet-ng/Text-in-
the-Dark.
Moreover, we synthesized extremely low-light images based on IC15 by
using U-Net and our proposed Supervised-DCE model, respectively. To study
the difference between these two variations of image synthesis methods, we
generated a total of four sets of images by using the aforementioned two
models trained on SID-Sony and SID-Fuji, individually. Naming convention
of such synthetic datasets follows the format of “{Syn-IC15}-{Sony/Fuji}-
{v1/v2}”. “{Sony/Fuji}” is an indication of which dataset the image synthe-
sis model is trained on, while “{v1/v2}” differentiates the image synthesis
models where v1 is U-Net and v2 is our proposed Supervised-DCE model.
For instance, the synthetic images generated by a U-Net trained on SID-Sony
and SID-Fuji, are named Syn-IC15-Sony-v1 and Syn-IC15-Fuji-v1. And, syn-
thetic images generated by our proposed Supervised-DCE model are denoted
as Syn-IC15-Sony-v2 and Syn-IC15-Fuji-v2.
22
6. Experimental Results
23
for 200 epochs using the Adam optimizer with default parameters and a fixed
learning rate of 1e−4 . It was trained on 256 × 256 image patches with loss
weightings of ωprox , ωspa , ωtvA and ωtvB , set to 1, 20, 10, and 10 respectively.
24
Image Quality H-Mean
Type Method
PSNR ↑ SSIM ↑ LPIPS ↓ CRAFT ↑ PAN ↑
Input - - - 0.057 0.026
LIME [18] 13.870 0.135 0.873 0.127 0.057
TRAD
BIMEF [19] 12.870 0.110 0.808 0.136 0.079
Zero-DCE [25] 10.495 0.080 0.999 0.196 0.157
ZSL Zero-DCE++ [46] 12.368 0.076 0.982 0.218 0.162
SID-Sony-Text
UL
EnlightenGAN [23] 15.800 0.654 0.300 0.343 0.125
Pix2Pix [21] 20.581 0.771 0.247 0.353 0.129
ChebyLighter [27] 19.820 0.769 0.199 0.353 0.176
FECNet [31] 20.432 0.787 0.231 0.378 0.229
SL IAT [30] 20.437 0.772 0.234 0.421 0.188
ELIE STR [45] 19.782 0.824 0.167 0.462 0.235
Ours 21.330 0.828 0.163 0.474 0.294
GT - - - 0.439 0.205
Table 3: Quantitative results of PSNR, SSIM, LPIPS, and text detection H-Mean for
low-light image enhancement methods on SID-Sony-Text, SID-Fuji-Text, and LOL-Text
datasets. Please note that TRAD, ZSL, UL, and SL stand for traditional methods, zero-
shot learning, unsupervised learning, and supervised learning respectively. Scores in bold
are the best of all.
25
(a) Low-Light (b) LIME [18] (c) BIMEF [19] (d) Zero-DCE [25]
(e) Zero-DCE++ [46] (f) SCI [26] (g) CycleGAN [22] (h) EnlightenGAN [23]
(i) RetinexNet [2] (j) Pix2Pix [21] (k) ChebyLighter [27] (l) FECNet [31]
(m) IAT [30] (n) ELIE STR [45] (o) Ours (p) Ground Truth
26
used LOL dataset, which is relatively brighter than the SID dataset, as de-
picted in Table 1. Interestingly, we found that our enhanced images achieved
the best detection results on LOL-Text among existing methods, as shown in
Table 3. Surprisingly, despite the lower resolution (600x400) of the images in
LOL, our method’s enhanced images with sharper and crisper low-level de-
tails surpassed the ground truth images’ H-Mean scores. Qualitative results
on the LOL-Text dataset are illustrated in the supplementary material. Al-
though certain methods yielded output images with acceptable image quality
(i.e., bright images without color shift), their text detection results were in-
ferior to ours. Furthermore, our superior results on the LOL-Text dataset
emphasize our method’s ability to generalize well on both ordinary and ex-
tremely low-light images, effectively enhancing a broader range of low-light
images while making the text clearly visible.
27
synthesized images by comparing them with the genuine ones along with the
average perceptual lightness in CIELAB color space. The comparison was
made on two SID datasets, SID-Sony and SID-Fuji.
In Table 4, we show that v2’s PSNR and SSIM are higher than v1’s, in-
dicating higher similarity between our synthesized and genuine images. Our
new method (v2) also exhibits closer Avg. L* values and H-Mean scores to
the genuine images than v1, indicating darker and more accurate deterio-
ration of fine text details. In addition, qualitative results for the proposed
Supervised-DCE model and results of synthetic IC15 datasets including Syn-
IC15-Sony-v1, Syn-IC15-Sony-v2, Syn-IC15-Fuji-v1, and Syn-IC15-Fuji-v2
are presented in the supplementary material for comprehensive analyses.
Table 4: The difference between genuine extremely low-light dataset, SID, and synthetic
extremely low-light images generated using U-Net (v1) and Supervised-DCE (v2). Please
note that synthetic images’ PSNR and SSIM values are based on comparison against
genuine low-light images in the test set instead of pure black images calculated in Table
1. Additionally, we can notice that v2-images are more realistic and darker, similar to
genuine extremely low-light images due to their higher values of PSNR and SSIM, along
with closer Avg. L*.
28
The trained models were evaluated on their respective genuine low-light
datasets. Results in Table 5 showed a significant increase in H-Mean, and we
found that both versions (v1 and v2) can fill the gap caused by the scarcity
of genuine low-light images. This justifies the creation of a synthetic IC15
dataset for such a purpose. Furthermore, v2-images, i.e., extremely low-
light images synthesized by our proposed Supervised-DCE, further pushed
the limit of H-mean scores on genuine extremely low-light images, and our
enhancement model benefited the most because it could learn more from text
instances and reconstruct necessary details to represent texts. Despite our
method’s success, a noticeable gap exists between our results and the ground
truth, emphasizing the need for further research and development to achieve
even more accurate and reliable scene text extraction in low-light conditions.
Table 5: Text detection H-Mean on genuine extremely low-light datasets when trained on
a combination of genuine and synthetic datasets. Scores in bold are the best of all.
29
Proposed Modules Image Quality H-Mean
Text-CP Dual Encoder Edge-Att Edge Decoder PSNR ↑ SSIM ↑ LPIPS ↓ CRAFT ↑ PAN ↑
- - - - 21.847 0.698 0.783 0.283 0.205
✓ - - - 21.263 0.658 0.771 0.304 0.252
✓ ✓ - - 20.597 0.655 0.780 0.335 0.261
✓ ✓ ✓ - 21.440 0.669 0.776 0.342 0.256
✓ ✓ - ✓ 21.588 0.674 0.779 0.353 0.285
✓ - ✓ ✓ 23.074 0.712 0.783 0.350 0.281
- ✓ ✓ ✓ 24.192 0.738 0.784 0.356 0.292
✓ ✓ ✓ ✓ 25.596 0.751 0.751 0.368 0.298
Table 6: Ablation study of proposed modules in terms of PSNR, SSIM, LPIPS, and text
detection H-Mean on the SID-Sony-Text dataset. Scores in bold are the best of all.
30
Further analysis of Edge-Att and Text-CP are included in the supplementary
material to study their effectiveness as compared to the original versions.
7. Conclusion
31
References
[1] C. Chen, Q. Chen, J. Xu, V. Koltun, Learning to see in the dark, in:
CVPR, 2018.
[6] L. Tao, C. Zhu, J. Song, T. Lu, H. Jia, X. Xie, Low-light image enhance-
ment using cnn and bright channel prior, in: ICIP, 2017.
32
[9] M. Gharbi, J. Chen, J. Barron, S. W. Hasinoff, F. Durand, Deep bilat-
eral learning for real-time image enhancement, ACM Transactions on
Graphics 36 (4) (2017) 1–12.
33
[17] J. Fu, J. Liu, H. Tian, Y. Li, Y. Bao, Z. Fang, H. Lu, Dual attention
network for scene segmentation, in: CVPR, 2019.
[18] X. Guo, Y. Li, H. Ling, Lime: Low-light image enhancement via illumi-
nation map estimation, IEEE Transactions on Image Processing 26 (2)
(2016) 982–993.
34
[25] C. G. Guo, C. Li, J. Guo, C. C. Loy, J. Hou, S. Kwong, R. Cong, Zero-
reference deep curve estimation for low-light image enhancement, in:
CVPR, 2020.
[26] L. Ma, T. Ma, R. Liu, X. Fan, Z. Luo, Toward fast, flexible, and robust
low-light image enhancement, in: CVPR, 2022, pp. 5637–5646.
[30] Z. Cui, K. Li, L. Gu, S. Su, P. Gao, Z. Jiang, Y. Qiao, T. Harada, You
only need 90k parameters to adapt light: a light weight transformer for
image enhancement and exposure correction, in: BMVC, 2022.
35
[33] W. Wang, E. Xie, X. Song, Y. Zang, W. Wang, T. Lu, G. Yu, C. Shen,
Efficient and accurate arbitrary-shaped text detection with pixel aggre-
gation network, in: ICCV, 2019.
[35] C.-K. Ch’ng, C. S. Chan, C.-L. Liu, Total-text: toward orientation ro-
bustness in scene text detection, International Journal on Document
Analysis and Recognition (IJDAR) 23 (1) (2020) 31–52.
[38] Y. Liu, M.-M. Cheng, X. Hu, J. Bian, L. Zhang, X. Bai, J. Tang, Richer
convolutional features for edge detection, IEEE Transactions on Pattern
Analysis and Machine Intelligence 41 (8) (2019) 1939–1946.
36
[41] H. Liu, F. Liu, X. Fan, D. Huang, Polarized self-attention: Towards
high-quality pixel-wise regression, arXiv preprint arXiv:2107.00782
(2021).
[45] P.-H. Hsu, C.-T. Lin, C. C. Ng, J. L. Kew, M. Y. Tan, S.-H. Lai, C. S.
Chan, C. Zach, Extremely low-light image enhancement with scene text
restoration, in: ICPR, 2022.
[46] C. Li, C. Guo, C. C. Loy, Learning to enhance low-light image via zero-
reference deep curve estimation, IEEE Transactions on Pattern Analysis
and Machine Intelligence 44 (8) (2021) 4225–4238.
37
Supplementary Material: Text in the Dark: Extremely
Low-Light Text Image Enhancement
Che-Tsung Lina,1 , Chun Chet Ngb,1 , Zhi Qin Tanb , Wan Jun Nahb , Xinyu
Wangd , Jie Long Kewb , Pohao Hsuc , Shang Hong Laic , Chee Seng Chanb,∗,
arXiv:2404.14135v1 [cs.CV] 22 Apr 2024
Christopher Zacha
a
Chalmers University of Technology, Gothenburg, Sweden
b
Universiti Malaya, Kuala Lumpur, Malaysia
c
National Tsing Hua University, Hsinchu, Taiwan
d
The University of Adelaide, Adelaide, Australia
Abstract
1. Pseudocode of Text-CP
∗
Corresponding Author.
Email address: cs.chan@um.edu.my (Chee Seng Chan)
1
These authors contributed equally to this work.
2
(a) Low-Light (b) Zero-DCE [? ] (c) Zero-DCE++ [? ] (d) SCI [? ] (e) CycleGAN [? ]
(j) IAT [? ] (k) ELIE STR [? ] (l) Ours (m) Ground Truth
3
(a) Low-Light (b) Zero-DCE [? ] (c) Zero-DCE++ [? ] (d) SCI [? ] (e) CycleGAN [? ]
(j) IAT [? ] (k) ELIE STR [? ] (l) Ours (m) Ground Truth
4
3. Qualitative Results of the Proposed Supervised-DCE Model
In this assessment, we first cropped the texts based on the detection re-
sults using a constraint of Intersection over Union (IoU) greater than 0.5 with
the ground truth bounding boxes. Subsequently, we performed text recogni-
tion (using TRBA and ASTER) on the cropped detected text regions. Given
that the quality of text detection plays a crucial role in text spotting, our pro-
posed method’s success in outperforming the runner-up method, ELIE STR,
by a significant margin, highlights the effectiveness of our approach in pre-
serving fine details within text regions while enhancing low-light images. The
complete quantitative results are presented in Table 1.
As for the qualitative results of the two-stage text spotting task, we
present comprehensive results in Figure 4 and Figure 5. In the first row,
we display the entire image to provide an overview of the scene. In the sub-
sequent row, we zoom in on regions of interest that contain text instances,
allowing for a closer examination of text spotting performance. Please note
that text recognition is performed only when the IoU of detected boxes with
respect to ground truth bounding boxes is larger than 0.5. Any recognition
5
rightly, solar, compactor bigbelly, solar, compactor you, solar, compactor
Figure 3: Text detection results (CRAFT) and text recognition results (ASTER) of the
enhanced images (genuine SID-{Sony, Fuji}, Syn-SID-{Sony, Fuji}-v1, and Syn-SID-{Sony,
Fuji}-v2). (a) a genuine SID-Sony low-light image and its enhanced counterpart with
detection and recognition results; (b) a Syn-SID-Sony-v1 low-light image and its enhanced
counterpart with detection and recognition results; (c) a Syn-SID-Sony-v2 low-light image
and its enhanced counterpart with detection and recognition results. Text recognition
results of each detection box are listed from top-to-bottom then left to right in the captions.
results below this threshold are considered failed text recognition and labeled
as “nil”. Additionally, correctly predicted words are colored in green, while
incorrect text recognition results are indicated in red.
Existing methods often suffer from introducing excessive noise and arti-
facts when applying global and local image enhancement techniques simul-
taneously. As a result, enhanced images produced by these methods exhibit
6
high pixelation and visual distortions. In contrast, our model generates en-
hanced images that are significantly less noisy and exhibit fewer artifacts,
even in local text regions. While it is true that some text recognition results
on our enhanced images may not be perfect, our results consistently exhibit
the closest resemblance to ground truth compared to other methods.
In summary, our model can generate globally well-enhanced images given
extremely low-light images while retaining important local details such as
text features, leading to better text detection and recognition results.
7
Image Quality H-Mean Case-Insensitive Accuracy
Type Method
PSNR ↑ SSIM ↑ LPIPS ↓ C ↑ P ↑ C+T ↑ C+A ↑ P+T ↑ P+A ↑
Input - - - 0.355 0.192 0.114 0.118 0.062 0.067
Syn-IC15-Sony-v1
Zero-DCE ++ [? ] 13.505 0.575 0.587 0.379 0.329 0.101 0.093 0.089 0.079
ZSL
SCI [? ] 14.195 0.589 0.486 0.479 0.443 0.154 0.132 0.133 0.121
CycleGAN [? ] 23.420 0.720 0.397 0.428 0.458 0.119 0.144 0.122 0.138
UL
EnlightenGAN [? ] 21.030 0.661 0.469 0.458 0.461 0.140 0.157 0.123 0.139
ELIE STR [? ] 28.410 0.840 0.253 0.622 0.631 0.193 0.219 0.197 0.221
SL
Ours 28.568 0.859 0.248 0.660 0.662 0.222 0.256 0.221 0.248
GT - - - 0.800 0.830 0.526 0.584 0.555 0.591
Input - - - 0.198 0.068 0.046 0.047 0.018 0.019
Syn-IC15-Sony-v2
Zero-DCE ++ [? ] 8.189 0.063 0.753 0.259 0.238 0.059 0.049 0.051 0.044
ZSL
SCI [? ] 9.055 0.111 0.711 0.266 0.278 0.066 0.052 0.066 0.055
CycleGAN [? ] 20.212 0.768 0.404 0.443 0.429 0.113 0.114 0.104 0.107
UL
EnlightenGAN [? ] 19.610 0.753 0.440 0.477 0.456 0.116 0.114 0.099 0.104
ELIE STR [? ] 22.724 0.792 0.339 0.570 0.566 0.126 0.136 0.122 0.130
SL
Ours 22.938 0.809 0.318 0.593 0.577 0.138 0.157 0.134 0.143
GT - - - 0.800 0.830 0.526 0.584 0.555 0.591
Input - - - 0.347 0.175 0.100 0.102 0.097 0.095
Syn-IC15-Fuji-v1
Zero-DCE ++ [? ] 13.124 0.569 0.591 0.379 0.323 0.099 0.092 0.094 0.083
ZSL
SCI [? ] 15.426 0.615 0.599 0.468 0.463 0.144 0.129 0.139 0.124
CycleGAN [? ] 24.476 0.797 0.377 0.406 0.371 0.109 0.113 0.097 0.104
UL
EnlightenGAN [? ] 25.923 0.803 0.364 0.420 0.406 0.115 0.121 0.104 0.106
ELIE STR [? ] 27.528 0.837 0.264 0.646 0.645 0.199 0.216 0.193 0.200
SL
Ours 28.437 0.856 0.248 0.668 0.668 0.218 0.234 0.211 0.217
GT - - - 0.800 0.830 0.526 0.584 0.555 0.591
Input - - - 0.133 0.028 0.023 0.022 0.006 0.009
Syn-IC15-Fuji-v2
Zero-DCE ++ [? ] 8.793 0.134 0.708 0.193 0.137 0.040 0.033 0.028 0.025
ZSL
SCI [? ] 8.071 0.069 0.789 0.213 0.128 0.048 0.042 0.034 0.030
CycleGAN [? ] 23.920 0.733 0.367 0.397 0.386 0.070 0.074 0.066 0.073
UL
EnlightenGAN [? ] 24.600 0.740 0.389 0.415 0.396 0.089 0.091 0.081 0.084
ELIE STR [? ] 25.536 0.819 0.301 0.556 0.546 0.121 0.133 0.117 0.127
SL
Ours 25.999 0.830 0.283 0.598 0.583 0.143 0.152 0.128 0.144
GT - - - 0.800 0.830 0.526 0.584 0.555 0.591
Table 1: Quantitative results of PSNR, SSIM, LPIPS, and text detection H-Mean for
low-light image enhancement methods on Syn-IC15-Sony-v1, and Syn-IC15-Sony-v2, Syn-
IC15-Fuji-v1, and Syn-IC15-Fuji-v2 datasets. Two-stage text spotting case-insensitive
word accuracy are also reported, where C, P, T, and A are CRAFT, PAN, TRBA, and
ASTER, respectively. Please note that TRAD, ZSL, UL, and SL stand for traditional
methods, zero-shot learning, unsupervised learning, and supervised learning respectively.
Scores in bold are the best of all.
8
Image Quality H-Mean
Model Variations
PSNR ↑ SSIM ↑ LPIPS ↓ CRAFT ↑ PAN ↑
Full Model (naive-PSA) 23.768 0.725 0.774 0.354 0.291
Full Model (naive-Copy-Paste) 22.822 0.714 0.787 0.356 0.284
Full Model 25.596 0.751 0.751 0.368 0.298
Table 2: Ablation study of our full model using different versions of PSA and Copy-Paste
on SID-Sony-Text in terms of PSNR, SSIM, LPIPS, and text detection H-Mean. Scores
in bold are the best of all.
that our model trained with our proposed Text-CP scored better than the
naive-Copy-Paste. By ensuring that texts are separated when pasted onto
training image patches, our proposed method ensures that text detection
models will not be confused by overlapping texts and can generate better
text heatmaps for calculating text detection loss. This design also encour-
ages our model to focus better on text instances since, without overlapping
texts, text features will be clearer and more distinctive than other objects.
9
nil nil nil nil
Figure 4: Comparison of our model with other state-of-the-art methods on the Syn-IC15-
Sony-v1 (top) and Syn-IC15-Sony-v2 (bottom) dataset. PAN [? ] is used as the text
detector, while ASTER [? ] is used for text recognition. Blue boxes represent the
zoomed-in regions, while red boxes are drawn based on the predictions of PAN. Text
recognition is performed on the word images cropped by using PAN’s predictions (from
column 4a to column 4g) and ground truth text boxes (column 4h). “nil” stands for
either no recognition results or invalid prediction boxes (with IoU ≤ 0.5) which are not
considered for the subsequent recognition.
10
nil nil, the, for nil, out, from from, an, nil
and, cut, but ereshly, cut, and freshly, cut, from freshly, cut, fruit
Figure 5: Comparison of our model with other state-of-the-art methods on the Syn-IC15-
Fuji-v1 (top) and Syn-IC15-Fuji-v2 (bottom) datasets. PAN [? ] is used as the text
detector, while ASTER [? ] is used for text recognition. Blue boxes represent the
zoomed-in regions, while red boxes are drawn based on the predictions of PAN. Text
recognition is performed on the word images cropped by using PAN’s predictions (from
column 5a to column 5g) and ground truth text boxes (column 5h). “nil” stands for
either no recognition results or invalid prediction boxes (with IoU ≤ 0.5) which are not
considered for the subsequent recognition.
11