Raw or Cooked? Object Detection On RAW Images
Raw or Cooked? Object Detection On RAW Images
Raw or Cooked?
Object Detection on RAW Images
1
Computer Vision Laboratory, Linköping University, 581 83 Linköping, Sweden
{william.ljungbergh, michael.felsberg}@liu.se
2
Zenseact, Lindholmspiren 2, 417 56 Gothenburg, Sweden
{joakim.johnander, christoffer.petersson}@zenseact.com
1 Introduction
Image sensors commonly collect RAW data in a one-channel Bayer pattern [2,22],
RAW images, that are converted into three-channel RGB images via a camera
Image Signal Processing (ISP) pipeline. This pipeline comprises a number of
low-level vision functions – such as decompanding [18], demosaicing [16] (or
debayering [22]), denoising, white balancing, and tone-mapping [31, 40]. Each
function is designed to tackle some particular phenomenon and the final pipeline
is aimed at producing a visually pleasing image.
In recent years, image-based computer vision tasks have seen a leap in perfor-
mance due to the advent of neural networks. Most computer vision tasks – such
as image classification or object detection – are based on RGB image inputs.
However, some recent works [33, 49] have considered the possibility of removing
the camera ISP and instead directly feeding the RAW image into the neural
network. The intuition is that the high flexibility of the neural network should
2 W. Ljungbergh et al.
Fig. 1. Three qualitative examples from the PASCALRAW dataset. We show the
ground-truth (top), the RGB baseline detector (center), and the RAW RGGB detector
with a learnable Yeo-Johnson operation (bottom). Compared to the RGB baseline,
our proposed RAW RGGB detector manages to detect objects subject to poor light
conditions.
enable it to approximate the camera ISP if that is the optimal way to trans-
form the RAW data. It is important to note that the camera ISP is in general
not optimized for the downstream task, and the neural network might by itself
be able to learn a more suitable transformation of the RAW data during the
training. One possibility is that the ISP might remove information that could be
crucial in adverse conditions, such as low light. Moreover, the camera ISP adds
image data according to image priors, which might result in spurious network
responses [21].
In this work we investigate object detection on RAW data, following the
hypothesis that RAW input images lead to superior detection performance, with
the aim to identify the minimal set of operations on the RAW data that results in
performance that exceeds the traditional RGB detectors. Our main contributions
are the following:
1. We show that naı̈vely feeding RAW data into an object detector leads to
poor performance.
2. We propose three simple yet effective strategies to mitigate the performance
drop. The outputs of the best performing strategy – a learnable version of
the Yeo-Johnson transformation – are visualized in Figure 1.
3. We provide an empirical study on the publicly available PASCALRAW
dataset.
Raw or Cooked? Object Detection on RAW Images 3
2 Related Work
Object detection: Object detection has been an active area of research for
many years, and has been approached in many different ways. It is common
to divide object detectors into two categories: (i) two-stage methods [11, 24, 37]
that first generate proposals and then localize and classify objects each proposal;
and (ii) one-stage detectors that either make use of a predefined set of anchors
[25, 35] or make a dense (anchor-free) [42, 51] prediction across the entire image.
Carion et al. [5] observed that both these categories of detectors rely on hand-
crafted post-processing steps, such as non-maximum suppression, and proposed
an end-to-end trainable object detector, DETR, that directly outputs a set of
objects. One drawback of DETR is that convergence is slow and several follow-
up works [27, 29, 41, 43, 48, 52] have proposed schemes to alleviate this issue. All
the work above shares one property: they rely on RGB image data.
RAW image data: RAW image data is traditionally fed through a camera ISP
that produces an RGB image. Substantial research efforts have been devoted
into the design of this ISP, usually with the aim to produce visually pleasing
RGB images. A large number of works have studied the different sub-tasks, e.g.,
demosaicing [9,16,23,28], denoising [3,7,10], and tone mapping [20,34,36]. Several
recent works propose to replace the camera ISP with deep neural networks [8,19,
39,50]. More precisely, these works aim to find a mapping between RAW images
and high-quality RGB images produced by a digital single-lens reflex camera
(DSLR).
Object detection using RAW image data: In this work, we aim to train
an object detector that takes RAW images as input. We are not the first to
explore this direction. Buckler et al. [4] found that for processing RAW data,
only demosaicing and gamma correction are crucial operations. In contrast to
their work, we find that also these two can be avoided. Yoshimura et al. [46],
Yoshimura et al. [47], and Morawski et al. [30] strive to construct a learnable
ISP that, together with an object detector, is trained for the object detection
task. Based on our experiments, we argue that also the learnable ISP can be
replaced with very simple operations. Most closely related to our work is the
work of Hong et al. [17], which proposes to only demosaic RAW images before
feeding them into an object detector. In contrast to their work, we do not find
the need for an auxiliary image construction loss nor for demosaicing.
3 Method
In this section, we first introduce a strategy for downsampling RAW Bayer im-
ages (Section 3.1). This enables us to downsample high-resolution images to
be more suitable for standard computer vision pipelines while maintaining the
Bayer pattern in the RAW image. In Section 3.2, we introduce the three learnable
operations.
4 W. Ljungbergh et al.
Fig. 2. Downsampling method for Bayer-pattern RAW data. Each of the colors in the
filter array of the downsampled RAW image (right) is the average over all cells in the
corresponding region in the original image with the same color (left and center). The
figure illustrates the downsampling of an original image patch of size 2d × 2d (with
d = 5 in this example), down to a patch of size 2 × 2, i.e. with a downsampling factor
d in each dimension.
(d−1)/2 (d−1)/2
1 X X
xi,j = xorig
di+2m,dj+2n , (1)
N m=0 n=0
where x ∈ R2×2 is the downsampled patch, xorig ∈ R2d×2d is the original patch,
d is the downsampling factor, N = (d+1)2 /4 is the number of elements averaged
over, and i, j ∈ 0, 1. All downsampled patches are then concatenated to form the
downsampled RAW image x ∈ RH/d×W/d .
It would be possible to feed the downsampled RAW image, x, directly into
an object detector. There is however one thing to note about the first layer
of the image encoder. In the standard RGB image setting, each weight in this
layer is only applied to one modality – red, green, or blue. This enables the first
layer to capture color-specific information, such as gradients from one color to
another. When fed with RAW images, as described above, we can assert the
same property by ensuring that the stride of the first layer is an even number.
Luckily, this is the case with the standard ResNet [14] architecture.
Raw or Cooked? Object Detection on RAW Images 5
A B C
Decompanding
Demosaicing
Not learnable module
Denoising
White balancing
Tone mapping
Compression F
Fig. 3. Traditional (A), naı̈ve (B), and proposed (C) detection pipelines. The tra-
ditional pipeline uses a set of common image signal processing operations, such as
Demosaicing, Denoising, and Tonemapping, and then feeds the object detector with
the processed RGB images. The naı̈ve pipeline feeds the RAW image directly into the
detector while our proposed pipeline first feeds the RAW image through a learnable
non-linear operation, F , which can be viewed as being part of the end-to-end trainable
object detection network.
towards the end task, we can optimize the Yeo-Johnson transformation with
respect to the end goal, rather than towards a Gaussian distribution. Inspired
by this, we define the Learnable Yeo-Johnson transformation as a point-wise
non-linear operation
(x + 1)λ − 1
FYJ (x) = , (5)
λ
where λ ∈ R+ is the learnable parameter.
4 Experiments
4.1 Dataset
Table 1. Object detection results on the PASCALRAW dataset. The results are pre-
sented in terms of AP (higher is better) and we report the mean and standard deviation
over 3 separate runs.
50 0.3500
Initial = 0.35
Final = 0.11
Parameter value
0.3000
0.2500
40
0.2000
Output activation value
0.1500
30 0.1000
0 30000 60000 90000 120000 150000
Iteration
20 0.0030
0.0025
0.0020
Density
10 0.0015
0.0010
0 0.0005
0 1000 2000 3000 4000 0.0000 0 1000 2000 3000 4000
Pixel value Pixel Value
Fig. 4. Evolution of the learnable parameter λ during the entire training (top-right),
the distribution of the RAW pixel values in PASCAL RAW (bottom-right), and the
functional form – before and after training – of the Learnable Yeo-Johnson operation
(left). In the left plot, the output activation values are shown across the full input
range [0, 212 − 1].
5 Conclusion
Motivated by the observation that camera ISP pipelines are typically optimized
towards producing visually pleasing images for the human eye, we have in this
work experimented with object detection on RAW images. While naı̈vely feed-
ing RAW images directly into the object detection backbone led to poor per-
formance, we proposed three simple, learnable operations that all led to good
performance. Two of these operators, the Learnable Gamma and Learnable Yeo-
Johnson, led to superior performance compared to the RGB baseline detector.
Based on qualitative comparison, the RAW detector performs better in low-light
conditions compared to the RGB detector.
10 W. Ljungbergh et al.
References
1. Åström, F., Zografos, V., Felsberg, M.: Density driven diffusion. In: Scandinavian
Conference on Image Analysis. pp. 718–730. Springer (2013)
2. Bayer, B.E.: Color imaging array. United States Patent 3,971,065 (1976)
3. Buades, A., Coll, B., Morel, J.M.: A non-local algorithm for image denoising. In:
2005 IEEE computer society conference on computer vision and pattern recognition
(CVPR’05). vol. 2, pp. 60–65. Ieee (2005)
4. Buckler, M., Jayasuriya, S., Sampson, A.: Reconfiguring the imaging pipeline for
computer vision. In: Proceedings of the IEEE International Conference on Com-
puter Vision. pp. 975–984 (2017)
5. Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A., Zagoruyko, S.: End-
to-end object detection with transformers. In: European conference on computer
vision. pp. 213–229. Springer (2020)
6. Ciufolini, I., Paolozzi, A.: Mathematical prediction of the time evolution of the
covid-19 pandemic in italy by a gauss error function and monte carlo simulations.
The European Physical Journal Plus 135(4), 355 (2020)
7. Condat, L.: A simple, fast and efficient approach to denoisaicking: Joint demosaick-
ing and denoising. In: 2010 IEEE International Conference on Image Processing.
pp. 905–908. IEEE (2010)
8. Dai, L., Liu, X., Li, C., Chen, J.: Awnet: Attentive wavelet network for image isp.
In: European Conference on Computer Vision. pp. 185–201. Springer (2020)
9. Dubois, E.: Filter design for adaptive frequency-domain bayer demosaicking. In:
2006 International Conference on Image Processing. pp. 2705–2708. IEEE (2006)
10. Foi, A., Trimeche, M., Katkovnik, V., Egiazarian, K.: Practical poissonian-gaussian
noise modeling and fitting for single-image raw-data. IEEE Transactions on Image
Processing 17(10), 1737–1754 (2008)
11. Girshick, R., Donahue, J., Darrell, T., Malik, J.: Rich feature hierarchies for ac-
curate object detection and semantic segmentation. In: Proceedings of the IEEE
conference on computer vision and pattern recognition. pp. 580–587 (2014)
12. Glorot, X., Bengio, Y.: Understanding the difficulty of training deep feedforward
neural networks. In: Proceedings of the thirteenth international conference on ar-
tificial intelligence and statistics. pp. 249–256. JMLR Workshop and Conference
Proceedings (2010)
13. He, K., Zhang, X., Ren, S., Sun, J.: Delving deep into rectifiers: Surpassing human-
level performance on imagenet classification. In: Proceedings of the IEEE interna-
tional conference on computer vision. pp. 1026–1034 (2015)
14. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In:
Proceedings of the IEEE conference on computer vision and pattern recognition.
pp. 770–778 (2016)
15. Hendrycks, D., Gimpel, K.: Gaussian error linear units (gelus). arXiv preprint
arXiv:1606.08415 (2016)
16. Hirakawa, K., Parks, T.W.: Adaptive homogeneity-directed demosaicing algorithm.
Ieee transactions on image processing 14(3), 360–369 (2005)
Raw or Cooked? Object Detection on RAW Images 11
17. Hong, Y., Wei, K., Chen, L., Fu, Y.: Crafting object detection in very low light.
In: BMVC. vol. 1, p. 3 (2021)
18. HP, A.W., Prasetyo, H., Guo, J.M.: Autoencoder-based image companding. In:
2020 IEEE International Conference on Consumer Electronics-Taiwan (ICCE-
Taiwan). pp. 1–2. IEEE (2020)
19. Ignatov, A., Van Gool, L., Timofte, R.: Replacing mobile camera isp with a single
deep learning model. In: Proceedings of the IEEE/CVF Conference on Computer
Vision and Pattern Recognition Workshops. pp. 536–537 (2020)
20. Krawczyk, G., Myszkowski, K., Seidel, H.P.: Lightness perception in tone repro-
duction for high dynamic range images. In: Computer Graphics Forum. vol. 24,
pp. 635–646. Amsterdam: North Holland, 1982- (2005)
21. Kriesel, D.: Traue keinem scan, den du nicht selbst gefälscht hast. Mitteilungen
der Deutschen Mathematiker-Vereinigung 22(1), 30–34 (2014)
22. Langseth, R., Gaddam, V.R., Stensland, H.K., Griwodz, C., Halvorsen, P.: An
evaluation of debayering algorithms on gpu for real-time panoramic video record-
ing. In: 2014 IEEE International Symposium on Multimedia. pp. 110–115. IEEE
(2014)
23. Li, X., Gunturk, B., Zhang, L.: Image demosaicing: A systematic survey. In: Visual
Communications and Image Processing 2008. vol. 6822, pp. 489–503. SPIE (2008)
24. Lin, T.Y., Dollár, P., Girshick, R., He, K., Hariharan, B., Belongie, S.: Feature
pyramid networks for object detection. In: Proceedings of the IEEE conference on
computer vision and pattern recognition. pp. 2117–2125 (2017)
25. Lin, T.Y., Goyal, P., Girshick, R., He, K., Dollár, P.: Focal loss for dense object
detection. In: Proceedings of the IEEE international conference on computer vision.
pp. 2980–2988 (2017)
26. Lin, T.Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P.,
Zitnick, C.L.: Microsoft coco: Common objects in context. In: European conference
on computer vision. pp. 740–755. Springer (2014)
27. Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin
transformer: Hierarchical vision transformer using shifted windows. In: Proceedings
of the IEEE/CVF International Conference on Computer Vision. pp. 10012–10022
(2021)
28. Malvar, H.S., He, L.w., Cutler, R.: High-quality linear interpolation for demosaic-
ing of bayer-patterned color images. In: 2004 IEEE International Conference on
Acoustics, Speech, and Signal Processing. vol. 3, pp. iii–485. IEEE (2004)
29. Meng, D., Chen, X., Fan, Z., Zeng, G., Li, H., Yuan, Y., Sun, L., Wang, J.: Con-
ditional detr for fast training convergence. In: Proceedings of the IEEE/CVF In-
ternational Conference on Computer Vision. pp. 3651–3660 (2021)
30. Morawski, I., Chen, Y.A., Lin, Y.S., Dangi, S., He, K., Hsu, W.H.: Genisp: Neural
isp for low-light machine cognition. In: Proceedings of the IEEE/CVF Conference
on Computer Vision and Pattern Recognition. pp. 630–639 (2022)
31. Mujtaba, N., Khan, I.R., Khan, N.A., Altaf, M.A.B.: Efficient flicker-free tone
mapping of hdr videos. In: 2022 IEEE 24th International Workshop on Multimedia
Signal Processing (MMSP). pp. 01–06. IEEE (2022)
32. Olli Blom, M., Johansen, T.: End-to-end object detection on raw camera data
(2021)
33. Omid-Zohoor, A., Ta, D., Murmann, B.: Pascalraw: raw image database for object
detection (2014)
34. Poynton, C.: Digital video and HD: Algorithms and Interfaces. Elsevier (2012)
35. Redmon, J., Farhadi, A.: Yolov3: An incremental improvement. arXiv preprint
arXiv:1804.02767 (2018)
12 W. Ljungbergh et al.
36. Reinhard, E., Stark, M., Shirley, P., Ferwerda, J.: Photographic tone reproduction
for digital images. In: Proceedings of the 29th annual conference on Computer
graphics and interactive techniques. pp. 267–276 (2002)
37. Ren, S., He, K., Girshick, R., Sun, J.: Faster r-cnn: Towards real-time object de-
tection with region proposal networks. Advances in neural information processing
systems 28 (2015)
38. Riechert, M.: Rawpy. https://github.com/letmaik/rawpy (2022)
39. Shekhar Tripathi, A., Danelljan, M., Shukla, S., Timofte, R., Van Gool, L.: Trans-
form your smartphone into a dslr camera: Learning the isp in the wild. In: European
Conference on Computer Vision. pp. 625–641. Springer (2022)
40. Suma, R., Stavropoulou, G., Stathopoulou, E.K., Van Gool, L., Georgopoulos, A.,
Chalmers, A.: Evaluation of the effectiveness of hdr tone-mapping operators for
photogrammetric applications. Virtual Archaeology Review 7(15), 54–66 (2016)
41. Sun, Z., Cao, S., Yang, Y., Kitani, K.M.: Rethinking transformer-based set predic-
tion for object detection. In: Proceedings of the IEEE/CVF international confer-
ence on computer vision. pp. 3611–3620 (2021)
42. Tian, Z., Shen, C., Chen, H., He, T.: Fcos: Fully convolutional one-stage object
detection. In: Proceedings of the IEEE/CVF international conference on computer
vision. pp. 9627–9636 (2019)
43. Wang, Y., Zhang, X., Yang, T., Sun, J.: Anchor detr: Query design for transformer-
based detector. In: Proceedings of the AAAI conference on artificial intelligence.
vol. 36, pp. 2567–2575 (2022)
44. Wu, Y., Kirillov, A., Massa, F., Lo, W.Y., Girshick, R.: Detectron2. https://
github.com/facebookresearch/detectron2 (2019)
45. Yeo, I.K., Johnson, R.A.: A new family of power transformations to improve nor-
mality or symmetry. Biometrika 87(4), 954–959 (2000)
46. Yoshimura, M., Otsuka, J., Irie, A., Ohashi, T.: Dynamicisp: Dynami-
cally controlled image signal processor for image recognition. arXiv preprint
arXiv:2211.01146 (2022)
47. Yoshimura, M., Otsuka, J., Irie, A., Ohashi, T.: Rawgment: Noise-accounted raw
augmentation enables recognition in a wide variety of environments. arXiv preprint
arXiv:2210.16046 (2022)
48. Zhang, H., Li, F., Liu, S., Zhang, L., Su, H., Zhu, J., Ni, L.M., Shum, H.Y.: Dino:
Detr with improved denoising anchor boxes for end-to-end object detection. arXiv
preprint arXiv:2203.03605 (2022)
49. Zhang, X., Zhang, L., Lou, X.: A raw image-based end-to-end object detection ac-
celerator using hog features. IEEE Transactions on Circuits and Systems I: Regular
Papers 69(1), 322–333 (2021)
50. Zhang, Z., Wang, H., Liu, M., Wang, R., Zhang, J., Zuo, W.: Learning raw-to-srgb
mappings with inaccurately aligned supervision. In: Proceedings of the IEEE/CVF
International Conference on Computer Vision. pp. 4348–4358 (2021)
51. Zhou, X., Wang, D., Krähenbühl, P.: Objects as points. arXiv preprint
arXiv:1904.07850 (2019)
52. Zhu, X., Su, W., Lu, L., Li, B., Wang, X., Dai, J.: Deformable detr: Deformable
transformers for end-to-end object detection. arXiv preprint arXiv:2010.04159
(2020)