Tensorflow Lite Micro
Tensorflow Lite Micro
A BSTRACT
TensorFlow Lite Micro (TFLM) is an open-source ML inference framework for running deep-learning models on
embedded systems. TFLM tackles the efficiency requirements imposed by embedded-system resource constraints
and the fragmentation challenges that make cross-platform interoperability nearly impossible. The framework
adopts a unique interpreter-based approach that provides flexibility while overcoming these unique challenges.
In this paper, we explain the design decisions behind TFLM and describe its implementation. We present an
evaluation of TFLM to demonstrate its low resource requirements and minimal run-time performance overheads.
• Inability to easily and portably deploy models across machine learning. Because machine-learning performance
multiple embedded hardware architectures is largely dictated by linear-algebra computations, the inter-
• Lack of optimizations that take advantage of the under- preter design imposes minimal run-time overhead.
lying hardware without requiring framework develop-
ers to make platform-specific efforts 2 T ECHNICAL C HALLENGES
• Lack of productivity tools that connect training
Many issues make developing an ML framework for embed-
pipelines to deployment platforms and tools
ded systems particularly difficult, as discussed here.
• Incomplete infrastructure for compression, quantiza-
tion, model invocation, and execution 2.1 Missing Features
• Minimal support features for performance profiling,
debugging, orchestration, and so on Embedded platforms are defined by their tight limitations.
Therefore, many advances from the past few decades that
• No benchmarks that allow vendors to quantify their
have made software development faster and easier are un-
chip’s performance in a fair and reproducible manner
available to these platforms because the resource tradeoffs
• Lack of testing in real-world applications. are too expensive. Examples include dynamic memory
management, virtual memory, an operating system, a stan-
To address these issues, we introduce TensorFlow Lite Mi- dard instruction set, a file system, floating-point hardware,
cro (TFLM), which mitigates the slow pace and high cost and other tools that seem fundamental to modern program-
of training and deploying models to embedded hardware by mers (Kumar et al., 2017). Though some platforms provide
emphasizing portability and flexibility. TFLM makes it easy a subset of these features, a framework targeting widespread
to get TinyML applications running across architectures, adoption in this market must avoid relying on them.
and it allows hardware vendors to incrementally optimize
kernels for their devices. It gives vendors a neutral platform 2.2 Fragmented Market and Ecosystem
to prove their performance and offers these benefits:
Many embedded-system uses only require fixed software
• Our interpreter-based approach is portable, flexible, developed alongside the hardware, usually by an affili-
and easily adapted to new applications and features ated team. The lack of applications capable of running
on the platform is therefore much less important than it
• We minimize the use of external dependencies and is for general-purpose computing. Moreover, backward
library requirements to be hardware agnostic instruction-set-architecture (ISA) compatibility with older
• We enable hardware vendors to provide platform- software matters less than in mainstream systems because
specific optimizations on a per-kernel basis without everything that runs on an embedded system is probably
writing target-specific compilers compiled from source code anyway. Thus, embedded hard-
• We allow hardware vendors to easily integrate their ker- ware can aggressively diversify to meet power requirements,
nel optimizations to ensure performance in production whereas even the latest x86 processor can still run instruc-
and comparative hardware benchmarking tions that are nearly three decades old (Intel, 2013).
• Our model-architecture framework is open to a wide These differences mean the pressure to converge on one
machine-learning ecosystem and the TensorFlow Lite or two dominant platforms or ISAs is much weaker in the
model conversion and optimization infrastructure embedded space, leading to fragmentation. Many ISAs have
• We provide benchmarks that are being adopted by thriving ecosystems, and the benefits they bring to partic-
industry-leading benchmark bodies like MLPerf ular applications outweigh developers’ cost of switching.
• Our framework supports popular, well-maintained Companies even allow developers to add their own ISA
Google applications that are in production. extensions (Waterman & Asanovic, 2019; ARM, 2019).
Matching the wide variety of embedded architectures are
This paper makes several contributions: First, we clearly lay the numerous tool chains and integrated development envi-
out the challenges to developing a machine-learning frame- ronments (IDEs) that support them. Many of these systems
work for embedded devices that supports the fragmented are only available through a commercial license with the
embedded ecosystem. Second, we provide design and imple- hardware manufacturer, and in cases where a customer has
mentation details for a system specifically created to cope requested specialized instructions, they may be inaccessi-
with these challenges. And third, we demonstrate that an ble to everyone. These arrangements have no open-source
interpreter-based approach, which is traditionally viewed ecosystem, leading to device fragmentation that prevents a
as a low-performance alternative to compilation, is in fact lone development team from producing software that runs
highly suitable for the embedded domain—specifically, for well on many different embedded platforms.
TensorFlow Lite Micro
allocations can be discarded after that function is done, Memory Size Memory Size
and the memory is reusable for evaluation variables. This Time Operator #1
approach also enables advanced applications to reuse the
Operator #2
arena’s function-lifetime section in between evaluation calls.
Operator #3 A B A B
Operator #4
4.4.2 Memory Planner
Operator #5
A more complex optimization opportunity involves the Operator #6 C C
space required for intermediate calculations during model Operator #7 D D
evaluation. An operator may write to one or more output
Operator #8
buffers, and later operators may later read them as inputs.
If the output is not exposed to the application as a model (a) Naive (b) Bin packing
output, its contents need only remain until the last operation
that needs them has finished. Its presence is also unneces- Figure 4. Intermediate allocation strategies.
sary until just before the operation that populates it executes. host before run time. The memory layout is stored as model
Memory reuse is possible by overlapping allocations that FlatBuffer metadata and contains an array of fixed-memory
are unneeded during the same evaluation sections. arena offsets for an arbitrary number of variable tensors.
The memory allocations required over time can be visual-
ized using rectangles (Figure 4a), where one dimension is 4.5 Multitenancy
memory size and the other is the time during which each Embedded-system constraints can force application-model
allocation must be preserved. The overall memory can be developers to create several specialized models instead of
substantially reduced if some areas are reused or compacted one large monolithic model. Hence, supporting multiple
together. Figure 4b shows a more optimal memory layout. models on the same embedded system may be necessary.
Memory compaction is an instance of bin packing (Martello, If an application has multiple models that need not run
1990). Calculating the perfect allocation strategy for arbi- simultaneously, it is possible to have two separate instances
trary models without exhaustively trying all possibilities is running in isolation from one another. However, this is
an unsolved problem, but a first-fit decreasing algorithm inefficient because the temporary space cannot be reused.
(Garey et al., 1972) usually provides reasonable solutions.
Instead, TFLM supports multitenancy with some memory-
In our case, this approach consists of gathering a list of planner changes that are transparent to the developer. TFLM
all temporary allocations, including size and lifetime; sort- supports memory-arena reuse by enabling the multiple
ing the list in descending order by size; and placing each model interpreters to allocate memory from a single arena.
allocation in the first sufficiently large gap, or at the end
of the buffer if no such gap exists. We do not support dy- We allow interpreter-lifetime areas to stack on each other in
namic shapes in the TFLM framework, so we must know the arena and reuse the function-lifetime section for model
at initialization all the information necessary to perform evaluation. The reusable (nonpersistent) part is set to the
this algorithm. The “Memory Planner” (shown in Figure 2) largest requirement, based on all models allocating in the
encapsulates this process; it allows us to minimize the arena arena. The nonreusable (persistent) allocations grow for
portion devoted to intermediate tensors. Doing so offers a each model—allocations are model specific (Figure 4b).
substantial memory-use reduction for many models.
4.6 Multithreading
Memory planning at run time incurs more overhead during
model preparation than a preplanned memory-allocation TFLM is thread-safe as long as there is no state correspond-
strategy. This cost, however, comes with the benefit of ing to the model that is kept outside the interpreter and the
model generality. TFLM models simply list the operator model’s memory allocation within the arena.
and tensor requirements. At run time, we allocate and enable
The interpreter’s only variables are kept in the arena, and
this capability for many model types.
each interpreter instance is uniquely bound to a specific
Offline-planned tensor allocation is an alternative memory- model. Therefore, TFLM can safely support multiple inter-
planning feature of TFLM. It allows a more compact mem- preter instances running from different tasks or threads.
ory plan, gives memory-plan ownership and control to the
TFLM can also run safely on multiple MCU cores. Since
end user, imposes less overhead on the MCU during ini-
the only variables used by the interpreter are kept in the
tialization, and enables more-efficient power options by
arena, this works well in practice. The executable code is
allowing different memory banks to store certain memory
shared, but the arenas ensure there are no threading issues.
areas. We allow the user to create a memory layout on a
TensorFlow Lite Micro
Head
TF Lite Micro (stack) Model
Interpreter TF Lite Micro Head
binds to
Interpreter (stack)
Figure 5. Memory-allocation strategy for a single model versus a multi-tenancy scenario. In TFLM, there is a one-to-one binding between
a model, an interpreter and the memory allocations made for the model (which may come from a shared memory arena).
4.7 Operator Support files replace the reference implementations during all build
steps when targeting the named platform or library (e.g., us-
Operators are the calculation units in neural-network graphs.
ing TAGS="cmsis-nn"). Each platform is given a unique
They represent a sizable amount of computation, typically
tag. The tag is a command line argument to the build system
requiring many thousands or even millions of individual
that replaces the reference kernels during compilation. In
arithmetic operations (e.g., multiplies or additions). They
a similar vein, library modifiers can swap or change the
are functional, with well-defined inputs, outputs, and state
implementations incrementally with no changes to the build
variables as well as no side effects beyond them.
scripts and the overarching build system we put in place.
Because the model execution’s latency, power consumption,
and code size tend to be dominated by the implementations 4.9 Build System
of these operations, they are typically specialized for partic-
ular platforms to take advantage of hardware characteristics. To address the embedded market’s fragmentation (Sec-
In practice, we attracted library optimizations from hard- tion 2.2), we needed our code to compile on many platforms.
ware vendors such as Arm, Cadence, Ceva, and Synopsys. We therefore wrote the code to be highly portable, exhibiting
few dependencies, but it was insufficient to give potential
Well-defined operator boundaries mean it is possible to de- users a good experience on a particular device.
fine an API that communicates the inputs and outputs but
hides implementation details behind an abstraction. Sev- Most embedded developers employ a platform-specific IDE
eral chip vendors have provided a library of neural network or tool chain that abstracts many details of building subcom-
kernels designed to deliver maximum neural-network per- ponents and presents libraries as interface modules. Simply
formance when running on their processors. For example, giving developers a folder hierarchy containing source-code
Arm has provided optimized CMSIS-NN libraries divided files would still leave them with multiple steps before they
into several functions, each covering a category: convolu- could build and compile that code into a usable library.
tion, activation, fully connected layer, pooling, softmax, and Therefore, we chose a single makefile based build sys-
optimized basic math. TFLM uses CMSIS-NN to deliver tem to determine which files the library required, then gen-
high performance as we demonstrate in Section 5. erated the project files for the associated tool chains. The
makefile held the source-file list, and we stored the platform-
4.8 Platform Specialization specific project files as templates that the project-generation
process filled in with the source-file information. That pro-
TFLM gives developers flexibility to modify the library
cess may also perform other postprocessing to convert the
code. Because operator implementations (kernels) often
source files to a format suitable for the target tool chain.
consume the most time when executing models, they are
prominent targets for platform-specific optimization. Our platform-agnostic approach has enabled us to support a
variety of tool chains with minimal engineering work, but
We wanted to make swapping in new implementations easy.
it does have some drawbacks. We implemented the project
To do so, we allow specialized versions of the C++ source
generation through an ad hoc mixture of makefile scripts and
code to override the default reference implementation. Each
Python. This strategy makes the process difficult to debug,
kernel has a reference implementation that is in a directory,
maintain, and extend. Our intent is for future versions to
but subfolders contain optimized versions for particular plat-
keep the concept of a master source-file list that only the
forms (e.g., the Arm CMSIS-NN library).
makefile holds, but then delegate the actual generation to
As we explain in Section 4.9, the platform-specific source better-structured Python in a more maintainable way.
TensorFlow Lite Micro
5 S YSTEM E VALUATION inputs through a single model, measuring the time to process
each input and produce an inference output. The benchmark
TFLM has undergone testing and it has been deployed ex- does not measure the time necessary to bring up the model
tensively with many processors based on the Arm Cortex-M and configure the run time, since the recurring inference cost
architecture (Arm, 2020). It has been ported to other ar- dominates total CPU cycles on most long-running systems.
chitectures including ESP32 (Espressif, 2020) and many
digital signal processors (DSPs). The framework is also
5.2 Benchmark Performance
available as an Arduino library. It can generate projects for
environments such as Mbed (ARM, 2020) as well. In this We provide two sets of benchmark results. First are the base-
section, we use two representative platforms to assess and line results from running the benchmarks on reference ker-
quantify TFLM’s computational and memory overheads. nels, which are simple operator-kernel implementations de-
signed for readability rather than performance. Second are
5.1 Experimental Setup results for optimized kernels compared with the reference
kernels. The optimized versions employ high-performance
Our benchmarks focus on the (1) performance benefits of ARM CMSIS-NN and Cadence libraries (Lai et al., 2018).
optimized kernels and (2) platforms we can support and the
performance we achieve on them. So, we focus on extreme The results in Table 2 are for the CPU (Table 2a) and DSP
endpoints rather than on the overall spectrum. Specifically, (Table 2b). The total run time appears under the “Total
we evaluate two extreme hardware designs and ML models. Cycles” column, and the run time excluding the interpreter
appears under the “Calculation Cycles” column. The differ-
We evaluate two extreme hardware designs: MCU (general) ence between them is the minimal interpreter overhead. The
and ultra-low-power DSP (specialized). The details for “Interpreter Overhead” column in both Table 2a and Table 2b
the two hardware platforms are shown in Table 1. First is insignificant compared with the total model run time on
is the Sparkfun Edge, which has an Ambiq Apollo3 MCU. both the CPU and DSP. The overhead on the microcontroller
Apollo3 is powered by an Arm Cortex-M4 core and operates CPU (Table 2a) is less than 0.1% for long-running models,
in burst mode at 96 MHz (Ambiq Micro, 2020). The second such as VWW. In the case of short-running models such
platform is an Xtensa Hifi Mini DSP, which is based on the as Google Hotword, the overhead is still minimal at about
Cadence Tensilica architecture (Cadence, 2020). 3% to 4%. The same general trend holds in Table 2b for
We evaluate two extreme ML models in terms of model size non-CPU architectures like the Xtensa HiFi Mini DSP.
and complexity for embedded devices. We use the Visual Comparing the reference kernel versions to the optimized
Wake Words (VWW) person-detection model (Chowdhery kernel versions reveals considerable performance improve-
et al., 2019), which represents a common microcontroller vi- ment. For example, between “VWW Reference” and
sion task of identifying whether a person appears in a given “VWW Optimized,” the CMSIS-NN library offers more than
image. The model is trained and evaluated on images from a 4x speedup on the Cortex-M4 microcontroller. Optimiza-
the Microsoft COCO data set (Lin et al., 2014). It primarily tion on the Xtensa HiFi Mini DSP offers a 7.7x speedup. For
stresses and measures the performance of convolutional op- “Google Hotword,” the optimized kernel speed on Cortex-
erations. Also, we use the Google Hotword model, which M4 is only 25% better than the baseline reference model
aids in detecting the key phrase “OK Google.” This model because less time goes to the kernel calculations. Each in-
is designed to be small and fast enough to run constantly ner loop accounts for less time with respect to the total run
on a low-power DSP in smartphones and other devices with time of the benchmark model. On the specialized DSP, the
Google Assistant. Because it is proprietary, we use a version optimized kernels have a significant impact on performance.
with scrambled weights and biases. More evaluation is bet-
ter but TinyML is nascent and not many benchmarks exist.
5.3 Memory Overhead
The benchmarks we use are part of TinyMLPerf (Banbury
et al., 2020) and also used by MCUNet (Lin et al., 2020). We assess TFLM’s total memory usage. TFLM’s memory
usage includes the code size for the interpreter, memory
Our benchmarks are INT8 TensorFlow Lite models in a
allocator, memory planner, etc. plus any operators that are
serialized FlatBuffer format. The benchmarks run multiple
required by the model. Hence, the total memory usage varies
greatly by the model. Large models and models with com-
Platform Processor Clock Flash RAM
Sparkfun Edge Arm CPU plex operators (e.g. VWW) consume more memory than
96 MHz 1 MB 0.38 MB their smaller counterparts like Google Hotword. In addition
(Ambiq Apollo3) Cortex-M4
Tensilica HiFi
Xtensa DSP
10 MHz 1 MB 1 MB
to VWW and Google Hotword, in this section, we added an
HiFi Mini even smaller reference convolution model containing just
two convolution layers, a max-pooling layer, a dense layer,
Table 1. Embedded-platform benchmarking.
and an activation layer to emphasize the differences.
TensorFlow Lite Micro
Overall, TFLM applications have a small footprint. The Persistent Nonpersistent Total
Model
Memory Memory Memory
interpreter footprint, by itself, is less than 2KB (at max).
Convolutional
1.29 kB 7.75 kB 9.04 kB
Table 3 shows that for the convolutional and Google Hot- Reference
word models, the memory consumed is at most 13 KB. For Google Hotword
12.12 kB 680 bytes 12.80 kB
the larger VWW model, the framework consumes 26.5 KB. Reference
VWW
26.50 kB 55.30 kB 81.79 kB
To further analyze memory usage, recall that TFLM allo- Reference
cates program memory into two main sections: persistent
Table 3. Memory consumption on Sparkfun Edge.
and nonpersistent. Table 3 reveals that depending on the
model characteristics, one section can be larger than the have evaluated. Graph Lowering (GLOW) (Rotem et al.,
other. The results show that we adjust to the needs of the 2018) is an open-source compiler that accelerates neural-
different models while maintaining a small footprint. network performance across a range of hardware platforms.
STM32Cube.AI (STMicroelectronics, 2020) takes models
5.4 Benchmarking and Profiling from Keras, TensorFlow Lite, and others to generate code
optimized for a range of STM32-series MCUs. TinyEngine
TFLM provides a set of benchmarks and profiling APIs (Lin et al., 2020) is a code-generator-based compiler that
(TensorFlow, 2020c) to compare hardware platforms and helps eliminate memory overhead for MCU deployments.
to let developers measure performance as well as iden- TVM (Chen et al., 2018) is an open-source ML compiler
tify opportunities for optimization. Benchmarks provide for CPUs, GPUs, and ML accelerators that has been ported
a consistent and fair way to measure hardware performance. to Cortex-M7 and other MCUs. uTensor (uTensor, 2020), a
MLPerf (Reddi et al., 2020; Mattson et al., 2020) adopted precursor to TFLM, consists of an offline tool that translates
the TFLM benchmarks; the tinyMLPerf benchmark suite a TensorFlow model into Arm microcontroller C++ machine
imposes accuracy metrics for them (Banbury et al., 2020). code and it has a run time for execution management.
Although benchmarks measure performance, profiling is In contrast to all of these related works, TFLM adopts
necessary to gain useful insights into model behavior. a unique interpreter based approach for flexibility. An
TFLM has hooks for developers to instrument specific code interpreter-based approach provides an alternative design
sections (TensorFlow, 2020d). These hooks allow a TinyML point for others to consider when engineering their inference
application developer to measure overhead using a general- system to address the ecosystem challenges (Section 2).
purpose interpreter rather than a custom neural-network
engine for a specific model, and they can examine a model’s
performance-critical paths. These features allow identifica- 7 C ONCLUSION
tion, profiling, and optimization of bottleneck operators. TFLM enables the transfer of deep learning onto embedded
systems, significantly broadening the reach of ML. TFLM
6 R ELATED W ORK is a framework that has been specifically engineered to run
machine learning effectively and efficiently on embedded
There are a number of compiler frameworks for infer- devices with only a few kilobytes of memory. TFLM’s
ence on TinyML systems. Examples include Microsoft’s fundamental contributions are the design decisions that we
ELL (Microsoft, 2020), which is a cross-compiler tool made to address the unique challenges of embedded sys-
chain that enables users to run ML models on resource tems: hardware heterogeneity in the fragmented ecosystem,
constrained platforms, similar to the platforms that we missing software features and severe resource constraints.
TensorFlow Lite Micro
ACKNOWLEDGEMENTS Chen, T., Moreau, T., Jiang, Z., Zheng, L., Yan, E., Shen,
H., Cowan, M., Wang, L., Hu, Y., Ceze, L., et al. TVM:
TFLM is an open-source project and a community-based An automated end-to-end optimizing compiler for deep
open-source project. As such, it rests on the work of learning. In 13th {USENIX} Symposium on Operating
many. We extend our gratitude to many individuals, teams, Systems Design and Implementation ({OSDI} 18), pp.
and organizations: Fredrik Knutsson and the CMSIS-NN 578–594, 2018.
team; Rod Crawford and Matthew Mattina from Arm; Raj
Pawate from Cadence; Erich Plondke and Evgeni Gousef Chollet, F. et al. Keras, 2015. URL https://
from Qualcomm; Jamie Campbell from Synopsys; Yair keras.io/.
Siegel from Ceva; Sai Yelisetty from DSP Group; Zain
Asgar from Stanford; Dan Situnayake from Edge Impulse; Chowdhery, A., Warden, P., Shlens, J., Howard, A., and
Neil Tan from the uTensor project; Sarah Sirajuddin, Rajat Rhodes, R. Visual wake words dataset. arXiv preprint
Monga, Jeff Dean, Andy Selle, Tim Davis, Megan Kacholia, arXiv:1906.05721, 2019.
Stella Laurenzo, Benoit Jacob, Dmitry Kalenichenko, An-
drew Howard, Aakanksha Chowdhery, and Lawrence Chan Espressif. Espressif ESP32, 2020. URL https:
from Google; and Radhika Ghosal, Sabrina Neuman, Mark //www.espressif.com/en/products/socs/
Mazumder, and Colby Banbury from Harvard University. esp32.
Banbury, C. R., Reddi, V. J., Lam, M., Fu, W., Fazel, Intel. Intel-64 and ia-32 architectures software developer’s
A., Holleman, J., Huang, X., Hurtado, R., Kanter, manual. Volume 3A: System Programming Guide, Part, 1
D., Lokhmotov, A., et al. Benchmarking tinyml (64), 2013.
systems: Challenges and direction. arXiv preprint
Koizumi, Y., Saito, S., Uematsu, H., Harada, N., and
arXiv:2003.04821, 2020.
Imoto, K. ToyADMOS: A dataset of miniature-
Cadence. Tensilica Hi-Fi DSP Family, 2020. URL machine operating sounds for anomalous sound detec-
https://ip.cadence.com/uploads/928/ tion. In Proceedings of IEEE Workshop on Applica-
TIP PB HiFi DSP FINAL-pdf. tions of Signal Processing to Audio and Acoustics (WAS-
PAA), pp. 308–312, November 2019. URL https:
Chavarriaga, R., Sagha, H., Calatroni, A., Digumarti, S. T., //ieeexplore.ieee.org/document/8937164.
Tröster, G., Millán, J. d. R., and Roggen, D. The op-
portunity challenge: A benchmark database for on-body Krishnamoorthi, R. Quantizing deep convolutional networks
sensor-based activity recognition. Pattern Recognition for efficient inference: A whitepaper. arXiv preprint
Letters, 34(15):2033–2042, 2013. arXiv:1806.08342, 2018.
Chen, G., Parada, C., and Heigold, G. Small-footprint Kumar, A., Goyal, S., and Varma, M. Resource-efficient
keyword spotting using deep neural networks. In 2014 machine learning in 2 kb ram for the internet of things.
IEEE International Conference on Acoustics, Speech and In International Conference on Machine Learning, pp.
Signal Processing (ICASSP), pp. 4087–4091. IEEE, 2014. 1935–1944, 2017.
TensorFlow Lite Micro
Lai, L., Suda, N., and Chandra, V. CMSIS-NN: Efficient TensorFlow. TensorFlow Lite Guide, 2020b. URL https:
neural network kernels for Arm Cortex-M cpus. arXiv //www.tensorflow.org/lite/guide.
preprint arXiv:1801.06601, 2018.
TensorFlow. Tensorflow Lite Micro Benchmarks, 2020c.
Lin, J., Chen, W.-M., Lin, Y., Cohn, J., Gan, C., and Han, URL https://github.com/tensorflow/
S. Mcunet: Tiny deep learning on IoT devices. arXiv tensorflow/tree/master/tensorflow/
preprint arXiv:2007.10319, 2020. lite/micro/benchmarks.
Lin, T.-Y., Maire, M., Belongie, S., Hays, J., Perona, P., TensorFlow. Tensorflow Lite Micro Profiler, 2020d.
Ramanan, D., Dollár, P., and Zitnick, C. L. Microsoft URL https://github.com/tensorflow/
COCO: Common objects in context. In European confer- tensorflow/blob/master/tensorflow/
ence on computer vision, pp. 740–755. Springer, 2014. lite/micro/micro profiler.cc.
Martello, S. Chapter 8: Bin packing, knapsack prob- TensorFlow. TensorFlow Core Ops, 2020e.
lems: algorithms and computer implementations. Wiley- URL https://github.com/tensorflow/
Interscience series in discrete mathematics and optimiza- tensorflow/blob/master/tensorflow/
tion, 1990. core/ops/ops.pbtxt.
Mattson, P., Reddi, V. J., Cheng, C., Coleman, C., Di- uTensor. uTensor, 2020. URL https://github.com/
amos, G., Kanter, D., Micikevicius, P., Patterson, D., uTensor/uTensor.
Schmuelling, G., Tang, H., et al. MLPerf: An industry
standard benchmark suite for machine learning perfor- Waterman, A. and Asanovic, K. The risc-v instruction set
mance. IEEE Micro, 40(2):8–16, 2020. manual, volume i: Unprivileged isa document, version
20190608-baseratified. RISC-V Foundation, Tech. Rep,
Microsoft. Embedded Learning Library, 2020. URL 2019.
https://microsoft.github.io/ELL/.
Wu, X., Lee, I., Dong, Q., Yang, K., Kim, D., Wang, J.,
Reddi, V. J., Cheng, C., Kanter, D., Mattson, P., Peng, Y., Zhang, Y., Saliganc, M., Yasuda, M., et al. A
Schmuelling, G., Wu, C.-J., Anderson, B., Breughe, M., 0.04 mm 3 16nw wireless and batteryless sensor system
Charlebois, M., Chou, W., et al. MLPerf inference bench- with integrated cortex-m0+ processor and optical commu-
mark. In 2020 ACM/IEEE 47th Annual International nication for cellular temperature measurement. In 2018
Symposium on Computer Architecture (ISCA), pp. 446– IEEE Symposium on VLSI Circuits, pp. 191–192. IEEE,
459. IEEE, 2020. 2018.
Rotem, N., Fix, J., Abdulrasool, S., Catron, G., Deng, Zhang, M. and Sawchuk, A. A. Usc-had: a daily activity
S., Dzhabarov, R., Gibson, N., Hegeman, J., Lele, M., dataset for ubiquitous activity recognition using wearable
Levenstein, R., et al. Glow: Graph lowering com- sensors. In Proceedings of the 2012 ACM Conference on
piler techniques for neural networks. arXiv preprint Ubiquitous Computing, pp. 1036–1043, 2012.
arXiv:1805.00907, 2018.
Zhang, Y., Suda, N., Lai, L., and Chandra, V. Hello edge:
Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., Keyword spotting on microcontrollers. arXiv preprint
and Salakhutdinov, R. Dropout: a simple way to prevent arXiv:1711.07128, 2017.
neural networks from overfitting. The journal of machine
learning research, 15(1):1929–1958, 2014.