2000 Algorithmic Theories of Everything
2000 Algorithmic Theories of Everything
0; 20 Dec 2000
Minor revision of Version 1.0 [75], quant-ph/0011122
Jürgen Schmidhuber
IDSIA, Galleria 2, 6928 Manno (Lugano), Switzerland
juergen@idsia.ch - http://www.idsia.ch/∼juergen
Abstract
The probability distribution P from which the history of our universe is sampled
represents a theory of everything or TOE. We assume P is formally describable. Since
most (uncountably many) distributions are not, this imposes a strong inductive bias.
We show that P (x) is small for any universe x lacking a short description, and study the
spectrum of TOEs spanned by two P s, one reflecting the most compact constructive
descriptions, the other the fastest way of computing everything. The former derives
from generalizations of traditional computability, Solomonoff’s algorithmic probability,
Kolmogorov complexity, and objects more random than Chaitin’s Omega, the latter
from Levin’s universal search and a natural resource-oriented postulate: the cumula-
tive prior probability of all x incomputable within time t by this optimal algorithm
should be 1/t. Between both P s we find a universal cumulatively enumerable mea-
sure that dominates traditional enumerable measures; any such CEM must assign low
probability to any universe lacking a short enumerating program. We derive P -specific
consequences for evolving observers, inductive reasoning, quantum physics, philosophy,
and the expected duration of our universe.
Note: This is a slightly revised version of a recent preprint [75]. The essential results should
be of interest from a purely theoretical point of view independent of the motivation through
formally describable universes. To get to the meat of the paper, skip the introduction and go
immediately to Subsection 1.1 which provides a condensed outline of the main theorems.
1
Contents
1 Introduction to Describable Universes 3
1.1 Outline of Main Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2 Preliminaries 7
2.1 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.2 Turing Machines: Monotone TMs (MTMs), General TMs (GTMs), Enumer-
able Output Machines (EOMs) . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.3 Infinite Computations, Convergence, Formal Describability . . . . . . . . . . 9
2.4 Formally Describable Functions . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.5 Weak Decidability and Convergence Problem . . . . . . . . . . . . . . . . . . 11
6 Temporal Complexity 27
6.1 Fast Computation of Finite and Infinite Strings . . . . . . . . . . . . . . . . 27
6.2 FAST: The Most Efficient Way of Computing Everything . . . . . . . . . . . 28
6.3 Speed-Based Characterization of the Describable . . . . . . . . . . . . . . . . 29
6.4 Enumerable Priors vs FAST . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
6.5 Speed Prior S and Algorithm GUESS . . . . . . . . . . . . . . . . . . . . . 31
6.6 Speed Prior-Based Inductive Inference . . . . . . . . . . . . . . . . . . . . . 32
2
6.7 Practical Applications of Algorithm GUESS . . . . . . . . . . . . . . . . . . 33
8 Concluding Remarks 42
3
What about our universe, or more precisely, its entire past and future history? Is it
individually describable by a finite sequence of bits, just like a movie stored on a compact
disc, or a never ending evolution of a virtual reality determined by a finite algorithm? If so,
then it is very special in a certain sense, just like the comparatively few describable reals are
special.
Example 1.2 (In which universe am I?) Let h(y) represent a property of any possibly
infinite bitstring y, say, h(y) = 1 if y represents the history of a universe inhabited by a
particular observer (say, yourself) and h(y) = 0 otherwise. According to the weak anthropic
principle [24, 4], the conditional probability of finding yourself in a universe compatible with
your existence equals 1. But there may be many y’s satisfying h(y) = 1. What is the
probability that y = x, where x is a particular universe satisfying h(x) = 1? According to
Bayes,
P (h(y) = 1 | x = y)P (x = y)
P (x = y | h(y) = 1) = P ∝ P (x) (1)
z:h(z)=1 P (z)
4
Each prior P stands for a particular “theory of everything” or TOE. Once we know
something about P we can start making informed predictions. Parts of this paper deal with
the question: what are plausible properties of P ? One very plausible assumption is that P is
approximable for all finite prefixes x̄ of x in the following sense. There exists a possibly never
halting computer which outputs a sequence of numbers T (t, x̄) at discrete times t = 1, 2, . . .
in response to input x̄ such that for each real ǫ > 0 there exists a finite time t0 such that for
all t ≥ t0 :
| P (x̄) − T (t, x̄) |< ǫ. (2)
Approximability in this sense is essentially equivalent to formal describability (Lemma 2.1
will make this more precise). We will show (Section 5) that the mild assumption above adds
enormous predictive power to the weak anthropic principle: it makes universes describable
by short algorithms immensely more likely than others. Any particular universe evolution
is highly unlikely if it is determined not only by simple physical laws but also by additional
truly random or noisy events. To a certain extent, this will justify “Occam’s razor” (e.g.,
[11]) which expresses the ancient preference of simple solutions over complex ones, and which
is widely accepted not only in physics and other inductive sciences, but even in the fine arts
[74].
All of this will require an extension of earlier work on Solomonoff’s algorithmic probabil-
ity, universal priors, Kolmogorov complexity (or algorithmic information), and their refine-
ments [50, 82, 26, 100, 52, 54, 35, 27, 36, 77, 83, 28, 5, 29, 93, 56]. We will prove several
theorems concerning approximable and enumerable objects and probabilities (Sections 2-5;
see outline below). These theorems shed light on the structure of all formally describable
objects and extend traditional computability theory; hence they should also be of interest
without motivation through describable universes.
The calculation of the subjects of these theorems, however, may occasionally require
excessive time, itself often not even computable in the classic sense. This will eventually
motivate a shift of focus on the temporal complexity of “computing everything” (Section
6). If you were to sit down and write a program that computes all possible universes, which
would be the best way of doing so? Somewhat surprisingly, a modification of Levin Search
[53] can simultaneously compute all computable universes in an interleaving fashion that
outputs each individual universe as quickly as its fastest algorithm running just by itself,
save for a constant factor independent of the universe’s size. This suggests a more restricted
TOE that singles out those infinite universes computable with countable time and space
resources, and a natural resource-based prior measure S on them. Given this “speed prior”
S, we will show that the most likely continuation of a given observed history is computable
by a fast and short algorithm (Section 6.6).
The S-based TOE will provoke quite specific prophecies concerning our own universe
(Section 7.5). For instance, the probability that it will last 2n times longer than it has lasted
so far is at most 2−n . Furthermore, all apparently random events, such as beta decay or
collapses of Schrödinger’s wave function of the universe, actually must exhibit yet unknown,
possibly nonlocal, regular patterns reflecting subroutines (e.g., pseudorandom generators) of
our universe’s algorithm that are not only short but also fast.
5
1.1 Outline of Main Results
Some of the novel results herein may be of interest to theoretical computer scientists and
mathematicians (Sections 2-6), some to researchers in the fields of machine learning and
inductive inference (the science of making predictions based on observations, e.g., 6-7), some
to physicists (e.g., 6-8), some to philosophers (e.g., 7-8). Sections 7-8 might help those
usually uninterested in technical details to decide whether they would also like to delve into
the more formal Sections 2-6. In what follows, we summarize the main contributions and
provide pointers to the most important theorems.
Section 2 introduces universal Turing Machines (TMs) more general than those considered
in previous related work: unlike traditional TMs, General TMs or GTMs may edit their
previous outputs (compare inductive TMs [18]), and Enumerable Output Machines (EOMs)
may do this provided the output does not decrease lexicographically. We will define: a
formally describable object x has a finite, never halting GTM program that computes x
such that each output bit is revised at most finitely many times; that is, each finite prefix
of x eventually stabilizes (Defs. 2.1-2.5); describable functions can be implemented by such
programs (Def. 2.10); weakly decidable problems have solutions computable by never halting
programs whose output is wrong for at most finitely many steps (Def. 2.11). Theorem 2.1
generalizes the halting problem by demonstrating that it is not weakly decidable whether
a finite string is a description of a describable object (compare a related result for analytic
TMs by Hotz, Vierke and Schieffer [45]).
Section 3 generalizes the traditional concept of Kolmogorov complexity or algorithmic
information [50, 82, 26] of finite x (the length of the shortest halting program computing
x) to the case of objects describable by nonhalting programs on EOMs and GTMs (Defs.
3.2-3.4). It is shown that the generalization for EOMs is describable, but the one for GTMs
is not (Theorem 3.1). Certain objects are much more compactly encodable on EOMs than
on traditional monotone TMs, and Theorem 3.3 shows that there are also objects with short
GTM descriptions yet incompressible on EOMs and therefore “more random” than Chaitin’s
Ω [28], the halting probability of a TM with random input, which is incompressible only on
monotone TMs. This yields a natural TM type-specific complexity hierarchy expressed by
Inequality (14).
Section 4 discusses probability distributions on describable objects as well as the non-
describable convergence probability of a GTM (Def. 4.14). It also introduces describable
(semi)measures as well as cumulatively enumerable measures (CEMs, Def. 4.5), where
the cumulative probability of all strings lexicographically greater than a given string x is
EOM-computable or enumerable. Theorem 4.1 shows that there is a universal CEM that
dominates all other CEMs, in the sense that it assigns higher probability to any finite y,
save for a constant factor independent of y. This probability is shown to be equivalent
to the probability that an EOM whose input bits are chosen randomly produces an out-
put starting with y (Corollary 4.3 and Lemma 4.2). The nonenumerable universal CEM
also dominates enumerable priors studied in previous work by Solomonoff, Levin and oth-
ers [82, 100, 54, 35, 27, 36, 77, 83, 28, 56]. Theorem 4.2 shows that there is no universal
approximable measure (proof by M. Hutter).
Section 5 establishes relationships between generalized Kolmogorov complexity and gen-
6
eralized algorithmic probability, extending previous work on enumerable semimeasures by
Levin, Gács, and others [100, 54, 35, 27, 36, 56]. For instance, Theorem 5.3 shows that the
universal CEM assigns a probability to each enumerable object proportional to 12 raised to
the power of the length of its minimal EOM-based description, times a small corrective fac-
tor. Similarly, objects with approximable probabilities yet without very short descriptions
on GTMs are necessarily very unlikely a priori (Theorems 5.4 and 5.5). Additional suspected
links between generalized Kolmogorov complexity and probability are expressed in form of
Conjectures 5.1-5.3.
Section 6 addresses issues of temporal complexity ignored in the previous sections on
describable universe histories (whose computation may require excessive time without re-
cursive bounds). In Subsection 6.2, Levin’s universal search algorithm [53, 55] (which takes
into account program runtime in an optimal fashion) is modified to obtain the fastest way of
computing all “S-describable” universes computable within countable time (Def. 6.1, Section
6.3); uncountably many other universes are ignored because they do not even exist from a
constructive point of view. Postulate 6.1 then introduces a natural resource-oriented bias
reflecting constraints of whoever calculated our universe (possibly as a by-product of a search
for something else): we assign to universes prior probabilities inversely proportional to the
time and space resources consumed by the most efficient way of computing them. Given the
resulting “speed prior S” (Def. 6.5) and past observations x, Theorem 6.1 and Corollary 6.1
demonstrate that the best way of predicting a future y is to minimize the Levin complexity
of (x, y).
Section 7 puts into perspective the algorithmic priors (recursive and enumerable) intro-
duced in previous work on inductive inference by Solomonoff and others [82, 83, 56, 47],
as well as the novel priors discussed in the present paper (cumulatively enumerable, ap-
proximable, resource-optimal). Collectively they yield an entire spectrum of algorithmic
TOEs. We evaluate the plausibility of each prior being the one from which our own uni-
verse is sampled, discuss its connection to “Occam’s razor” as well as certain physical and
philosophical consequences, argue that the resource-optimal speed prior S may be the most
plausible one (Section 7.4), analyze the inference problem from the point of view of an
observer [13, 14, 91, 99, 87, 68] evolving in a universe sampled from S, make appropriate
predictions for our own universe (Section 7.5), and discuss their falsifiability.
2 Preliminaries
2.1 Notation
Much but not all of the notation used here is similar or identical to the one used in the
standard textbook on Kolmogorov complexity by Li and Vitányi [56].
Since sentences over any finite alphabet are encodable as bitstrings, without loss of gen-
erality we focus on the binary alphabet B = {0, 1}. λ denotes the empty string, B ∗ the set of
finite sequences over B, B ∞ the set of infinite sequences over B, B ♯ = B ∗ ∪ B ∞ . x, y, z, z 1 , z 2
stand for strings in B ♯ . If x ∈ B ∗ then xy is the concatenation of x and y (e.g., if x = 10000
and y = 1111 then xy = 100001111). Let us order B ♯ lexicographically: if x precedes y alpha-
7
betically (like in the example above) then we write x ≺ y or y ≻ x; if x may also equal y then
we write x y or y x (e.g., λ ≺ 001 ≺ 010 ≺ 1 ≺ 1111...). The context will make clear
where we also identify x ∈ B ∗ with a unique nonnegative integer 1x (e.g., string 0100 is rep-
resented by integer 10100 in the dyadic system or 20 = 1 ∗ 24 + 0 ∗ 23 + 1 ∗ 22 + 0 ∗ 21 + 0 ∗ 20
in the decimal system). Indices i, j, m, m0 , m1 , n, n0 , t, t0 range over the positive integers,
constants c, c0 , c1 over the positive reals, f, g denote functions mapping integers to inte-
gers, log the logarithm with basis 2, lg(r) = maxk {integer k : 2k ≤ r} for real r > 0.
For x ∈ B ∗ \{λ}, 0.x stands for the real number with dyadic expansion x (note that
0.x0111.... = 0.x1 = 0.x10 = 0.x100... for x ∈ B ∗ , although x0111.... 6= x1 6= x10 6= x100...).
For x ∈ B ∗ , l(x) denotes the number of bits in x, where l(x) = ∞ for x ∈ B ∞ ; l(λ) = 0. xn
is the prefix of x consisting of the first n bits, if l(x) ≥ n, and x otherwise (x0 := λ). For
those x ∈ B ∗ that contain at least one 0-bit, x′ denotes the lexicographically smallest y ≻ x
satisfying l(y) ≤ l(x) (x′ is undefined for x of the form 111 . . . 1). We write f (n) = O(g(n))
if there exists c, n0 such that f (n) ≤ cg(n) for all n > n0 .
8
(but not out of bounds); (b) delete square at the right end of the output tape (if it is not the
initial square or above the scanning head); (c) write 1 or 0 on square above output scanning
head. Compare Burgin’s inductive TMs and super-recursive algorithms [18, 19].
Enumerable Output Machines (EOMs). Like GTMs, EOMs can edit their previous
output, but not such that it decreases lexicographically. The expressive power of EOMs lies in
between those of MTMs and GTMs, with interesting computability-related properties whose
analogues do not hold for GTMs. EOMs are like MTMs, except that the only permitted
output instruction sequences are: (a) shift output tape scanning head left/right unless this
leads out of bounds; (b) replace bitstring starting above the output scanning head by the
string to the right of the scanning head of the second work tape, readjusting output tape
size accordingly, but only if this lexicographically increases the contents of the output tape.
The necessary test can be hardwired into the finite TM transition table.
T (p) = x (3)
for “p computes x on T and halts”. Much of this paper, however, deals with programs that
never halt, and with TMs that do not need halt instructions.
Definition 2.1 (Convergence) Let p ∈ B ♯ denote the input string or program read by TM
T. Let Tt (p) denote T’s finite output string after t instructions. We say that p and p’s output
stabilize and converge towards x ∈ B ♯ iff for each n satisfying 0 ≤ n ≤ l(x) there exists a
postive integer tn such that for all t ≥ tn : Tt (p)n = xn and l(Tt (p)) ≤ l(x). Then we write
T (p) ❀ x. (4)
Although each beginning or prefix of x eventually becomes stable during the possibly infi-
nite computation, there need not be a halting program that computes an upper bound of
stabilization time, given any p and prefix size. Compare the concept of computability in the
limit [39, 65, 34] and [41, 63].
Definition 2.3 (Universal TMs) Let C denote a set of TMs. C has a universal element
if there is a TM U C ∈ C such that for each T ∈ C there exists a constant string pT ∈ B ∗
(the compiler) such that for all possible programs p, if T (p) ❀ x then U C (pT p) ❀ x.
9
Definition 2.4 (M, E, G) Let M denote the set of MTMs, E denote the set of EOMs, G
denote the set of GTMs.
M, E, G all have universal elements, according to the fundamental compiler theorem (for
instance, a fixed compiler can translate arbitrary LISP programs into equivalent FORTRAN
programs).
Definition 2.5 (Individual Describability) Let C denote a set of TMs with universal
element U C . Some x ∈ B ♯ is C-describable or C-computable if it is U C -describable. E-
describable strings are called enumerable. G-describable strings are called formally describ-
able or simply describable.
Definition 2.6 (Always converging TMs) TM T always converges if for all of its pos-
sible programs p ∈ B ♯ there is an x ∈ B ♯ such that T (p) ❀ x.
Definition 2.7 (Approximability) Let 0.x denote a real number, x ∈ B ♯ \{λ}. 0.x is
approximable by TM T if there is a p ∈ B ∗ such that for each real ǫ > 0 there exists a t0
such that
| 0.x − 0.Tt (p) |< ǫ
for all times t ≥ t0 . 0.x is approximable if there is at least one GTM T as above — compare
(2).
10
Definition 2.8 (Encoding B ∗ ) Encode x ∈ B ∗ as a self-delimiting input p(x) for an ap-
propriate TM, using
l(p(x)) = l(x) + 2log l(x) + O(1) (5)
bits as follows: write l(x) in binary notation, insert a “0” after every “0” and a “1” after
every “1,” append “01” to indicate the end of the description of the size of the following
string, then append x.
Definition 2.10 (Describable Functions) Let T denote a TM using the encoding of Def.
2.8. A function h : D1 ⊂ B ∗ → D2 ⊂ B ♯ is T -describable if for all x ∈ D1 : T (p(x)) ❀ h(x).
Let C denote a set of TMs using encoding 2.8, with universal element U C . h is C-describable
or C-computable if it is U C -computable. If the T above is universal among the GTMs with
such input encoding (see Def. 2.3) then h is describable.
Compare functions in the arithmetic hierarchy [67] and the concept of ∆0n -describability, e.g.,
[56, p. 46-47].
Example 2.2 Is a given string p ∈ B ∗ a halting program for a given MTM? The problem
is not decidable in the traditional sense (no halting algorithm solves the general halting
problem [92]), but weakly decidable and even E-decidable, by a trivial algorithm: print “0”
on first output square; simulate the MTM on work tapes and apply it to p, once it halts
after having read no more than l(p) bits print “1” on first output square.
11
Example 2.3 It is weakly decidable whether a finite bitstring p is a program for a given
TM. Algorithm: print “0”; feed p bitwise into the internally simulated TM whenever it
requests a new input bit; once the TM has requested l(p) bits, print “1”; if it requests an
additional bit, print “0”. After finite time the output will stabilize forever.
Theorem 2.1 (Convergence Problem) Given a GTM, it is not weakly decidable whether
a finite bitstring is a converging program, or whether some of the output bits will fluctuate
forever.
Proof. A proof conceptually quite similar to the one below was given by Hotz, Vierke
and Schieffer [45] in the context of analytic TMs [25] derived from R-Machines [10] (the
alphabet of analytic TMs is real-valued instead of binary). Version 1.0 of this paper [75] was
written without awareness of this work. Nevertheless, the proof in Version 1.0 is repeated
here because it does serve illustrative purposes.
In a straightforward manner we adapt Turing’s proof of the undecidability of the MTM
halting problem [92], a reformulation of Gödel’s celebrated result [38], using the diagonaliza-
tion trick whose roots date back to Cantor’s proof that one cannot count the real numbers
[23]. Let us write T (x) ↓ if there is a z ∈ B ♯ such that T (x) ❀ z. Let us write T (x) l if T ’s
output fluctuates forever in response to x (e.g., by flipping from 1 to zero and back forever).
Let A1 , A2 , . . . be an effective enumeration of all GTMs. Uniquely encode all pairs of finite
strings (x, y) in B ∗ × B ∗ as finite strings code(x, y) ∈ B ∗ . Suppose there were a GTM U
such that (*): for all x, y ∈ B ∗ : U(code(x, y)) ❀ 1 if Ax (y) ↓, and U(code(x, y)) ❀ 0
otherwise. Then one could construct a GTM T with T (x) ❀ 1 if U(code(x, x)) ❀ 0, and
T (x) l otherwise. Let y be the index of T = Ay , then Ay (y) ↓ if U(code(y, y)) ❀ 0, otherwise
Ay (y) l. By (*), however, U(code(y, y)) ❀ 1 if Ay (y) ↓, and U(code(y, y)) ❀ 0 if Ay (y) l.
Contradiction. ✷
K(x) = min
p
{l(p) : U(p) = x}. (6)
12
3.1 Generalized Kolmogorov Complexity for EOMs and GTMs
Definition 3.2 (Generalized KT ) Given any TM T, define
This is justified by an appropriate Invariance Theorem [50, 82, 26]: there is a positive
constant c such that KU C (x) ≤ KT (x) + c for all x, since the size of the compiler that
translates arbitrary programs for T into equivalent programs for U C does not depend on x.
Consider Def. 2.4. If C denotes a set of TMs with universal TM U C , then define KmC (x) =
KmU C (x).
KmC is a generalization of Schnorr’s [77] and Levin’s [52] complexity measure KmM for
MTMs.
Describability issues. K(x) is not computable by a halting program [50, 82, 26], but
1
obviously G-computable or describable; the z with 0.z = K(x) is even enumerable. Even
E
K (x) is describable, using the following algorithm:
Run all EOM programs in “dovetail style” such that the n-th step of the i-th
program is executed in the n + i-th phase (i = 1, 2, . . .); whenever a program
outputs x, place it (or its prefix read so far) in a tentative list L of x-computing
programs or program prefixes; whenever an element of L produces output ≻ x,
delete it from L; whenever an element of L requests an additional input bit,
update L accordingly. After every change of L replace the current estimate of
K E (x) by the length of the shortest element of L. This estimate will eventually
stabilize forever.
13
Proof. Identify finite bitstrings with the integers they represent. If K G (x) were describable
then also
h(x) = maxy {K G (y) : 1 ≤ y ≤ g(x)}, (8)
where g is any fixed recursive function, and also
Since the number of descriptions p with l(p) < n − O(1) cannot exceed 2n−O(1) , but the
number of strings x with l(x) = n equals 2n , most x cannot be compressed by more than
O(1) bits; that is, K G (x) ≥ log x−O(1) for most x. From (9) we therefore obtain K G (f (x)) >
log g(x)−O(1) for large enough x, because f (x) picks out one of the incompressible y ≤ g(x).
However, obviously we also would have K G (f (x)) ≤ l(x)+2log l(x)+O(1), using the encoding
x
of Def. 2.8. Contradiction for quickly growing g with low complexity, such as g(x) = 22 . ✷
Proof. Define
where g is recursive. Then K G (f ′ (x)) = O(l(x) + K(g)) (where K(g) is the size of the
minimal halting description of function g), but K(f ′ (x)) > log g(x) − O(1) for sufficiently
large x — compare the proof of Theorem 3.1. Therefore K(f ′ (x))−K G (f ′ (x)) ≥ O(log g(x))
for infinitely many x and quickly growing g with low complexity. ✷
Similarly, some x are compactly describable on EOMs but not on MTMs. To see this,
consider Chaitin’s Ω, the halting probability of an MTM whose input bits are obtained by
tossing an unbiased coin whenever it requests a new bit [28]. Ω is enumerable (dovetail over
14
all programs p and sum up the contributions 2−l(p) of the halting p), but there is no recursive
upper bound on the number of instructions required to compute Ωn , given n. This implies
K(Ωn ) = n + O(1) [28] and also K M (Ωn ) = n + O(1). It is easy to see, however, that on
nonhalting EOMs there are much more compact descriptions:
3.2.2 GTMs More Expressive Than EOMs — Objects Less Regular Than Ω
We will now show that there are describable strings that have a short GTM description yet
are “even more random” than Chaitin’s Omegas, in the sense that even on EOMs they do
not have any compact descriptions.
First note that the dyadic expansion of Ξ(x) is EOM-computable or enumerable. The algo-
rithm works as follows:
V approximates Ξ(x) from below in enumerable fashion — infinite p are not worrisome as
T must only read a finite prefix of p to observe 0.y > 0.x if the latter holds indeed. We
will now show that knowledge of Ξ(x)n , the first n bits of Ξ(x), allows for constructing a
bitstring z with K E (z) ≥ n − O(1) when x has low complexity.
Suppose we know Ξ(x)n . Once algorithm A above yields V > Ξ(x)n we know that no
programs p with l(p) < n will contribute any more to V. Choose the shortest z satisfying
0.z = (0.ymin − 0.x)/2, where ymin is the lexicographically smallest y previously computed
by algorithm A such that 0.y > 0.x. Then z cannot be among the strings T-describable with
fewer than n bits. Using the Invariance Theorem (compare Def. 3.3) we obtain K E (z) ≥
n − O(1).
15
While prefixes of Ω are greatly compressible on EOMs, z is not. On the other hand,
z is compactly G-describable: K G (z) ≤ K(x) + K(n) + O(1). For instance, choosing a
low-complexity x, we have K G (z) ≤ O(K(n)) ≤ O(log n). ✷
The discussion above reveils a natural complexity hierarchy. Ignoring additive constants, we
have
K G (x) ≤ K E (x) ≤ K M (x), (14)
where for each “≤” relation above there are x which allow for replacing “≤” by “<.”
is a normalizing factor. The most likely continuation y is determined by P (xy), the prior
probability of xy — compare the similar Equation (1). Now what are the formally describable
ways of assigning prior probabilities to universes? In what follows we will first consider
describable semimeasures on B ∗ , then probability distributions on B ♯ .
A notational difference to the approach of Levin [100] (who writes µ(x) ≤ µ(x0) + µ(x1)) is
the explicit introduction of µ̄. Compare the introduction of an undefined element u by Li
P
and Vitanyi [56, p. 281]. Note that x∈B∗ µ̄(x) ≤ 1. Later we will discuss the interesting
case µ̄(x) = P (x), the a priori probability of x.
16
Definition 4.2 (Dominant Semimeasures) A semimeasure µ0 dominates another semimea-
sure µ if there is a constant cµ such that for all x ∈ B ∗
Note that we could replace “l(x)” by “l(x)+c” in the definition above. Recall that x′ denotes
the smallest y ≻ x with l(y) ≤ l(x) (x′ may be undefined). We have
Then µ(x) is the difference of two finite enumerable values, according to (20).
Proof. We first show that one can enumerate the CEMs, then construct a universal CEM
from the enumeration. Check out differences to Levin’s related proofs that there is a universal
discrete semimeasure and a universal enumerable semimeasure [100, 52], and Li and Vitányi’s
presentation of the latter [56, p. 273 ff], attributed to J. Tyszkiewicz.
Without loss of generality, consider only EOMs without halt instruction and with fixed
input encoding of B ∗ according to Def. 2.8. Such EOMs are enumerable, and correspond
to an effective enumeration of all enumerable functions from B ∗ to B ♯ . Let EOMi denote
the i-th EOM in the list, and let EOMi (x, n) denote its output after n instructions when
applied to x ∈ B ∗ . The following procedure filters out those EOMi that already represent
CEMs, and transforms the others into representations of CEMs, such that we obtain a way
of generating all and only CEMs.
17
START: let V µi (x) and V µ̄i (x) and V Cµi (x) denote variable functions
on B ∗ . Set V µi (λ) := V µ̄i (λ) := V Cµi (λ) := 1, and V µi (x) :=
V µ̄i (x) := V Cµi (x) := 0 for all other x ∈ B ∗ . Define V Cµi (x′ ) := 0
for undefined x′ . Let z denote a string variable.
FOR n = 1, 2, . . . DO:
(1) Lexicographically order and rename all x with l(x) ≤ n:
n+1
x1 := λ ≺ x2 := 0 ≺ x3 ≺ . . . ≺ x2 −1 := 11...1
| {z }.
n
(2) FOR k = 2n+1 − 1 down to 1 DO:
(2.1) Systematically search for the smallest m ≥ n
such that z := EOMi (xk , m) 6= λ AND 0.z ≥ V Cµi (xk+1 )
if k < 2n+1 − 1; set V Cµi (xk ) := 0.z.
(3) For all x ≻ λ satisfying l(x) ≤ n, set V µi (x) := V Cµi (x)−
V Cµi (x′ ). For all x with l(x) < n, set V µ̄i (x) := V µi (x) −
V µi (x1) − V µi (x0). For all x with l(x) = n, set V µ̄i (x) :=
V µi (x).
If EOMi indeed represents a CEM µi then each search process in (2.1) will terminate, and the
V Cµi (x) will enumerate the Cµi (x) from below, and the V µi (x) and V µ̄i (x) will approximate
the true µi (x) and µ̄i (x), respectively, not necessarily from below though. Otherwise there
will be a nonterminating search at some point, leaving V µi from the previous loop as a trivial
CEM. Hence we can enumerate all CEMs, and only those. Now define (compare [52]):
X X X
µ0 (x) = αn µn (x), µ̄0 (x) = αn µ̄n (x), where αn > 0, αn = 1,
n>0 n>0 n
X X X X
αn µn (y) + µ̄n (y) = αn Cµn (x) (22)
n>0 yx: l(x)=l(y) y≻x: l(x)>l(y) n>0
is enumerable, since αn and Cµn (x) are (dovetail over all n). That is, µ0 (x) is approximable
as the difference of two enumerable finite values, according to Equation (20). ✷
18
4.3 Approximable and Cumulatively Enumerable Distributions
To deal with infinite x, we will now extend the treatment of semimeasures on B ∗ in the
previous subsection by discussing probability distributions on B ♯ .
Proof. The following proof is due to M. Hutter (personal communications by email following
a discussion of enumerable and approximable universes on 2 August 2000 in Munich). It is an
extension of a modified1 proof [56, p. 249 ff] that there is no universal recursive semimeasure.
It suffices to focus on x ∈ B ∗ . Identify strings with integers, and assume P (x) is a
universal approximable semidistribution. We construct an approximable semidistribution
Q(x) that is not dominated by P (x), thus contradicting the assumption. Let P0 (x), P1 (x), . . .
be a sequence of recursive functions converging to P (x). We recursively define a sequence
Q0 (x), Q1 (x), . . . converging to Q(x). The basic idea is: each contribution to Q is the sum of
n consecutive P probabilities (n increasing). Define Q0 (x) := 0; In := {y : n2 ≤ y < (n+1)2}.
Let n be such that x ∈ In . Define jtn (ktn ) as the element with smallest Pt (largest Qt−1 )
probability in this interval, i.e., jtn := minargx∈In Pt (x) (ktn := maxargx∈In Qt−1 (x)). If
n·Pt (ktn ) is less than twice and n·Pt(jtn ) is more than half of Qt−1 (ktn ), set Qt (x) = Qt−1 (x).
1
As pointed out by M. Hutter (14 Nov. 2000, personal communication) and even earlier by A. Fujiwara
(1998, according to P. M. B. Vitányi, personal communication, 21 Nov. 2000), the proof on the bottom
of p. 249 of [56] should be slightly
P modified. For instance, the sum could be taken over xi−1 < x ≤ xi .
The sequence of inequalities xi−1 <x≤xi P (x) > xi P (xi ) is then satisfiable by a suitable xi sequence, since
lim inf x→∞ {xP (x)} = 0. The basic idea of the proof is correct, of course, and very useful.
19
Otherwise set Qt (x) = n·Pt (jtn ) for x = jtn and Qt (x) = 0 for x 6= jtn . Qt (x) is obviously
total recursive and non-negative. Since 2n ≤ |In |, we have
X X
Qt (x) ≤ 2n·Pt(jtn ) = 2n·min Pt (x) ≤ Pt (x).
x∈In
x∈In x∈In
P
Note that x m(x) < 1 if T universal. Let us now generalize this to B ♯ and nonhalting
programs:
Definition 4.12 (PT , KPT ) Suppose T ’s input bits are obtained by tossing an unbiased coin
whenever a new one is requested.
X
PT (x) = 2−l(p) , KPT (x) = −lgPT (x) f or PT (x) > 0, (26)
p:T (p)❀x
where x, p ∈ B ♯ .
Program Continua. According to Def. 4.12, most infinite x have √ zero probability, but not
those with finite programs, such as the dyadic expansion of 0.5 2. However, a nonvanishing
part of the entire unit of probability mass is contributed by continua of mostly incompressible
strings, such as those with cumulative probability 2−l(q) computed by the following class of
uncountably many infinite programs with a common finite prefix q: “repeat forever: read
and print next input bit.” The corresponding traditional measure-oriented notation for
X
2−l(qx) = 2−l(q)
x:T (qx)❀x
20
would be Z 0.q+2−l(q)
dx = 2−l(q) .
0.q
P
For notational simplicity, however, we will continue using the sign to indicate summation
over uncountable objects, rather than using a measure-oriented notation for probability den-
sities. The reader should not feel uncomfortable with this — the theorems in the remainder
of the paper will focus on those x ∈ B ♯ with P (x) > 0; density-like nonzero sums over
uncountably many bitsrings, each with individual measure zero, will not play any critical
role in the proofs.
21
Lemma 4.1 For x ∈ B ∗ , CP E (x) is enumerable.
Proof. The following algorithm computes CP E (x) (compare proof of Theorem 3.3):
Initialize the real-valued variable V by 0, run all possible programs of EOM T
dovetail style; whenever the output of a program prefix q starts with some y x
for the first time, set V := V + 2−l(q) ; henceforth ignore continuations of q.
In this way V enumerates CP E (x). Infinite p are not problematic as only a finite prefix of
p must be read to establish y x if the latter indeed holds. ✷
22
4.6 Universal CEM vs EOM with Random Input
Corollary 4.3 and Lemma 4.2 below imply that µE and µ0 are essentially the same thing:
randomly selecting the inputs of a universal EOM yields output prefixes whose probabilities
are determined by the universal CEM.
Corollary 4.3 Let µ0 denote the universal CEM of Theorem 4.1. For x ∈ B ∗ ,
µE (x) = O(µ0(x)).
Proof. In the enumeration of EOMs in the proof of Theorem 4.1, let EOM0 be an EOM
representing µ0 . We build an EOM T such that µT (x) = µ0 (x). The rest follows from the
Invariance Theorem (compare Def. 3.3).
T applies EOM0 to all x ∈ B ∗ in dovetail fashion, and simultaneously simply reads
randomly selected input bits forever. At a given time, let string variable z denote T ’s
input string read so far. Starting at the right end of the unit interval [0, 1), as the V µ̄0 (x)
are being updated by the algorithm of Theorem 4.1, T keeps updating a chain of finitely
many, variable, disjoint, consecutive, adjacent, half-open intervals V I(x) of size V µ̄0 (x) in
alphabetic order on x, such that V I(y) is to the right of V I(x) if y ≻ x. After every variable
update and each increase of z, T replaces its output by the x of the V I(x) with 0.z ∈ V I(x).
Since neither z nor the V Cµ0 (x) in the algorithm of Theorem 4.1 can decrease (that is, all
interval boundaries can only shift left), T ’s output cannot either, and therefore is indeed
EOM-computable. Obviously the following holds:
and X
µPT (x) = PT (xz) = µ0 (x).
z∈B ♯
✷
23
In this special case, the contributions of the shortest programs dominate the probabilities of
objects computable in the traditional sense. As shown by Gács [36] for the case of MTMs,
M
however, contrary to Levin’s [52] conjecture, µM (x) 6= O(2−Km (x) ); but a slightly worse
bound does hold:
Theorem 5.2
KµM (x) − 1 ≤ KmM (x) ≤ KµM (x) + KmM (KµM (x)) + O(1). (37)
The term −1 on the left-hand side stems from the definition of lg(x) ≤ log(x). We will
now consider the case of probability distributions that dominate m, and semimeasures that
dominate µM , starting with the case of enumerable objects.
Using K E (y) ≤ log y + 2log log y + O(1) for y interpreted as an integer — compare Def. 2.8
— this yields
E E
2−K (x) < P E (x) ≤ O(2−K (x) )(K E (x))2 . (39)
That is, objects that are hard to describe (in the sense that they have only long enumerating
descriptions) have low probability.
Proof. The left-hand inequality follows by definition. To show the right-hand side, one can
build an EOM T that computes x ∈ B ♯ using not more than KP E (x) + KT (KP E (x)) + O(1)
input bits in a way inspired by Huffman-Coding [46]. The claim then follows from the
Invariance Theorem. The trick is to arrange T ’s computation such that T ’s output converges
yet never needs to decrease lexicographically. T works as follows:
24
of the largest x coincides with the right end of [0, 1), and IV (y) is to the right
of IV (x) if y ≻ x. After every variable update and each change of s, replace the
output of T by the x of the IV (x) with 0.s ∈ IV (x).
This will never violate the EOM constraints: the enumerable s cannot shrink, and since
EOM outputs cannot decrease lexicographically, the interval boundaries RV (x) and LV (x)
cannot grow (their negations are enumerable, compare Lemma 4.1), hence T ’s output cannot
decrease.
For x ∈ B ∗ the IV (x) converge towards an interval I(x) of size P E (x). For x ∈ B ∞ with
E
P (x) > 0, we have: for any ǫ > 0 there is a time t0 such that for all time steps t > t0 in
T ’s computation, an interval Iǫ (x) of size P E (x) − ǫ will be completely covered by certain
IV (y) satisfying x ≻ y and 0.x − 0.y < ǫ. So for ǫ → 0 the Iǫ (x) also converge towards an
interval I(x) of size P E (x). Hence T will output larger and larger y approximating x from
below, provided 0.s ∈ I(x).
Since any interval of size c within [0, 1) contains a number 0.z with l(z) = −lg c, in both
cases there is a number 0.s (encodable by some r satisfying r ≤ l(s) + KT (l(s)) + O(1))) with
l(s) = −lgP E (x) + O(1), such that T (r) ❀ x, and therefore KT (x) ≤ l(s) + KT (l(s)) + O(1).
✷
Theorem 5.4 Let TM T induce approximable CPT (x) for all x ∈ B ∗ (compare Defs. 4.10
and 4.12; an EOM would be a special case). Then for x ∈ B ♯ , PT (x) > 0:
Proof. Modify the proof of Theorem 5.3 for approximable as opposed to enumerable interval
boundaries and approximable 0.s. ✷
A similar proof, but without the complication for the case x ∈ B ∞ , yields:
As a consequence,
µ(x) G µ̄(x) G
2
≤ O(2−Km (x) ); 2
≤ O(2−K (x) ). (43)
Kµ(x)log Kµ(x) K µ̄(x)log K µ̄(x)
Proof. Initialize variables Vλ := 1 and IVλ := [0, 1). Dovetailing over all x ≻ λ, approximate
the GTM-computable µ̄(x) = µ(x) − µ(x0) − µ(x1) in variables Vx initialized by zero, and
create a chain of adjacent intervals IVx analogously to the proof of Theorem 5.3.
25
The IVx converge against intervals Ix of size µ̄(x). Hence x is GTM-encodable by any
program r producing an output s with 0.s ∈ Ix : after every update, replace the GTM’s
output by the x of the IVx with 0.s ∈ IVx . Similarly, if 0.s is in the union of adjacent
intervals Iy of strings y starting with x, then the GTM’s output will converge towards some
string starting with x. The rest follows in a way similar to the one described in the final
paragraph of the proof of Theorem 5.3. ✷
Using the basic ideas in the proofs of Theorem 5.3 and 5.5 in conjunction with Corollary 4.3
and Lemma 4.2, one can also obtain statements such as:
Theorem 5.6 Let µ0 denote the universal CEM from Theorem 4.1. For x ∈ B ∗ ,
Kµ0 (x) − O(1) ≤ KmE (x) ≤ Kµ0 (x) + KmE (Kµ0 (x)) + O(1) (44)
While P E dominates P M and P G dominates P E , the reverse statements are not true. In fact,
given the results from Sections 3.2 and 5, one can now make claims such as the following
ones:
Proof. For the cases µE and P E , apply Theorems 5.2, 5.6 and the unboundedness of (12).
For the case P G , apply Theorems 3.3 and 5.3.
The work of Gács has already shown, however, that analogue conjectures for semimeasures
such as µM (as opposed to distributions) are false [36].
26
5.3 Between EOMs and GTMs?
The dominance of P G over P E comes at the expense of occasionally “unreasonable,” noncon-
verging outputs. Are there classes of always converging TMs more expressive than EOMs?
Consider a TM called a PEOM whose inputs are pairs of finite bitstrings x, y ∈ B ∗ (code
them using 2log l(x) + 2log l(y) + l(xy) + O(1) bits). The PEOM uses dovetailing to run
all self-delimiting programs on the y-th EOM of an enumeration of all EOMs, to approxi-
mate the probability P EOM(y, x) (again encoded as a string) that the EOM’s output starts
with x. P EOM(y, x) is approximable (we may apply Theorem 5.5) but not necessarily
enumerable. On the other hand, it is easy to see that PEOMs can compute all enumerable
strings describable on EOMs. In this sense PEOMs are more expressive than EOMs, yet
never diverge like GTMs. EOMs can encode some enumerable strings slightly more com-
pactly, however, due to the PEOM’s possibly unnecessarily bit-consuming input encoding.
An interesting topic of future research may be to establish a partially ordered expressiveness
hierarchy among classes of always converging TMs, and to characterize its top, if there is
one, which we doubt. Candidates to consider may include TMs that approximate certain
recursive or enumerable functions of enumerable strings.
6 Temporal Complexity
So far we have completely ignored the time necessary to compute objects from programs.
In fact, the objects that are highly probable according to P G and P E and µE introduced in
the previous sections yet quite improbable according to less dominant priors studied earlier
(such as µM and recursive priors [100, 54, 83, 36, 56]) are precisely those whose computation
requires immense time. For instance, the time needed to compute the describable, even enu-
merable Ωn grows faster than any recursive function of n, as shown by Chaitin [28]. Analogue
statements hold for the z of Theorem 3.2. Similarly, many of the semimeasures discussed
above are approximable, but the approximation process is excessively time-consuming.
Now we will study the opposite extreme, namely, priors with a bias towards the fastest
way of producing certain outputs. Without loss of generality, we will focus on computations
on a universal MTM. For simplicity let us extend the binary alphabet such that it contains
an additional output symbol “blank.”
27
There are much faster ways though. For instance, the algorithm used in the previous
paper on the computable universes [72] sequentially computes all computable bitstrings by
a particular form of dovetailing. Let pi denote the i-th possible program. Program p1 is run
for one instruction every second step (to simplify things, if the TM has a halt instruction and
p1 has halted we assume nothing is done during this step — the resulting loss of efficiency is
not significant for what follows). Similarly, p2 is run for one instruction every second of the
remaining steps, and so on.
Following Li and Vitányi [56, p. 503 ff], let us call this popular dovetailer “SIMPLE.” It
turns out that SIMPLE actually is the fastest in a certain sense. For instance, the nth bit of
string “11111111...” now will appear after at most O(n) steps (as opposed to at least O(n2n )
steps for ALPHABET). Why? Let pk be the fastest algorithm that outputs “11111111...”.
Obviously pk computes the n-th bit within O(n) instructions. Now SIMPLE will execute
one instruction of pk every 2−k steps. But 2−k is a positive constant that does not depend
on n.
Generally speaking, suppose pk is among the fastest finite algorithms for string x and
computes xn within at most O(f (n)) instructions, for all n. Then x’s first n symbols will
appear after at most O(f (n)) steps of SIMPLE. In this sense SIMPLE essentially computes
each string as quickly as its fastest algorithm, although it is in fact computing all computable
strings simultaneously. This may seem counterintuitive.
Following Levin [53], within 2k+1 TM steps, each of order O(1) “micro-steps” (no excessive
computational overhead due to storage allocation etc.), FAST will generate all prefixes xn
28
satisfying Kt(xn ) ≤ k, where xn ’s Levin complexity Kt(xn ) is defined as
where program prefix q computes xn in t(q, xn ) time steps. The computational complexity
of the algorithm is not essentially affected by the fact that PHASE i = 2, 3, . . ., repeats
the computation of PHASE i − 1 which for large i is approximately half as short (ignoring
nonessential speed-ups due to halting programs if there are any).
One difference between SIMPLE and FAST is that SIMPLE may allocate steps to al-
gorithms with a short description less frequently than FAST. Suppose no finite algorithm
computes x faster than pk which needs at most f (n) instructions for xn , for all n. While
SIMPLE needs 2k+1 f (n) steps to compute xn , following Levin [53] it can be shown that
k
FAST requires at most 2K(p )+1 f (n) steps — compare [56, p. 504 ff]. That is, SIMPLE
and FAST share the same order of time complexity (ignoring SIMPLE’s “micro-steps” for
storage organization), but FAST’s constant factor tends to be better.
Note that an observer A evolving in one of the universes computed by FAST might
decide to build a machine that simulates all possible computable universes using FAST,
and so on, recursively. Interestingly, this will not necessarily cause a dramatic exponential
slowdown: if the n-th discrete time step of A’s universe (compare Example 1.1) is computable
within O(n) time then A’s simulations can be as fast as the “original” simulation, save for
a constant factor. In this sense a “Great Programmer” [72] who writes a program that runs
all possible universes would not be superior to certain nested Great Programmers inhabiting
his universes.
To summarize: the effort required for computing all computable objects simultaneously
does not prevent FAST from computing each object essentially as quickly as its fastest
algorithm. No other dovetailer can have a better order of computational complexity. This
suggests a notion of describability that is much more restricted yet perhaps much more
natural than the one used in the earlier sections on description size-based complexity.
Lemma 6.1 With countable time and space requirements, FAST computes all S-describable
strings.
29
To see this, recall that FAST will output any S-describable string as fast as its fastest algo-
rithm, save for a constant factor. Those x with polynomial time bounds on the computation
of xn (e.g., O(n37 )) are S-describable, but most x ∈ B ♯ are not, as obvious from Cantor’s
insight [23].
The prefixes xn of all x ∈ B ♯ , even of those that are not S-describable, are computed
within at most O(n2n ) steps, at least as quickly as by ALPHABET. The latter, however,
never is faster than that, while FAST often is. Now consider infinite strings x whose fastest
individual finite program needs even more than O(n2n ) time steps to output xn and nothing
but xn , such as Chaitin’s Ω (or the even worse z from Theorem 3.3) — recall that the time
for computing Ωn grows faster than any recursive function of n [28]. We observe that this
result is irrelevant for FAST which will output Ωn within O(n2n ) steps, but only because
it also outputs many other strings besides Ωn — there is still no fast way of identifying Ωn
among all the outputs. Ω is not S-describable because it is not generated any more quickly
than uncountably many other infinite and incompressible strings, which are not S-describable
either.
We observe that X
µM (x) = limi→∞ 2−l(p) , (45)
p→i x
otherwise µM (x) would be recursive. Therefore we might argue that the use of prior µM
is essentially equivalent to using a probabilistic version of FAST which randomly selects a
phase according to a distribution assigning zero probability to any phase with recursively
computable number. Since the time and space consumed by PHASE i is at least O(2i ), we are
approaching uncountable resources as i goes to infinity. From any reasonable computational
perspective, however, the probability of a phase consuming more than countable resources
clearly should be zero. This motivates the next subsection.
30
6.5 Speed Prior S and Algorithm GUESS
A resource-oriented point of view suggests the following postulate.
Postulate 6.1 The cumulative prior probability measure of all x incomputable within time
t by the most efficient way of computing everything should be inversely proportional to t.
Since the most efficient way of computing all x is embodied by FAST, and since each
phase of FAST consumes roughly twice the time and space resources of the previous phase,
the cumulative prior probability of each finite phase should be roughly half the one of the
previous phase; zero probability should be assigned to infinitely resource-consuming phases.
Postulate 6.1 therefore suggests the following definition.
Since x ∈ B ∗ is first computed in PHASE Kt(x) within 2Kt(x)+1 steps, we may rewrite:
∞
X
S(x) = 2−Kt(x) 2−i SKt(x)+i−1 (x) ≤ 2−Kt(x) (47)
i=1
Algorithm GUESS:
1. Toss an unbiased coin until heads is up; let i denote the number of
required trials; set t := 2i .
2. If the number of steps executed so far exceeds t then exit. Execute
one step; if it is a request for an input bit, toss the coin to determine
the bit, and set t := t/2.
3. Go to 2.
In the spirit of FAST, algorithm GUESS makes twice the computation time half as likely,
and splits remaining time in half whenever a new bit is requested, to assign equal runtime
to the two resulting sets of possible program continuations. Note that the expected runtime
P
of GUESS is unbounded since i 2−i 2i does not converge. Expected runtime is count-
able, however, and expected space is of the order of expected time, due to numerous short
algorithms producing a constant number of output bits per constant time interval.
Assuming our universe is sampled according to GUESS implemented on some machine,
note that the true distribution is not essentially different from the estimated one based on
our own, possibly different machine.
31
6.6 Speed Prior-Based Inductive Inference
Given S, as we observe an initial segment x ∈ B ∗ of some string, which is the most likely
continuation? Consider x’s finite continuations xy, y ∈ B ∗ . According to Bayes (compare
Equation (15)),
S(x | xy)S(xy) S(xy)
S(xy | x) = = , (48)
S(x) S(x)
where S(z 2 | z 1 ) is the measure of z 2 , given z 1 . Having observed x we will predict those y
that maximize S(xy | x). Which are those? In what follows, we will confirm the intuition
that for n → ∞ the only probable continuations of xn are those with fast programs. The
sheer number of “slowly” computable strings cannot balance the speed advantage of “more
quickly” computable strings with equal beginnings.
<k <k
Definition 6.4 (p −→i x etc.) Write p −→ x if finite program p (p → x) computes x
<k
within less than k steps, and p −→i x if it does so within PHASE i of FAST. Similarly for
≤k ≤k =k ≥k
p −→ x and p −→i x (at most k steps), p −→ x, (exactly k steps), p −→ x, (at least k
>k
steps), p −→ x (more than k steps).
Proof. Since no program that requires at least g(n) steps for producing xn can compute xn
in a phase with number < log g(n), we have
P∞ g(n)−i P
i=1 2−log ≥g(n) 2−l(p)
p −→ (i+log g(n)) xn
Q(x, g, f ) ≤ limn→∞ P∞ f (n)−i
P ≤
i=1 2−log =f (n) 2−l(p)
p −→i xn
P
f (n) p→xn 2−l(p) f (n) 1
limn→∞ P ≤ limn→∞ = 0.
g(n) =f (n) 2 −l(p) g(n) 2−l(px )
p −→ xn
Here we have used the Kraft inequality [51] to obtain a rough upper bound for the enumer-
P
ator: when no p is prefix of another one, then p 2−l(p) ≤ 1. ✷
Hence, if we know a rather fast finite program px for x, then Theorem 6.1 allows for predict-
ing: if we observe some xn (n sufficiently large) then it is very unlikely that it was produced
by an x-computing algorithm much slower than px .
Among the fastest algorithms for x is FAST itself, which is at least as fast as px , save
for a constant factor. It outputs xn after O(2Kt(xn) ) steps. Therefore Theorem 6.1 tells us:
32
Corollary 6.1 Let x ∈ B ∞ be S-describable. For n → ∞, with probability 1 the continuation
of xn is computable within O(2Kt(xn) ) steps.
Given observation x with l(x) → ∞, we predict a continuation y with minimal Kt(xy).
Example 6.1 Consider Example 1.2 and Equation (1). According to the weak anthropic
principle, the conditional probability of a particular observer finding herself in one of the
universes compatible with her existence equals 1. Given S, we predict a universe with
minimal Kt. Short futures are more likely than long ones: the probability that the universe’s
history so far will extend beyond the one computable in the current phase of FAST (that
is, it will be prolongated into the next phase) is at most 50 %. Infinite futures have measure
zero.
33
certain IQ tests, however, the answer “250” will not yield maximal score, because it does not
seem to be the “simplest” answer consistent with the data (compare [73]). And physicists
and others favor “simple” explanations of observations.
Roughly fourty years ago Solomonoff set out to provide a theoretical justification of this
quest for simplicity [82]. He and others have made substantial progress over the past decades.
In particular, technical problems of Solomonoff’s original approach were partly overcome by
Levin [54] who introduced self-delimiting programs, m and µM mentioned above, as well
as several theorems relating probabilities to complexities — see also Chaitin’s and Gács’
independent papers on prefix complexity and m [35, 27]. Solomonoff’s work on inductive
inference helped to inspire less general yet practically more feasible principles of minimum
description length [95, 66, 44] as well as time-bounded restrictions of Kolmogorov complexity,
e.g., [42, 2, 96, 56], as well as the concept of “logical depth” of x, the runtime of the shortest
program of x [8].
Equation (15) makes predictions of the entire future, given the past. This seems to be
the most general approach. Solomonoff [83] focuses just on the next bit in a sequence. Al-
though this provokes surprisingly nontrivial problems associated with translating the bitwise
approach to alphabets other than the binary one — only recently Hutter managed to do this
[48] — it is sufficient for obtaining essential insights [83].
Given an observed bitstring x, Solomonoff assumes the data are drawn according to a
recursive measure µ; that is, there is a MTM program that reads x ∈ B ∗ and computes µ(x)
and halts. He estimates the probability of the next bit (assuming there will be one), using
the fact that the enumerable µM dominates the less general recursive measures:
where cµ is a constant depending on µ but not on x. Compare [56, p. 282 ff]. Solomonoff
showed that the µM -probability of a particular continuation converges towards µ as the
observation size goes to infinity [83]. Hutter recently extended his results by showing that the
number of prediction errors made by universal Solomonoff prediction is essentially bounded
by the number of errors made by any other recursive prediction scheme, including the optimal
scheme based on the true distribution µ [47]. Hutter also extended Solomonoff’s passive
universal induction framework to the case of agents actively interacting with an unknown
environment [49].
A previous paper on computable universes [72, Section: Are we Run by a Short Algo-
rithm?] applied the theory of inductive inference to entire universe histories, and predicted
that simple universes are more likely; that is, observers are likely to find themselves in a
simple universe compatible with their existence (compare everything mailing list archive [30],
messages dated 21 Oct and 25 Oct 1999: http://www.escribe.com/science/theory/m1284.html
and m1312.html). There are two ways in which one could criticize this approach. One sug-
gests it is too general, the other suggests it is too restrictive.
1. Recursive priors too general? µM (x) is not recursively computable, hence there
is no general practically feasible algorithm to generate optimal predictions. This suggests
to look at more restrictive priors, in particular, S, which will receive additional motivation
further below.
34
2. Recursive priors too restricted? If we want to explain the entire universe, then
the assumption of a recursive P on the possible universes may even be insufficient. In
particular, although our own universe seems to obey simple rules — a discretized version of
Schrödinger’s wave function could be implemented by a simple recursive algorithm — the
apparently noisy fluctuations that we observe on top of the simple laws might be due to a
pseudorandom generator (PRG) subroutine whose output is describable, even enumerable,
but not recursive — compare Example 2.1.
In particular, the fact that nonrecursive priors may not allow for recursive bounds on
the time necessary to compute initial histories of some universe does not necessarily prohibit
nonrecursive priors. Each describable initial history may be potentially relevant as there is
an infinite computation during which it will be stable for all but finitely many steps. This
suggests to look at more general priors such as µE , P E , P G , which will be done next, before
we come back to the speed prior S.
35
exponentially fast.
Hence, the relatively mild assumption that the probability distribution from which our
universe is drawn is cumulatively enumerable provides a theoretical justification of the pre-
diction that the most likely continuations of our universes are computable by short EOM
algorithms. However, given P E , Occam’s razor (e.g., [11]) is only partially justified because
the sum of the probabilities of the most complex xy does not vanish:
X
limn→∞ P E (xy) > 0.
xy∈B ♯ :K E (xy)>n
To see this, compare Def. 4.12 and the subsequent paragraph on program continua. There
would be a nonvanishing chance for an observer to end up in one of the maximally complex
universes compatible with his existence, although only universes with finite descriptions have
nonvanishing individual probability.
We will conclude this subsection by addressing the issue of falsifiability. If P E or µE
were responsible for the pseudorandom aspects of our universe (compare Example 2.1),
then this might indeed be effectively undetectable in principle, because some approximable
and enumerable patterns cannot be proven to be nonrandom in recursively bounded time.
Therefore the results above may be of interest mainly from a philosophical point of view, not
from a practical one: yes, universes computable by short EOM algorithms are much more
likely indeed, but even if we inhabit one then we may not be able to find its short algorithm.
36
unlikely; much more likely are those histories where our lives are deterministically computed
by a short algorithm, where the algorithmic entropy (compare [98]) of the universe does
not increase over time, because a finite program conveying a finite amount of information is
responsible for everything, and where concepts such as “free will” are just an illusion in a
certain sense. Nevertheless, there may not be any effective way of proving or falsifying this.
37
can conclude that the Great Programmer’s resources are sufficient to compute at least one
instance of A. What A does not know, however, is the current phase of FAST, or whether
the Great Programmer is interested in or aware of A, or whether A is just an accidental
by-product of some Great Programmer’s search for something else, etc.
Here is where a resource-oriented bias comes in naturally. It seems to make sense for
A to assume that the Great Programmer is also bound by the limits of computability, that
infinitely late phases of FAST consuming uncountable resources are infinitely unlikely, that
any Great Programmer’s a priori probability of investing computational resources into some
search problem tends to decrease with growing search costs, and that the prior probability
of anything whose computation requires more than O(n) resources by the optimal method
is indeed inversely proportional to n. This immediately leads to the speed prior S.
Believing in S, A could use Theorem 6.1 to predict the future (or “postdict” unknown
aspects of the past) by assigning highest probability to those S-describable futures (or pasts)
that are (a) consistent with A’s experiences and (b) are computable by short and fast algo-
rithms. The appropriate simplicity measure minimized by this resource-oriented version of
Occam’s razor is the Levin complexity Kt.
When exactly will a particular neutron decay into a proton, an electron and an antineutrino?
Is the moment of its death correlated with other events in our universe? Conventional wisdom
rejects this idea and suggests that beta decay is a source of true randomness. According
to S, however, this cannot be the case. Never-ending true randomness is neither formally
describable (Def. 2.5) nor S-describable (Def. 6.1); its computation would not be possible
using countable computational steps.
This encourages a re-examination of beta decay or other types of particle decay: given
S, a very simple and fast but maybe not quite trivial PRG should be responsible for the
decay pattern of possibly widely separated neutrons. (If the PRG were too trivial and too
obvious then maybe the resulting universe would be too simple to permit evolution of our
38
type of consciousness, thus being ruled out by the weak anthropic principle.) Perhaps the
main reason for the current absence of empirical evidence in this vein is that nobody has
systematically looked for it yet.
Everett’s many worlds hypothesis [33] essentially states: whenever our universe’s quantum
mechanics based on Schrödinger’s equation allows for alternative “collapses of the wave
function,” all are made and the world splits into separate universes. The previous paper
[72] already pointed out that from our algorithmic point of view there are no real splits —
there are just a bunch of different algorithms which yield identical results for some time,
until they start computing different outputs corresponding to different possible observations
in different universes. According to P G , P E , µE , µM , S, however, most of these alternative
continuations are much less likely than others.
In particular, the outcomes of experiments involving entangled states, such as the obser-
vations of spins of initially close but soon distant particles with correlated spins, are currently
widely assumed to be random. Given S, however, whenever there are several possible con-
tinuations of our universe corresponding to different wave function collapses, and all are
compatible with whatever it is we call our consciousness, we are more likely to end up in one
computable by a short and fast algorithm. A re-examination of split experiment data might
reveil unexpected, nonobvious, nonlocal algorithmic regularity due to a PRG.
This prediction runs against current mainstream trends in physics, with the possible
exception of hidden variable theory, e.g., [7, 12, 90].
Given S, the probability that the history of the universe so far will reach into the next phase
of FAST is at most 12 — compare Example 6.1. Does that mean there is a 50 % chance that
our universe will get at least twice as old as it is now? Not necessarily, if the computation
of its state at the n-th time step (local time) requires more than O(n) time.
As long as there is no compelling contrarian evidence, however, a reasonable guess would
be that our universe is indeed among the fastest ones with O(1) output bits per constant time
interval consumed by algorithm FAST. It may even be “locally” computable through simple
simulated processors, each interacting with only few neighbouring processors, assuming that
the pseudorandom aspects of our universe do not require any more global communication
between spatio-temporally separated parts than the well-known physical laws. Note that the
fastest universe evolutions include those representable as sequences of substrings of constant
length l, where each substring stands for the universe’s discretized state at a certain discrete
time step and is computable from the previous substring in O(l) time (compare Example
1.1). However, the fastest universes also include those whose representations of successive
discrete time steps do grow over time and where more and more time is spent on their
computation. The expansion of certain computable universes actually requires this.
In any case, the probability that ours will last 2n times longer than it has lasted so far
is at most 2−n (except, of course, when its early states are for some reason much harder to
39
compute than later ones and we are still in an early state). This prediction also differs from
those of current mainstream physics (compare [40] though), but obviously is not verifiable.
40
list archive: http://www.escribe.com/science/theory/m1284.html and m1312.html, as well as
recent papers by Standish and Soklakov [86, 81], and see Calude and Meyerstein [22] for a
somewhat contrarian view.
The current paper introduces simplicity measures more dominant than the traditional
ones [50, 82, 83, 26, 100, 52, 54, 35, 27, 36, 77, 28, 37, 56], and provides a more general, more
technical, and more detailed account, incorporating several novel theoretical results based
on generalizations of Kolmogorov complexity and algorithmic probability. In particular,
it stretches the notions of computability and constructivism to the limits, by considering
not only MTM-based traditional computability but also less restrictive GTM-based and
EOM-based describability, and proves several relevant “Occams razor theorems.” Unlike
the previous paper [72] it also analyzes fundamental time constraints on the computation of
everything, and derives predictions based on these restrictions.
Rather than pursuing the computability-oriented path layed out in [72], Tegmark recently
suggested what at first glance seems to be an alternative ensemble of possible universes based
on a (somewhat vaguely defined) set of “self-consistent mathematical structures” [89], thus
going beyond his earlier, less general work [88] on physical constants and Everett’s many
world variants [33] of our own particular universe — compare also Marchal’s and Bostrom’s
theses [60, 15]. It is not quite clear whether Tegmark would like to include universes that
are not formally describable according to Def. 2.5. It is well-known, however, that for any
set of mathematical axioms there is a program that lists all provable theorems in order of
the lengths of their shortest proofs encoded as bitstrings. Since the TM that computes all
bitstrings outputs all these proofs for all possible sets of axioms, Tegmark’s view [89] seems in
a certain sense encompassed by the algorithmic approach [72]. On the other hand, there are
many formal axiomatic systems powerful enough to encode all computations of all possible
TMs, e.g., number theory. In this sense the algorithmic approach is encompassed by number
theory.
The algorithmic approach, however, offers several conceptual advantages: (1) It provides
the appropriate framework for issues of information-theoretic complexity traditionally ig-
nored in pure mathematics, and imposes natural complexity-based orderings on the possible
universes and subsets thereof. (2) It taps into a rich source of theoretical insights on com-
putable probability distributions relevant for establishing priors on possible universes. Such
priors are needed for making probabilistic predictions concerning our own particular uni-
verse. Although Tegmark suggests that “... all mathematical structures are a priori given
equal statistical weight” [89](p. 27), there is no way of assigning equal nonvanishing prob-
ability to all (infinitely many) mathematical structures. Hence we really need something
like the complexity-based weightings discussed in [72] and especially the paper at hand. (3)
The algorithmic approach is the obvious framework for questions of temporal complexity
such as those discussed in this paper, e.g., “what is the most efficient way of simulating all
universes?”
41
8 Concluding Remarks
There is an entire spectrum of ways of ordering the describable things, spanned by two
extreme ways of doing it. Sections 2-5 analyzed one of the extremes, based on minimal
constructive description size on generalized Turing Machines more expressive than those
considered in previous work on Kolmogorov complexity and algorithmic probability and
inductive inference. Section 6 discussed the other extreme based on the fastest way of
computing all computable things.
Between the two extremes we find methods for ordering describable things by (a) their
minimal nonhalting enumerable descriptions (also discussed in Sections 2-5), (b) their min-
imal halting or monotonic descriptions (this is the traditional theory of Kolmogorov com-
plexity or algorithmic information), and (c) the polynomial time complexity-oriented criteria
being subject of most work in theoretical computer science. Theorems in Sections 2-6 reveil
some of the structure of the computable and enumerable and constructively describable
things.
Both extremes of the spectrum as well as some of the intermediate points yield natural
prior distributions on describable objects. The approximable and cumulatively enumerable
description size-based priors (Sections 4-5) suggest algorithmic theories of everything (TOEs)
partially justifying Occam’s razor in a way more general than previous approaches: given
several explanations of your universe, those requiring few bits of information are much more
probable than those requiring many bits (Section 7). However, there may not be an effective
procedure for discovering a compact and complete explanation even if there is one.
The resource-optimal, less dominant, yet arguably more plausible extreme (Section 6)
leads to an algorithmic TOE without excessive temporal complexity: no calculation of any
universe computable in countable time needs to suffer from an essential slow-down due to
simultaneous computation of all the others. Based on the rather weak assumption that the
world’s creator is constrained by certain limits of computability, and considering that all of
us may be just accidental by-products of His optimally efficient search for a solution to some
computational problem, the resulting “speed prior” predicts that a fast and short algorithm
is responsible not only for the apparently simple laws of physics but even for what most
physicists currently classify as noise or randomness (Section 7). It may be not all that hard
to find; we should search for it.
Much of this paper highlights differences between countable and uncountable sets. It is
argued (Sections 6, 7) that things such as uncountable time and space and incomputable
probabilities actually should not play a role in explaining the world, for lack of evidence
that they are really necessary. Some may feel tempted to counter this line of reasoning by
pointing out that for centuries physicists have calculated with continua of real numbers, most
of them incomputable. Even quantum physicists who are ready to give up the assumption
of a continuous universe usually do take for granted the existence of continuous probability
distributions on their discrete universes, and Stephen Hawking explicitly said: “Although
there have been suggestions that space-time may have a discrete structure I see no reason
to abandon the continuum theories that have been so successful.” Note, however, that all
physicists in fact have only manipulated discrete symbols, thus generating finite, describable
proofs of their results derived from enumerable axioms. That real numbers really exist
42
in a way transcending the finite symbol strings used by everybody may be a figment of
imagination — compare Brouwer’s constructive mathematics [17, 6] and the Löwenheim-
Skolem Theorem [58, 79] which implies that any first order theory with an uncountable
model such as the real numbers also has a countable model. As Kronecker put it: “Die
ganze Zahl schuf der liebe Gott, alles Übrige ist Menschenwerk” (“God created the integers,
all else is the work of man” [20]). Kronecker greeted with scepticism Cantor’s celebrated
insight [23] about real numbers, mathematical objects Kronecker believed did not even exist.
A good reason to study algorithmic, noncontinuous, discrete TOEs is that they are the
simplest ones compatible with everything we know, in the sense that universes that cannot
even be described formally are obviously less simple than others. In particular, the speed
prior-based algorithmic TOE (Sections 6, 7) neither requires an uncountable ensemble of
universes (not even describable in the sense of Def. 6.1), nor infinitely many bits to specify
nondescribable real-valued probabilities or nondescribable infinite random sequences. One
may believe in the validity of algorithmic TOEs until (a) there is evidence against them, e.g.,
someone shows that our own universe is not formally describable and would not be possible
without, say, existence of incomputable numbers, or (b) someone comes up with an even
simpler explanation of everything. But what could that possibly be?
Philosophers tend to create theories inspired by recent scientific developments. For in-
stance, Heisenberg’s uncertainty principle and Gödel’s incompleteness theorem greatly in-
fluenced modern philosophy. Are algorithmic TOEs and the “Great Programmer Religion”
[72] just another reaction to recent developments, some in hindsight obvious by-product of
the advent of good virtual reality? Will they soon become obsolete, as so many previous
philosophies? We find it hard to imagine so, even without a boost to be expected for al-
gorithmic TOEs in case someone should indeed discover a simple subroutine responsible for
certain physical events hitherto believed to be irregular. After all, algorithmic theories of
the describable do encompass everything we will ever be able to talk and write about. Other
things are simply beyond description.
Acknowledgments
At the age of 17 my brother Christof declared that the universe is a mathematical structure
inhabited by observers who are mathematical substructures (private communication, Mu-
nich, 1981). As he went on to become a theoretical physicist, discussions with him about
the relation between superstrings and bitstrings became a source of inspiration for writing
both the earlier paper [72] and the present one, both based on computational complexity
theory, which seems to provide the natural setting for his more physics-oriented ideas (pri-
vate communication, Munich 1981-86; Pasadena 1987-93; Princeton 1994-96; Berne/Geneva
1997–; compare his notion of “mathscape” [70]). Furthermore, Christof’s 1997 remarks on
similarities and differences between Feynman path integrals and “the sum of all computable
universes” and his resulting dissatisfaction with the lack of a discussion of temporal aspects
in [72] triggered Section 6 on temporal complexity.
I am grateful to Ray Solomonoff for his helpful comments on earlier work [73] making use
of the probabilistic algorithm of Section 6, and to Paul Vitányi for useful information rele-
vant to the proof of Theorem 4.2. I would also like to express my thanks to numerous posters
43
and authors (e.g., [60, 89, 62, 15, 86, 32, 43, 59, 69]) of the everything mailing list created by
Wei Dai [30] (everything-list@eskimo.com). Some of the text above actually derives from my
replies to certain postings (see archive at http://www.escribe.com/science/theory/). Finally
I am indebted to Marcus Hutter and Sepp Hochreiter for independently checking the theo-
rems, to Leonora Bianchi, Wei Dai, Doug Eck, Felix Gers, Ivo Kwee, Carlo Lepori, Leonid
Levin, Monaldo Mastrolilli, Andrea Rizzoli, Nicol N. Schraudolph, and Marco Zaffalon, for
comments on (parts of) earlier drafts or of Version 1.0 [75], to Wilfried Brauer and Karl
Svozil for relevant pointers and references, and especially to Marcus Hutter for the proof of
Theorem 4.2.
References
[1] L. Adleman. Time, space, and randomness. Technical Report MIT/LCS/79/TM-131,
Laboratory for Computer Science, MIT, 1979.
[4] J. D. Barrow and F. J. Tipler. The Anthropic Cosmological Principle. Clarendon Press,
Oxford, 1986.
[7] J. S. Bell. On the problem of hidden variables in quantum mechanics. Rev. Mod. Phys.,
38:447–452, 1966.
[8] C. H. Bennett. Logical depth and physical complexity. In The Universal Turing
Machine: A Half Century Survey, volume 1, pages 227–258. Oxford University Press,
Oxford and Kammerer & Unverzagt, Hamburg, 1988.
[10] L. Blum, M. Shub, and S. Smale. On a theory of computation and complexity over the
real numbers: NP completeness, recursive functions, and universal machines. Bulletin
AMS, 21, 1989.
44
[12] D. Bohm and B. J. Hiley. The Undivided Universe. Routledge, New York, N.Y., 1993.
[18] M. S. Burgin. Inductive Turing machines. Notices of the Academy of Sciences of the
USSR (translated from Russian), 270(6):1289–1293, 1991.
[20] F. Cajori. History of mathematics (2nd edition). Macmillan, New York, 1919.
[21] C. S. Calude. Chaitin Ω numbers, Solovay machines and Gödel incompleteness. The-
oretical Computer Science, 2000. In press.
[22] C. S. Calude and F. W. Meyerstein. Is the universe lawful? Chaos, Solitons & Fractals,
10(6):1075–1084, 1999.
[23] G. Cantor. Über eine Eigenschaft des Inbegriffes aller reellen algebraischen Zahlen.
Crelle’s Journal für Mathematik, 77:258–263, 1874.
[24] B. Carter. Large number coincidences and the anthropic principle in cosmology. In
M. S. Longair, editor, Proceedings of the IAU Symposium 63, pages 291–298. Reidel,
Dordrecht, 1974.
[25] T. Chadzelek and G. Hotz. Analytic machines. Theoretical Computer Science, 219:151–
167, 1999.
[26] G.J. Chaitin. On the length of programs for computing finite binary sequences: sta-
tistical considerations. Journal of the ACM, 16:145–159, 1969. Submitted 1965.
[27] G.J. Chaitin. A theory of program size formally identical to information theory. Journal
of the ACM, 22:329–340, 1975.
45
[28] G.J. Chaitin. Algorithmic Information Theory. Cambridge University Press, Cam-
bridge, 1987.
[31] D. Deutsch. The Fabric of Reality. Allen Lane, New York, NY, 1997.
[32] M. J. Donald. Quantum theory and the brain. Proceedings of the Royal Society
(London) Series A, 427:43–93, 1990.
[33] H. Everett III. ‘Relative State’ formulation of quantum mechanics. Reviews of Modern
Physics, 29:454–462, 1957.
[35] P. Gács. On the symmetry of algorithmic information. Soviet Math. Dokl., 15:1477–
1480, 1974.
[36] P. Gács. On the relation between descriptional complexity and algorithmic probability.
Theoretical Computer Science, 22:71–93, 1983.
[38] K. Gödel. Über formal unentscheidbare Sätze der Principia Mathematica und ver-
wandter Systeme I. Monatshefte für Mathematik und Physik, 38:173–198, 1931.
[40] J. R. Gott, III. Implications of the Copernican principle for our future prospects.
Nature, 363:315–319, 1993.
[42] J. Hartmanis. Generalized Kolmogorov complexity and the structure of feasible com-
putations. In Proc. 24th IEEE Symposium on Foundations of Computer Science, pages
439–445, 1983.
[44] S. Hochreiter and J. Schmidhuber. Flat minima. Neural Computation, 9(1):1–42, 1997.
[45] G. Hotz, G. Vierke, and B. Schieffer. Analytic machines. Technical Report TR95-
025, Electronic Colloquium on Computational Complexity, 1995. http://www.eccc.uni-
trier.de/eccc/.
46
[46] D. A. Huffman. A method for construction of minimum-redundancy codes. Proceedings
IRE, 40:1098–1101, 1952.
[47] M. Hutter. New error bounds for Solomonoff prediction. Journal of Computer and
System Science, in press, 2000. http://xxx.lanl.gov/abs/cs.AI/9912008.
[48] M. Hutter. Optimality of universal prediction for general loss and alphabet. Technical
report, Istituto Dalle Molle di Studi sull’Intelligenza Artificiale, Manno (Lugano), CH,
December 2000. In progress.
[51] L. G. Kraft. A device for quantizing, grouping, and coding amplitude modulated
pulses. M.Sc. Thesis, Dept. of Electrical Engineering, MIT, Cambridge, Mass., 1949.
[52] L. A. Levin. On the notion of a random sequence. Soviet Math. Dokl., 14(5):1413–1416,
1973.
[54] L. A. Levin. Laws of information (nongrowth) and aspects of the foundation of prob-
ability theory. Problems of Information Transmission, 10(3):206–210, 1974.
[60] B. Marchal. Calculabilité, Physique et Cognition. PhD thesis, L’Université des Sciences
et Technologies De Lilles, 1998.
[61] P. Martin-Löf. The definition of random sequences. Information and Control, 9:602–
619, 1966.
47
[62] H. Moravec. Robot. Wiley Interscience, 1999.
[64] R. Penrose. The Emperor’s New Mind. Oxford University Press, 1989.
[65] H. Putnam. Trial and error predicates and the solution to a problem of Mostowski.
Journal of Symbolic Logic, 30(1):49–57, 1965.
[66] J. Rissanen. Stochastic complexity and modeling. The Annals of Statistics, 14(3):1080–
1100, 1986.
[67] H. Rogers, Jr. Theory of Recursive Functions and Effective Computability. McGraw-
Hill, New York, 1967.
[68] Otto E. Rössler. Endophysics. The World as an Interface. World Scientific, Singapore,
1998. With a foreword by Peter Weibel.
[69] H. Ruhl. The use of complexity to solve dilemmas in physics. Continually modi-
fied draft, Dec 2000. http://www.connix.com/∼hjr/model01.html (nonpermanent con-
tents).
[71] J. Schmidhuber. Discovering solutions with low Kolmogorov complexity and high gen-
eralization capability. In A. Prieditis and S. Russell, editors, Machine Learning: Pro-
ceedings of the Twelfth International Conference, pages 488–496. Morgan Kaufmann
Publishers, San Francisco, CA, 1995.
[72] J. Schmidhuber. A computer scientist’s view of life, the universe, and everything. In
C. Freksa, M. Jantzen, and R. Valk, editors, Foundations of Computer Science: Po-
tential - Theory - Cognition, volume 1337, pages 201–208. Lecture Notes in Computer
Science, Springer, Berlin, 1997. Submitted 1996.
[73] J. Schmidhuber. Discovering neural nets with low Kolmogorov complexity and high
generalization capability. Neural Networks, 10(5):857–873, 1997.
[76] J. Schmidhuber, J. Zhao, and M. Wiering. Shifting inductive bias with success-story al-
gorithm, adaptive Levin search, and incremental self-improvement. Machine Learning,
28:105–130, 1997.
48
[77] C. P. Schnorr. Process complexity and effective random tests. Journal of Computer
Systems Science, 7:376–388, 1973.
[78] C. E. Shannon. A mathematical theory of communication (parts I and II). Bell System
Technical Journal, XXVII:379–423, 1948.
[80] T. Slaman. Randomness and recursive enumerability. Technical report, Univ. of Cali-
fornia, Berkeley, 1999. Preprint, http://www.math.berkeley.edu/∼slaman.
[81] A. N. Soklakov. Occam’s razor as a formal basis for a physical theory. Technical Report
math-ph/0009007, Univ. London, Dept. Math., Royal Holloway, Egham, Surrey TW20
OEX, September 2000. http://arXiv.org/abs/math-ph/0009007.
[82] R.J. Solomonoff. A formal theory of inductive inference. Part I. Information and
Control, 7:1–22, 1964.
[85] R. M. Solovay. A version of Ω for which ZFC can not predict a single bit. In C. S. Calude
and G. Păun, editors, Finite Versus Infinite. Contributions to an Eternal Dilemma,
pages 323–334. Springer, London, 2000.
[86] R. Standish. Why Occam’s razor? Technical report, High Performance Computing
Support Unit, Univ. New South Wales, Sydney, 2052, Australia, July 2000.
[88] M. Tegmark. Does the universe in fact contain almost no information? Foundations
of Physics Letters, 9(1):25–42, 1996.
[89] M. Tegmark. Is “the theory of everything” merely the ultimate ensemble theory?
Annals of Physics, 270:1–51, 1998. Submitted 1996.
[91] T. Toffoli. The role of the observer in uniform systems. In G. Klir, editor, Applied
General Systems Research. Plenum Press, New York, London, 1978.
49
[92] A. M. Turing. On computable numbers, with an application to the Entscheidungsprob-
lem. Proceedings of the London Mathematical Society, Series 2, 41:230–267, 1936.
[97] M.A. Wiering and J. Schmidhuber. Solving POMDPs with Levin search and EIRA.
In L. Saitta, editor, Machine Learning: Proceedings of the Thirteenth International
Conference, pages 534–542. Morgan Kaufmann Publishers, San Francisco, CA, 1996.
[98] W. H. Zurek. Algorithmic randomness and physical entropy I. Phys. Rev., A40:4731–
4751, 1989.
[99] W. H. Zurek. Decoherence and the transition from quantum to classical. Physics
Today, 44(10):36–44, 1991.
[100] A. K. Zvonkin and L. A. Levin. The complexity of finite objects and the algorithmic
concepts of information and randomness. Russian Math. Surveys, 25(6):83–124, 1970.
50