R-Mming Thesis
R-Mming Thesis
Fredrik Rømming
Department of Informatics
The Faculty of Mathematics and Natural Sciences
UNIVERSITY OF OSLO
Spring 2021
Learning to Reason
Fredrik Rømming
© 2021 Fredrik Rømming
Learning to Reason
http://www.duo.uio.no/
i
ii
Contents
1 Introduction 1
iii
5 A General Framework for ML-Guided ATP 47
5.1 From Prolog to Python . . . . . . . . . . . . . . . . . . . . . . 47
5.1.1 Logic + Control . . . . . . . . . . . . . . . . . . . . . . 47
5.1.2 Object-orientation . . . . . . . . . . . . . . . . . . . . 48
5.1.3 Rapid Prototyping . . . . . . . . . . . . . . . . . . . . 50
5.2 Incorporating Learning . . . . . . . . . . . . . . . . . . . . . . 51
5.2.1 Proof Search as an MDP . . . . . . . . . . . . . . . . . 51
5.2.2 Model Module . . . . . . . . . . . . . . . . . . . . . . 54
iv
Chapter 1
Introduction
The two sentences above the horizontal line are premises which combined
imply the sentence below the line, which is a logical conclusion.
Definition 1.0.2 (Induction). Inductive inference (Not to be confused
with "mathematical induction") makes steps from premises, which are
observations providing some evidence, to conclusions saying something
general about the phenomenon underlying the particular observations.
An example of inductive inference is the generalization:
1
The proportion Q of the sample has attribute A
Therefore, the proportion Q of the population has attribute A
The sentence above the line states some specific observations, while the
sentence below the line says something general about all possible potential
observations.
In other words, deduction is concluding the particular from the general,
while induction is concluding the general from the particular.
Systems automating the process of reasoning play a central role in software
and hardware verification, and have applications in any area where logical
reasoning is required, such as in artificial general intelligence, mathematics
and philosophy. Although the overall goal of automated reasoning is
to mechanize different forms of reasoning, the term has largely been
identified with deductive reasoning as practiced in mathematics and
formal logic. In this thesis, we explore the use of automated inductive
reasoning (machine learning) to guide automated deductive reasoning
(theorem proving).
In Section 2, 3, and 4 we introduce the fields of automated theorem proving
and machine learning, and review what has been done to combine the
two. In Section 5 we propose a high-level framework for machine learning
guided automated theorem proving, and in Section 6 and 7 we describe
how we developed a machine learning guided automated theorem prover
in the aforementioned framework.
2
Chapter 2
3
To reach conclusions through deduction in this way, we need to use logical
systems whose syntax and semantics are expressive enough and whose
deductive systems are strong enough to prove theorems corresponding to
the statements in our domain of discourse. What formal framework to use
is highly dependent on the nature of the problem/theorem domain. There
is a wide range of imaginable domains in which we would like to perform
automated deduction. Table 2.1 shows some example domains and their
most common formalizations [1]:
Domain Common formalizations
General-purpose theorem proving and problem solving First-order logic, Simple type theory
Program verification Typed First-order logic, Higher-order logic
Distributed and concurrent systems Modal logic, Temporal logic
Program synthesis Intuitionistic logic
Hardware verification Higher-order logic, Propositional logic
Logic programming Horn logic
Constraint satisfaction Propositional logic, Satisfiabilty Modulo Theories
Computational metaphysics Higher-order modal logic
We will now have a look at the two most well-known categories of logical
frameworks: propositional logic and first-order logic. For a more detailed
overview and proofs of associated theorems, see [2].
2.2.1 Syntax
To formalize the notion of a proposition in a formal language for
propositional logic, we build propositional formulae using two atomic
types of symbols, logical connectives and propositional variables:
Definition 2.2.2 (logical connectives). The logical connectives are the sym-
bols ¬ (negation), ∧ (conjunction), and ∨ (disjunction), → (implication).
Definition 2.2.3 (Propositional variables). A propositional variable is a
symbol which is not a logical connective, usually a lower case letter p, q, r,
etc.
Using these we can build a formal language which will be a set of well-
formed propositional formulae recursively defined as follows:
4
Definition 2.2.4 (Propositional formulae). A propositional variable is an
atomic propositional formula. Any atomic propositional formula is a
propositional formula. If F and G are propositional formulae then the
following are also propositional formulae:
- ¬F
- ( F ∧ G)
- ( F ∨ G)
- ( F → G)
Some examples of well-formed formulas are:
(¬ p → (q → p)) and ((p → q) ∧ (q → p))
It is common to impose a precedence, , over the connectives to decrease
the number of parentheses needed to write a formula. If • ◦, then p • q ◦ r
means (( p • q) ◦ r ) and not ( p • (q ◦ r )). The standard precendence is:
¬ ∧ ∨ →. This precedence will be assumed throughout the rest
of this thesis.
2.2.2 Semantics
To make sense of the formulae we defined in the previous subsection, we
introduce some semantics to interpret them.
Definition 2.2.5 (Truth values and propositional interpretations). An
interpretation, I, is an assignment of truth values, true or false, to each
propositional variable.
We extend this notion of interpretation to formulas of the formal language
described in the last subsection recursively as follows:
Definition 2.2.6 (Interpretation of propositional formulae). If F is an atomic
formula then I ( F ) = true precisely when we have assigned true to F,
otherwise I ( F ) = false. Otherwise if:
- F = ¬ G, then I ( F ) = true if I ( G ) = false, otherwise I ( F ) = true.
- F = ( G ∧ H ), then I ( F ) = true when both I ( G ) = true and I ( H ) = true,
otherwise I ( F ) = false.
- F = ( G ∨ H ), then I ( F ) = false when both I ( G ) = false and I ( H ) =
false, otherwise I ( F ) = true.
- F = ( G → H ), then I ( F ) = false when both I ( G ) = true and I ( H ) =
false, otherwise I ( F ) = true.
This way, the symbols: ¬, ∧, ∨ and → match our intuitive notions of
the natural language logical connectives: "not", "and", "or", and "implies"
respectively. To make this more concrete, let p be a propositional variable
representing the proposition "It rains" and q be a propositional variable
representing the proposition "I go outside". The formula p → ¬ q then
reads: "If it is raining then I do not go outside". Notice how this is true
5
precisely when the following sentence is true: "It is not the case that it rains
and I go outside". In propositional syntax that is: ¬ ( p ∧ q).
Definition 2.2.7 (Semantic equivalence). Two formulae F and G are
semantically equivalent if I ( F ) = I ( G ) for all possible interpretations I. This
is usually written F ≡ G.
Definition 2.2.8 (Semantic consequence). Formula F is a logical con-
sequence of the set of formulae Γ, if whenever I ( G ) = true for all G ∈ Γ
simultaneously, then I ( F ) = true, for all interpretations I. This is usually
written Γ F.
Definition 2.2.9 (Tautology). Formula F is a tautology if it is true for every
interpretation.
In propositional logic, the formulae F and G are semantically equivalent if
for all possible assignments of truth values to their propositional variables,
if F is true, then G is true, and if F is false, then G is false. We can show this
using what are known as truth tables (T = true and F = false):
p q p → ¬ q ¬(q ∧ p) ¬ p ¬ p →(p → ¬ q)
T T T F F T F T T T F T F T T T F F T
T F T T T F T F F T F T F T T T T T F
F T F T F T T T F F T F T F T F T F T
F F F T T F T F F F T F T F T F T T F
6
2.2.3 Deductive systems
We previously defined a semantic notion of what logical consequence is
by semantic consequence. We now introduce a purely syntactic way to
determine whether conclusions follow from premises without considering
interpretations.
Definition 2.2.10 (Syntactic consequence). Given a formal system as
defined in Definition 2.1.1, a formula F is a syntactic consequence of a set
of formulae Γ, written Γ ` F, if F can be inferred by Γ according to the
inference rules of the formal system.
Definition 2.2.11 (Formal Proof). Given a formal system, a formal proof is
a sequence of formulas, each of which is an axiom of the formal system or
is a syntactic consequence of the preceding formulae in the sequence. The
last formula in a formal proof is called a theorem.
This notion of a formal proof will not mean much unless our logical system
has two core properties:
Definition 2.2.12 (Semantical soundness). A logical system is called sound,
iff any theorem is a tautology of the system.
Definition 2.2.13 (Semantical completeness). A logical system is called
complete, iff any tautology is a theorem of the system.
Hence we want to connect the semantic notion of truth with the syntactic
notion of proof, by defining a sound and complete logical system.
There are many formal systems which when combined with the standard
semantics for propositional logic create a sound and complete logical
system. First we will introduce some standard syntax for all deductive
systems (proof calculi) to make it easier to compare different proof calculi:
Definition 2.2.14 (General calculus and proof syntax). A proof calculus
consists of:
- Axioms written: w (Axiom name)
w1 w2 · · · wn
- Rules written: w (Rule name)
7
Table 2.3 shows an example of a proof caclulus for propositional logic:
(axiom 1)
F → (G → G)
(axiom 2)
(( F → ( G → H )) → (( F → G ) → ( F → H )))
(axiom 3)
((¬ F → ¬ G ) → ( G → F ))
Notice that Łukasiewicz’ calculus only uses the ¬ and → connectives. For
this calculus to be sound and complete we must first translate all formulae
containing ∧ and ∨ as follows:
- Any formula of the form F ∨ G becomes ¬ F → G
- Any formula of the form F ∧ G becomes ¬( F → ¬ G )
The proof is omitted here, but one can verify that Łukasiewicz’ calculus
(Table 2.3) with standard propositional semantics is sound and complete.
2.3.1 Syntax
We build first-order formal languages as follows:
Definition 2.3.1 (First-order symbols). Any first-order formal language is
built using the following disjoint sets of of symbols, the logical symbols:
- Logical connectives: ¬, ∧, ∨, →.
- Quantifiers: ∀ (universal), ∃ (existential)
- Variables: v1 , v2 , v3 , .. (countably infinitely many)
8
and the non-logical symbols:
- Constants: c1 , c2 , c3 , ...
- Functions: f 1 , f 2 , f 3 , ...
- Predicates: P1 , P2 , P3 , ...
Definition 2.3.2 (First-order signature). The non-logical symbols of a first-
order language make up what is called the signature of the language. A
signature is defined as a tuple σ = (Scon , S f un , Srel , ar ) such that:
- Scon is the set of constant symbols
- S f un is the set of function symbols
- S pre is the set of predicate symbols
- ar : S f un ∪ S pre → N, ar assigns a natural number called the arity to
every function and predicate symbol
Definition 2.3.3 (First-order terms). We define the set of terms inductively
as the smallest set such that:
- Every variable and constant is a term denoted by t1 , t2 , t3 , ...
- If f is a function of arity ar ( f ) = n and t1 , ..., tn are terms, then so is
f (t1 , ..., tn )
Definition 2.3.4 (Atomic first-order formula). We define the set of atomic
first-order formulae inductively as the smallest set such that for each
predicate symbol P:
- If P is a predicate symbol of arity ar ( P) = n = 0, then P is an atomic
formula
- If P is a predicate symbol of arity ar ( P) = n > 0 and t1 , ..., tn are terms
then P(t1 , ..., tn ) is an atomic formula
Definition 2.3.5 (First-order formulae). We define the set of formulae
inductively as the smallest set such that:
- All atomic formulae are formulae
- If F and G are formulae, then ¬ F, ( F ∧ G ), ( F ∨ G ), and ( F → G ) are
formulae
- If F is a formula and v is a variable, then ∀vF and ∃vF are formulae
All occurences of the variable v in a formula F, are said to be "bounded" in
the formulae ∀vF and ∃vF and "inside the scope" of the respective quantifier
in each formula.
As with propositional logic, there is a standard precedence for first-order
formulae defined in order to reduce parentheses. The precedence is: ∀
∃ ¬ ∧ ∨ →. Where "" means same precedence, and "" is as
before.
9
2.3.2 Semantics
Again, as with propositional logic, we want to make sense of the formulae
we just defined. We are no longer dealing with propositions as atomic
entities, instead we will be interested in the truth values of the first-order
analogs of propositions, predicates:
Definition 2.3.6 (First-order predicates). A predicate is an atomic formula
containing one or more placeholders and has a truth value of true or false
depending on the value of the placeholders.
Definition 2.3.7 (First-order interpretation). A first-order interpretation of
a first-order language with signature σ is a tuple I = ( D, ν) such that:
- D is a set of elements, called the domain of discourse
- ν is an interpretation function, such that:
- For every constant symbol c ∈ σ, ν(c) ∈ D
- For every function symbol f ∈ σ of arity ar ( f ) = n, ν( f ) : D n →
D
- For every relation symbol P ∈ σ of arity ar ( P) = n, ν( P) ⊆ D n
Definition 2.3.8 (First-order substitution and closed formulae). A substitu-
tion τ is a total mapping from the set of variables to terms. If F is a formula,
we write τ (F) to mean the result of substituting each free (not in scope of
quantifier ∀v) occurrence of all variables v in the domain of τ with τ (v) in
F. We also write F[v/t] to mean the substitution of each free (not in scope
of quantifier ∀v) occurrence of the variable v with the term t in F.
Definition 2.3.9 (Closed formula). A first-order formula F is said to be
closed if it contains no free variables. We say F is closed under the
substitution τ if τ ( F ) contains no free variables.
Definition 2.3.10 (Interpretation of first-order terms). Given a first-order
interpretation, I = ( D, ν), we interpret a term t recursively as follows, if:
- t is a constant, then v(t) = a, where a ∈ D
- t is a function, f , and ar ( f ) = m, then ν( f (t1 , ..., tm )) =
ν( f )(ν(t1 ), ..., ν(tm ))
Definition 2.3.11 (Interpretation of closed first-order formulae). Given a
first-order interpretation, I, we can define what it means for a formula F to
be true in the interpretation I = ( D, ν), recursively as follows:
- A closed atomic formula F = P(t1 , ..., tn ) is true in I if
(ν(t1 ), ..., ν(tn )) ∈ ν( P).
- Otherwise if:
- F = ¬ G, then F is true in I if G is not true in I
- F = G ∧ H is true in I if both G and H are true in I
10
- F = G ∨ H is true in I if not both G and H are false in I
- F = G → H is true in I if its not the case that G is true in I and
H is false in I
- F = ∀ xG if G [ x/a] is true in I for all a ∈ D
- F = ∃ xG if G [ x/a] is true in I for at least one a ∈ D
Definition 2.3.12 (First-order satisfiability and validity). If F is true in an
interpretation I, we say that I satisfies F. If there is an interpretation I which
satisfies F, then we say F is satisfiable, otherwise we say it is unsatisfiable.
If a formula is true in every interpretation, then we say it is valid, otherwise
we say it is invalid.
(axiom 1)
F → (G → G)
(axiom 2)
(( F → ( G → H )) → (( F → G ) → ( F → H )))
(axiom 3)
((¬ F → ¬ G ) → ( G → F ))
(axiom 4)
∀ xF → F [ x/t]
(axiom 5)
∀ x ( F → G ) → (∀ xF → ∀ xG )
(axiom 6)
F → ∀ xF
Notice that, just like Łukasiewicz’ calculus for propositional calculus, the
simplified Hilbert calculus uses a limited number of connectives. For this
calculus to be sound and complete we must first translate all formulae
containing ∧, ∨, and ∃ as follows:
- Any formula of the form F ∨ G becomes ¬ F → G
- Any formula of the form F ∧ G becomes ¬( F → ¬ G )
- Any formula of the form ∃ F becomes ¬∀¬ F
11
Combining Hilbert’s deductive system with the first-order syntax and se-
mantics we have defined so far, we have a semantically sound and com-
plete logical system expanding the propositional system with quantifica-
tion. First-order logic is a key logic for a multitude of reasons:
Theorem 2.3.1 (Compactness). A set of formulae, Γ, has a satisfying interpret-
ation, if and only if every finite subset, Γ0 ⊆ Γ, has a satisfying interpretation.
Theorem 2.3.2 ((downward) Löwenheim–Skolem). If a set of formulas, Γ, has
a satisfying interpretation, it is satisfiable in an interpretation with a countable
domain.
Theorem 2.3.3 (Lindström’s theorem). First-order logic is the most express-
ive logic having both the compactness property and the (downward) Löwen-
heim–Skolem property.
Definition 2.3.13 (Syntactic completeness). A formal system is syntactically
complete if for each formula, F, in the formula language, either F or ¬ F is
provable.
Theorem 2.3.4 (Gödel’s second incompleteness theorem). First-order logic is
not syntactically complete. That is, there is no algorithm that always terminates
and decides whether an arbitrary first-order proposition is a theorem or not. We
say that first order logic is "undecidable".
First-order logic can express a large part of mathematics and encode any
computable problem. Therefore, Lindström’s theorem and the semantical
soundness and completeness of first-order logic make it an interesting
system for automation of proof search. Unfortunately, as Gödel’s theorem
states, first-order logic is undecidable.
However, since first-order logic is semantically complete, there is an
algorithm that given a valid first-order proposition terminates with a proof
of this formula, but given an invalid formula might not terminate. In
this case we say first-order logic validity is semi-decidable (or Turing
recognizable).
12
2.4.1 Natural Deduction and Sequent Calculus
The calculi presented earlier in Tables 2.3 and 2.4 are examples of so called
Hilbert-style systems. These came bundled with the early exploration of
modern logic in the late-19th/early-20th century by Frege, Hilbert, and
Russell and are quite compact and elegant; only using the modus ponens
rule and a few axioms. However, these calculi are quite far from the
way humans normally reason. Wishing to construct a formalism closer
to human reasoning, Gerhard Gentzen proposed the natural deduction
system in 1934 [6]. In this system Gentzen addresses the fact that
humans do not normally start proofs from axioms, but instead make
claims under some assumptions and then analyze the assumptions and
claims separately and combine them later in the proof. Assumptions are
combined into claims by what are known as introduction rules, while
assumptions are split into its parts by elimination rules. However, this
two-way search, makes natural deduction not well suited for automation.
Gentzen also later realized that he could convert this into one-way ordeal
by making assumptions local and encoding the derivability relations in
natural deduction by what he called sequents. A sequent, written Γ ` ∆,
consists of a succedent Γ and an antescedent ∆, where ∆ and Γ are sets of
formulae. Instead of formulae, the words of the calculus are sequents.
(axiom)
Γ, A ` A, ∆
Γ, A, B ` ∆ Γ ` A, ∆ Γ ` B, ∆
(∧-left) (∧-right)
Γ, A ∧ B ` ∆ Γ ` A ∧ B, ∆
Γ, A ` ∆ Γ, B ` ∆ Γ ` A, B, ∆
(∨-left) (∧-left)
Γ, A ∨ B ` ∆ Γ ` A ∨ B, ∆
Γ ` A, ∆ Γ, B ` ∆ Γ, A ` B, ∆
(→-left) (→-right)
Γ, A → B ` ∆ Γ ` A → B, ∆
Γ ` A, ∆ Γ, A ` ∆
(¬-left) (¬-right)
Γ, ¬ A ` ∆ Γ ` ¬ A, ∆
Γ, A[ x \ti ], ∀ xA ` ∆ Γ ` A [ x \ t j ], ∆
(∀-left) (∀-right)
Γ, ∀ xA ` ∆ Γ ` ∀ xA, ∆
Γ, A[ x \t j ] ` ∆ Γ ` ∃ xA, A[ x \ti ], ∆
(∃-left) (∃-right)
Γ, ∃ xA ` ∆ Γ ` ∃ xA, ∆
13
Theorem 2.4.1 (Correctness and Completeness of LK). A formula F is valid
iff there is a proof for " ` F".
When applying the ∀ and ∃ rules in the Table 2.5 LK calculus above, we
are currently choosing arbitrary terms ti and t j for x. We would like to
do something more intelligent than random guessing of the terms we are
substituting in so that we are more likely to end up with two syntactically
equivalent formulae in the antecedent and succedent. That is, more likely
to end up in an axiom.
Definition 2.4.1 (Unification and most general unifier). We say that two
terms s and t are unifiable if there exists a substitution such that σ(s) is
syntactically equivalent to σ(t), written σ(s) = σ(t). e.g. s = f ( x ) and
t = f ( a) are unifiable using the substitution σ = { x \ a}. We call σ a "unifier"
for s and t. A unifier σ1 is a "most general unifier" (mgu) for s and t if:
- σ1 is a unifier for s and t
- for every unifier σ2 of s and t, there exists a substitution τ, such that
σ2 = τ (σ1 )
Keeping track of an mgu σ and using so called "free variables", allows
us to delay making substitution decisions, instantiating terms, until they
are absolutely necessary and thus reduces the search space. The following
calculi we will look at, all use unification and are more machine oriented
than the previous calculi. Most real world ATP systems for first-order logic
use unification.
14
- Any formula of the form A → B becomes ¬ A ∨ B
- Any formula of the form ¬( A ∧ B) becomes (¬ A ∨ ¬ B)
- Any formula of the form ¬( A ∨ B) becomes (¬ A ∧ ¬ B)
- Any formula of the form ¬∀ xA becomes ∃ x ¬ A
- Any formula of the form ¬∃ xA becomes ∀ x ¬ A
- Any formula of the form ¬¬ A becomes A
Definition 2.4.3 (Skolemization). Through Skolemization, any first-order
formula F 0 can be translated into an equisatisfiable formula F 00 which
does not contain ∃. Equisatisfiable means that F is satisfiable if and only
if F 0 is satisfiable, however they may be satisfied by different variable
instantiations so they are not necessarily semantically equivalent. A
formula F 0 in NNF is Skolemized as follows:
- Any formula of the form ∀y1 , . . . , ∀yn ∃ xA becomes
∀y1 , . . . , ∀yn A[ x \ f (y1 , . . . , yn )], where f is a new function symbol.
Combining the notions of one-sided sequents, unification, and Skolemized
negation normal form we can build the fairly compact sound and complete
block tableau calculus shown in Table 2.6 with the ` sign omitted.
(axiom)
L1 , ¬ L2 , ∆
(Such that σ is a m.g.u of L1 and L2 )
F, G, ∆
(α-rule)
F ∧ G, ∆
F, ∆ G, ∆
(β-rule)
F ∨ G, ∆
F [ x \ x ∗ ], ∀ xF, ∆
(γ-rule)
∀ xF, ∆
(x ∗ is a new variable)
15
¬ p ∧ q, ( p ∨ r ) ∨ ¬q ¬ p ∧ q, ( p ∨ r ) ∨ ¬q
¬ p ∧ q, p ∨ r, ¬q p∨r
¬ p, p ∨ r, ¬q q, p ∨ r, ¬q ¬q
¬ p, p, r, ¬q ¬p q
In the left tree in Fig. 2.1 we write ∆ explicitly in each node. We have
reached an axiom when a node contains both A and ¬ A. In the right tree,
we only write the new formulae generated by the rule application in each
node. An axiom is then reached when both A and ¬ A are on the same
branch.
16
(axiom)
{}, M, Path
C2 , M, Path
(start)
e, M, e
C, M, Path ∪ { L2 }
(reduction)
C ∪ { L1 }, M, Path ∪ { L2 }
C2 \ { L2 }, M, Path ∪ { L1 } C, M, Path
(extention)
C ∪ { L1 }, M, Path
17
(axiom)
C1 , . . . , {}, . . . , Cn
C1 , . . . , Ci ∪ { L1 }, . . . , Cj ∪ { L2 }, . . . , Cn , σ (Ci ∪ Cj )
(resolution)
C1 , . . . , Ci ∪ { L1 }, . . . , Cj ∪ { L2 }, . . . , Cn
(Where σ is a m.g.u of L1 and L2 )
C1 , . . . , Ci ∪ { L1 , . . . , Lm }, . . . , Cn , σ (Ci ∪ { L1 })
(factorization)
C1 , . . . , Ci ∪ { L1 , . . . , Lm }, . . . , Cn
(Where σ is a mgu of L1 , . . . , Lm )
18
Chapter 3
3.2 ML Paradigms
Machine learning approaches have traditionally been divided into three
categories, depending on the nature of the training data available and the
use case of the algorithm.
Reward
Critic Critic
Output Output Output
19
Supervised learning
While all machine learning algorithms learn input/output behaviour by
optimizing a function, called the objective function or loss function, and
are in this sense "supervised" by that goal, "supervised" in the case of
supervised learning means that the training data consists of both example
inputs and their desired outputs. The goal is to learn a function that best
generalizes this input → output behaviour. Most of the examples we will
see in the next subsection are in this category.
Definition 3.2.1 (Regression). Supervised learning task where the output
space is continuous.
Definition 3.2.2 (Classification). Supervised learning task where the output
space is discrete.
Unsupervised learning
In unsupervised learning, while we are still "supervised" by our goal of
optimizing with respect to an objective function, our training data only
consists of example inputs no example outputs. The goal of unsupervised
algorithms is to discover hidden patterns in inputs and produce output
summarizing found patterns.
Definition 3.2.3 (Dimensionality reduction). Unsupervised learning task
where the output space is continuous.
Definition 3.2.4 (Clustering). Unsupervised learning task where the output
space is discrete.
Reinforcement learning
Reinforcement learning is perhaps the most general of the three paradigms,
and is the category concerned with agents which interact with environ-
ments. In reinforcement learning, training data consists of example inputs,
outputs, and rewards for giving the respective outputs for the respective
inputs. The goal is to learn a function from input to output which maxim-
izes reward.
Definition 3.2.5 (Optimal control/decision-making). Reinforcement learn-
ing task where the output space can be both discrete and continuous (ac-
tions).
20
Yi = g ( Xi ) + ei
21
Figure 3.2: Decision tree training data and decision boundary
We can visualize the splits of the feature space shown in Figure 3.2 as a tree
as shown in Figure 3.3, hence the name "decision tree". f ( Xi ) is the leaf
found by following the correct branch corresponding to the values of the
features of Xi .
Yes No
Class 1
Yes No
Class 0
Yes No
Class 1 Class 0
22
discussed in detail in the next subsection) or Gradient boosting models
(usually ensemble of decision trees) [21]. These models are heavily applied
in real-world applications and research since they tend to show good
results on general data.
f ( Xi ; w) = E(Yi | Xi )
f ( Xi ; w) = E(Yi | Xi ) = µ = h( Xi · w)
h(z) = z
f ( Xi ; w ) = h ( Xi · w ) = Xi · w
23
The retrieved model is what is known as "Linear regression".
f ( Xi ) = 1.8x1 + 0.97
1
h(z) = σ(z) =
1 + e−z
f ( Xi ; ) = h ( Xi · w ) = σ ( Xi · w )
24
Figure 3.5: Logistic regression with two features
We now have two features of the data we use to predict, so every point is
described as a vector Xi = ( x0 , x1 , x2 ) = (1, x1 , x2 ), and f is parameterized
by:
The blue line in Figure 3.5 shows where f (z) = 0.5. That is, the decision
boundary. The decision boundary generalizes to a hyperplane in higher
dimensions if we predict using more features.
f ( Xi ; w ) = h ( Xi · w )
25
Figure 3.6: Neuron model
The simplest model based on this concept is called the "Perceptron". This
model uses the step function as, h:
(
1 if z > 0
h(z) =
0 if z ≤ 0
26
Figure 3.7: Neurons composed in width
a (0 ) = Xi
a ( j ) = h ( j ) ( a ( j −1) W ( j ) )
f ( Xi ; W (1,...,d) ) = a(d)
27
Composing neurons both in width and depth, we arrive at a family of
models called "Artificial Neural Networks" (ANN). In practice most ANNs
typically have a simpler structure than what is strictly possible. The same h
functions are typically used for every inner neuron; they are usually simple
and provide a smooth (differentiable) transition as input values change. A
commonly used activation function is the rectified linear unit:
28
Figure 3.9: Temporal Figure 3.10: Spatial Figure 3.11: Arbitrary
data (2D) data relational data
h0v = Xv
htv = qt htv−1 , f t htv−1 , htu−1 , eu→v
[
u∈N (v)
This is shown graphically for one node in the Figure 3.11 graph in Figure
3.12.
29
A B
Target node
C D
E F
INPUT
Embedding of node at
30
3.4.1 Markov Decision Processes (MDP)
The environment of an RL problem is usually formalized as a Markov
Decision Process. A Markov decision process is a discrete-time stochastic
control process, which means it provides a mathematical framework for
modeling decision making where outcomes are partly random and partly
controlled by a decision maker. A Markov decision process is a 4-tuple
(S, A, Pa , R a ), where:
- S is a set of states called the "state space",
- A is a set of actions called the "action space",
- Pa (s, s0 ) is the probability that taking action a when in state s at time
t will lead to state s0 at time t + 1. This is called the "transition
probability",
- R a (s, s0 ) is the reward received for transitioning from state s to state
s0 by action a.
MDP
(Environment)
Reward
Action
State Agent
∞
" #
E ∑ γ t R π ( s ) ( s t , s t +1 )
t
t =0
Where:
- the expectation is taken over st+1 ∼ Pπ (st ) (st , st+1 ),
- π is the optimal policy and π (st ) is the optimal action taken at time t,
- 0 ≤ γ ≤ 1 is a discount factor motivating taking actions early.
The field of reinforcement learning studies algorithms for finding optimal
π in different scenarios, depending on the nature of the 4 components of
the MDP.
31
3.4.2 Model-free vs. Model-based RL Algorithms
The term "Model" in "model-free vs. model-based RL", refers to the MDP of
the RL problem. In model-free RL we do not model the MDP. That is, we do
not try to find explicit expressions defining P and R, we rely on sampling
and simulation to estimate the optimal policy. We do not need to know the
inner workings of the MDP. Intuitively, model-based RL is the opposite. In
model-based RL we use the MDP directly to estimate the optimal policy π.
Model-based methods are thought to be more sample efficient than model-
free methods since they are able to explicitly plan ahead and therefore don’t
need to test as many trajectories as model-free methods to find the optimal
policy.
RL Algorithms
Model-Free RL Model-Based RL
The algorithms in Figure 3.14 differ by what they learn and how they use
what they learn to determine a policy. They can learn:
- policies (which action to take from a state)
- action-value functions (quantifies how good it is to take an action
from a state)
- value functions (quantifies how good a state is)
- environment models (how the world works)
32
agent is making actions only according to its learned policy it has high
exploitation and low exploration. It is important to strike a good balance
to converge to good and/or optimal policies.
Selection
In this step, the algorithm starts in the root node (the current state) of
the search tree and recursively chooses a child until a leaf node (a state
with potential, but unrealized child nodes) is reached. The child node
maximizing equation 3.1 is chosen at each step:
s
wi ln Ni
U (i ) = +c (3.1)
ni ni
33
SELECTION
11/21
2/3 3/3
Expansion
The algorithm has now arrived at a leaf node with potential, but unrealized
children in the search tree. It chooses one of them, adds it to the tree as the
child of the current node, and call this child node vi .
Figure 3.16 shows the expansion step coming after the selection step in
Figure 3.15 (still in the adversarial setting).
EXPANSION
11/21
2/3 3/3
0/0
34
Simulation
The algorithm then completes one rollout/simulation from the state vi . A
rollout consists of choosing moves (usually randomly or semi-randomly)
until some stopping condition is met (game is won/lost, maximum number
of moves reached, etc.).
Figure 3.17 shows the simulation step coming after the expansion step in
Figure 3.16 (still in the adversarial setting).
SIMULATION
11/21
2/3 3/3
0/0
0/1
Backpropagation
The algorithm now uses the result of the playout to update the values of
the nodes on the path from vi to the root. Visit counts ni are increased by
1 and reward - or "goodness" - estimates wi are increased proportionally to
the new node’s wi .
Figure 3.18 shows the backpropagation step coming after the simulation
step in Figure 3.17. Still in the adversarial setting.
35
BACKPROPAGATION
11/22
2/3 4/4
0/1
This loop is continued for a set number of iterations. In the end, the
algorithm chooses the action from the root node leading to the child with
highest wi .
37
have some intuition for how to do the splitting, aka finding lemmas that
combine to prove the theorem, machines should be able to learn this too.
We somehow need to teach the machine, based purely on the symbols of
the theorem, which rules should be applied to which parts of the theorem
in which order. The success of this proof search guidance will be reliant on
the interplay between two crucial components: the proof calculus and the
machine learning approach.
There are many parts of the general procedure of going from a natural
language statement to a formal proof which are candidates for learning:
1. Autoformalization: Translating a statement in natural language or
LaTeX into a statement in a formal unambiguous language such as
first-order logic.
2. Premise selection from large corpora: Determining which premises/axioms
to add to your formula to make it more effectively provable (or prov-
able at all), from a corpus of premises/axioms.
3. Lemmatization and conjecturing: Determining which already found
derivations to add to your proof search to make it more effectively
provable, from a corpus of derivations.
4. Internal proof guidance: Determining which rule applications to use
during proof search.
We will only consider internal proof guidance and will mainly look at
results related to this. In particular, we will look at many results from
the highly successful EU-funded project AI4REASON 1 . The project and
related work has shown that learning-based guidance can significantly
enhance the power of existing ATP systems.
38
(axiom)
{}, M, Path
C2 , M, Path
(start)
e, M, e
C, M, Path ∪ { L2 }
(reduction)
C ∪ { L1 }, M, Path ∪ { L2 }
C2 \ { L2 }, M, Path ∪ { L1 } C, M, Path
(extention)
C ∪ { L1 }, M, Path
39
the MCTS strategy with strong learners such as gradient boosted trees
and GNNs (both of which were introduced in Section 3.3). These systems
include rlCoP [43], plCoP [44], and the system described [45] (We will refer
to this as graphCoP). In all these MCTS planning systems, states are the
current state of the proof (the tableau proof tree), actions are inference steps,
and the simulation step is done by making d proof inference steps (actions)
b times. In chess, the simulation step is comprised of playing moves until
the game is finished. The reason we can’t do the same in first-order theorem
proving, that is, make inferences until a proof is found, is because there
isn’t necessarily a proof to be found. The proof search problem is semi-
decidable, and hence the search tree is potentially infinitely deep.
rlCoP, plCoP and graphCoP extend MCTS by adding:
1. learning-based mechanisms for estimating the prior probability of
inferences to lead to a proof (policy),
2. learning-based mechanisms for assigning heuristic value to the proof
state.
1 and 2 are realized by altering the UCT formula (equation 3.1) as follows:
s
wi ln Ni
U (i ) = + c · pi · (4.1)
ni ni
In Equation 4.1, pi is the learned prior probabilities of going from node i’s
parent to node i. wi is the sum of the wi ’s of the node’s children, if the node
has no children, wi is equal to a learned value of the node, V (vi ). The rest
is as in equation 3.1.
In rlCoP and plCoP, data collection and model training is intertwined in
the DAgger meta-learning algorithm [46], described below. The policy and
value functions are gradient boosted trees (XGBoost [47]), trained at each
iteration of DAgger with hyperparameters found by grid search.
40
learning. rlCoP is given 2000 MCTS iterations per decision. We see that
after three iterations, rlCoP is able to solve about 40% more problems than
unguided leanCoP (mlCoP).
The previous models have all been based on hand-crafted features of proof
states. In general, the more information about the data we are able to
model, the more the learner will be able to learn. One of the strong suits of
neural networks is that they can learn good functions from very general
input spaces to very general output spaces. However, neural networks
can quickly require too many resources to train and perform inference.
So neural network libraries, architecture, hardware needs to be optimized
for this to be feasible in a setting such as ATP where we want to do as
many inferences per second as possible. The last couple of years, all these
three points have seen a lot of progress, inspiring the use of Graph neural
networks for guiding ATP.
In Table 4.3, graphCoP is compared to rlCoP on the Miz40 dataset. This
time, both provers are given 200 MCTS iterations per decision, down from
2000 in Table 4.2. We see that after one iteration, graphCoP is able to prove
about 50% more problems than rlCoP and that the difference is slowly
diminishing.
41
systems, such as nanoCoP [50, 51] for classical logic, and ileanCoP and
MleanCoP, two of the fastest ATP systems for non-classical logics [52, 53].
(axiom)
C1 , . . . , {}, . . . , Cn
C1 , . . . , Ci ∪ { L1 }, . . . , Cj ∪ { L2 }, . . . , Cn , σ (Ci ∪ Cj )
(resolution)
C1 , . . . , Ci ∪ { L1 }, . . . , Cj ∪ { L2 }, . . . , Cn
(Where σ is a m.g.u of L1 and L2 )
C1 , . . . , Ci ∪ { L1 , . . . , Lm }, . . . , Cn , σ (Ci ∪ { L1 })
(factorization)
C1 , . . . , Ci ∪ { L1 , . . . , Lm }, . . . , Cn
(Where σ is a m.g.u of L1 , . . . , Lm )
These ATP systems maintain two sets of processed (P) and unprocessed (U)
clauses. At each loop iteration, a given clause g from U is selected, moved
to P, and U is extended with new inferences from g and P. This process
continues until the contradiction is found (empty clause), U becomes
empty, or a resource limit is reached. The search space grows quickly and
selection of the right given clauses is critical [54].
The ENIGMA system [55] and later iterations [54, 56] incorporate learning
into the E prover, learning which clause is best to process next. This was
done by extracting hand-crafted feature vector representations from the
clauses and feeding them to linear classifier models (SVM and logistic
regression in LIBLINEAR [57]).
ENIGMA-NG [54] improved on this by using more expressive Gradient
boosted tree and recursive neural network classifiers combined with fast
feature hashing to reduce the number of features the classifiers have to
consider and hence speed up inference, since this is will generally be slower
with more expressive models.
Table 4.5 are benchmarks on the MPTP2078 bushy [41] dataset, MPTP2078
contains 2078 problems coming from the MPTP translation [58] of the Mizar
Mathematical Library (MML) [59] to first-order logic.
In Tables 4.5 and 4.6, S is an E proof search strategy, and S + M is S with
M as added clause selection guidance.
42
S S + Linear S + GB Trees S + RNN
solved 1086 1210 1256 1197
S S + GB Trees S + GNN
solved 14966 24347 23262
In table 4.7, all provers are run on the MPTP2078 problem set, where E
is run in auto mode. We see that although TRAIL does better than the
leanCoP modifications discussed in Section 4.1, it does not do better than
the pure E prover, like the ENIGMA systems which used E-proof traces to
train instead of training from scratch like TRAIL.
43
4.3 Interactive Provers and Other Related Problems
Many of the same ideas for guiding rlCoP and related systems have also
been used for internal selection of tactical steps inside proof search for
interactive theorem provers (ITPs) such as HOL4 [62, 63] and Coq [64,
65]. These systems are based on higher-order logic which is not even semi-
decidable and hence pose a lot of new problems which will not be detailed
here. However, TacticToe combined with E proves about 70% of the HOL4
library, which has inspired a number of followup works from Google [66]
and OpenAI [67].
Graph Convolutional Neural Networks have also been used with success to
guide search for other combinatorically exploding problems such as mixed
integer programming [68].
44
- Formula embeddings should be invariant to symbol renaming across
problems. The following two formulas should have similar if not
equal encodings:
P(y, x )
P( x, y)
45
46
Chapter 5
47
knowledge is used. The main idea of the declarative logic programming
paradigm, and languages such as Prolog, is to have the programmer take
care of the logic component and then the logic programming language
takes care of the control component. As Zombori et al. mention, this is great
for rapid ATP prototyping, in the sense that it is easy to state the rules of a
calculus in Prolog and have it run. However, when our purpose stretches
beyond simply implementing a calculus to also guiding the execution of
the rules, the idea of separating logic and control loses ground. From an
abstraction perspective, having the programmer interface with a Prolog
program to guide it’s execution goes against the underlying ideas of the
language. In this regard, using an imperative language for the task makes
a lot more sense.
5.1.2 Object-orientation
Figure 5.1 shows a class diagram of our proposed architecture for
implementing theorem provers based on arbitrary proof calculi using
arbitrary proof search strategies. The triangle-shaped relations show
inheritance, while the diamond shaped relations show dependence (white
is aggregation and black is composition).
1..* 1..*
1..1 1..1
ProofSearchProblem
ProofState ProofAction
This architecture separates the search procedure from the calculus specific
elements of the theorem prover without losing the possibility of using cal-
48
culus specific elements during search and without introducing intersystem
communication overhead.
Separation of search and calculus is done, at least implicitly, whenever one
wants to guide a prover in any way, but we argue that the explicitness of
object-orientation and the relative simplicity of this architecture makes it
easier to reason about and compare general ATP systems by boxing the
different components.
The framework is heavily inspired by Reinforcement Learning environ-
ments such as OpenAi Gym [70]. As the authors of this framework say:
"RL is very general, encompassing all problems that involve making a se-
quence of decisions". Theorem proving is such a problem; we have the state
of our proof (i.e. the derivation tree) and need to make a decision of which
inference rule to apply next (i.e. which leaf of the derivation tree to expand
next).
To implement an arbitrary proof calculus in the framework shown in
Figure 5.1, one has to implement the ProofState, ProofAction, and
ProofSearchProblem classes which inherit from the State, Action, and
SearchProblem classes respectively. The ProofState class represents the
state of a proof, e.g. the derivation tree, while the ProofAction class
represents possible inferences, e.g. extension in the connection calculus
shown in Table 2.7. Both of these classes could in the simplest case just be
reduced to tuples, so the only function which needs overriding is the init
function as shown in Figure. 5.2. However, as we will see in Section 6, it
can be useful to let these classes be more complex and include options for
more functions or even inner classes.
class State:
# Abstract class defining the states of a search problem
def __init__(self):
# Initialize state variables
pass
class Action:
# Abstract class defining the actions of a search problem
def __init__(self):
# Initialize action variables
pass
Figure 5.2: Abstract Python state and action classes used in search problems
49
class SearchProblem:
# Abstract class defining the dynamics of a search problem
def __init__(self):
# Define initial state and action spaces
pass
def start(self):
# Returns the start state(s)
pass
def reset(self):
# Reset the search problem
pass
Figure 5.3: Abstract Python problem class for defining proof search
dynamics
50
other syntactic operations for free. This in addition to the SearchAgent
class being implemented, leaves us with something reminiscent of a logic
programming environment, where most of what we have to implement are
the rules of the system. A big difference is that it is now very natural to also
change the SearchAgent class, giving the possibility of rapid prototyping
both for logic and control.
The framework discussed so far is object-oriented and can in theory be
implemented in any object-oriented language. The reasons we chose
Python for our implementation, instead of a fast compiled language such
as C++ or Java, are very reminiscent of the points made by Zombori et al.
mentioned in the beginning of Section 5.1:
1. The high level nature of Python allows for rapid prototyping as well
as compact and easy to extend implementations.
2. Python is the lingua franca in the ML community. Hence integration
of machine learning will be natural.
A drawback to the high level features of Python is that they make the
language slow compared to slightly lower level languages such as C++
and Java (roughly 10x slower according to [71]). For our purposes, this
is fine as rapid prototyping is valued higher than speed. Due to the (worse
than) exponential time complexity of theorem proving, inference quality is
a lot more important than inference quantity. A python program making
good inferences (i.e. inferences leading to a proof) will find proofs quicker
than a C++ program making bad inferences 100 times as fast. Hence, from
an engineering perspective it is better to spend more time prototyping
than implementing. To sum up this point: inference quality inference
quantity. Additionally, Python is very synergistic with C, so making the
switch later down the road is plausible with minor (but time consuming)
changes.
51
MDP
(Environment)
Reward
Action
State Agent
1..* 1..*
ProofSearchProblem
ProofState ProofAction
Figure 5.5: The MDP is already naturally represented in the system design
52
General Framework for
SearchProblem SearchAgent Search
1..1 1..1
1..* 1..*
1..1 1..1
RL Setting
State Action Proof Calculus Specific
Abstraction Level
ProofSearchProblem
ProofState ProofAction
53
5.2.2 Model Module
Urban et al. [39] propose the architecture shown in Figure 5.7 to modularily
combine existing theorem provers with learning modules.
The following is mentioned in the same paper: "The slow external advice
is currently a clear bottleneck". There are two obvious potential reasons
for this: communication overhead and advisor model complexity. Our
framework addresses the first by incorporating learning directly into
the object-oriented system by simply adding the Model class and the
ProofModel subclass as shown in Figure 5.8.
Not only does the framework allow for tight coupling between ML advice
and proof search, the modular property of the framework still holds. To
now incoprporate learning, we only need to implement the ProofModel
class. We can implement this using any ML library we want (scikit-learn,
TensorFlow, PyTorch, etc.), we just need to override the functions shown in
Figure 5.9. The SearchAgent can use the post_processed data however it
wants in it’s search strategies, whether it be iterative deepening or MCTS.
54
General Framework for
SearchProblem SearchAgent Search
1..1 1..1
ProofSearchProblem ProofModel
ProofState ProofAction
Learning
1..* 0..* 1..*
class Model:
# Abstract class defining the model guiding a search problem
def __init__(self):
# Initialize model (i.e. a PyTorch NN)
pass
Figure 5.9: Abstract Python model class used to guide search problems
55
To conclude, the framework consists of three main modular parts:
- Proof Calculus (ProofSearchProblem, ProofState, ProofAction)
- Search Strategy (SearchAgent)
- Learning guidance (ProofModel)
While the ProofModel’s pre_process function relies on the implementation
details of ProofState and ProofActions, all other parts of the framework
can be swapped out without regard for the other components. I.e. if
you have an implementation of resolution with a BFS strategy in this
framework and want to try an MCTS strategy, all you have to change is
the SearchAgent. While, if you want to try the connection calculus with
a BFS strategy, all you have to change are the proof dynamics classes
(ProofSearchProblem, ProofState, ProofAction)
We have developed a Python version of the framework which provides
primitives and primitive operations such as literals, terms, formu-
lae/matrices, unification, etc. In the next few chapters, we will discuss how
we also implemented the connection calculus, iterative deepening, and a
model based on graph neural networks using the PyTorch deep learning
library [72] in our framework.
56
Chapter 6
Implementing an ML-Guided
ATP System
57
(axiom)
{}, M, Path
C2 , M, Path
(start)
e, M, e
C, M, Path ∪ { L2 }
(reduction)
C ∪ { L1 }, M, Path ∪ { L2 }
C2 \ { L2 }, M, Path ∪ { L1 } C, M, Path
(extention)
C ∪ { L1 }, M, Path
58
Non-learning theorem provers rely on a priori optimizations to prune the
search space and to determine what parts of the search space to search
first. leanCoP 2.0 shown in algorithm 4 improves on leanCoP 1.0 by
adding a list of new techniques. The remaining subsections of Section
6.1 provide descriptions and examples of these techniques. The proofs of
their corresponding theorems can be found in [73–75]. Our base prover
implements two of them: "positive start clauses" and "regularity", both
of which prune the search space in a way such that completeness is
maintained. That is, they only remove redundant search states. Since
"positive start clauses" was already in leanCoP 1.0, our base connection
prover is a Python equivalent of leanCoP 1.0 + regularity described in
Algorithm 5.
59
6.1.1 Positive start clauses
Theorem 6.1.1 (Positive start clause). The connection calculus remains correct
and complete if the clause C1 of the start rule is restricted to positive clauses. A
positive clause is a clause that does not contain negated literals.
Consider the formula M. Theorem 6.1.1 then states that we only have to
consider the trees rooted with { P1 , P2 } as premise to maintain a complete
proof search.
Hence the only potential root for a connection calculus proof tree for M is
shown in Figure 6.2, while all other roots do not need to be considered since
they contain negated literals, shown in Figure 6.3
{ P1 , P2 }, M, {}
(start)
e, {{ P1 , P2 }, {¬ P1 , P3 (v1 )}, {¬ P3 (c1 ), P1 }, {¬ P3 (c2 ), ¬ P1 }, { P1 , ¬ P2 }}, e
Figure 6.2: Potential root of proof tree for M
{¬ P3 (h
c2h),h ¬(P1(}, (({}
M,
hhh ((((
{¬ P1 , Ph
hhhh
3 (v 1 )} , M,
(( {}((((
h
( h
( (
h hhh(start) ( h h hhhh(start)
(((( e, M, e
( (
((((e, M, e hhh hh
{¬ P3 (h
hhh
c1h),h
P1h } , M,
( {}
( ( (( XXX
{ P1 , ¬XPX
2 }
, M, {}
((((hhhh(start)
(
XXX(start)
X
((( e, M, e hh
e, M, e XXX
Figure 6.3: Pruned roots of potential proof trees for M due to theorem 6.1.1
6.1.2 Regularity
Definition 6.1.1 (Regularity). A connection proof is regular iff no literal
occurs more than once in the active path.
Theorem 6.1.2 (Regularity). A formula M in clausal form is valid iff there is a
regular connection proof for “e, M, e”.
Using the partial proof tree for M shown in Figure 6.4 as an example,
regularity tells us we can safely backtrack from the node with highlighted
P1 ’s, and do not need to continue searching from it for our proof search
to be complete. The intuition here, is that to close the new P1 , we would
have to use another path than was used to close the first P1 to not end up in
an infinite loop. Therefore, we must necessarily have been able to use the
same new path to close the first P1 .
(?) (axiom)
{ P1 }, M, { P1 , P3 (v10 )} {}, M, { P1 }
(extension)
{ P3 (v10 )}, M, { P1 } { P2 }, M, {}
(extension)
{ P1 , P2 }, M, {}
(start)
e, M, e
60
6.1.3 Lemmata
One additional optimization which is included in leanCoP 2.0 and
commonly included in re-implementations of leanCoP is "Lemmata".
However, this technique typically doesn’t increase performance notably
and is therefore omitted in our base connection prover. For completeness,
it is described here anyways.
Definition 6.1.2 (Lemmata). The connection calculus in Figure 1 is mod-
ified by adding a set of literals Lem, called lemmata, to all tuples
“C, M, Path”. The empty set is added to the premise of the new start rule,
e is added to its conclusion. The set Lem ∪ { L1 } is added to the premise
of the new reduction rule and the right premise of the extension rule. Fur-
thermore, the following rule is added to the connection calculus:
C, M, Path, lem ∪ { L2 }
(lemma)
C ∪ { L1 }, M, Path, lem ∪ { L2 }
with σ ( L1 ) = σ ( L2 )
(axiom)
{}, M, { P2 }, { P1 }
(lemma) (axiom)
{ P1 }, M, { P2 }, { P1 } {}, M, {}, { P1 , P2 }
... (...) (extension)
{...}, M, { P1 }, {} { P2 }, M, {}, { P1 }
(extension)
{ P1 , P2 }, M, {}, {}
(start)
e, M, e, e
Figure 6.5: Partial connection calculus proof tree for M with lemma rule
61
L iff L is the principal literal of the proof step. An extension step solves a
literal L iff L is the principal literal of the proof step and there is a proof
for the left premise, i.e. there is a derivation for the left premise so that all
leaves are axioms.
Definition 6.1.4 (Essential backtracking/proof step). Let R1 , . . . , Rn be
instances of rules with the same principal literal L1 applicable to a
node of a derivation in the connection calculus. If the literal L1 can
be solved by applying the rule Ri , but not by applying the rules R1
to Ri−1 , then backtracking over the rules R2 , . . . , Ri is called essential
backtracking; backtracking over the rules Ri+1 , . . . , Rn is called non-
essential backtracking. The application of one of the rules R1 , ..., Ri is an
essential proof step; the application of one of the rules Ri+1 , . . . , Rn is a
nonessential proof step.
Definition 6.1.5 (Restricted backtracking). Let R1 , . . . , Ri , . . . , Rn be the
instances of (reduction, extension or lemma) rules with principal literal L1
that are applicable to a node of a derivation in the connection calculus and
rule Ri solves L1 . Restricted backtracking does not apply the alternative
rules Ri+1 , . . . , Rn anymore.
Restricted backtracking restricts backtracking by only trying to solve a
literal once. This makes the search procedure incomplete, but has shown
to be very effective at proving more problems in fewer proof step iterations
[75]
Figure 6.6 compares this type of tree search (b) to the search strategy used in
leanCoP 1.0 (a) and the MCTS search strategy (c) used in various extensions
of leanCoP supporting machine learning [43–45].
62
6.2 Formula Tensorization and State/Action Embed-
dings
Now that we have the proof dynamics in place, we will have a look at
how we can discriminate which actions to take from a given proof state by
approximating a function from states to actions using machine learning.
Our goal is to implement the ProofModel class such that the SearchAgent
will receive a probability distribution over legal actions from the Model
after giving the Model a State and list of potential Action.
Most interesting and effective machine learning models, such as the ones
discussed in Section 3.3, learn functions from real valued tensors to other
real valued tensors. However, it is not immediately clear how one
would express connection calculus proof states and actions as tensors. As
discussed in Section 4, this is one of the main bottlenecks of incorporating
machine learning into theorem provers. How do we properly embed proof
states into the space of real valued tensors.
Since this is quite a crucial step in the process of incorporating ML into
theorem provers, there has been quite a bit of work done on this. Here are
a couple of approaches:
- Meta-information such as the theory name and presence in various
databases [77]
- Term walks of length 2 [77]
- Term walks of length 3 [55]
- Symbol level token embeddings [78]
- Linear chains [60]
One of the problems with these previous methods is that they for the
most part do not have necessary invariant properties such as invariance
to literal/clause order and invariance to symbol renaming. Newer
embedding strategies are therefore based on representation learning with
graph neural networks, exploiting the relational structure of formulas
and proof states. This combined with the fact that there is a relatively
large amount of literature on graph representation learning using GNNs,
reducing the problem of tensorizing proof states to finding a graph
representation of a proof state as shown in Figure 6.7 is thought to be
beneficial. Here are some graph encoding approaches:
- Embed the formula parse trees extended with subexpression sharing
(parse DAGs) [66, 79–81], see Figure 6.8
- Embed proof states as hypergraphs [45]
63
?
64
Matrix:
Current tableau/substitution:
Possible actions/inferences:
Extension, Extension,
New New
65
Formally, an undirected hypergraph H is a 2-tuple H = (V, E) where:
- V is a of nodes.
- E ⊆ P (V ) \ {∅} is a set of hyperedges.
66
6.2.2 Graph Construction 2
The previously presented embedding is invariant to variable renaming,
literal and clause reordering, and partially solves the problem of term
ordering by connecting successive terms by hyperedges. However, as the
authors mention, in this encoding, f (t1 , t2 , t1 ) would be encoded the same
way as f (t2 , t1 , t2 ).
Definition 6.2.2 (Heterogeneous graph). A heterogeneous graph is a
generalization of a graph (see Definition 3.3.1) where each node and edge
is associated with a type. Formally, a heterogeneous graph G is a 6-tuple
G = (V, E, TV , TE , τ, φ) where:
- V is a of nodes
- E ⊆ {( x, y)|( x, y) ∈ V 2 }, is a set of edges
- TV is a set of node types
- TE is a set of node types
- τ : V → TV is a node type mapping
- φ : E → TE is an edge type mapping
Figure 6.10: Node and edge types of the heterogeneous graph constructions
Given a connection calculus proof state, the state defining graph is built as
follows:
- Every clause in the matrix M is represented by a CLA node
- Every literal in a clause is represented by a LIT node
- Every term in a literal is represented by a:
- FUN node if the term is a function
- CON node if the term is a constant
67
- VAR node if the term is a variable
- Every term in a function is represented by a:
- FUN node if the term is a function
- CON node if the term is a constant
- VAR node if the term is a variable
- Every pair of LIT nodes representing literals with the same predicate
symbol have an Equal edge between them.
- Every pair of FUN nodes representing functions with the same
function symbol have an Equal edge between them.
- Every pair of CON nodes representing constants with the same
constant symbol have an Equal edge between them.
- Every pair of VAR nodes representing constants with the same
constant symbol have an Equal edge between them (Note: there
should be no Equal edges between VAR nodes originating from
different clauses).
- Every pair of LIT nodes representing literals with the same predicate
symbol but where one contains ¬ and the other does not have a
Complements edge between them.
- Every pair of nodes where one represents an immediate sub-
expression of the other have a Contains/Inn edge between them.
("immediate sub-expression" refers to a literal in a clause or an
argument of a literal or function)
- Every pair of FUN, CON, and VAR nodes where one node represents
the argument which is immediately after the other in their originating
super-expression have a Successor edge going from the first node to
the second.
An example of this conversion is shown in Figure 6.11 for a subset of the
clauses of an arbitrary matrix.
68
CLA CLA CLA
LIT LIT
CLA CLA
CONS
LIT LIT
VAR
CONS VAR
CONS
LIT LIT
VAR
CLA
69
6.3 Model Architecture
Now that we have graph representations of the proof states, we can use
graph neural networks to both embed the proof states and, furthermore,
perform inference on the proof states.
j
Fct = a : Cj , Ta ∈ Ect
j
Ftc = a : Ca , Tj ∈ Ect
j
Fst = ( a, b, c, g) : S j , Ta , Tb , Tc , g ∈ Est
j
Fts,1 = ( a, b, c, g) : Sc , Tj , Ta , Tb , g ∈ Est
j
Fts,2 = ( a, b, c, g) : Sc , Ta , Tj , Tb , g ∈ Est
j
Fts,3 = ( a, b, c, g) : Sc , Ta , Tb , Tj , g ∈ Est
ci+1,j = ReLU Bci + Mci · ci,j + Mct
i
· reda∈ F j (ti,a )
ct
xia,b,c = i
Bts + i
Mts,1 · ti,a + i
Mts,2 · ti,b + i
Mts,3 · ti,c
!
0
si+1,j = tanh Msi · si,j + i
Mts · red g· xia,b,c
j
( a,b,c,g)∈ Fst
a,b,c,g 1,d 2,d 3,d
yi,d = Bsti + Mst,i · ti,a + Mst,i · ti,b + Mst,i · si,c · g
i a,b,c,g
zi,j,d = Mst,d · red ReL U yi,d
j
( a,b,c,g)∈ Fts,d
i
vi,j = Mtc · reda∈ F j (ci,a )
tc
!
ti+1,j = ReLU Bti + Mti · ti,j + vi,j + ∑ zi,j,d
d∈{1,2,3}
Here, all the B symbols represent learnable vectors (biases), and all the M
symbols represent learnable matrices. The aggregation operations used are
defined as follows:
70
redi∈ I (ui ) = maxi∈ I (ui ) k meani∈ I (ui )
redi0∈ I (ui ) = maxi∈ I (ui ) − mini∈ I (ui ) k meani∈ I (ui )
In the above, "k" means concatenation such that if the ui are of dimension
d, the resulting vector is of dimension 2d. All the aggregations (max, min,
mean) are done pointwise.
After L message passing layers, we obtain the embeddings c L,j , s L,j , t L,j of
the clauses Cj , symbols S j and terms and literals Tj respectively. These are
fed to a regular feed forward neural network, whose outputs are used to
compute the logits for taking an action. An action, in this case, corresponds
to the use of axiom Ci , and complementing its literal Tj with the current
goal. Let Ck represent the clause containing all the remaining goals. The
logit for taking the action corresponding to c L,i , t L,j are then found by
feeding the concatenation of c L,i , t L,j and c L,k through a hidden layer of
size 64 with ReLU activation, and then a linear output layer (without the
activation function). The distribution of actions are found by using the
softmax function on the logits.
We can specialize this definition to take into account different edge types
by having separate weights for each type of relation. The R-GCN model
from [82] solves this as shown in Equation 6.2:
!
1
htv+1 =σ ∑ ∑ cv,r
Wrt htu + W0t htv (6.2)
r ∈R u∈Nvr
Here, Nvr denotes the set of neighbor indices of node v under relation
r ∈ R. Wrt denotes a weight matrix for relation r at GNN layer t. cv,r
is a problem-specific normalization constant that can either be learned or
chosen in advance (such as cv,r = |Nvr |). σ is any non-linear function (such
as ReLU).
The underlying idea is that, different relation types will now add different
information to v even if the message sending node u is held constant.
71
R-GCN is one way to incorporate different relations into the message
passing formulation. Figure 6.13 shows some other examples which
incorporate more advanced features and information.
To use this for inference in our theorem prover, we will frame the problem
search guidance as a link prediction problem on the graph embedding.
That is, given the LIT node representing the current goal in our connection
calculus path, which node is it most likely that it has an Extension edge to?
Notice: the graph stays constant throughout a proof search problem and
we never actually add any Extension edges to the graph.
Our model for the heterogeneous graph construction will be reminiscent of
the link prediction graph auto-encoder model used in [82]. The encoder
will consist of multiple R-GCN layers, while the decoder function will
simply be a DistMult factorization [84]. That is, we score the potential
Extension edge between the LIT node u representing the current open goal
and a LIT node v representing a unifiable complimentary literal, by the
function:
f (u, v) = euT Wev
Where W is a learnable matrix and eu = huL and ev = hvL are the (encoder)
encoded representations of u and v respectively (where L is the number of
R-GCN layers in the encoder).
Since the number of supervision edges per sample state/action is relatively
low, we use a standard cross-entropy loss without negative sampling to
train the model.
Here A is the set of all possible Extension edges and l is the logistic sigmoid
function. y is an indicator set to y = 1 for edges representing actions
taken which successfully lead to proofs, while y is set to y = 0 for the
complementary actions, that is, edges representing actions taken which did
not lead to proofs.
72
Chapter 7
73
7.1 Hardware and Model Training
All testing is run on a server with four nodes of Intel(R) Xeon(R) CPU L7555
@ 1.87GHz CPUs totaling 64 cores. Among other things, these CPUs do not
have support for AVX, so running vector operations on them is slightly
less efficient than running them on more modern processors. Therefore,
a rather lightweight R-GCN model is used. The encoder consists of three
R-GCN layers with input dimension 32 and output dimension 32, while
the Decoder is a simple DistMult of the relevant node embeddings as
discussed in Section 6.3.2. Input node embeddings are one-hot encoded
vectors of dimension 5, one for each node type. We train this network
for 50 epochs using the Adam optimizer [85] with a learning rate of 2e-6
(for both problem sets) found using the algorithm described in [86]. Our
network is implemented in the PyTorch deep learning library [72] using
the PyTorch Geometric extension library [87] for R-GCN layers and the
PyTorch Lightning extension library [88] for training and logging.
Due to time and resource constraints, we only perform one: solving →
training → solving iteration. In every solving step, we attempt to solve
every problem in the given problem set, generating training data in the
meantime. The training data are then used for training the network,
minimizing the cross-entropy with the target link probabilities as described
in Section 6.3.2.
The first solving step is run without guidance to save time, only storing
necessary training data for each inference step. In both solving steps, the
prover makes a maximum of 100 000 total inference steps for each problem
before moving on to the next problem. Iterative deepening with restricted
backtracking up to a depth of 5 is used before falling back to iterative
deepening without restricted backtracking for both guidance and training
data gathering instead of MCTS which was used in systems such as rlCoP
[43] and AlphaZero [33]. This is to make sure we find proofs in a relative
short amount of time during training data gathering.
Our network is lightweight enough to be able to complete more than
2000 inferences per second total, while proving problems in parallel.
This is notably less than the non-guided version (about 10 000 on both
benchmarks), but still faster than other deep learning guided systems such
as graphCoP [45]. Another consequence of the model being lightweight, is
that it does not gain anything from being put on a GPU during inference
due to communication overhead. Lastly, since the graph itself is stationary,
we only send the graph through the encoder once per problem, and only
query the decoder for the rest of the steps. This is the reason that the
number of inferences the prover makes per second is comparable to the
non-learning guided version.
74
7.2 Results on MPTP2078
Tables 7.1 - 7.3 contain results of running the experiments described in the
beginning of Section 7 on the MPTP2078 1 bushy problem set.
Base RB Learning
Proved 351 462 451
[%] 16.89% 22.23% 21.70%
75
Base RB Learning
Base - 57 54
RB 168 - 68
Learning 154 57 -
76
Base RB RB + Learning
Proved 822 846 815
[%] 41.03% 42.23% 40.68%
Base RB RB + Learning
Base - 182 124
RB 206 - 143
RB + Learning 117 112 -
77
78
Chapter 8
79
The recency of most of the cited papers in Section 4 and the apparent
superiority of their related models is a promising basis for future work on
the topic of learning-guided ATP.
80
List of Figures
81
6.2 Potential root of proof tree for M . . . . . . . . . . . . . . . . 60
6.3 Pruned roots of potential proof trees for M due to theorem
6.1.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
6.4 Non-regularity in a partial connection calculus proof tree for
M . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
6.5 Partial connection calculus proof tree for M with lemma rule 61
6.6 Comparison of tree search strategies [76] . . . . . . . . . . . . 62
6.7 Reduction of representation learning problem to graph
representation . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
6.8 A syntax-tree and a DAG representation of the formula
∀ A, B, C.r ( A, B) ∧ p( A) ∨ ¬q( B, f ( A)) ∨ q(C, f ( A)) from
[80]. Here A, B, C are variables and p, q, r are predicates . . . 64
6.9 Example of a connection calculus proof state and actions . . 65
6.10 Node and edge types of the heterogeneous graph constructions 67
6.11 Example conversion from connection calculus matrix to
heterogeneous graph showing embedding of clauses C1 , C4 , C5 69
6.12 More planar version the graph in Figure 6.11 . . . . . . . . . 69
6.13 Some relational GNNs from [83] . . . . . . . . . . . . . . . . 72
82
List of Tables
83
84
Bibliography
85
[13] Laura Kovács and Andrei Voronkov. ‘First-Order Theorem Proving
and Vampire’. In: Computer Aided Verification. Ed. by Natasha Shary-
gina and Helmut Veith. Berlin, Heidelberg: Springer Berlin Heidel-
berg, 2013, pp. 1–35. ISBN: 978-3-642-39799-8.
[14] Stephan Schulz, Simon Cruanes and Petar Vukmirović. ‘Faster,
Higher, Stronger: E 2.3’. In: Proc. of the 27th CADE, Natal, Brasil. Ed. by
Pacal Fontaine. LNAI 11716. Springer, 2019, pp. 495–507.
[15] ‘The CADE-27 Automated theorem proving System Competition -
CASC-27’. English (US). In: AI Communications 32.5-6 (2020), pp. 373–
389. ISSN: 0921-7126. DOI: 10.3233/AIC-190627.
[16] Stuart J. Russell and Peter Norvig. Artificial Intelligence: A Modern
Approach. 4th ed. Pearson, 2020.
[17] Gareth James et al. An Introduction to Statistical Learning: with
Applications in R. Springer, 2013. URL: https : / / faculty. marshall . usc .
edu/gareth-james/ISL/.
[18] William L. Hamilton. ‘Graph Representation Learning’. In: Synthesis
Lectures on Artificial Intelligence and Machine Learning 14.3 (), pp. 1–
159.
[19] Ian Goodfellow, Yoshua Bengio and Aaron Courville. Deep Learning.
http://www.deeplearningbook.org. MIT Press, 2016.
[20] M. Tim Jones. Models for machine learning. Dec. 2017. URL: https : / /
developer . ibm . com / technologies / artificial - intelligence / articles / cc -
models-machine-learning/#.
[21] Anthony Goldbloom. "What algorithms are most successful on Kaggle?"
2016 (accessed September 17, 2020). URL: https://www.kaggle.com/
antgoldbloom/what-algorithms-are-most-successful-on-kaggle.
[22] Volodymyr Mnih et al. ‘Asynchronous Methods for Deep Reinforce-
ment Learning’. In: Proceedings of the 33rd International Conference on
International Conference on Machine Learning - Volume 48. ICML’16.
New York, NY, USA: JMLR.org, 2016, pp. 1928–1937.
[23] John Schulman et al. Proximal Policy Optimization Algorithms. 2017.
arXiv: 1707.06347 [cs.LG].
[24] John Schulman et al. ‘Trust Region Policy Optimization’. In: Proceed-
ings of the 32nd International Conference on International Conference on
Machine Learning - Volume 37. ICML’15. Lille, France: JMLR.org, 2015,
pp. 1889–1897.
[25] Scott Fujimoto, Herke van Hoof and David Meger. ‘Addressing
Function Approximation Error in Actor-Critic Methods’. In: Proceed-
ings of the 35th International Conference on Machine Learning, ICML
2018, Stockholmsmässan, Stockholm, Sweden, July 10-15, 2018. Ed. by
Jennifer G. Dy and Andreas Krause. Vol. 80. Proceedings of Machine
Learning Research. PMLR, 2018, pp. 1582–1591.
86
[26] Tuomas Haarnoja et al. ‘Soft Actor-Critic: Off-Policy Maximum
Entropy Deep Reinforcement Learning with a Stochastic Actor’. In:
Proceedings of the 35th International Conference on Machine Learning. Ed.
by Jennifer Dy and Andreas Krause. Vol. 80. Proceedings of Machine
Learning Research. Stockholmsmässan, Stockholm Sweden: PMLR,
Oct. 2018, pp. 1861–1870.
[27] Volodymyr Mnih et al. ‘Playing Atari With Deep Reinforcement
Learning’. In: NIPS Deep Learning Workshop. 2013.
[28] Marc G. Bellemare, Will Dabney and Rémi Munos. ‘A Distributional
Perspective on Reinforcement Learning’. In: Proceedings of the 34th
International Conference on Machine Learning - Volume 70. ICML’17.
Sydney, NSW, Australia: JMLR.org, 2017, pp. 449–458.
[29] Will Dabney et al. Distributional Reinforcement Learning With Quantile
Regression. 2018.
[30] Marcin Andrychowicz et al. ‘Hindsight Experience Replay’. In:
Proceedings of the 31st International Conference on Neural Information
Processing Systems. NIPS’17. Long Beach, California, USA: Curran
Associates Inc., 2017, pp. 5055–5065. ISBN: 9781510860964.
[31] David Ha and Jürgen Schmidhuber. ‘Recurrent World Models Fa-
cilitate Policy Evolution’. In: Advances in Neural Information Pro-
cessing Systems. Ed. by S. Bengio et al. Vol. 31. Curran Associates,
Inc., 2018. URL: https : / / proceedings . neurips . cc / paper / 2018 / file /
2de5d16682c3c35007e4e92982f1a2ba-Paper.pdf.
[32] Sébastien Racanière et al. ‘Imagination-Augmented Agents for Deep
Reinforcement Learning’. In: Proceedings of the 31st International
Conference on Neural Information Processing Systems. NIPS’17. Long
Beach, California, USA: Curran Associates Inc., 2017, pp. 5694–5705.
ISBN : 9781510860964.
87
[37] Wikimedia Commons Rmoss92. The 4 steps of Monte Carlo tree
search: selection, expansion, simulation, and backpropagation. File:
MCTS-steps.svg. 2020. URL: https://https://commons.wikimedia.org/
wiki/File:MCTS-steps.svg.
[38] Wolfgang Ertel, Johann M. Ph. Schumann and Christian B. Suttner.
‘Learning Heuristics for a Theorem Prover using Back Propagation’.
In: 5. Österreichische Artificial-Intelligence-Tagung. Ed. by Johannes
Retti and Karl Leidlmair. Berlin, Heidelberg: Springer, 1989, pp. 87–
95. ISBN: 978-3-642-74688-8.
[39] Josef Urban, Jiří Vyskočil and Petr Štěpánek. ‘MaLeCoP Machine
Learning Connection Prover’. In: Automated Reasoning with Analytic
Tableaux and Related Methods. Ed. by Kai Brünnler and George
Metcalfe. Berlin, Heidelberg: Springer, 2011, pp. 263–277. ISBN: 978-
3-642-22119-4.
[40] Cezary Kaliszyk and Josef Urban. ‘FEMaLeCoP: Fairly Efficient
Machine Learning Connection Prover’. In: Logic for Programming,
Artificial Intelligence, and Reasoning. Ed. by Martin Davis et al. Berlin,
Heidelberg: Springer Berlin Heidelberg, 2015, pp. 88–96. ISBN: 978-3-
662-48899-7.
[41] Jesse Alama et al. ‘Premise Selection for Mathematics by Corpus
Analysis and Kernel Methods’. In: Journal of Automated Reasoning 52.2
(Feb. 2014), pp. 191–213. ISSN: 1573-0670. DOI: 10.1007/s10817- 013-
9286-5. URL: https://doi.org/10.1007/s10817-013-9286-5.
[42] Michael Färber, Cezary Kaliszyk and Josef Urban. ‘Monte Carlo
Tableau Proof Search’. In: Automated Deduction – CADE 26. Ed. by
Leonardo de Moura. Cham: Springer International Publishing, 2017,
pp. 563–579. ISBN: 978-3-319-63046-5.
[43] Cezary Kaliszyk et al. ‘Reinforcement Learning of Theorem Proving’.
In: Advances in Neural Information Processing Systems. Ed. by S. Bengio
et al. Vol. 31. Curran Associates, Inc., 2018.
[44] Zsolt Zombori, Josef Urban and Chad E. Brown. ‘Prolog Technology
Reinforcement Learning Prover’. In: Automated Reasoning. Ed. by
Nicolas Peltier and Viorica Sofronie-Stokkermans. Cham: Springer
International Publishing, 2020, pp. 489–507. ISBN: 978-3-030-51054-1.
[45] Miroslav Olsák, Cezary Kaliszyk and Josef Urban. ‘Property Invari-
ant Embedding for Automated Reasoning’. In: ECAI 2020 - 24th
European Conference on Artificial Intelligence, 29 August-8 September
2020, Santiago de Compostela, Spain, August 29 - September 8, 2020 -
Including 10th Conference on Prestigious Applications of Artificial In-
telligence (PAIS 2020). Ed. by Giuseppe De Giacomo et al. Vol. 325.
Frontiers in Artificial Intelligence and Applications. IOS Press, 2020,
pp. 1395–1402.
88
[46] Stephane Ross, Geoffrey Gordon and Drew Bagnell. ‘A Reduction of
Imitation Learning and Structured Prediction to No-Regret Online
Learning’. In: Proceedings of the Fourteenth International Conference on
Artificial Intelligence and Statistics. Ed. by Geoffrey Gordon, David
Dunson and Miroslav Dudík. Vol. 15. Proceedings of Machine
Learning Research. 2011, pp. 627–635.
[47] Tianqi Chen and Carlos Guestrin. ‘XGBoost: A Scalable Tree Boosting
System’. In: Proceedings of the 22nd ACM SIGKDD International
Conference on Knowledge Discovery and Data Mining. KDD ’16. San
Francisco, California, USA: Association for Computing Machinery,
2016, pp. 785–794. ISBN: 9781450342322. DOI: 10 . 1145 / 2939672 .
2939785. URL: https://doi.org/10.1145/2939672.2939785.
[48] Cezary Kaliszyk and Josef Urban. ‘MizAR 40 for Mizar 40’. In: Journal
of Automated Reasoning 55.3 (Oct. 2015), pp. 245–256. ISSN: 1573-0670.
DOI : 10.1007/s10817-015-9330-8. URL : https://doi.org/10.1007/s10817-
015-9330-8.
[49] Adam Grabowski, Artur Kornilowicz and Adam Naumowicz.
‘Mizar in a Nutshell’. In: Journal of Formalized Reasoning 3 (Dec. 2010),
pp. 153–245. DOI: 10.6092/issn.1972-5787/1980. URL: https://jfr.unibo.
it/article/view/1980.
[50] Jens Otten. ‘nanoCoP: Natural Non-clausal Theorem Proving’. In:
Proceedings of the Twenty-Sixth International Joint Conference on Artificial
Intelligence, IJCAI 2017, Melbourne, Australia. Ed. by Carles Sierra.
ijcai.org, 2017, pp. 4924–4928.
[51] Wolfgang Bibel and Jens Otten. ‘From Schütte’s Formal Systems to
Modern Automated Deduction’. In: The Legacy of Kurt Schütte. Ed. by
Reinhard Kahle and Michael Rathjen. Springer, 2020, pp. 217–251.
[52] Jens Otten. ‘Non-clausal Connection Calculi for Non-classical Lo-
gics’. In: 26th International Conference on Automated Reasoning with
Analytic Tableaux and Related Methods. Ed. by R. Schmidt and
C. Nalon. Vol. 10501. Lecture Notes in Artificial Intelligence.
Springer, 2017, pp. 209–227.
[53] Jens Otten and Wolfgang Bibel. ‘Advances in Connection-based
Automated Theorem Proving’. In: Provably Correct Systems. Ed.
by Jonathan Bowen, Mike Hinchey and Ernst-Rüdiger Olderog.
NASA Monographs in Systems and Software Engineering. London:
Springer, 2017, pp. 211–241.
[54] Karel Chvalovský et al. ‘ENIGMA-NG: Efficient Neural and
Gradient-Boosted Inference Guidance for E’. In: Automated Deduc-
tion – CADE 27. Ed. by Pascal Fontaine. Cham: Springer International
Publishing, 2019, pp. 197–215. ISBN: 978-3-030-29436-6.
[55] Jan Jakubův and Josef Urban. ‘ENIGMA: Efficient Learning-Based
Inference Guiding Machine’. In: Intelligent Computer Mathematics. Ed.
by Herman Geuvers et al. Cham: Springer International Publishing,
2017, pp. 292–302. ISBN: 978-3-319-62075-6.
89
[56] Jan Jakubův et al. ‘ENIGMA Anonymous: Symbol-Independent
Inference Guiding Machine (System Description)’. In: Automated
Reasoning. Ed. by Nicolas Peltier and Viorica Sofronie-Stokkermans.
Cham: Springer International Publishing, 2020, pp. 448–463. ISBN:
978-3-030-51054-1.
[57] Rong-En Fan et al. ‘LIBLINEAR: A Library for Large Linear Classi-
fication’. In: J. Mach. Learn. Res. 9 (June 2008), pp. 1871–1874. ISSN:
1532-4435.
[58] Josef Urban. ‘MPTP 0.2: Design, Implementation, and Initial Exper-
iments’. In: Journal of Automated Reasoning 37.1 (Aug. 2006), pp. 21–
43. ISSN: 1573-0670. DOI: 10 . 1007 / s10817 - 006 - 9032 - 3. URL: https :
//doi.org/10.1007/s10817-006-9032-3.
[59] Grzegorz Bancerek et al. ‘Mizar: State-of-the-art and Beyond’. In:
Intelligent Computer Mathematics. Ed. by Manfred Kerber et al. Cham:
Springer International Publishing, 2015, pp. 261–279. ISBN: 978-3-319-
20615-8.
[60] Maxwell Crouse et al. ‘A Deep Reinforcement Learning Approach to
First-Order Logic Theorem Proving’. In: CoRR abs/1911.02065 (2019).
arXiv: 1911.02065. URL: http://arxiv.org/abs/1911.02065.
[61] Peter Baumgartner, Joshua Bax and Uwe Waldmann. ‘Beagle – A
Hierarchic Superposition Theorem Prover’. In: Automated Deduction -
CADE-25. Ed. by Amy P. Felty and Aart Middeldorp. Cham: Springer
International Publishing, 2015, pp. 367–377. ISBN: 978-3-319-21401-6.
[62] Thibault Gauthier, Cezary Kaliszyk and Josef Urban. ‘TacticToe:
Learning to Reason with HOL4 Tactics’. In: LPAR-21. 21st Interna-
tional Conference on Logic for Programming, Artificial Intelligence and
Reasoning. Ed. by Thomas Eiter and David Sands. Vol. 46. EPiC Series
in Computing. EasyChair, 2017, pp. 125–143. DOI: 10.29007/ntlb. URL:
https://easychair.org/publications/paper/WsM.
[63] Thibault Gauthier et al. ‘TacticToe: Learning to Prove with Tactics’.
In: Journal of Automated Reasoning 65.2 (Feb. 2021), pp. 257–286. ISSN:
1573-0670. DOI: 10.1007/s10817-020-09580-x. URL: https://doi.org/10.
1007/s10817-020-09580-x.
[64] Lasse Blaauwbroek, Josef Urban and Herman Geuvers. ‘The Tac-
tician’. In: Intelligent Computer Mathematics. Ed. by Christoph Ben-
zmüller and Bruce Miller. Cham: Springer International Publishing,
2020, pp. 271–277. ISBN: 978-3-030-53518-6.
[65] Lasse Blaauwbroek, Josef Urban and Herman Geuvers. ‘Tactic Learn-
ing and Proving for the Coq Proof Assistant’. In: LPAR23. LPAR-23:
23rd International Conference on Logic for Programming, Artificial Intelli-
gence and Reasoning. Ed. by Elvira Albert and Laura Kovacs. Vol. 73.
EPiC Series in Computing. EasyChair, 2020, pp. 138–150. DOI: 10 .
29007/wg1q. URL: https://easychair.org/publications/paper/JLdB.
90
[66] Aditya Paliwal et al. ‘Graph Representations for Higher-Order Logic
and Theorem Proving’. In: Proceedings of the AAAI Conference on
Artificial Intelligence 34.03 (Apr. 2020), pp. 2967–2974. DOI: 10.1609/
aaai.v34i03.5689. URL: https://ojs.aaai.org/index.php/AAAI/article/
view/5689.
[67] Stanislas Polu and Ilya Sutskever. ‘Generative Language Modeling
for Automated Theorem Proving’. In: CoRR abs/2009.03393 (2020).
URL : https://arxiv.org/abs/2009.03393.
91
[78] Alexander A. Alemi et al. ‘DeepMath - Deep Sequence Models for
Premise Selection’. In: Proceedings of the 30th International Conference
on Neural Information Processing Systems. NIPS’16. Barcelona, Spain:
Curran Associates Inc., 2016, pp. 2243–2251. ISBN: 9781510838819.
[79] Mingzhe Wang et al. ‘Premise Selection for Theorem Proving by Deep
Graph Embedding’. In: Advances in Neural Information Processing
Systems. Ed. by I. Guyon et al. Vol. 30. Curran Associates, Inc.,
2017. URL: https : / / proceedings . neurips . cc / paper / 2017 / file /
18d10dc6e666eab6de9215ae5b3d54df-Paper.pdf.
[80] Ibrahim Abdelaziz et al. ‘An Experimental Study of Formula Em-
beddings for Automated Theorem Proving in First-Order Logic’. In:
CoRR abs/2002.00423 (2020). arXiv: 2002.00423. URL: https://arxiv.
org/abs/2002.00423.
[81] Maxwell Crouse et al. Improving Graph Neural Network Representations
of Logical Formulae with Subgraph Pooling. 2020. arXiv: 1911 . 06904
[cs.AI].
[82] Michael Sejr Schlichtkrull et al. ‘Modeling Relational Data with
Graph Convolutional Networks’. In: The Semantic Web - 15th Interna-
tional Conference, ESWC 2018, Heraklion, Crete, Greece, June 3-7, 2018,
Proceedings. Ed. by Aldo Gangemi et al. Vol. 10843. Lecture Notes in
Computer Science. Springer, 2018, pp. 593–607. DOI: 10.1007/978-3-
319- 93417- 4\_38. URL: https://doi.org/10.1007/978- 3- 319- 93417-
4%5C_38.
[83] Marc Brockschmidt. ‘GNN-FiLM: Graph Neural Networks with
Feature-wise Linear Modulation’. In: Proceedings of the 37th Interna-
tional Conference on Machine Learning. Ed. by Hal Daumé III and Aarti
Singh. Vol. 119. Proceedings of Machine Learning Research. PMLR,
13–18 Jul 2020, pp. 1144–1152. URL: http://proceedings.mlr.press/v119/
brockschmidt20a.html.
[84] Bishan Yang et al. ‘Embedding Entities and Relations for Learning
and Inference in Knowledge Bases’. In: Proceedings of the International
Conference on Learning Representations (ICLR) 2015. May 2015. URL:
https://www.microsoft.com/en- us/research/publication/embedding-
entities-and-relations-for-learning-and-inference-in-knowledge-bases/.
[85] Diederik P. Kingma and Jimmy Ba. ‘Adam: A Method for Stochastic
Optimization’. In: 3rd International Conference on Learning Representa-
tions, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track
Proceedings. Ed. by Yoshua Bengio and Yann LeCun. 2015. URL: http:
//arxiv.org/abs/1412.6980.
[86] Leslie N. Smith. ‘Cyclical Learning Rates for Training Neural Net-
works’. In: 2017 IEEE Winter Conference on Applications of Computer
Vision (WACV). 2017, pp. 464–472. DOI: 10.1109/WACV.2017.58.
92
[87] Matthias Fey and Jan E. Lenssen. ‘Fast Graph Representation Learn-
ing with PyTorch Geometric’. In: ICLR 2019 Workshop on Representa-
tion Learning on Graphs and Manifolds. New Orleans, USA, 2019. URL:
https://arxiv.org/abs/1903.02428.
[88] William Falcon et al. ‘PyTorch Lightning’. In: GitHub. Note:
https://github.com/PyTorchLightning/pytorch-lightning 3 (2019).
93