0% found this document useful (0 votes)
12 views101 pages

R-Mming Thesis

Uploaded by

deszti2
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views101 pages

R-Mming Thesis

Uploaded by

deszti2
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 101

Learning to Reason

Fredrik Rømming

Thesis submitted for the degree of


Master in Informatics: Programming and System
Architecture
60 credits

Department of Informatics
The Faculty of Mathematics and Natural Sciences

UNIVERSITY OF OSLO

Spring 2021
Learning to Reason

Fredrik Rømming
© 2021 Fredrik Rømming

Learning to Reason

http://www.duo.uio.no/

Printed: Reprosentralen, University of Oslo


Abstract

Theorem proving formalizes the notion of deductive reasoning, while


machine learning formalizes the notion of inductive reasoning. In this
thesis, we present an overview of the current state of machine learning
guided first-order automated theorem proving systems and outline a novel
high-level modular object-oriented framework for combining arbitrary
machine learning models with arbitrary proof calculi. Additionally, we
present an example implementation in the aforementioned framework
taking a novel approach to combining graph neural networks with the first-
order connection calculus, generating a new Python implementation of the
leanCoP theorem prover as a by-product.

i
ii
Contents

1 Introduction 1

2 Automated Theorem Proving (ATP) 3


2.1 Formalizing Deductive Reasoning . . . . . . . . . . . . . . . 3
2.2 Propositional Logic . . . . . . . . . . . . . . . . . . . . . . . . 4
2.2.1 Syntax . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.2.2 Semantics . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.2.3 Deductive systems . . . . . . . . . . . . . . . . . . . . 7
2.3 First-order Logic . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.3.1 Syntax . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.3.2 Semantics . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.3.3 Deductive system . . . . . . . . . . . . . . . . . . . . . 11
2.4 First-Order ATP . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.4.1 Natural Deduction and Sequent Calculus . . . . . . . 13
2.4.2 Method of Analytic Tableaux . . . . . . . . . . . . . . 14
2.4.3 Connection Calculus . . . . . . . . . . . . . . . . . . . 16
2.4.4 Resolution Calculus . . . . . . . . . . . . . . . . . . . 17

3 Machine Learning (ML) 19


3.1 Formalizing Inductive Reasoning . . . . . . . . . . . . . . . . 19
3.2 ML Paradigms . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.3 ML Algorithms and Models . . . . . . . . . . . . . . . . . . . 20
3.3.1 Non-parametric Models . . . . . . . . . . . . . . . . . 21
3.3.2 Parametric Models . . . . . . . . . . . . . . . . . . . . 23
3.4 Reinforcement Learning (RL) . . . . . . . . . . . . . . . . . . 30
3.4.1 Markov Decision Processes (MDP) . . . . . . . . . . . 31
3.4.2 Model-free vs. Model-based RL Algorithms . . . . . 32
3.4.3 Exploration vs. Exploitation . . . . . . . . . . . . . . . 32
3.4.4 Monte Carlo Tree Search (MCTS) . . . . . . . . . . . . 33

4 ML-Guided ATP Literature Review 37


4.1 Connection-based Provers . . . . . . . . . . . . . . . . . . . . 38
4.2 Saturation-based Provers . . . . . . . . . . . . . . . . . . . . . 42
4.3 Interactive Provers and Other Related Problems . . . . . . . 44
4.4 Main Takeaways . . . . . . . . . . . . . . . . . . . . . . . . . . 44

iii
5 A General Framework for ML-Guided ATP 47
5.1 From Prolog to Python . . . . . . . . . . . . . . . . . . . . . . 47
5.1.1 Logic + Control . . . . . . . . . . . . . . . . . . . . . . 47
5.1.2 Object-orientation . . . . . . . . . . . . . . . . . . . . 48
5.1.3 Rapid Prototyping . . . . . . . . . . . . . . . . . . . . 50
5.2 Incorporating Learning . . . . . . . . . . . . . . . . . . . . . . 51
5.2.1 Proof Search as an MDP . . . . . . . . . . . . . . . . . 51
5.2.2 Model Module . . . . . . . . . . . . . . . . . . . . . . 54

6 Implementing an ML-Guided ATP System 57


6.1 Base Connection Prover . . . . . . . . . . . . . . . . . . . . . 57
6.1.1 Positive start clauses . . . . . . . . . . . . . . . . . . . 60
6.1.2 Regularity . . . . . . . . . . . . . . . . . . . . . . . . . 60
6.1.3 Lemmata . . . . . . . . . . . . . . . . . . . . . . . . . . 61
6.1.4 Restricted Backtracking . . . . . . . . . . . . . . . . . 61
6.2 Formula Tensorization and State/Action Embeddings . . . . 63
6.2.1 Graph Construction 1 . . . . . . . . . . . . . . . . . . 65
6.2.2 Graph Construction 2 . . . . . . . . . . . . . . . . . . 67
6.3 Model Architecture . . . . . . . . . . . . . . . . . . . . . . . . 70
6.3.1 Model for Graph Construction 1 . . . . . . . . . . . . 70
6.3.2 Model for graph construction 2 . . . . . . . . . . . . . 71

7 Evaluating the ML-Guided ATP System 73


7.1 Hardware and Model Training . . . . . . . . . . . . . . . . . 74
7.2 Results on MPTP2078 . . . . . . . . . . . . . . . . . . . . . . . 75
7.2.1 Unguided Proof Search . . . . . . . . . . . . . . . . . 75
7.2.2 Guided Proof Search . . . . . . . . . . . . . . . . . . . 75
7.3 Results on M2k . . . . . . . . . . . . . . . . . . . . . . . . . . 76
7.3.1 Unguided Proof Search . . . . . . . . . . . . . . . . . 76
7.3.2 Guided Proof Search . . . . . . . . . . . . . . . . . . . 76

8 Conclusion and Future Work 79

iv
Chapter 1

Introduction

“If we had it [a characteristica universalis], we should be able to


reason in metaphysics and morals in much the same way as in
geometry and analysis.
If controversies were to arise, there would be no more need of
disputation between two philosophers than between two
accountants. For it would suffice to take their pencils in their
hands, to sit down to their slates, and to say to each other . . . :
Let us calculate.”
- Gottfried Wilhelm Leibniz

Reasoning is the ability to make steps of inference from premises to


consequences. We traditionally divide inference into two subcategories:
deduction and induction.
Definition 1.0.1 (Deduction). Deductive inference makes steps from
premises, which are known or assumed to be true, to logical conclusions
by following rules of valid reasoning.
An example of deductive inference is the ancient Greek syllogism:

All men are mortal.


Socrates is a man.

Therefore, Socrates is mortal. ∴

The two sentences above the horizontal line are premises which combined
imply the sentence below the line, which is a logical conclusion.
Definition 1.0.2 (Induction). Inductive inference (Not to be confused
with "mathematical induction") makes steps from premises, which are
observations providing some evidence, to conclusions saying something
general about the phenomenon underlying the particular observations.
An example of inductive inference is the generalization:

1
The proportion Q of the sample has attribute A
Therefore, the proportion Q of the population has attribute A

The sentence above the line states some specific observations, while the
sentence below the line says something general about all possible potential
observations.
In other words, deduction is concluding the particular from the general,
while induction is concluding the general from the particular.
Systems automating the process of reasoning play a central role in software
and hardware verification, and have applications in any area where logical
reasoning is required, such as in artificial general intelligence, mathematics
and philosophy. Although the overall goal of automated reasoning is
to mechanize different forms of reasoning, the term has largely been
identified with deductive reasoning as practiced in mathematics and
formal logic. In this thesis, we explore the use of automated inductive
reasoning (machine learning) to guide automated deductive reasoning
(theorem proving).
In Section 2, 3, and 4 we introduce the fields of automated theorem proving
and machine learning, and review what has been done to combine the
two. In Section 5 we propose a high-level framework for machine learning
guided automated theorem proving, and in Section 6 and 7 we describe
how we developed a machine learning guided automated theorem prover
in the aforementioned framework.

2
Chapter 2

Automated Theorem Proving


(ATP)

2.1 Formalizing Deductive Reasoning


Deduction is the type of reasoning concerned with necessity, where each
conclusion is a logical consequence of the premise. But what exactly does it
mean for a statement to be a "logical consequence"? What are "valid rules of
inference"? To formalize our natural notion of this, we use what are known
as formal and logical systems:
Definition 2.1.1 (Formal system). A formal system, consists of:
1. A formal language, which is formed by:
- A set of symbols, the alphabet, strings of which are called
formulas.
- Rules for creating syntactically valid formulas (syntax).
2. A deductive system (proof calculus), which consists of:
- A set of axioms. Formulas providing a start or end point for our
deduction.
- A set of inference rules operating on formulas in the formal
language, determining what are valid inferences.

Definition 2.1.2 (Logical system). A logical system, consists of


1. A formal system.
2. Rules for interpreting formulas in the formal language of the formal
system (semantics).
Using these notions, we can formalize the process of deduction as
mechanical theorem proving. We will therefore use the terms "theorem
proving" and "deduction" interchangeably from now on.

3
To reach conclusions through deduction in this way, we need to use logical
systems whose syntax and semantics are expressive enough and whose
deductive systems are strong enough to prove theorems corresponding to
the statements in our domain of discourse. What formal framework to use
is highly dependent on the nature of the problem/theorem domain. There
is a wide range of imaginable domains in which we would like to perform
automated deduction. Table 2.1 shows some example domains and their
most common formalizations [1]:
Domain Common formalizations
General-purpose theorem proving and problem solving First-order logic, Simple type theory
Program verification Typed First-order logic, Higher-order logic
Distributed and concurrent systems Modal logic, Temporal logic
Program synthesis Intuitionistic logic
Hardware verification Higher-order logic, Propositional logic
Logic programming Horn logic
Constraint satisfaction Propositional logic, Satisfiabilty Modulo Theories
Computational metaphysics Higher-order modal logic

Table 2.1: Deduction domains and common formalizations

We will now have a look at the two most well-known categories of logical
frameworks: propositional logic and first-order logic. For a more detailed
overview and proofs of associated theorems, see [2].

2.2 Propositional Logic


Propositional logic, also called zeroth-order logic, is an umbrella term for
a set of equally expressive logical systems which deal with propositions as
the atomic entity.
Definition 2.2.1 (Proposition). A proposition is a statement which is either
true or false.
The statement "It is raining" is an example of a proposition.

2.2.1 Syntax
To formalize the notion of a proposition in a formal language for
propositional logic, we build propositional formulae using two atomic
types of symbols, logical connectives and propositional variables:
Definition 2.2.2 (logical connectives). The logical connectives are the sym-
bols ¬ (negation), ∧ (conjunction), and ∨ (disjunction), → (implication).
Definition 2.2.3 (Propositional variables). A propositional variable is a
symbol which is not a logical connective, usually a lower case letter p, q, r,
etc.
Using these we can build a formal language which will be a set of well-
formed propositional formulae recursively defined as follows:

4
Definition 2.2.4 (Propositional formulae). A propositional variable is an
atomic propositional formula. Any atomic propositional formula is a
propositional formula. If F and G are propositional formulae then the
following are also propositional formulae:
- ¬F
- ( F ∧ G)
- ( F ∨ G)
- ( F → G)
Some examples of well-formed formulas are:
(¬ p → (q → p)) and ((p → q) ∧ (q → p))
It is common to impose a precedence, , over the connectives to decrease
the number of parentheses needed to write a formula. If •  ◦, then p • q ◦ r
means (( p • q) ◦ r ) and not ( p • (q ◦ r )). The standard precendence is:
¬  ∧  ∨  →. This precedence will be assumed throughout the rest
of this thesis.

2.2.2 Semantics
To make sense of the formulae we defined in the previous subsection, we
introduce some semantics to interpret them.
Definition 2.2.5 (Truth values and propositional interpretations). An
interpretation, I, is an assignment of truth values, true or false, to each
propositional variable.
We extend this notion of interpretation to formulas of the formal language
described in the last subsection recursively as follows:
Definition 2.2.6 (Interpretation of propositional formulae). If F is an atomic
formula then I ( F ) = true precisely when we have assigned true to F,
otherwise I ( F ) = false. Otherwise if:
- F = ¬ G, then I ( F ) = true if I ( G ) = false, otherwise I ( F ) = true.
- F = ( G ∧ H ), then I ( F ) = true when both I ( G ) = true and I ( H ) = true,
otherwise I ( F ) = false.
- F = ( G ∨ H ), then I ( F ) = false when both I ( G ) = false and I ( H ) =
false, otherwise I ( F ) = true.
- F = ( G → H ), then I ( F ) = false when both I ( G ) = true and I ( H ) =
false, otherwise I ( F ) = true.
This way, the symbols: ¬, ∧, ∨ and → match our intuitive notions of
the natural language logical connectives: "not", "and", "or", and "implies"
respectively. To make this more concrete, let p be a propositional variable
representing the proposition "It rains" and q be a propositional variable
representing the proposition "I go outside". The formula p → ¬ q then
reads: "If it is raining then I do not go outside". Notice how this is true

5
precisely when the following sentence is true: "It is not the case that it rains
and I go outside". In propositional syntax that is: ¬ ( p ∧ q).
Definition 2.2.7 (Semantic equivalence). Two formulae F and G are
semantically equivalent if I ( F ) = I ( G ) for all possible interpretations I. This
is usually written F ≡ G.
Definition 2.2.8 (Semantic consequence). Formula F is a logical con-
sequence of the set of formulae Γ, if whenever I ( G ) = true for all G ∈ Γ
simultaneously, then I ( F ) = true, for all interpretations I. This is usually
written Γ  F.
Definition 2.2.9 (Tautology). Formula F is a tautology if it is true for every
interpretation.
In propositional logic, the formulae F and G are semantically equivalent if
for all possible assignments of truth values to their propositional variables,
if F is true, then G is true, and if F is false, then G is false. We can show this
using what are known as truth tables (T = true and F = false):

p q p → ¬ q ¬(q ∧ p) ¬ p ¬ p →(p → ¬ q)
T T T F F T F T T T F T F T T T F F T
T F T T T F T F F T F T F T T T T T F
F T F T F T T T F F T F T F T F T F T
F F F T T F T F F F T F T F T F T T F

Table 2.2: Truth table

The leftmost columns of Table 2.2 indicate truth assignments to the


propositional variables p and q. Two variables and two truth values (true,
false) gives 22 = 4 possible interpretations, hence there are 4 rows of truth
values in the table. Row 1: p = T and q = T, Row 2: p = T and q = F, etc. The
remaining columns indicate truth values of the above substituent formulae
as defined in Definition 2.2.5, with the columns marked in red indicating
the truth values with respect to the outermost connectives and hence the
entire formula. We can now formally say that ¬ (q ∧ p) and p → ¬ q are
semantically equivalent, since their truth values are the same for all truth
assignments to p and q. We also see that both ¬ (q ∧ p) and p → ¬ q
are, according to Definition 2.2.8, semantic consequences of ¬ p, since if
we know that I (¬ p) = true, then both ¬(q ∧ p) = true and p → ¬ q =
true. Checking with our intuition, if we know that "It is not raining", then
we know that "It is not the case that it rains and I go outside". We are not
satisfying both sides of the "and" connective. Lastly we see that ¬ p → (p
→ ¬ q) is a tautology because it is true irrespective of interpretation.
Notice: To determine the truth value of a formula, it does not matter what
the constituent propositional variables represent, only what their truth
values are.

6
2.2.3 Deductive systems
We previously defined a semantic notion of what logical consequence is
by semantic consequence. We now introduce a purely syntactic way to
determine whether conclusions follow from premises without considering
interpretations.
Definition 2.2.10 (Syntactic consequence). Given a formal system as
defined in Definition 2.1.1, a formula F is a syntactic consequence of a set
of formulae Γ, written Γ ` F, if F can be inferred by Γ according to the
inference rules of the formal system.
Definition 2.2.11 (Formal Proof). Given a formal system, a formal proof is
a sequence of formulas, each of which is an axiom of the formal system or
is a syntactic consequence of the preceding formulae in the sequence. The
last formula in a formal proof is called a theorem.
This notion of a formal proof will not mean much unless our logical system
has two core properties:
Definition 2.2.12 (Semantical soundness). A logical system is called sound,
iff any theorem is a tautology of the system.
Definition 2.2.13 (Semantical completeness). A logical system is called
complete, iff any tautology is a theorem of the system.
Hence we want to connect the semantic notion of truth with the syntactic
notion of proof, by defining a sound and complete logical system.
There are many formal systems which when combined with the standard
semantics for propositional logic create a sound and complete logical
system. First we will introduce some standard syntax for all deductive
systems (proof calculi) to make it easier to compare different proof calculi:
Definition 2.2.14 (General calculus and proof syntax). A proof calculus
consists of:
- Axioms written: w (Axiom name)

w1 w2 · · · wn
- Rules written: w (Rule name)

Where w1 , . . . , wn are premises and w is the conclusion. A derivation of w


is then a tree growing upwards such that:
- Nodes are axioms or rules of the calculus
- The conclusion in the root node is w
- The conclusion of each inner node is a premise of its parent node
A proof of w is a derivation of w for which all leaves are axioms.

7
Table 2.3 shows an example of a proof caclulus for propositional logic:

(axiom 1)
F → (G → G)
(axiom 2)
(( F → ( G → H )) → (( F → G ) → ( F → H )))
(axiom 3)
((¬ F → ¬ G ) → ( G → F ))

F F→G (Modus Ponens)


G

Table 2.3: Łukasiewicz’ calculus for propositional logic [3]

Notice that Łukasiewicz’ calculus only uses the ¬ and → connectives. For
this calculus to be sound and complete we must first translate all formulae
containing ∧ and ∨ as follows:
- Any formula of the form F ∨ G becomes ¬ F → G
- Any formula of the form F ∧ G becomes ¬( F → ¬ G )
The proof is omitted here, but one can verify that Łukasiewicz’ calculus
(Table 2.3) with standard propositional semantics is sound and complete.

2.3 First-order Logic


Propositional logic treats propositions as atomic entities. It turns out, how-
ever, that it is very limiting to only be able to reason about propositions.
In propositional logic, we are easily able to reason about singular things
that may or may not be true like whether it is raining. However, to reason
that when it rains, grass gets wet, we would need to have a proposition
for every single strain of grass in the world saying that it is wet if it rains.
First-order logic is a collection of equivalently expressive formal systems
which extends propositional logic by breaking up propositions to allow for
quantification of things in the world.

2.3.1 Syntax
We build first-order formal languages as follows:
Definition 2.3.1 (First-order symbols). Any first-order formal language is
built using the following disjoint sets of of symbols, the logical symbols:
- Logical connectives: ¬, ∧, ∨, →.
- Quantifiers: ∀ (universal), ∃ (existential)
- Variables: v1 , v2 , v3 , .. (countably infinitely many)

8
and the non-logical symbols:
- Constants: c1 , c2 , c3 , ...
- Functions: f 1 , f 2 , f 3 , ...
- Predicates: P1 , P2 , P3 , ...
Definition 2.3.2 (First-order signature). The non-logical symbols of a first-
order language make up what is called the signature of the language. A
signature is defined as a tuple σ = (Scon , S f un , Srel , ar ) such that:
- Scon is the set of constant symbols
- S f un is the set of function symbols
- S pre is the set of predicate symbols
- ar : S f un ∪ S pre → N, ar assigns a natural number called the arity to
every function and predicate symbol
Definition 2.3.3 (First-order terms). We define the set of terms inductively
as the smallest set such that:
- Every variable and constant is a term denoted by t1 , t2 , t3 , ...
- If f is a function of arity ar ( f ) = n and t1 , ..., tn are terms, then so is
f (t1 , ..., tn )
Definition 2.3.4 (Atomic first-order formula). We define the set of atomic
first-order formulae inductively as the smallest set such that for each
predicate symbol P:
- If P is a predicate symbol of arity ar ( P) = n = 0, then P is an atomic
formula
- If P is a predicate symbol of arity ar ( P) = n > 0 and t1 , ..., tn are terms
then P(t1 , ..., tn ) is an atomic formula
Definition 2.3.5 (First-order formulae). We define the set of formulae
inductively as the smallest set such that:
- All atomic formulae are formulae
- If F and G are formulae, then ¬ F, ( F ∧ G ), ( F ∨ G ), and ( F → G ) are
formulae
- If F is a formula and v is a variable, then ∀vF and ∃vF are formulae
All occurences of the variable v in a formula F, are said to be "bounded" in
the formulae ∀vF and ∃vF and "inside the scope" of the respective quantifier
in each formula.
As with propositional logic, there is a standard precedence for first-order
formulae defined in order to reduce parentheses. The precedence is: ∀ 
∃  ¬  ∧  ∨  →. Where "" means same precedence, and "" is as
before.

9
2.3.2 Semantics
Again, as with propositional logic, we want to make sense of the formulae
we just defined. We are no longer dealing with propositions as atomic
entities, instead we will be interested in the truth values of the first-order
analogs of propositions, predicates:
Definition 2.3.6 (First-order predicates). A predicate is an atomic formula
containing one or more placeholders and has a truth value of true or false
depending on the value of the placeholders.
Definition 2.3.7 (First-order interpretation). A first-order interpretation of
a first-order language with signature σ is a tuple I = ( D, ν) such that:
- D is a set of elements, called the domain of discourse
- ν is an interpretation function, such that:
- For every constant symbol c ∈ σ, ν(c) ∈ D
- For every function symbol f ∈ σ of arity ar ( f ) = n, ν( f ) : D n →
D
- For every relation symbol P ∈ σ of arity ar ( P) = n, ν( P) ⊆ D n
Definition 2.3.8 (First-order substitution and closed formulae). A substitu-
tion τ is a total mapping from the set of variables to terms. If F is a formula,
we write τ (F) to mean the result of substituting each free (not in scope of
quantifier ∀v) occurrence of all variables v in the domain of τ with τ (v) in
F. We also write F[v/t] to mean the substitution of each free (not in scope
of quantifier ∀v) occurrence of the variable v with the term t in F.
Definition 2.3.9 (Closed formula). A first-order formula F is said to be
closed if it contains no free variables. We say F is closed under the
substitution τ if τ ( F ) contains no free variables.
Definition 2.3.10 (Interpretation of first-order terms). Given a first-order
interpretation, I = ( D, ν), we interpret a term t recursively as follows, if:
- t is a constant, then v(t) = a, where a ∈ D
- t is a function, f , and ar ( f ) = m, then ν( f (t1 , ..., tm )) =
ν( f )(ν(t1 ), ..., ν(tm ))
Definition 2.3.11 (Interpretation of closed first-order formulae). Given a
first-order interpretation, I, we can define what it means for a formula F to
be true in the interpretation I = ( D, ν), recursively as follows:
- A closed atomic formula F = P(t1 , ..., tn ) is true in I if
(ν(t1 ), ..., ν(tn )) ∈ ν( P).
- Otherwise if:
- F = ¬ G, then F is true in I if G is not true in I
- F = G ∧ H is true in I if both G and H are true in I

10
- F = G ∨ H is true in I if not both G and H are false in I
- F = G → H is true in I if its not the case that G is true in I and
H is false in I
- F = ∀ xG if G [ x/a] is true in I for all a ∈ D
- F = ∃ xG if G [ x/a] is true in I for at least one a ∈ D
Definition 2.3.12 (First-order satisfiability and validity). If F is true in an
interpretation I, we say that I satisfies F. If there is an interpretation I which
satisfies F, then we say F is satisfiable, otherwise we say it is unsatisfiable.
If a formula is true in every interpretation, then we say it is valid, otherwise
we say it is invalid.

2.3.3 Deductive system


The previously defined notions of semantic and syntactic consequence hold
for first-order logic as well. Table 2.4 shows Hilbert’s proof calculus for
first-order logic leading to a semantically sound and complete first-order
logical system.

(axiom 1)
F → (G → G)
(axiom 2)
(( F → ( G → H )) → (( F → G ) → ( F → H )))
(axiom 3)
((¬ F → ¬ G ) → ( G → F ))
(axiom 4)
∀ xF → F [ x/t]
(axiom 5)
∀ x ( F → G ) → (∀ xF → ∀ xG )
(axiom 6)
F → ∀ xF

F F→G (Modus Ponens)


G

Table 2.4: A simplified Hilbert calculus [3]

Notice that, just like Łukasiewicz’ calculus for propositional calculus, the
simplified Hilbert calculus uses a limited number of connectives. For this
calculus to be sound and complete we must first translate all formulae
containing ∧, ∨, and ∃ as follows:
- Any formula of the form F ∨ G becomes ¬ F → G
- Any formula of the form F ∧ G becomes ¬( F → ¬ G )
- Any formula of the form ∃ F becomes ¬∀¬ F

11
Combining Hilbert’s deductive system with the first-order syntax and se-
mantics we have defined so far, we have a semantically sound and com-
plete logical system expanding the propositional system with quantifica-
tion. First-order logic is a key logic for a multitude of reasons:
Theorem 2.3.1 (Compactness). A set of formulae, Γ, has a satisfying interpret-
ation, if and only if every finite subset, Γ0 ⊆ Γ, has a satisfying interpretation.
Theorem 2.3.2 ((downward) Löwenheim–Skolem). If a set of formulas, Γ, has
a satisfying interpretation, it is satisfiable in an interpretation with a countable
domain.
Theorem 2.3.3 (Lindström’s theorem). First-order logic is the most express-
ive logic having both the compactness property and the (downward) Löwen-
heim–Skolem property.
Definition 2.3.13 (Syntactic completeness). A formal system is syntactically
complete if for each formula, F, in the formula language, either F or ¬ F is
provable.
Theorem 2.3.4 (Gödel’s second incompleteness theorem). First-order logic is
not syntactically complete. That is, there is no algorithm that always terminates
and decides whether an arbitrary first-order proposition is a theorem or not. We
say that first order logic is "undecidable".
First-order logic can express a large part of mathematics and encode any
computable problem. Therefore, Lindström’s theorem and the semantical
soundness and completeness of first-order logic make it an interesting
system for automation of proof search. Unfortunately, as Gödel’s theorem
states, first-order logic is undecidable.
However, since first-order logic is semantically complete, there is an
algorithm that given a valid first-order proposition terminates with a proof
of this formula, but given an invalid formula might not terminate. In
this case we say first-order logic validity is semi-decidable (or Turing
recognizable).

2.4 First-Order ATP


The goal of automating theorem proving procedures was an early and
major impetus for development of the field of computer science. The
Turing machine was invented to prove that first-order logic is undecidable
[4] and the formalization of the most important problem in computer
science, the P/NP problem, is based on ATP in propositional logic [5]. We
will now have a look at the state of the art algorithms which semi-decide
validity of first-order formulae.
Automating first-order theorem proving boils down to automating a first-
order deductive system.

12
2.4.1 Natural Deduction and Sequent Calculus
The calculi presented earlier in Tables 2.3 and 2.4 are examples of so called
Hilbert-style systems. These came bundled with the early exploration of
modern logic in the late-19th/early-20th century by Frege, Hilbert, and
Russell and are quite compact and elegant; only using the modus ponens
rule and a few axioms. However, these calculi are quite far from the
way humans normally reason. Wishing to construct a formalism closer
to human reasoning, Gerhard Gentzen proposed the natural deduction
system in 1934 [6]. In this system Gentzen addresses the fact that
humans do not normally start proofs from axioms, but instead make
claims under some assumptions and then analyze the assumptions and
claims separately and combine them later in the proof. Assumptions are
combined into claims by what are known as introduction rules, while
assumptions are split into its parts by elimination rules. However, this
two-way search, makes natural deduction not well suited for automation.
Gentzen also later realized that he could convert this into one-way ordeal
by making assumptions local and encoding the derivability relations in
natural deduction by what he called sequents. A sequent, written Γ ` ∆,
consists of a succedent Γ and an antescedent ∆, where ∆ and Γ are sets of
formulae. Instead of formulae, the words of the calculus are sequents.

(axiom)
Γ, A ` A, ∆

Γ, A, B ` ∆ Γ ` A, ∆ Γ ` B, ∆
(∧-left) (∧-right)
Γ, A ∧ B ` ∆ Γ ` A ∧ B, ∆

Γ, A ` ∆ Γ, B ` ∆ Γ ` A, B, ∆
(∨-left) (∧-left)
Γ, A ∨ B ` ∆ Γ ` A ∨ B, ∆

Γ ` A, ∆ Γ, B ` ∆ Γ, A ` B, ∆
(→-left) (→-right)
Γ, A → B ` ∆ Γ ` A → B, ∆

Γ ` A, ∆ Γ, A ` ∆
(¬-left) (¬-right)
Γ, ¬ A ` ∆ Γ ` ¬ A, ∆

Γ, A[ x \ti ], ∀ xA ` ∆ Γ ` A [ x \ t j ], ∆
(∀-left) (∀-right)
Γ, ∀ xA ` ∆ Γ ` ∀ xA, ∆

Γ, A[ x \t j ] ` ∆ Γ ` ∃ xA, A[ x \ti ], ∆
(∃-left) (∃-right)
Γ, ∃ xA ` ∆ Γ ` ∃ xA, ∆

(For ∀-right and ∃-right, t j must not occur in the conclusion)

Table 2.5: Sequent Calculus LK

13
Theorem 2.4.1 (Correctness and Completeness of LK). A formula F is valid
iff there is a proof for " ` F".
When applying the ∀ and ∃ rules in the Table 2.5 LK calculus above, we
are currently choosing arbitrary terms ti and t j for x. We would like to
do something more intelligent than random guessing of the terms we are
substituting in so that we are more likely to end up with two syntactically
equivalent formulae in the antecedent and succedent. That is, more likely
to end up in an axiom.
Definition 2.4.1 (Unification and most general unifier). We say that two
terms s and t are unifiable if there exists a substitution such that σ(s) is
syntactically equivalent to σ(t), written σ(s) = σ(t). e.g. s = f ( x ) and
t = f ( a) are unifiable using the substitution σ = { x \ a}. We call σ a "unifier"
for s and t. A unifier σ1 is a "most general unifier" (mgu) for s and t if:
- σ1 is a unifier for s and t
- for every unifier σ2 of s and t, there exists a substitution τ, such that
σ2 = τ (σ1 )
Keeping track of an mgu σ and using so called "free variables", allows
us to delay making substitution decisions, instantiating terms, until they
are absolutely necessary and thus reduces the search space. The following
calculi we will look at, all use unification and are more machine oriented
than the previous calculi. Most real world ATP systems for first-order logic
use unification.

2.4.2 Method of Analytic Tableaux


The semantics of sequents are such that a sequent Γ ` ∆ is true in every
interpretation, if all interpretations that satisfy all formulae in Γ satisfy at
least one formula in ∆. In classical logics such as first-order logic, one way
to prove that a formula F is true in every interpretation is to show that
¬ F is false in every interpretation. This is a consequence of "the law of
the excluded middle"; either a statement is true or it’s negation is true.
Expressed in terms of sequents, F is true in every interpretation if the
sequent {¬ F } ` {} is true in every interpretation. A proof of this is called
a "Proof by contradiction" or "refutation".
By negating every formula in the succedent of a sequent and moving them
to the anticedent we can create a sequent calculus where all rules have
empty succedents in both the premises and conclusions. These are called
"one-sided sequent calculi" or "block tableaux". We will see an example of
this shortly, but first we will have a look at some further useful notions.
Definition 2.4.2 (Negation Normal Form). Any first-order formula F can
be translated into a semantically equivalent formula F 0 in negation normal
form (NNF). Formulae in NNF do not contain →, and ¬ only occurs
(directly) in front of atomic formulae. Formulae are translated into NNF
as follows:

14
- Any formula of the form A → B becomes ¬ A ∨ B
- Any formula of the form ¬( A ∧ B) becomes (¬ A ∨ ¬ B)
- Any formula of the form ¬( A ∨ B) becomes (¬ A ∧ ¬ B)
- Any formula of the form ¬∀ xA becomes ∃ x ¬ A
- Any formula of the form ¬∃ xA becomes ∀ x ¬ A
- Any formula of the form ¬¬ A becomes A
Definition 2.4.3 (Skolemization). Through Skolemization, any first-order
formula F 0 can be translated into an equisatisfiable formula F 00 which
does not contain ∃. Equisatisfiable means that F is satisfiable if and only
if F 0 is satisfiable, however they may be satisfied by different variable
instantiations so they are not necessarily semantically equivalent. A
formula F 0 in NNF is Skolemized as follows:
- Any formula of the form ∀y1 , . . . , ∀yn ∃ xA becomes
∀y1 , . . . , ∀yn A[ x \ f (y1 , . . . , yn )], where f is a new function symbol.
Combining the notions of one-sided sequents, unification, and Skolemized
negation normal form we can build the fairly compact sound and complete
block tableau calculus shown in Table 2.6 with the ` sign omitted.

(axiom)
L1 , ¬ L2 , ∆
(Such that σ is a m.g.u of L1 and L2 )

F, G, ∆
(α-rule)
F ∧ G, ∆

F, ∆ G, ∆
(β-rule)
F ∨ G, ∆

F [ x \ x ∗ ], ∀ xF, ∆
(γ-rule)
∀ xF, ∆
(x ∗ is a new variable)

Table 2.6: Block Tableau Calculus

Theorem 2.4.2 (Correctness and Completeness of the Block Tableau


Calculus). A formula F is valid iff there is a Block Tableau proof for the Skolemized
negation normal form of ¬ F for a single substitution σ.
This calculus has "tableau" in the name because it has a quite nice
(unambiguous) graphical representation due to only having one branching
rule and two non-branching rules. For example a proof tree for refuting the
propositional set of formulae: {¬ p ∨ q, ( p ∨ r ) ∨ q} can be drawn as shown
in Figure 2.1.

15
¬ p ∧ q, ( p ∨ r ) ∨ ¬q ¬ p ∧ q, ( p ∨ r ) ∨ ¬q

¬ p ∧ q, p ∨ r, ¬q p∨r

¬ p, p ∨ r, ¬q q, p ∨ r, ¬q ¬q

¬ p, p, r, ¬q ¬p q

Figure 2.1: Proof tableau, Left: Explicit, Right: Implicit [7]

In the left tree in Fig. 2.1 we write ∆ explicitly in each node. We have
reached an axiom when a node contains both A and ¬ A. In the right tree,
we only write the new formulae generated by the rule application in each
node. An axiom is then reached when both A and ¬ A are on the same
branch.

2.4.3 Connection Calculus


Building on the formula transformations defined earlier we arrive at the
normal form which is used for most state-of-the-art theorem provers.
Definition 2.4.4 (Clausal Form). Any formula F 0 in Skolemized NNF can
be translated into an equisatisfiable set of atomic formulae F 00 in clausal
form. The translation is as follows:
1. Any subformulae of the form A ∨ ( B ∧ C ) becomes ( A ∨ B) ∧ ( A ∨ C )
2. For any subformula of the form ∀ xA in F 0 , A becomes A[ x \ x ∗ ] and F 0
becomes ∀ x ∗ F 0 , where x ∗ is a new unique variable.
3. The resulting formula should now be of the form ∀1 x1 , . . . , ∀n xn M,
where M (called the matrix) is a conjunction of disjunctions. For every
conjunction in M create a set of its constituent atomic formulae. F 00 is
then the set of all these sets.
Using the clausal form of a formula we can define another proof calculus
reminiscent of the block tableau calculus called the connection calculus.

16
(axiom)
{}, M, Path

C2 , M, Path
(start)
e, M, e

C, M, Path ∪ { L2 }
(reduction)
C ∪ { L1 }, M, Path ∪ { L2 }

C2 \ { L2 }, M, Path ∪ { L1 } C, M, Path
(extention)
C ∪ { L1 }, M, Path

In all cases C2 is a copy of C1 ∈ M, L2 ∈ C2 , and σ( L1 ) = σ( L2 )

Table 2.7: Connection Calculus [8]

The complement L of a literal L is P if L is of the form ¬ P, and ¬ L otherwise.


In Table 2.7, the words of the calculus are tuples “C, M, Path”, where M
is the matrix of the given formula in clausal form, while C and Path are
sets whose elements are removed and added according to the rules of the
calculus.
Theorem 2.4.3 (Correctness Completeness of the Connection Calculus). A
formula F in clausal form is valid iff there is a Connection proof for "e, M, e",
where M is the matrix of ¬ F, for a single substitution σ.
Further description of the intuition behind these sets and proofs of
soundness and completeness can be found in Bibel’s book [8]. This calculus
is the basis of the leanCoP [9, 10] prover which we will have a closer look
at later in Section 4.

2.4.4 Resolution Calculus


A common heuristic for writing efficient programs is to restrict branching
computation as much as possible. The resolution calculus has been the
standard basis for first-order proof search since it was introduced by Davis
and Putnam in 1960 [11] and improved by the introduction of unification
in 1965 by Robinson [12].

17
(axiom)
C1 , . . . , {}, . . . , Cn

C1 , . . . , Ci ∪ { L1 }, . . . , Cj ∪ { L2 }, . . . , Cn , σ (Ci ∪ Cj )
(resolution)
C1 , . . . , Ci ∪ { L1 }, . . . , Cj ∪ { L2 }, . . . , Cn
(Where σ is a m.g.u of L1 and L2 )

C1 , . . . , Ci ∪ { L1 , . . . , Lm }, . . . , Cn , σ (Ci ∪ { L1 })
(factorization)
C1 , . . . , Ci ∪ { L1 , . . . , Lm }, . . . , Cn
(Where σ is a mgu of L1 , . . . , Lm )

Table 2.8: Resolution Calculus

Theorem 2.4.4 (Correctness and Completeness of the Resoultion Calculus).


A formula F in clausal form is valid iff there is a Resolution proof for "¬ F".
Saturation style calculi based on resolution are at the core of all the
currently most efficient first-order ATP systems such as Vampire [13] and E
[14] (measured on a subset of the TPTP theorem database during the CADE
ATP System Competitions [15]). These are both based on the superposition
calculus which is an extension of resolution.

18
Chapter 3

Machine Learning (ML)

3.1 Formalizing Inductive Reasoning


Machine Learning (ML) is the subfield of Artificial Intelligence (AI) con-
cerned with algorithms which improve automatically through experience.
More concretely, these algorithms are capable of generalization. That is,
they provide sensible outputs for inputs not encountered previously by
extracting relevant information from previous inputs and supervision sig-
nals, and applying it to analyze new inputs. The algorithms build mathem-
atical models based on sample data, known as "training data", in order to
make predictions or decisions they haven’t been explicitly programmed to
make. This section is loosely inspired by, and contains paraphrasings from,
the following textbooks: [16–19].

3.2 ML Paradigms
Machine learning approaches have traditionally been divided into three
categories, depending on the nature of the training data available and the
use case of the algorithm.

Input Input Input

Supervised Unsupervised Reinforcement


Error

Reward
Critic Critic
Output Output Output

Figure 3.1: Comparison of ML paradigms [20]

19
Supervised learning
While all machine learning algorithms learn input/output behaviour by
optimizing a function, called the objective function or loss function, and
are in this sense "supervised" by that goal, "supervised" in the case of
supervised learning means that the training data consists of both example
inputs and their desired outputs. The goal is to learn a function that best
generalizes this input → output behaviour. Most of the examples we will
see in the next subsection are in this category.
Definition 3.2.1 (Regression). Supervised learning task where the output
space is continuous.
Definition 3.2.2 (Classification). Supervised learning task where the output
space is discrete.

Unsupervised learning
In unsupervised learning, while we are still "supervised" by our goal of
optimizing with respect to an objective function, our training data only
consists of example inputs no example outputs. The goal of unsupervised
algorithms is to discover hidden patterns in inputs and produce output
summarizing found patterns.
Definition 3.2.3 (Dimensionality reduction). Unsupervised learning task
where the output space is continuous.
Definition 3.2.4 (Clustering). Unsupervised learning task where the output
space is discrete.

Reinforcement learning
Reinforcement learning is perhaps the most general of the three paradigms,
and is the category concerned with agents which interact with environ-
ments. In reinforcement learning, training data consists of example inputs,
outputs, and rewards for giving the respective outputs for the respective
inputs. The goal is to learn a function from input to output which maxim-
izes reward.
Definition 3.2.5 (Optimal control/decision-making). Reinforcement learn-
ing task where the output space can be both discrete and continuous (ac-
tions).

3.3 ML Algorithms and Models


Machine learning algorithms build models of input → output behaviour.
These models describe which output to give for any given input. ML al-
gorithms use training data to build models during what’s called "training".
Statistically any machine learning problem can be expressed as follows:

20
Yi = g ( Xi ) + ei

Yi is the observed output, g is the underlying function describing the real


relationship between input and output, Xi = ( x1 , . . . , xn ) is the input, and
ei represents un-modeled determinants of Yi or random statistical noise.
Our job is to find a model, f , which resembles g as closely as possible. Dif-
ferent algorithms build different types of models, the choice of algorithm
will depend on the requirements of the model. The models differentiate
themselves in terms of expressive power, efficiency, and interpretability.

3.3.1 Non-parametric Models


Non parametric models do not make any assumptions about the structure
of the function approximator or underlying data distributions, and can fit a
large variety of functions. They are therefore good when there is little prior
knowledge about the data.

K-nearest neighbors (kNN)


This model has no training phase. Instead, to find f ( Xi ), it measures
the distance between Xi and every point X j in its training data according
to some metric (e.g. L2-distance) and considers the k closest points.
Classification is done by a plurality vote of the label of the k nearest
neighbors, while regression is done by averaging the label of the k nearest
neighbors.

Decision Trees (DT)


Decision trees are more efficient at prediction time than kNN in exchange
for having a training phase. In the decision tree model, during training,
we split the feature space into regions which best splits the training data
according to some metric (usually based on information gain/entropy).
To find f ( Xi ), during prediction, instead of looking at the entire training
data set, we only have to determine which region Xi belongs to. Figure 3.2
shows a decision tree boundary for a decision tree classifier and the data
points used to train it. A decision boundary is the threshold where points
are classified as class 0 on one side and class 1 on the other side.

21
Figure 3.2: Decision tree training data and decision boundary

We can visualize the splits of the feature space shown in Figure 3.2 as a tree
as shown in Figure 3.3, hence the name "decision tree". f ( Xi ) is the leaf
found by following the correct branch corresponding to the values of the
features of Xi .

Yes No

Class 1

Yes No

Class 0

Yes No

Class 1 Class 0

Figure 3.3: Decision tree prediction

In general, finding an optimal decision tree for a task is infeasible.


Therefore it is quite common to use ensemble methods which construct
more than one tree (e.g. boosting and random forests).
At the time of writing, most data science contests, such as the ones on
Kaggle.com, are won either by Neural Network models (which will be

22
discussed in detail in the next subsection) or Gradient boosting models
(usually ensemble of decision trees) [21]. These models are heavily applied
in real-world applications and research since they tend to show good
results on general data.

3.3.2 Parametric Models


Unlike non-parametric models, parametric models have a predetermined
finite number of parameters, w. In particular, w ∈ Ω, where Ω is a
finite-dimensional parameter space. In these models, the same number
of parameters will be fit to approximate the input/output function, f ,
regardless of the nature and size of the training data. In all the models
we will look at, w is commonly referred to as the "weights" of the model.
The intuition is that the weights represent the discriminatory information
contained in the training data.
Fitting the function f (training) is usually done by either maximizing a
likelihood function, so that under the assumed conditions the observed
data is most probable, or by more general Bayesian methods.

Generalized Linear Models (GLM)


In these models, the predicted output Yi is assumed to have an error
distribution belonging to the exponential family. The exponential family is
a large class of statistical distributions including, among others, the normal,
binomial, and Poisson distributions. We let the range of f be the expected
value of Yi given the input:

f ( Xi ; w) = E(Yi | Xi )

Furthermore, f assumes the output is given by a linear combination of


the input features, Xi , determined by the parameters w. This linear
combination is then fed to a mean function h, giving the mean µ of the
assumed error distribution of the output. In general:

f ( Xi ; w) = E(Yi | Xi ) = µ = h( Xi · w)

If the predicted Yi is assumed to have a normally distributed error


distribution, h is the identity function:

h(z) = z

Our function approximator, f , then becomes:

f ( Xi ; w ) = h ( Xi · w ) = Xi · w

23
The retrieved model is what is known as "Linear regression".

Figure 3.4: Linear regression with one feature

The function found by linear regression in Figure 3.4 is:

f ( Xi ) = 1.8x1 + 0.97

That is, every data point is described as a vector Xi = ( x0 , x1 ) = (1, x1 ),


where x1 is the feature of our data we use to predict, and x0 is a bias term to
ensure the found function is affine. In terms of w, we say f is parameterized
by:

w = (w0 , w1 ) = (0.97, 1.8)

Now if the predicted Yi is assumed to have a Bernoulli distributed error


distribution, h is the logistic function:

1
h(z) = σ(z) =
1 + e−z

Our function approximator, f , then becomes:

f ( Xi ; ) = h ( Xi · w ) = σ ( Xi · w )

The retrieved model is what is known as "Logistic regression". This model


is commonly used for binary classification by introducing a threshold
function, g, to the above f , say:
(
1 if f ( Xi ) > 0.5
g( f ( Xi )) =
0 if f ( Xi ) ≤ 0.5

24
Figure 3.5: Logistic regression with two features

The function found by logistic regression in Figure 3.5 is:

f ( Xi ) = σ(1.67x2 + 1.30x1 − 3.97)

We now have two features of the data we use to predict, so every point is
described as a vector Xi = ( x0 , x1 , x2 ) = (1, x1 , x2 ), and f is parameterized
by:

w = (w0 , w1 , w2 ) = (−3.97, 1.30, 1.67)

The blue line in Figure 3.5 shows where f (z) = 0.5. That is, the decision
boundary. The decision boundary generalizes to a hyperplane in higher
dimensions if we predict using more features.

Perceptron and Neural Networks (NN):


We will now look at some function approximators which take on the same
form as the GLM’s, that is:

f ( Xi ; w ) = h ( Xi · w )

However, they do not restrict the function, h, to depend on the error


distribution of the output belonging to the exponential family. Instead
they draw inspiration from the brain. Here synapses carry signals between
neurons and neurons fire signals as a function of its received signals. In this
scenario, we typically refer to h as the "activation function", and its output
is called an "activation", denoted by a. All of this is illustrated in Figure 3.6.

25
Figure 3.6: Neuron model

The simplest model based on this concept is called the "Perceptron". This
model uses the step function as, h:
(
1 if z > 0
h(z) =
0 if z ≤ 0

Hence if the sum of incoming signals is greater than 0, the neuron


"fires", otherwise it doesn’t. This yields a simpler linear model for binary
classification than the logistic regression model, but would have similar
weights and plot if trained on the same data as the classifier in Figure 3.5.
The idea of neural networks is to compose these neurons to create a non-
linear model. Neural networks compose neurons in two ways, in width
and depth. We compose in width by making w a matrix (denoted W)
instead of a vector, where each column of W corresponds to the weights
w of one neuron, and h a vector valued function. Letting Xi W = z, f now
becomes the vector valued function:

f ( Xi ; W ) = h( Xi W ) = (h0 (z0 ), . . . , hl (zl )) = ( a0 , . . . , al )

This is shown graphically in Figure 3.7. Here is an explanation of the


notation used in these figures 3.7 and 3.8:
- wi,j corresponds to the entry of W at row i and column j.
- xi means the i-th entry of vector x.
- x (d) means the version of x at depth d.
- w(d,q) is the q-th column of W (d)

26
Figure 3.7: Neurons composed in width

To compose the neurons depth-wise, we introduce a temporal element to


the model, by composing functions. f is now defined as follows:

a (0 ) = Xi
a ( j ) = h ( j ) ( a ( j −1) W ( j ) )
f ( Xi ; W (1,...,d) ) = a(d)

Figure 3.8: Neurons composed in depth

27
Composing neurons both in width and depth, we arrive at a family of
models called "Artificial Neural Networks" (ANN). In practice most ANNs
typically have a simpler structure than what is strictly possible. The same h
functions are typically used for every inner neuron; they are usually simple
and provide a smooth (differentiable) transition as input values change. A
commonly used activation function is the rectified linear unit:

h(z) = ReLU(z) = max(0, z)

The defining advantage of neural networks is that, by the universal


approximation theorem, they can approximate effectively any function to
any given precision just by increasing their width and depth enough. That
includes non-linear functions (due to the non-linear activation functions)!

Graph Embeddings, Convolutions and Attention Networks


Deep learning is usually said to mean neural networks with many layers
(large d in the previous equations). In addition to heavily increased
computing power and available data, the recent explosion in interest for
machine learning can be largely attributed to developments in the design of
ANNs; exploiting the general a priori structure of input data. Introducing
convolutions for spatially structured data such as images has shown great
results for tasks such as image classification. Using the output of a
previous input as supplementary input for temporally structured data such
as sentences or audio has shown great results for tasks such as translation
and speech recognition. These spatial and temporal data structures are
very simple data structures, which can be viewed as special types of
graphs. Graphs in themselves are very simple, but also very powerful data
structures generalizing the notion of relational data.
Definition 3.3.1 (Graph). A graph is a 2-tuple G = (V, E) where:
- V is a set of vertices.
- E ⊆ {( x, y)|( x, y) ∈ V 2 }, is a set of edges.
In figure 3.9, vertices are values at certain time steps and neighbouring
time steps are connected by edges. In Figure 3.10, vertices are coordin-
ates/pixels and neighbouring coordinates are connected by edges. In fig-
ure 3.11 vertices and edges represents an arbitrary relational structure.
In order to process these types of data with any of our previously
introduced models, we need to somehow represent the data as constant-
shape tensors X.
For temporal data such as sentences or audio, a common solution is to
represent the nodes with some constant-dimensional node feature vector
Xt and have the model process neighbours sequentially using previous
output as input, i.e. ht = f ( Xt , ht − 1) where f is a neural network.

28
Figure 3.9: Temporal Figure 3.10: Spatial Figure 3.11: Arbitrary
data (2D) data relational data

These architectures, recursing in a linear fashion, are called recurrent neural


networks.
For spatial data such as images, a common solution is to again have a
constant-dimensional feature vector for each node (pixel), i.e. RGB values,
luminocity, etc, but consider neighbours simultaneously by convolving a
constant sized set of weights over the picture representation.
The goal in both of these cases is to produce what’s called an embedding of
the graph. These are vector representations of the graph, such that similar
graphs are close in the vector space. The embeddings can then be used to
discriminate/generate in any machine learning task.
Graph neural networks (GNN) are an attempt at generalizing these
concepts to arbitrary graph inputs G = (V, E), such that our model is
invariant to node order and number of nodes and edges. The goal is
to learn an embedding hv ∈ Rs for every node v ∈ V which contains
information about the neighborhood of v. In general, htv is given by:

h0v = Xv
 
 
htv = qt  htv−1 , f t htv−1 , htu−1 , eu→v 
[

u∈N (v)

This is shown graphically for one node in the Figure 3.11 graph in Figure
3.12.

29
A B

Target node
C D

E F

INPUT

Embedding of node at

Figure 3.12: Graph neural network node embedding

Where f t is a differentiable message function, extracting the most im-


S
portant information from each neighbour of the current node. is a
permutation-invariant function to aggregate the messages (like the sum or
the average), N (v) are the neighbors of v, and eu→v is the edge attribute
of edge u → v. qt is a differentiable update function, taking the previous
value of hv and the aggregation of the attended neighbours to produce the
embedding of v at the next time step.

3.4 Reinforcement Learning (RL)


RL is used in scenarios where we have no chance of knowing the correct
input/output pairs a priori, but we can reward some good input/output
pairs when they pay of, such as with complex decision-making problems
where we don’t know long term consequences right away. We therefore
typically view the reinforcement learning problem as an intelligent agent,
the decision maker, trying to maximize long term reward by interacting
with a reward-giving environment.

30
3.4.1 Markov Decision Processes (MDP)
The environment of an RL problem is usually formalized as a Markov
Decision Process. A Markov decision process is a discrete-time stochastic
control process, which means it provides a mathematical framework for
modeling decision making where outcomes are partly random and partly
controlled by a decision maker. A Markov decision process is a 4-tuple
(S, A, Pa , R a ), where:
- S is a set of states called the "state space",
- A is a set of actions called the "action space",
- Pa (s, s0 ) is the probability that taking action a when in state s at time
t will lead to state s0 at time t + 1. This is called the "transition
probability",
- R a (s, s0 ) is the reward received for transitioning from state s to state
s0 by action a.

MDP
(Environment)

Reward
Action
State Agent

Figure 3.13: RL as interplay between an intelligent agent and an MDP

The goal of the agent in Figure 3.13 is to build a model, π : S → A, a


policy for which actions to take from any given state. In general, we want
to maximize discounted expected reward over a potentially infinite time
frame:


" #
E ∑ γ t R π ( s ) ( s t , s t +1 )
t
t =0

Where:
- the expectation is taken over st+1 ∼ Pπ (st ) (st , st+1 ),
- π is the optimal policy and π (st ) is the optimal action taken at time t,
- 0 ≤ γ ≤ 1 is a discount factor motivating taking actions early.
The field of reinforcement learning studies algorithms for finding optimal
π in different scenarios, depending on the nature of the 4 components of
the MDP.

31
3.4.2 Model-free vs. Model-based RL Algorithms
The term "Model" in "model-free vs. model-based RL", refers to the MDP of
the RL problem. In model-free RL we do not model the MDP. That is, we do
not try to find explicit expressions defining P and R, we rely on sampling
and simulation to estimate the optimal policy. We do not need to know the
inner workings of the MDP. Intuitively, model-based RL is the opposite. In
model-based RL we use the MDP directly to estimate the optimal policy π.
Model-based methods are thought to be more sample efficient than model-
free methods since they are able to explicitly plan ahead and therefore don’t
need to test as many trajectories as model-free methods to find the optimal
policy.

RL Algorithms

Model-Free RL Model-Based RL

Policy Optimization Q-Learning Learn the Model Given the Model

Policy Gradient DQN World Models AlphaZero


DDPG
A2C / A3C C51 I2A
TD3
PPO QR-DQN MBMF
SAC

TRPO HER MBVE

Figure 3.14: A non-exhaustive, but useful taxonomy of algorithms [22–33]


in modern RL from [34]

The algorithms in Figure 3.14 differ by what they learn and how they use
what they learn to determine a policy. They can learn:
- policies (which action to take from a state)
- action-value functions (quantifies how good it is to take an action
from a state)
- value functions (quantifies how good a state is)
- environment models (how the world works)

3.4.3 Exploration vs. Exploitation


A key concept in all of AI, and especially in RL, is the trade-off between
how much to explore new solutions vs. how much to exploit known
solutions. In the case of RL, this refers to how the trajectories the agent
explores are generated. If the agent is making actions in the environment
completely randomly, it has high exploration and low exploitation, if the

32
agent is making actions only according to its learned policy it has high
exploitation and low exploration. It is important to strike a good balance
to converge to good and/or optimal policies.

3.4.4 Monte Carlo Tree Search (MCTS)


Monte Carlo Tree Search is a tree search algorithm which has seen
use as the main planning procedure of choice in many model-based
reinforcement learning algorithms. A particularly famous example is
Deepminds AlphaGo and AlphaZero systems [33, 35], which used MCTS
in conjunction with deep neural networks to beat the best human players
at various games.
The MCTS algorithm plans which action to take from a given state by
iteratively building a search tree of reachable states from the current
state, considering a certain number of states before committing to the best
looking action. In each node i, we maintain the number of visits ni and how
many times the node has lead to a good outcome wi . Each iteration has 4
stages: Selection, Expansion, Simulation, and Backpropagation.

Selection
In this step, the algorithm starts in the root node (the current state) of
the search tree and recursively chooses a child until a leaf node (a state
with potential, but unrealized child nodes) is reached. The child node
maximizing equation 3.1 is chosen at each step:

s
wi ln Ni
U (i ) = +c (3.1)
ni ni

The U function defined in equation 3.1 is UCT (Upper Confidence Bound


1 applied to trees) [36]. UCT is a formula for balancing exploration and
exploitation where Ni is the number of visits of the parent of node i, ni and
wi are as previously defined above, and c is a parameter that determines
the balance between exploration
√ and exploitation, called the "exploration
parameter" - usually 2 or chosen empirically. Higher c gives more
emphasis on exploration (less frequently visited nodes, low ni ), while lower
c gives more emphasis on exploitation (nodes with higher wi ).
If the recursion ends at a node with no potential children (a terminal state),
then the algorithm calls this node vi and skips to the backpropagation step.
Figure 3.15 shows a selection step in an adversarial two agent turn based
setting, where the white nodes represent the states of the agent trying to
maximize wi and the grey nodes represent the states of the agent trying to
minimize wi .

33
SELECTION

11/21

7/10 0/3 3/8

2/4 1/6 1/2 2/3 2/3

2/3 3/3

Figure 3.15: MCTS selection step [37]

Expansion
The algorithm has now arrived at a leaf node with potential, but unrealized
children in the search tree. It chooses one of them, adds it to the tree as the
child of the current node, and call this child node vi .
Figure 3.16 shows the expansion step coming after the selection step in
Figure 3.15 (still in the adversarial setting).

EXPANSION

11/21

7/10 0/3 3/8

2/4 1/6 1/2 2/3 2/3

2/3 3/3

0/0

Figure 3.16: MCTS expansion step [37]

34
Simulation
The algorithm then completes one rollout/simulation from the state vi . A
rollout consists of choosing moves (usually randomly or semi-randomly)
until some stopping condition is met (game is won/lost, maximum number
of moves reached, etc.).
Figure 3.17 shows the simulation step coming after the expansion step in
Figure 3.16 (still in the adversarial setting).

SIMULATION

11/21

7/10 0/3 3/8

2/4 1/6 1/2 2/3 2/3

2/3 3/3

0/0

0/1

Figure 3.17: MCTS simulation step [37]

Backpropagation
The algorithm now uses the result of the playout to update the values of
the nodes on the path from vi to the root. Visit counts ni are increased by
1 and reward - or "goodness" - estimates wi are increased proportionally to
the new node’s wi .
Figure 3.18 shows the backpropagation step coming after the simulation
step in Figure 3.17. Still in the adversarial setting.

35
BACKPROPAGATION

11/22

8/11 0/3 3/8

2/4 1/7 1/2 2/3 2/3

2/3 4/4

0/1

Figure 3.18: MCTS backpropagation step [37]

This loop is continued for a set number of iterations. In the end, the
algorithm chooses the action from the root node leading to the child with
highest wi .

Algorithm 1 Monte Carlo Tree Search (MCTS)


Initialize root node v0 with current state s as value
while within computational budget do
vi = EXPANSION(SELECTION(v0 ))
wi = SIMULATION(vi )
BACKPROPAGATE(vi , wi )
end while

MCTS is an interesting way to choose actions in Markov decision problems


such as chess. However, MCTS is not fast enough in itself to find
superhuman-level policies in tractable time. What the team at DeepMind
did with AlphaZero [33] and its predecessors, was to use deep neural
networks for the selection and simulation steps of MCTS. For the selection
step the NN is added to the U function, weighting the action given the state.
For the simulation step the NN evaluates the value of leaves, replacing the
need to simulate play. The NN parameters are then trained to reflect the
outcome of the game.
Neural networks have been used in search for a long time, even in ATP in
1989 [38]. Deep neural networks are powerful representations compared
to their linear function approximator counterparts used in other game AIs.
However, NNs may also introduce spurious approximation errors. Hence,
MCTS and NNs complement each other well, as MCTS averages over these
approximation errors which therefore tend to cancel out when evaluating
a large subtree.
36
Chapter 4

ML-Guided ATP Literature


Review

As discussed in Section 2, determining the validity of a first-order formula


is not only co-NP hard [5] (believed to be computationally inefficient), it
is undecidable [4] (computationally impossible). However, luckily for us,
the problem is semi-decidable. That is, the best algorithm we can construct
can prove that a valid formula is valid, but not necessarily that an invalid
formula is invalid. Still, the best algorithms that are able to prove valid
formulae, degenerate to exhaustively trying all possible rule applications of
a calculus recursively, and then checking whether the resulting proof trees
are valid proofs. What differentiates these algorithms, are which calculi
they base themselves on and in which order they apply the rules of the
calculus to search for a proof. In this way, the problem is reminiscent of
a very hard NP-hard combinatorial optimization problem where heuristics
are used to search for optimal solutions. However, in our case, we might
not be able to find a solution, even given infinite time!
Machine learning is commonly used to deal with intractable combinatorial
problems. When the best solution seems like brute force search coupled
with a set of heursitics, such as in chess or scheduling, machine learning is
used to find some strategies, tactics, or heuristics, to speed up the process
by finding patterns that can be exploited in training data. That is, learn
some intuition. ATP in all generality is a search problem, it is searching
for a proof given axioms and a theorem/conjecture. When mathematicians
prove theorems they have intuition for which lemmas and proof methods
to use. For example, in a basic geometry problem, it would make sense
to use geometric lemmas, maybe the Pythagorean theorem. For solving a
partial differential equation, separation of variables might be a good idea.
One would use some property of the problem to proceed. The general
procedure for doing ATP is intuitively not awfully complex. We start
with a theorem we want to prove, and then recursively split the theorem
and the resulting statements based on the calculus rules until we are only
left with axioms or previously proved lemmas. Just like how humans

37
have some intuition for how to do the splitting, aka finding lemmas that
combine to prove the theorem, machines should be able to learn this too.
We somehow need to teach the machine, based purely on the symbols of
the theorem, which rules should be applied to which parts of the theorem
in which order. The success of this proof search guidance will be reliant on
the interplay between two crucial components: the proof calculus and the
machine learning approach.
There are many parts of the general procedure of going from a natural
language statement to a formal proof which are candidates for learning:
1. Autoformalization: Translating a statement in natural language or
LaTeX into a statement in a formal unambiguous language such as
first-order logic.
2. Premise selection from large corpora: Determining which premises/axioms
to add to your formula to make it more effectively provable (or prov-
able at all), from a corpus of premises/axioms.
3. Lemmatization and conjecturing: Determining which already found
derivations to add to your proof search to make it more effectively
provable, from a corpus of derivations.
4. Internal proof guidance: Determining which rule applications to use
during proof search.
We will only consider internal proof guidance and will mainly look at
results related to this. In particular, we will look at many results from
the highly successful EU-funded project AI4REASON 1 . The project and
related work has shown that learning-based guidance can significantly
enhance the power of existing ATP systems.

4.1 Connection-based Provers


leanCoP [9, 10] is a minimal first-order prover implemented in sub-20 lines
of Prolog code, based on the connection calculus [8] discussed in Section 1
and shown again in Figure 4.1.
1 http://ai4reason.org

38
(axiom)
{}, M, Path

C2 , M, Path
(start)
e, M, e

C, M, Path ∪ { L2 }
(reduction)
C ∪ { L1 }, M, Path ∪ { L2 }

C2 \ { L2 }, M, Path ∪ { L1 } C, M, Path
(extention)
C ∪ { L1 }, M, Path

In all cases C2 is a copy of C1 ∈ M, L2 ∈ C2 , and σ( L1 ) = σ( L2 )

Table 4.1: Connection Calculus

A calculus suitable for search guidance, should have a transparent notion


of proof state to which advising operations can be applied. Saturation-
based calculi, such as resolution, are traditionally good in that the proofs
do not explicitly branch. However, this lends the proof states to just be large
piles of clauses, which can be difficult to work with and find structure in.
MaLeCoP (Machine Learning Connection Prover) [39], one early attempt
at introducing learning into proof search, therefore built on top of leanCoP
since it provides a transparent notion of proof state in the form of open
goals and branches that need to be closed.
During proof search, MaLeCoP chooses which clause to extend to by using
a naive Bayes classifier which was trained in a supervised way on good
decisions. Training features were hand-crafted features of the proof tree
with the chosen clause as supervision signal. One important takeaway
from this system is that inference speed of the learner matters, as MaLeCoP
was initially roughly 1000 times slower than the regular implementation
of leanCoP. This was later improved in the system FEMaLeCoP (Fairly
Efficient MaLeCoP) [40], which uses the same ideas as MaLeCoP but the
entire system is implemented in OCAML, both leanCoP and the learner.
The OCAML implementation of leanCoP is called mlCoP. The learning
guidance allowed MaLeCoP to achieve a 15% improvement over leanCoP
on the MPTP2078 [41] problems.
Later approaches at guiding leanCoP with learning have adopted single
player versions of the MCTS algorithm described in Section 3.4.4. This
makes sense considering theorem proving is best understood as a sequen-
tial decision process and most of the recent ML success stories which
brought about superhuman performance have been based on model-based
reinforcement learning applied to sequential decision processes, such as
AlphaZero [33].
One early approach purely swapped leanCoP’s iterative deepening search
strategy for MCTS [42], and then more recent approaches have enhanced

39
the MCTS strategy with strong learners such as gradient boosted trees
and GNNs (both of which were introduced in Section 3.3). These systems
include rlCoP [43], plCoP [44], and the system described [45] (We will refer
to this as graphCoP). In all these MCTS planning systems, states are the
current state of the proof (the tableau proof tree), actions are inference steps,
and the simulation step is done by making d proof inference steps (actions)
b times. In chess, the simulation step is comprised of playing moves until
the game is finished. The reason we can’t do the same in first-order theorem
proving, that is, make inferences until a proof is found, is because there
isn’t necessarily a proof to be found. The proof search problem is semi-
decidable, and hence the search tree is potentially infinitely deep.
rlCoP, plCoP and graphCoP extend MCTS by adding:
1. learning-based mechanisms for estimating the prior probability of
inferences to lead to a proof (policy),
2. learning-based mechanisms for assigning heuristic value to the proof
state.
1 and 2 are realized by altering the UCT formula (equation 3.1) as follows:

s
wi ln Ni
U (i ) = + c · pi · (4.1)
ni ni

In Equation 4.1, pi is the learned prior probabilities of going from node i’s
parent to node i. wi is the sum of the wi ’s of the node’s children, if the node
has no children, wi is equal to a learned value of the node, V (vi ). The rest
is as in equation 3.1.
In rlCoP and plCoP, data collection and model training is intertwined in
the DAgger meta-learning algorithm [46], described below. The policy and
value functions are gradient boosted trees (XGBoost [47]), trained at each
iteration of DAgger with hyperparameters found by grid search.

Algorithm 2 DAgger (Dataset Aggregation)


Use expert demonstrations to train learners (supervised learning)
for i = 1 to N do
Use the learners to gather new data (MCTS)
Add the new data to the dataset
Train new learners on the new dataset
end for

In Table 4.2, leanCoP is compared to rlCoP on the Miz40 dataset [48],


which is a a subset of the Mizar Mathematical Library [49] and consists of
32524 provable first-order problems and their proofs. The dataset is split in
two, 90% training problems (29272 problems) and 10% test problems (3252
problems). "bare" rlCoP refers to rlCoP before it has learned anything, that
is, the learners are pseudorandom functions. "iter 1" refers to one iteration
of DAgger. "mlCoP" is an OCAML implementation of leanCoP without

40
learning. rlCoP is given 2000 MCTS iterations per decision. We see that
after three iterations, rlCoP is able to solve about 40% more problems than
unguided leanCoP (mlCoP).

mlCoP bare iter. 1 iter. 2 iter. 3


overall 11581
training 10438
testing 1143
rlCoP bare iter. 1 iter. 2 iter. 3
overall 4615 13679 15268 15721
training 4184 12325 13749 14155
testing 431 1354 1519 1566

Table 4.2: Comparison of the number of problems solved by leanCoP and


leanCoP guided by XGBoost, from [43]

The previous models have all been based on hand-crafted features of proof
states. In general, the more information about the data we are able to
model, the more the learner will be able to learn. One of the strong suits of
neural networks is that they can learn good functions from very general
input spaces to very general output spaces. However, neural networks
can quickly require too many resources to train and perform inference.
So neural network libraries, architecture, hardware needs to be optimized
for this to be feasible in a setting such as ATP where we want to do as
many inferences per second as possible. The last couple of years, all these
three points have seen a lot of progress, inspiring the use of Graph neural
networks for guiding ATP.
In Table 4.3, graphCoP is compared to rlCoP on the Miz40 dataset. This
time, both provers are given 200 MCTS iterations per decision, down from
2000 in Table 4.2. We see that after one iteration, graphCoP is able to prove
about 50% more problems than rlCoP and that the difference is slowly
diminishing.

graphCoP bare iter. 1 iter. 2 iter. 3


overall 5105 13300 14042 14002
training 4595 11978 12648 12642
testing 510 1322 1394 1360
rlCoP bare iter. 1 iter. 2 iter. 3
overall 5105 8920 10030 10959
training 4595 8012 9042 9874
testing 510 908 988 1085

Table 4.3: Comparison of the number of problems solved by leanCoP


guided by the invariant-preserving GNN and by XGBoost, from [45]

Clearly, the learning procedures introduced have been successful, and it


will be interesting to see how they can be applied to other very similar ATP

41
systems, such as nanoCoP [50, 51] for classical logic, and ileanCoP and
MleanCoP, two of the fastest ATP systems for non-classical logics [52, 53].

4.2 Saturation-based Provers


Most current state-of-the-art first order theorem provers such as E [14] and
Vampire [13] are based on saturation procedures such as the resolution
calculus (discussed in Section 2.4.4 and shown again in Table 4.4) or
extensions of it.

(axiom)
C1 , . . . , {}, . . . , Cn

C1 , . . . , Ci ∪ { L1 }, . . . , Cj ∪ { L2 }, . . . , Cn , σ (Ci ∪ Cj )
(resolution)
C1 , . . . , Ci ∪ { L1 }, . . . , Cj ∪ { L2 }, . . . , Cn
(Where σ is a m.g.u of L1 and L2 )

C1 , . . . , Ci ∪ { L1 , . . . , Lm }, . . . , Cn , σ (Ci ∪ { L1 })
(factorization)
C1 , . . . , Ci ∪ { L1 , . . . , Lm }, . . . , Cn
(Where σ is a m.g.u of L1 , . . . , Lm )

Table 4.4: Resolution calculus

These ATP systems maintain two sets of processed (P) and unprocessed (U)
clauses. At each loop iteration, a given clause g from U is selected, moved
to P, and U is extended with new inferences from g and P. This process
continues until the contradiction is found (empty clause), U becomes
empty, or a resource limit is reached. The search space grows quickly and
selection of the right given clauses is critical [54].
The ENIGMA system [55] and later iterations [54, 56] incorporate learning
into the E prover, learning which clause is best to process next. This was
done by extracting hand-crafted feature vector representations from the
clauses and feeding them to linear classifier models (SVM and logistic
regression in LIBLINEAR [57]).
ENIGMA-NG [54] improved on this by using more expressive Gradient
boosted tree and recursive neural network classifiers combined with fast
feature hashing to reduce the number of features the classifiers have to
consider and hence speed up inference, since this is will generally be slower
with more expressive models.
Table 4.5 are benchmarks on the MPTP2078 bushy [41] dataset, MPTP2078
contains 2078 problems coming from the MPTP translation [58] of the Mizar
Mathematical Library (MML) [59] to first-order logic.
In Tables 4.5 and 4.6, S is an E proof search strategy, and S + M is S with
M as added clause selection guidance.

42
S S + Linear S + GB Trees S + RNN
solved 1086 1210 1256 1197

Table 4.5: Comparison of different E strategies on MPTP2078 bushy from


[54]

ENIGMA-Anonymous [56] further expands on this by making features


invariant to symbol name changes across problems and employing graph
neural networks as learners.
Table 4.6 show benchmarks on 32524 Mizar40 [48] problems exported by
MPTP [58]. Here we can see a very impressive 70% improvement over the
base strategy with gradient boosted trees added to S with the new symbol
invariant feature representation and more suitable hyperparameters than
the Table 4.5 benchmark [56].

S S + GB Trees S + GNN
solved 14966 24347 23262

Table 4.6: Comparison of different E strategies on Mizar40 from [56]

Another attempt to incorporate learning in saturation-based theorem


proving was the TRAIL [60] system. Whereas the Enigma systems
rely heavily on the E prover’s guidance for finding proofs to train on,
potentially biasing the learner towards the hand-crafted strategies, TRAIL
learns to guide the underlying saturation prover (Beagle [61]) from scratch.
TRAIL does this by posing the learning problem as an RL problem much
like what was done for the leanCoP extensions discussed in Section 4.1.
Instead of using MCTS, TRAIL uses a simulated annealing approach to
balance exploration and exploitation.
Key to TRAIL’s design is a novel neural representation of the state
of a theorem prover in terms of inferences and clauses, and a novel
characterization of the inference selection process in terms of an attention-
based action policy network. [60].

E Beagle mlCoP rlCoP plCoP TRAIL


solved 998 742 502 733 782 910

Table 4.7: Comparison of different provers on MPTP2078 bushy from [60]

In table 4.7, all provers are run on the MPTP2078 problem set, where E
is run in auto mode. We see that although TRAIL does better than the
leanCoP modifications discussed in Section 4.1, it does not do better than
the pure E prover, like the ENIGMA systems which used E-proof traces to
train instead of training from scratch like TRAIL.

43
4.3 Interactive Provers and Other Related Problems
Many of the same ideas for guiding rlCoP and related systems have also
been used for internal selection of tactical steps inside proof search for
interactive theorem provers (ITPs) such as HOL4 [62, 63] and Coq [64,
65]. These systems are based on higher-order logic which is not even semi-
decidable and hence pose a lot of new problems which will not be detailed
here. However, TacticToe combined with E proves about 70% of the HOL4
library, which has inspired a number of followup works from Google [66]
and OpenAI [67].

Figure 4.1: HOL4 guidance overview from [63]

Graph Convolutional Neural Networks have also been used with success to
guide search for other combinatorically exploding problems such as mixed
integer programming [68].

4.4 Main Takeaways


We now summarize the findings in the previous sections. To guide an
ATP system using current machine learning methods, there are a couple
of important things to keep in mind:
Formula vectorization and proof state embeddings are alpha and omega.
In particular, vectorization should maintain some important properties:

44
- Formula embeddings should be invariant to symbol renaming across
problems. The following two formulas should have similar if not
equal encodings:

∀y (¬¬ P(y) → P(y))


∀ x (¬¬ P( x ) → P( x ))

- Formula embeddings should be invariant to clause and literal


reordering. The following two formulas should have similar if not
equal encodings:

∀y [(¬ P(y) ∨ P(y)) ∧ (¬ P( a) ∨ P(b))]


∀y [( P(b) ∨ ¬ P( a)) ∧ ( P(y) ∨ ¬ P(y))]

- Formula embeddings should not be invariant to term reordering. The


following two formulas should have dissimilar encodings:

P(y, x )
P( x, y)

Additionally, learner overhead should be taken heavily into consideration.


The rlCoP paper [43] mentions that their system is 4 times as slow
when adding simple XGBoost gradient boosted tree learners as opposed
to not introducing learning. Their system made an average of 64335.5
inferences per second without learning and 16205.7 inferences per second
with learning. The graphCoP paper [45], which introduces a more
computationally heavy model, mentions that even with multiprocessing
and batch queries to their learning model on the GPU, simply querying the
prover takes on average 0.1 seconds. Extrapolating from their reported
numbers in [45], the graphCoP system overall averages less than 200
inferences per second.

45
46
Chapter 5

A General Framework for


ML-Guided ATP

5.1 From Prolog to Python


The plCoP paper by Zombori et al. [44] argues that the Prolog setting is
suitable for combining statistical and symbolic learning methods, citing –
from what we can see – three main reasons:
1. Prolog is traditionally associated with ATP research, and it has been
used for a number of provers, as well as for rapid ATP prototyping,
with core methods like unification for free.
2. Prolog is the basis for Inductive Logic Programming (ILP) style
systems.
3. Prolog provides for compact and easy to extend implementations of
theorem provers such as leanCoP [9] discussed in Section 6.1.
However, a common theme in all of these learning modified leanCoP sys-
tems (rlCoP, plCoP, etc.) is as Olsák et al. mention in the graphCoP pa-
per [45]: "The current main speed issue turned out to be the communica-
tion overhead between the provers and the network." That is, there is ma-
jor communication overhead between the Python learning models and the
leanCoP inference engine.
In this section, we address the above and provide further reasoning for why
we have chosen to design a general framework for implementing arbitrary
proof calculi in Python, and why this makes sense for integrating learning.

5.1.1 Logic + Control


In his 1979 paper [69], Kowalski argues that any algorithm can be regarded
as consisting of two components: a logic component, which specifies the
knowledge to be used in solving problems, and a control component,
which determines the problem-solving strategies by means of which that

47
knowledge is used. The main idea of the declarative logic programming
paradigm, and languages such as Prolog, is to have the programmer take
care of the logic component and then the logic programming language
takes care of the control component. As Zombori et al. mention, this is great
for rapid ATP prototyping, in the sense that it is easy to state the rules of a
calculus in Prolog and have it run. However, when our purpose stretches
beyond simply implementing a calculus to also guiding the execution of
the rules, the idea of separating logic and control loses ground. From an
abstraction perspective, having the programmer interface with a Prolog
program to guide it’s execution goes against the underlying ideas of the
language. In this regard, using an imperative language for the task makes
a lot more sense.

5.1.2 Object-orientation
Figure 5.1 shows a class diagram of our proposed architecture for
implementing theorem provers based on arbitrary proof calculi using
arbitrary proof search strategies. The triangle-shaped relations show
inheritance, while the diamond shaped relations show dependence (white
is aggregation and black is composition).

General Framework for


SearchProblem SearchAgent Search
1..1 1..1

1..* 1..*

1..1 1..1

State Action Proof Calculus Specific


Abstraction Level

ProofSearchProblem

ProofState ProofAction

1..* 0..* 1..*

1..1 1..* 1..*

Matrix Literal Term


0..1 1..* 1..* 0..*

Figure 5.1: Generalized prover class diagram

This architecture separates the search procedure from the calculus specific
elements of the theorem prover without losing the possibility of using cal-

48
culus specific elements during search and without introducing intersystem
communication overhead.
Separation of search and calculus is done, at least implicitly, whenever one
wants to guide a prover in any way, but we argue that the explicitness of
object-orientation and the relative simplicity of this architecture makes it
easier to reason about and compare general ATP systems by boxing the
different components.
The framework is heavily inspired by Reinforcement Learning environ-
ments such as OpenAi Gym [70]. As the authors of this framework say:
"RL is very general, encompassing all problems that involve making a se-
quence of decisions". Theorem proving is such a problem; we have the state
of our proof (i.e. the derivation tree) and need to make a decision of which
inference rule to apply next (i.e. which leaf of the derivation tree to expand
next).
To implement an arbitrary proof calculus in the framework shown in
Figure 5.1, one has to implement the ProofState, ProofAction, and
ProofSearchProblem classes which inherit from the State, Action, and
SearchProblem classes respectively. The ProofState class represents the
state of a proof, e.g. the derivation tree, while the ProofAction class
represents possible inferences, e.g. extension in the connection calculus
shown in Table 2.7. Both of these classes could in the simplest case just be
reduced to tuples, so the only function which needs overriding is the init
function as shown in Figure. 5.2. However, as we will see in Section 6, it
can be useful to let these classes be more complex and include options for
more functions or even inner classes.
class State:
# Abstract class defining the states of a search problem
def __init__(self):
# Initialize state variables
pass

class Action:
# Abstract class defining the actions of a search problem
def __init__(self):
# Initialize action variables
pass

Figure 5.2: Abstract Python state and action classes used in search problems

Once we have representations of the proof calculus’ state and inference


rules, all that’s left is to define the dynamics of the calculus. This is
done in the ProofSearchProblem class. All the functions which need to
be overridden are shown in Figure. 5.3.

49
class SearchProblem:
# Abstract class defining the dynamics of a search problem
def __init__(self):
# Define initial state and action spaces
pass

def start(self):
# Returns the start state(s)
pass

def legal_actions(self, state):


# Returns a list of legal actions
pass

def step(self, state, action):


# Execute action from state.
pass

def undo(self, state, action):


# Undo execution of action to state
pass

def is_goal(self, state):


# Returns whether state is a goal state
pass

def reset(self):
# Reset the search problem
pass

Figure 5.3: Abstract Python problem class for defining proof search
dynamics

The framework differentiates itself from other imperative/object-oriented


implementations of a proof calculus in that we define the dynamics of the
proof calculus, but we don’t need to implement how the theorem prover
explores the search space. This is taken care of by the SearchAgent class.
The key design value of the architecture is exactly this: the SearchAgent
class is agnostic to the concrete implementation of the search problem.
The SearchAgent class can swap freely between arbitrary search strategies:
Depth First Search (DFS), Breadth First Search (BFS), Iterative Deepening,
MCTS, etc. and everything will still work without altering the ProofState,
ProofAction, and ProofSearchProblem classes.

5.1.3 Rapid Prototyping


Just as with Prolog, once implemented, the Term, Literal, and Matrix
classes provide necessary operations such as unification, substitution, and

50
other syntactic operations for free. This in addition to the SearchAgent
class being implemented, leaves us with something reminiscent of a logic
programming environment, where most of what we have to implement are
the rules of the system. A big difference is that it is now very natural to also
change the SearchAgent class, giving the possibility of rapid prototyping
both for logic and control.
The framework discussed so far is object-oriented and can in theory be
implemented in any object-oriented language. The reasons we chose
Python for our implementation, instead of a fast compiled language such
as C++ or Java, are very reminiscent of the points made by Zombori et al.
mentioned in the beginning of Section 5.1:
1. The high level nature of Python allows for rapid prototyping as well
as compact and easy to extend implementations.
2. Python is the lingua franca in the ML community. Hence integration
of machine learning will be natural.
A drawback to the high level features of Python is that they make the
language slow compared to slightly lower level languages such as C++
and Java (roughly 10x slower according to [71]). For our purposes, this
is fine as rapid prototyping is valued higher than speed. Due to the (worse
than) exponential time complexity of theorem proving, inference quality is
a lot more important than inference quantity. A python program making
good inferences (i.e. inferences leading to a proof) will find proofs quicker
than a C++ program making bad inferences 100 times as fast. Hence, from
an engineering perspective it is better to spend more time prototyping
than implementing. To sum up this point: inference quality  inference
quantity. Additionally, Python is very synergistic with C, so making the
switch later down the road is plausible with minor (but time consuming)
changes.

5.2 Incorporating Learning


5.2.1 Proof Search as an MDP
We recall from Section 3.4 that the ML paradigm associated with sequential
decision/search problems is reinforcement learning. The reinforcement
learning (RL) problem is framed as interplay between an agent and an
environment, where the environment is formalized as a Markov Decision
process (MDP) receiving actions and returning states and rewards as
shown in Figure 5.4.

51
MDP
(Environment)

Reward
Action
State Agent

Figure 5.4: General RL scenario

As discussed in the previous section, our framework is inspired by the


OpenAI Gym framework, which is an RL framework. This is not a
coincidence. We see that the MDP is quite naturally represented by the
SearchProblem class in Figure 5.5

General Framework for


SearchProblem SearchAgent Search
1..1 1..1

1..* 1..*

1..1 1..1 MDP


State Action Proof Calculus Specific
Abstraction Level

ProofSearchProblem

ProofState ProofAction

1..* 0..* 1..*

1..1 1..* 1..*

Matrix Literal Term


0..1 1..* 1..* 0..*

Figure 5.5: The MDP is already naturally represented in the system design

Furthermore, the entire RL problem is contained in the search abstraction


of the framework as interplay between the SearchAgent (the agent) and the
SearchProblem (the environment/MDP) shown in Figure 5.6

52
General Framework for
SearchProblem SearchAgent Search
1..1 1..1

1..* 1..*

1..1 1..1
RL Setting
State Action Proof Calculus Specific
Abstraction Level

ProofSearchProblem

ProofState ProofAction

1..* 0..* 1..*

1..1 1..* 1..*

Matrix Literal Term


0..1 1..* 1..* 0..*

Figure 5.6: The RL components are already naturally represented in the


system design

53
5.2.2 Model Module
Urban et al. [39] propose the architecture shown in Figure 5.7 to modularily
combine existing theorem provers with learning modules.

Figure 5.7: MaLeCoP general architecture from [39]

The following is mentioned in the same paper: "The slow external advice
is currently a clear bottleneck". There are two obvious potential reasons
for this: communication overhead and advisor model complexity. Our
framework addresses the first by incorporating learning directly into
the object-oriented system by simply adding the Model class and the
ProofModel subclass as shown in Figure 5.8.
Not only does the framework allow for tight coupling between ML advice
and proof search, the modular property of the framework still holds. To
now incoprporate learning, we only need to implement the ProofModel
class. We can implement this using any ML library we want (scikit-learn,
TensorFlow, PyTorch, etc.), we just need to override the functions shown in
Figure 5.9. The SearchAgent can use the post_processed data however it
wants in it’s search strategies, whether it be iterative deepening or MCTS.

54
General Framework for
SearchProblem SearchAgent Search
1..1 1..1

1..* 1..* 1..*

1..1 1..1 1..*

State Action Model Proof Calculus Specific


Abstraction Level

ProofSearchProblem ProofModel

ProofState ProofAction
Learning
1..* 0..* 1..*

1..1 1..* 1..*

Matrix Literal Term


0..1 1..* 1..* 0..*

Figure 5.8: Complete framework with support for learning

class Model:
# Abstract class defining the model guiding a search problem
def __init__(self):
# Initialize model (i.e. a PyTorch NN)
pass

def __call__(self, **kwargs):


# Define how inference is done
# Return state values and/or action probabilities
pass

def train(self, problems_path):


# Define the training loop of the model
pass

def pre_process(self, state, actions):


# Translate state/actions to model-readable input format
pass

Figure 5.9: Abstract Python model class used to guide search problems

55
To conclude, the framework consists of three main modular parts:
- Proof Calculus (ProofSearchProblem, ProofState, ProofAction)
- Search Strategy (SearchAgent)
- Learning guidance (ProofModel)
While the ProofModel’s pre_process function relies on the implementation
details of ProofState and ProofActions, all other parts of the framework
can be swapped out without regard for the other components. I.e. if
you have an implementation of resolution with a BFS strategy in this
framework and want to try an MCTS strategy, all you have to change is
the SearchAgent. While, if you want to try the connection calculus with
a BFS strategy, all you have to change are the proof dynamics classes
(ProofSearchProblem, ProofState, ProofAction)
We have developed a Python version of the framework which provides
primitives and primitive operations such as literals, terms, formu-
lae/matrices, unification, etc. In the next few chapters, we will discuss how
we also implemented the connection calculus, iterative deepening, and a
model based on graph neural networks using the PyTorch deep learning
library [72] in our framework.

56
Chapter 6

Implementing an ML-Guided
ATP System

In this section, we describe how we implemented a learning-guided


theorem prover in Python using the framework described in Section 5. The
prover combines the following techniques:
1. Proof dynamics (ProofState, ProofAction, SearchProblem): The
connection calculus discussed in Section 2.4.3 and 4.1.
2. Search strategy (SearchAgent): Various modifications of restricted
iterative deepening, discussed in Section 6.1 and 7.1
3. Learning guidance (ProofModel): A novel graph neural network
model, discussed in Section 3.3 and 6.2.
The implementations of the supporting classes will not be discussed, but
allow for fast translation of TPTP CNF format first-order problems to an
internal object-oriented representation with support for unification and
other syntactic operations which the proof dynamics build on.

6.1 Base Connection Prover


leanCoP [9, 10] is a compact sound and complete automated theorem
prover for classical first-order logic. It is based on the connection calculus
described in Section 2.4.3 (shown again in Table 6.1) and implemented in
Prolog. The entire code for leanCoP 1.0 is shown in Algorithm 3, for full
explanation of this code see Otten [9]. For our purposes, leanCoP is nice
precisely because it is so compact and has a relatively clear notion of proof
state as discussed in Section 4.1.

57
(axiom)
{}, M, Path

C2 , M, Path
(start)
e, M, e

C, M, Path ∪ { L2 }
(reduction)
C ∪ { L1 }, M, Path ∪ { L2 }

C2 \ { L2 }, M, Path ∪ { L1 } C, M, Path
(extention)
C ∪ { L1 }, M, Path

In all cases C2 is a copy of C1 ∈ M, L2 ∈ C2 , and σ( L1 ) = σ( L2 )

Table 6.1: Connection Calculus

Algorithm 3 leanCoP 1.0 (original Prolog implementation)


prove(M,I) :- append(Q,[C|R], \+member(-_,C),
append(Q,R,S), prove([!],[[-!|C]|S],[],I).
prove([],_,_,_).
prove([L|C],M,P,I) :- (-N=L; -L=N) -> (member(N,P);
append(Q,[D|R],M), copy\_term(D,E), append(A,[N|B],E),
append(A,B,F), (D==E -> append(R,Q,S); length(P,K), K<I,
append(R,[D|Q],S)), prove(F,S,[L|P],I)), prove(C,M,P,I).

The general flow of leanCoP 1.0 is as follows: a derivation for a formula in


clausal form is generated by first applying the start rule (Table 6.1) and then
repeatedly applying the reduction or extension rules. In each inference step
a connection is identified along an active path. The search strategy used is
iterative deepening, which is a form of repeated depth first search where
the search stops at a given depth limit, and the depth limit is increased by
one for each repetition. Here, the "depth" corresponds to the length of the
active path.

Algorithm 4 leanCoP 2.0 (original Prolog implementation)

prove(I,S) :- \+member(scut,S) -> prove([-(#)],[],I,[],S) ;


lit(#,C,_) -> prove(C,[-(#)],I,[],S).
prove(I,S) :- member(comp(L),S), I=L -> prove(1,[]) ;
(member(comp(_),S);retract(p)) -> J is I+1, prove(J,S).
prove([],_,_,_,_).
prove([L|C],P,I,Q,S) :- \+ (member(A,[L|C]), member(B,P),
A==B), (-N=L;-L=N) -> ( member(D,Q), L==D ;
member(E,P), unify_with_occurs_check(E,N) ; lit(N,F,H),
(H=g -> true ; length(P,K), K true ;
\+p -> assert(p), fail), prove(F,[L|P],I,Q,S) ),
(member(cut,S) -> ! ; true), prove(C,P,I,[L|Q],S).

58
Non-learning theorem provers rely on a priori optimizations to prune the
search space and to determine what parts of the search space to search
first. leanCoP 2.0 shown in algorithm 4 improves on leanCoP 1.0 by
adding a list of new techniques. The remaining subsections of Section
6.1 provide descriptions and examples of these techniques. The proofs of
their corresponding theorems can be found in [73–75]. Our base prover
implements two of them: "positive start clauses" and "regularity", both
of which prune the search space in a way such that completeness is
maintained. That is, they only remove redundant search states. Since
"positive start clauses" was already in leanCoP 1.0, our base connection
prover is a Python equivalent of leanCoP 1.0 + regularity described in
Algorithm 5.

Algorithm 5 leanCoP 1.0 + regularity


prove(M,I) :- append(Q,[C|R], \+member(-_,C),
append(Q,[[-!|C]|R],S), prove([!],[S],[],I).
prove([],_,_,_).
prove([L|C],M,P,I) :- \+ (member(A,[L|C]), member(B,P),
A==B),(-N=L; -L=N) -> (member(N,P); member(D,M),
copy_term(D,E), length(P,K), K<I, append(A,[N|B],E),
append(A,B,F), prove(F,M,[L|P],I)), prove(C,M,P,I).

In the remaining subsections of Section 6, we will consider the following


formula (in clausal form) as a running example:

M = {{ P1 , P2 }, {¬ P1 , P3 (v1 )}, {¬ P3 (c1 ), P1 }, {¬ P3 (c2 ), ¬ P1 }, { P1 , ¬ P2 }}

We typically represent this graphically by putting clauses next to each other


and literals on top of each other as shown in Figure 6.1. We call this the
matrix representation of M.

Figure 6.1: Graphical representation of M

To prove M using the connection calculus, we have to build trees of rule


applications with axioms as leaves using the clauses of M and the start,
reduction, extension rules of the connection calculus as described in Section
2.4.

59
6.1.1 Positive start clauses
Theorem 6.1.1 (Positive start clause). The connection calculus remains correct
and complete if the clause C1 of the start rule is restricted to positive clauses. A
positive clause is a clause that does not contain negated literals.
Consider the formula M. Theorem 6.1.1 then states that we only have to
consider the trees rooted with { P1 , P2 } as premise to maintain a complete
proof search.
Hence the only potential root for a connection calculus proof tree for M is
shown in Figure 6.2, while all other roots do not need to be considered since
they contain negated literals, shown in Figure 6.3

{ P1 , P2 }, M, {}
(start)
e, {{ P1 , P2 }, {¬ P1 , P3 (v1 )}, {¬ P3 (c1 ), P1 }, {¬ P3 (c2 ), ¬ P1 }, { P1 , ¬ P2 }}, e
Figure 6.2: Potential root of proof tree for M

{¬ P3 (h
c2h),h ¬(P1(}, (({}
M,
hhh ((((
{¬ P1 , Ph
hhhh
3 (v 1 )} , M,
(( {}((((
h
( h
( (
h hhh(start) ( h h hhhh(start)
(((( e, M, e
( (
((((e, M, e hhh hh
{¬ P3 (h
hhh
c1h),h
P1h } , M,
( {}
( ( (( XXX
{ P1 , ¬XPX
2 }
, M, {}


((((hhhh(start)
( 
 XXX(start)
X
((( e, M, e hh
e, M, e XXX

Figure 6.3: Pruned roots of potential proof trees for M due to theorem 6.1.1

6.1.2 Regularity
Definition 6.1.1 (Regularity). A connection proof is regular iff no literal
occurs more than once in the active path.
Theorem 6.1.2 (Regularity). A formula M in clausal form is valid iff there is a
regular connection proof for “e, M, e”.
Using the partial proof tree for M shown in Figure 6.4 as an example,
regularity tells us we can safely backtrack from the node with highlighted
P1 ’s, and do not need to continue searching from it for our proof search
to be complete. The intuition here, is that to close the new P1 , we would
have to use another path than was used to close the first P1 to not end up in
an infinite loop. Therefore, we must necessarily have been able to use the
same new path to close the first P1 .
(?) (axiom)
{ P1 }, M, { P1 , P3 (v10 )} {}, M, { P1 }
(extension)
{ P3 (v10 )}, M, { P1 } { P2 }, M, {}
(extension)
{ P1 , P2 }, M, {}
(start)
e, M, e

Figure 6.4: Non-regularity in a partial connection calculus proof tree for M

60
6.1.3 Lemmata
One additional optimization which is included in leanCoP 2.0 and
commonly included in re-implementations of leanCoP is "Lemmata".
However, this technique typically doesn’t increase performance notably
and is therefore omitted in our base connection prover. For completeness,
it is described here anyways.
Definition 6.1.2 (Lemmata). The connection calculus in Figure 1 is mod-
ified by adding a set of literals Lem, called lemmata, to all tuples
“C, M, Path”. The empty set is added to the premise of the new start rule,
e is added to its conclusion. The set Lem ∪ { L1 } is added to the premise
of the new reduction rule and the right premise of the extension rule. Fur-
thermore, the following rule is added to the connection calculus:
C, M, Path, lem ∪ { L2 }
(lemma)
C ∪ { L1 }, M, Path, lem ∪ { L2 }
with σ ( L1 ) = σ ( L2 )

Theorem 6.1.3 (Lemmata). A formula M in clausal form is valid iff there


is a (regular) connection proof for “e, M, e, e” in the connection calculus with
lemmata.
Again considering the formula M from Section 6.1.1 and the partial proof
tree in Figure 6.5, the added lemma rule let’s us close branches faster, using
the fact that we necessarily have to find closing branches for the literal
added to the path in the left hand branch of any previous extension rule
applications.

(axiom)
{}, M, { P2 }, { P1 }
(lemma) (axiom)
{ P1 }, M, { P2 }, { P1 } {}, M, {}, { P1 , P2 }
... (...) (extension)
{...}, M, { P1 }, {} { P2 }, M, {}, { P1 }
(extension)
{ P1 , P2 }, M, {}, {}
(start)
e, M, e, e

Figure 6.5: Partial connection calculus proof tree for M with lemma rule

6.1.4 Restricted Backtracking


Lastly, we implement a core technique in leanCoP 2.0: "restricted back-
tracking". In Section 7, we will compare our base prover (leanCoP 1.0 +
regularity) to our base prover with restricted backtracking (leanCoP 1.0 +
regularity + restricted backtracking). The latter can, for all intents and pur-
poses, be viewed as leanCoP 2.0 without lemmata.
Definition 6.1.3 (Principal literal, solved literal). When the reduction,
extension or lemma rules are applied to a literal L, L is called the principal
literal of the proof step. We say a reduction or lemma step solves a literal

61
L iff L is the principal literal of the proof step. An extension step solves a
literal L iff L is the principal literal of the proof step and there is a proof
for the left premise, i.e. there is a derivation for the left premise so that all
leaves are axioms.
Definition 6.1.4 (Essential backtracking/proof step). Let R1 , . . . , Rn be
instances of rules with the same principal literal L1 applicable to a
node of a derivation in the connection calculus. If the literal L1 can
be solved by applying the rule Ri , but not by applying the rules R1
to Ri−1 , then backtracking over the rules R2 , . . . , Ri is called essential
backtracking; backtracking over the rules Ri+1 , . . . , Rn is called non-
essential backtracking. The application of one of the rules R1 , ..., Ri is an
essential proof step; the application of one of the rules Ri+1 , . . . , Rn is a
nonessential proof step.
Definition 6.1.5 (Restricted backtracking). Let R1 , . . . , Ri , . . . , Rn be the
instances of (reduction, extension or lemma) rules with principal literal L1
that are applicable to a node of a derivation in the connection calculus and
rule Ri solves L1 . Restricted backtracking does not apply the alternative
rules Ri+1 , . . . , Rn anymore.
Restricted backtracking restricts backtracking by only trying to solve a
literal once. This makes the search procedure incomplete, but has shown
to be very effective at proving more problems in fewer proof step iterations
[75]
Figure 6.6 compares this type of tree search (b) to the search strategy used in
leanCoP 1.0 (a) and the MCTS search strategy (c) used in various extensions
of leanCoP supporting machine learning [43–45].

Figure 6.6: Comparison of tree search strategies [76]

62
6.2 Formula Tensorization and State/Action Embed-
dings
Now that we have the proof dynamics in place, we will have a look at
how we can discriminate which actions to take from a given proof state by
approximating a function from states to actions using machine learning.
Our goal is to implement the ProofModel class such that the SearchAgent
will receive a probability distribution over legal actions from the Model
after giving the Model a State and list of potential Action.
Most interesting and effective machine learning models, such as the ones
discussed in Section 3.3, learn functions from real valued tensors to other
real valued tensors. However, it is not immediately clear how one
would express connection calculus proof states and actions as tensors. As
discussed in Section 4, this is one of the main bottlenecks of incorporating
machine learning into theorem provers. How do we properly embed proof
states into the space of real valued tensors.
Since this is quite a crucial step in the process of incorporating ML into
theorem provers, there has been quite a bit of work done on this. Here are
a couple of approaches:
- Meta-information such as the theory name and presence in various
databases [77]
- Term walks of length 2 [77]
- Term walks of length 3 [55]
- Symbol level token embeddings [78]
- Linear chains [60]
One of the problems with these previous methods is that they for the
most part do not have necessary invariant properties such as invariance
to literal/clause order and invariance to symbol renaming. Newer
embedding strategies are therefore based on representation learning with
graph neural networks, exploiting the relational structure of formulas
and proof states. This combined with the fact that there is a relatively
large amount of literature on graph representation learning using GNNs,
reducing the problem of tensorizing proof states to finding a graph
representation of a proof state as shown in Figure 6.7 is thought to be
beneficial. Here are some graph encoding approaches:
- Embed the formula parse trees extended with subexpression sharing
(parse DAGs) [66, 79–81], see Figure 6.8
- Embed proof states as hypergraphs [45]

63
?

Proof State Graph Representation Tensor Representation


? GNN

Figure 6.7: Reduction of representation learning problem to graph repres-


entation

Figure 6.8: A syntax-tree and a DAG representation of the formula


∀ A, B, C.r ( A, B) ∧ p( A) ∨ ¬q( B, f ( A)) ∨ q(C, f ( A)) from [80]. Here A, B, C
are variables and p, q, r are predicates

In the case of the connection calculus, on a macro scale, we typically have


three defining components of a proof search state. The matrix, the current
proof tableau, and the substitution. The actions are given by the possible
tablaeux expansions. This is illustrated in Figure 6.9.

64
Matrix:

Current tableau/substitution:

Possible actions/inferences:

Extension, Extension,
New New

Figure 6.9: Example of a connection calculus proof state and actions

6.2.1 Graph Construction 1


As a first attempt, we implemented the tensorization and model described
in [45]. The authors mention in [45]: "The current main speed issue
turned out to be the communication overhead between the provers and
the network. The average inference step in 100 agents and one network
inference took on average 0.57 sec.". This turned out to be too big of an
issue in our case of limited resources, and hence all evaluation (Section 7)
is done using the second approach, which we describe next. However, we
still think it is worthwhile to describe this first approach, as it is part of our
codebase.
Definition 6.2.1 (Hypergraph). A hypergraph is a generalization of a graph
(see Definition 3.3.1) in which an edge can join any number of vertices.

65
Formally, an undirected hypergraph H is a 2-tuple H = (V, E) where:
- V is a of nodes.
- E ⊆ P (V ) \ {∅} is a set of hyperedges.

In [45], they construct a Hypergraph consisting of:


- C, S, T: Three independent sets of nodes
- Ect ⊂ C × T: A binary set of edges between C and T
- Est ⊂ S × T × ( T ∪ { T0 })2 × {1, −1}: A 4-ary labeled ({1, −1}) set of
edges between S and T, where T0 is a dummy node whose purpose
will be made clear later.
To produce this hypergraph the connection calculus proof state is conver-
ted to a list of clauses as follows:
- All clauses in M are added to the list of clauses
- All literals on the path from the root of the tableau to the currently
considered open goal are added to the list of clauses as individual
clauses
- All open goals (literals) are added to the list of clauses as one clause
The current substitution σ is applied to the list of clauses, and the node sets
C, S, T are generated as follows:
- All clauses in the newly constructed list of clauses are represented by
individual nodes in C
- All unique predicate and function symbols in the list are represented
by individual nodes in S
- All (symbolically) unique terms and literals are represented by
individual nodes in T
The edge sets are generated as follows:
- Ect contains all the pairs (Ci , Tj ) where Tj is a literal contained in Ci .
- Est is constructed by the following procedure applied to every literal
or subterm Ti that is not a variable:
- If Ti is a negative literal, σ = 1, otherwise σ = 0
- S j is the outermost function or predicate symbol of Ti
- If the arity of Ti = 0, add (S j , Ti , T0 , T0 , σ) to Est
- If the arity of Ti = 1, add (S j , Ti , T1 , T0 , σ) to Est
- If the arity of Ti = n ≥ 2, add (S j , Ti , Tk , Tk+1 , σ) to Est for
k ∈ {1, . . . , n − 1}

66
6.2.2 Graph Construction 2
The previously presented embedding is invariant to variable renaming,
literal and clause reordering, and partially solves the problem of term
ordering by connecting successive terms by hyperedges. However, as the
authors mention, in this encoding, f (t1 , t2 , t1 ) would be encoded the same
way as f (t2 , t1 , t2 ).
Definition 6.2.2 (Heterogeneous graph). A heterogeneous graph is a
generalization of a graph (see Definition 3.3.1) where each node and edge
is associated with a type. Formally, a heterogeneous graph G is a 6-tuple
G = (V, E, TV , TE , τ, φ) where:
- V is a of nodes
- E ⊆ {( x, y)|( x, y) ∈ V 2 }, is a set of edges
- TV is a set of node types
- TE is a set of node types
- τ : V → TV is a node type mapping
- φ : E → TE is an edge type mapping

Here we present a novel proof state/action embedding in the form of a


heterogeneous graph. Heterogeneous graphs are graphs where each node
and edge is associated with a type. The different types of nodes and edges
tend to have different types of attributes that are designed to capture the
characteristics of what they are representing. Our construction uses 5 node
types and 5 edge types as shown in Figure. 6.10.

CLA Equal (Equivalence)


LIT Complements (Symmetric)
Node Types FUN Edge Types Contains/Inn (Inverse)
CON Successor (Antisymmetric)

VAR Extension (Antisymmetric)

Figure 6.10: Node and edge types of the heterogeneous graph constructions

Given a connection calculus proof state, the state defining graph is built as
follows:
- Every clause in the matrix M is represented by a CLA node
- Every literal in a clause is represented by a LIT node
- Every term in a literal is represented by a:
- FUN node if the term is a function
- CON node if the term is a constant

67
- VAR node if the term is a variable
- Every term in a function is represented by a:
- FUN node if the term is a function
- CON node if the term is a constant
- VAR node if the term is a variable
- Every pair of LIT nodes representing literals with the same predicate
symbol have an Equal edge between them.
- Every pair of FUN nodes representing functions with the same
function symbol have an Equal edge between them.
- Every pair of CON nodes representing constants with the same
constant symbol have an Equal edge between them.
- Every pair of VAR nodes representing constants with the same
constant symbol have an Equal edge between them (Note: there
should be no Equal edges between VAR nodes originating from
different clauses).
- Every pair of LIT nodes representing literals with the same predicate
symbol but where one contains ¬ and the other does not have a
Complements edge between them.
- Every pair of nodes where one represents an immediate sub-
expression of the other have a Contains/Inn edge between them.
("immediate sub-expression" refers to a literal in a clause or an
argument of a literal or function)
- Every pair of FUN, CON, and VAR nodes where one node represents
the argument which is immediately after the other in their originating
super-expression have a Successor edge going from the first node to
the second.
An example of this conversion is shown in Figure 6.11 for a subset of the
clauses of an arbitrary matrix.

68
CLA CLA CLA

LIT LIT LIT LIT LIT LIT

CONS VAR CONS VAR CONS VAR

Figure 6.11: Example conversion from connection calculus matrix to


heterogeneous graph showing embedding of clauses C1 , C4 , C5

A slightly more planar version of the graph in Figure 6.11 is shown in


Figure 6.12.

LIT LIT

CLA CLA

CONS

LIT LIT

VAR

CONS VAR
CONS

LIT LIT

VAR

CLA

Figure 6.12: More planar version the graph in Figure 6.11

69
6.3 Model Architecture
Now that we have graph representations of the proof states, we can use
graph neural networks to both embed the proof states and, furthermore,
perform inference on the proof states.

6.3.1 Model for Graph Construction 1


The following describes the model running on the tensorization described
in Section 6.2.1.
To make clear the direction of message passing in the neural layers, helper
j
sets are constructed such that the set Fxy contains the indices of node type
y connected to the j-th receiving node of type x as follows:

j  
Fct = a : Cj , Ta ∈ Ect
j  
Ftc = a : Ca , Tj ∈ Ect
j  
Fst = ( a, b, c, g) : S j , Ta , Tb , Tc , g ∈ Est
j  
Fts,1 = ( a, b, c, g) : Sc , Tj , Ta , Tb , g ∈ Est
j  
Fts,2 = ( a, b, c, g) : Sc , Ta , Tj , Tb , g ∈ Est
j  
Fts,3 = ( a, b, c, g) : Sc , Ta , Tb , Tj , g ∈ Est

The neural layers are then defined by the following expressions:

 
ci+1,j = ReLU Bci + Mci · ci,j + Mct
i
· reda∈ F j (ti,a )
ct

xia,b,c = i
Bts + i
Mts,1 · ti,a + i
Mts,2 · ti,b + i
Mts,3 · ti,c
!
 
0
si+1,j = tanh Msi · si,j + i
Mts · red g· xia,b,c
j
( a,b,c,g)∈ Fst
a,b,c,g 1,d 2,d 3,d
yi,d = Bsti + Mst,i · ti,a + Mst,i · ti,b + Mst,i · si,c · g
  
i a,b,c,g
zi,j,d = Mst,d · red ReL U yi,d
j
( a,b,c,g)∈ Fts,d
i
vi,j = Mtc · reda∈ F j (ci,a )
tc
!
ti+1,j = ReLU Bti + Mti · ti,j + vi,j + ∑ zi,j,d
d∈{1,2,3}

Here, all the B symbols represent learnable vectors (biases), and all the M
symbols represent learnable matrices. The aggregation operations used are
defined as follows:

70
redi∈ I (ui ) = maxi∈ I (ui ) k meani∈ I (ui )
redi0∈ I (ui ) = maxi∈ I (ui ) − mini∈ I (ui ) k meani∈ I (ui )

In the above, "k" means concatenation such that if the ui are of dimension
d, the resulting vector is of dimension 2d. All the aggregations (max, min,
mean) are done pointwise.
After L message passing layers, we obtain the embeddings c L,j , s L,j , t L,j of
the clauses Cj , symbols S j and terms and literals Tj respectively. These are
fed to a regular feed forward neural network, whose outputs are used to
compute the logits for taking an action. An action, in this case, corresponds
to the use of axiom Ci , and complementing its literal Tj with the current
goal. Let Ck represent the clause containing all the remaining goals. The
logit for taking the action corresponding to c L,i , t L,j are then found by
feeding the concatenation of c L,i , t L,j and c L,k through a hidden layer of
size 64 with ReLU activation, and then a linear output layer (without the
activation function). The distribution of actions are found by using the
softmax function on the logits.

6.3.2 Model for graph construction 2


The following describes the model running on the tensorization described
in Section 6.2.2.
Recall the definition of a general graph neural network node update from
Section 3.3 shown in Equation 6.1:
 

htv+1 = qt  htv , f t htv , htu , eu→v 


[ 
(6.1)
u∈N (v)

We can specialize this definition to take into account different edge types
by having separate weights for each type of relation. The R-GCN model
from [82] solves this as shown in Equation 6.2:
!
1
htv+1 =σ ∑ ∑ cv,r
Wrt htu + W0t htv (6.2)
r ∈R u∈Nvr

Here, Nvr denotes the set of neighbor indices of node v under relation
r ∈ R. Wrt denotes a weight matrix for relation r at GNN layer t. cv,r
is a problem-specific normalization constant that can either be learned or
chosen in advance (such as cv,r = |Nvr |). σ is any non-linear function (such
as ReLU).
The underlying idea is that, different relation types will now add different
information to v even if the message sending node u is held constant.

71
R-GCN is one way to incorporate different relations into the message
passing formulation. Figure 6.13 shows some other examples which
incorporate more advanced features and information.

Figure 6.13: Some relational GNNs from [83]

To use this for inference in our theorem prover, we will frame the problem
search guidance as a link prediction problem on the graph embedding.
That is, given the LIT node representing the current goal in our connection
calculus path, which node is it most likely that it has an Extension edge to?
Notice: the graph stays constant throughout a proof search problem and
we never actually add any Extension edges to the graph.
Our model for the heterogeneous graph construction will be reminiscent of
the link prediction graph auto-encoder model used in [82]. The encoder
will consist of multiple R-GCN layers, while the decoder function will
simply be a DistMult factorization [84]. That is, we score the potential
Extension edge between the LIT node u representing the current open goal
and a LIT node v representing a unifiable complimentary literal, by the
function:
f (u, v) = euT Wev
Where W is a learnable matrix and eu = huL and ev = hvL are the (encoder)
encoded representations of u and v respectively (where L is the number of
R-GCN layers in the encoder).
Since the number of supervision edges per sample state/action is relatively
low, we use a standard cross-entropy loss without negative sampling to
train the model.

L=− ∑ y log l ( f (u, v)) + (1 − y) log(1 − l ( f (u, v)))


(u,v)∈A

Here A is the set of all possible Extension edges and l is the logistic sigmoid
function. y is an indicator set to y = 1 for edges representing actions
taken which successfully lead to proofs, while y is set to y = 0 for the
complementary actions, that is, edges representing actions taken which did
not lead to proofs.

72
Chapter 7

Evaluating the ML-Guided


ATP System

In this section, we compare our prover described in Section 5 and 6 to


other provers on two benchmarks: M2k and MPTP2078. Both problem sets
are subsets of the Mizar Mathematical Library [49]. The M2k [43] dataset
contains 2003 problems selected randomly from the subset of Mizar that
is known to be provable by existing ATPs, while MPTP2078 [41] contains
2078 problems selected regardless of whether or not they could be solved
by an ATP system.
Without learning guidance, our prover is (assuming bug-free) functionally
equivalent to leanCoP 1.0 + regularity, as discussed in Section 6.1. It is a
Python translation of leanCoP, translated within our framework. Tracing
the connection calculus steps of both provers on both problem sets, we
see that they make the exact same inferences at all steps. It is therefore
interesting to compare the speed (in seconds) of leanCoP and our base non-
learning guided prover since they are implemented using different data
structures and programming languages. We do this on both problem sets.
We also benchmark our prover with learning guidance on both problem
sets. We guide the search by reordering the possible inference actions
at each step using the model we described in Section 6.3.2 (graph
construction 2). We additionally, add the restricted backtracking technique
used in leanCoP 2.0 to our base prover for benchmarking. That is, we
benchmark the following approaches on both M2k and MPTP2078 all using
a maximum of 100 000 inference steps:
- Base prover (Base)
- Base prover + restricted backtracking (RB)
- Base prover + restricted backtracking + learning guidance (Learning)

73
7.1 Hardware and Model Training
All testing is run on a server with four nodes of Intel(R) Xeon(R) CPU L7555
@ 1.87GHz CPUs totaling 64 cores. Among other things, these CPUs do not
have support for AVX, so running vector operations on them is slightly
less efficient than running them on more modern processors. Therefore,
a rather lightweight R-GCN model is used. The encoder consists of three
R-GCN layers with input dimension 32 and output dimension 32, while
the Decoder is a simple DistMult of the relevant node embeddings as
discussed in Section 6.3.2. Input node embeddings are one-hot encoded
vectors of dimension 5, one for each node type. We train this network
for 50 epochs using the Adam optimizer [85] with a learning rate of 2e-6
(for both problem sets) found using the algorithm described in [86]. Our
network is implemented in the PyTorch deep learning library [72] using
the PyTorch Geometric extension library [87] for R-GCN layers and the
PyTorch Lightning extension library [88] for training and logging.
Due to time and resource constraints, we only perform one: solving →
training → solving iteration. In every solving step, we attempt to solve
every problem in the given problem set, generating training data in the
meantime. The training data are then used for training the network,
minimizing the cross-entropy with the target link probabilities as described
in Section 6.3.2.
The first solving step is run without guidance to save time, only storing
necessary training data for each inference step. In both solving steps, the
prover makes a maximum of 100 000 total inference steps for each problem
before moving on to the next problem. Iterative deepening with restricted
backtracking up to a depth of 5 is used before falling back to iterative
deepening without restricted backtracking for both guidance and training
data gathering instead of MCTS which was used in systems such as rlCoP
[43] and AlphaZero [33]. This is to make sure we find proofs in a relative
short amount of time during training data gathering.
Our network is lightweight enough to be able to complete more than
2000 inferences per second total, while proving problems in parallel.
This is notably less than the non-guided version (about 10 000 on both
benchmarks), but still faster than other deep learning guided systems such
as graphCoP [45]. Another consequence of the model being lightweight, is
that it does not gain anything from being put on a GPU during inference
due to communication overhead. Lastly, since the graph itself is stationary,
we only send the graph through the encoder once per problem, and only
query the decoder for the rest of the steps. This is the reason that the
number of inferences the prover makes per second is comparable to the
non-learning guided version.

74
7.2 Results on MPTP2078
Tables 7.1 - 7.3 contain results of running the experiments described in the
beginning of Section 7 on the MPTP2078 1 bushy problem set.

7.2.1 Unguided Proof Search


In Table 7.1, we see that the Prolog version is able to prove slightly more
MPTP2078 bushy problems than the Python version on our hardware in
10 seconds of CPU time, while there are some discrepancies in which
problems the provers are able to prove. We found that both provers make
the exact same inference steps on all problems by tracing both provers on
all problems. Hence, we conclude that the discrepancies are purely due to
different programming language and data structure implementations.

leanCoP 1.0 + regularity Python prover


Proved 362 340
[%] 17.42% 16.36%
Unique 26 4

Table 7.1: leanCoP vs. base Python prover (MPTP2078 bushy)

7.2.2 Guided Proof Search


In Table 7.2, we see that adding restricted backtracking and learning
guidance to our base prover greatly increases the number of problems
our Python prover is able to solve. Table 7.3 shows how many unique
problems each technique allows us to prove. Each technique in the rows
of Table 7.3 show the number of problems proven by the technique which
are not proven by the techniques in the columns. We see that although
the restricted backtracking prover with our learning model heuristic did
not end up proving more problems than the pure restricted backtracking
prover, it was able to prove 57 problems the pure restricted backtracking
prover was not able to prove.

Base RB Learning
Proved 351 462 451
[%] 16.89% 22.23% 21.70%

Table 7.2: Comparison of various strategies (MPTP2078 bushy)


1 https://github.com/JUrban/MPTP2078

75
Base RB Learning
Base - 57 54
RB 168 - 68
Learning 154 57 -

Table 7.3: Unique problems proven by various strategies (MPTP2078


bushy)

7.3 Results on M2k


Tables 7.4 - 7.6 contain results of running the experiments described in the
beginning of Section 7 on the the M2k 2 subset of the Miz40 3 problem set:

7.3.1 Unguided Proof Search


In Table 7.4, we see that the Prolog version is able to prove slightly more
M2k problems than the Python version on our hardware in 10 seconds
of CPU time. Again, we know that both provers make the exact same
inference steps on all problems by tracing both provers on all problems.
Hence, the discrepancies here, as with MPTP2078, are purely due to
different programming language and data structure implementations.

leanCoP 1.0 + regularity Python prover


Proved 876 804
[%] 43.73% 40.14%
Unique 72 0

Table 7.4: leanCoP vs base Python prover (M2k)

7.3.2 Guided Proof Search


In Table 7.5, we see that adding restricted backtracking to our base prover
increases the number of problems our Python prover is able to solve, albeit
not as large of an increase as we saw on MPTP2078 in Table 7.2. Whereas
our learning prover is actually able to prove fewer problems than the base
prover. Table 7.6 shows how many unique problems each technique allows
us to prove. As in Table 7.3, each technique in the rows of Table 7.6 show
the number of problems proven by the technique which are not proven by
the techniques in the columns. Again we see that although the different
strategies are able to prove similar numbers of problems, they actually
prove quite a big non-overlapping set of unique different problems.
2 https://raw.githubusercontent.com/JUrban/deepmath/master/M2k_list
3 https://github.com/JUrban/deepmath

76
Base RB RB + Learning
Proved 822 846 815
[%] 41.03% 42.23% 40.68%

Table 7.5: Comparison of various strategies (M2k)

Base RB RB + Learning
Base - 182 124
RB 206 - 143
RB + Learning 117 112 -

Table 7.6: Unique problems proven by various restricted strategies (M2k)

77
78
Chapter 8

Conclusion and Future Work

In this thesis, we have developed a high-level modular object-oriented


framework for combining arbitrary proof calculi with arbitrary learning
models and search strategies. Furthermore, we have implemented the
framework in Python and created a Python translation of the leanCoP
theorem prover. On top of this, we have developed and implemented
a novel graph embedding for connection calculus proof states/actions,
framing the problem of proof guidance as a link prediction problem in said
embedding using an R-GCN auto-encoder model implemented in PyTorch.
The embedding was incorporated into our Python translation of leanCoP
(made easy by our framework) to provide proof search guidance.
In regards to our novel learning approach (graph construction 2), imme-
diate future work would include more testing, tweaking and study of it’s
limitations.
An important design choice for our framework, is that it allows for easy
integration of learning techniques with arbitrary proof calculi, not just
the connection calculus. This includes calculi for non-classical logics.
Therefore, future research could continue to explore the space of non-
classical reasoners quite painlessly using our framework. On the horizon,
future work would include porting the back-end parts of the framework
such as primitives to a faster language such as C++.
On the topic of ML theory for ATP, an interesting issue for future research
to explore, is how much the choice of proof calculus matters for learning.
Also, whether there could be loss in important information or other
learning-related issues related to various formula translations, such as
whether important information is lost when going from standard first-
order syntax to clausal form (equisatisfiability). Lastly, research on good
models for guidance, how to deal with sparse data, and what search
strategies to use, should continue to be the central topics for research in
combining ML and ATP, since these are fundamental topics and the field is
still in its infancy.

79
The recency of most of the cited papers in Section 4 and the apparent
superiority of their related models is a promising basis for future work on
the topic of learning-guided ATP.

80
List of Figures

2.1 Proof tableau, Left: Explicit, Right: Implicit [7] . . . . . . . . 16

3.1 Comparison of ML paradigms [20] . . . . . . . . . . . . . . . 19


3.2 Decision tree training data and decision boundary . . . . . . 22
3.3 Decision tree prediction . . . . . . . . . . . . . . . . . . . . . 22
3.4 Linear regression with one feature . . . . . . . . . . . . . . . 24
3.5 Logistic regression with two features . . . . . . . . . . . . . . 25
3.6 Neuron model . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.7 Neurons composed in width . . . . . . . . . . . . . . . . . . . 27
3.8 Neurons composed in depth . . . . . . . . . . . . . . . . . . . 27
3.9 Temporal data . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.10 Spatial (2D) data . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.11 Arbitrary relational data . . . . . . . . . . . . . . . . . . . . . 29
3.12 Graph neural network node embedding . . . . . . . . . . . . 30
3.13 RL as interplay between an intelligent agent and an MDP . . 31
3.14 A non-exhaustive, but useful taxonomy of algorithms [22–
33] in modern RL from [34] . . . . . . . . . . . . . . . . . . . 32
3.15 MCTS selection step [37] . . . . . . . . . . . . . . . . . . . . . 34
3.16 MCTS expansion step [37] . . . . . . . . . . . . . . . . . . . . 34
3.17 MCTS simulation step [37] . . . . . . . . . . . . . . . . . . . . 35
3.18 MCTS backpropagation step [37] . . . . . . . . . . . . . . . . 36

4.1 HOL4 guidance overview from [63] . . . . . . . . . . . . . . 44

5.1 Generalized prover class diagram . . . . . . . . . . . . . . . . 48


5.2 Abstract Python state and action classes used in search
problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
5.3 Abstract Python problem class for defining proof search
dynamics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
5.4 General RL scenario . . . . . . . . . . . . . . . . . . . . . . . . 52
5.5 The MDP is already naturally represented in the system design 52
5.6 The RL components are already naturally represented in the
system design . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
5.7 MaLeCoP general architecture from [39] . . . . . . . . . . . . 54
5.8 Complete framework with support for learning . . . . . . . 55
5.9 Abstract Python model class used to guide search problems 55

6.1 Graphical representation of M . . . . . . . . . . . . . . . . . . 59

81
6.2 Potential root of proof tree for M . . . . . . . . . . . . . . . . 60
6.3 Pruned roots of potential proof trees for M due to theorem
6.1.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
6.4 Non-regularity in a partial connection calculus proof tree for
M . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
6.5 Partial connection calculus proof tree for M with lemma rule 61
6.6 Comparison of tree search strategies [76] . . . . . . . . . . . . 62
6.7 Reduction of representation learning problem to graph
representation . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
6.8 A syntax-tree and a DAG representation of the formula
∀ A, B, C.r ( A, B) ∧ p( A) ∨ ¬q( B, f ( A)) ∨ q(C, f ( A)) from
[80]. Here A, B, C are variables and p, q, r are predicates . . . 64
6.9 Example of a connection calculus proof state and actions . . 65
6.10 Node and edge types of the heterogeneous graph constructions 67
6.11 Example conversion from connection calculus matrix to
heterogeneous graph showing embedding of clauses C1 , C4 , C5 69
6.12 More planar version the graph in Figure 6.11 . . . . . . . . . 69
6.13 Some relational GNNs from [83] . . . . . . . . . . . . . . . . 72

82
List of Tables

2.1 Deduction domains and common formalizations . . . . . . . 4


2.2 Truth table . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.3 Łukasiewicz’ calculus for propositional logic [3] . . . . . . . 8
2.4 A simplified Hilbert calculus [3] . . . . . . . . . . . . . . . . . 11
2.5 Sequent Calculus LK . . . . . . . . . . . . . . . . . . . . . . . 13
2.6 Block Tableau Calculus . . . . . . . . . . . . . . . . . . . . . . 15
2.7 Connection Calculus [8] . . . . . . . . . . . . . . . . . . . . . 17
2.8 Resolution Calculus . . . . . . . . . . . . . . . . . . . . . . . . 18

4.1 Connection Calculus . . . . . . . . . . . . . . . . . . . . . . . 39


4.2 Comparison of the number of problems solved by leanCoP
and leanCoP guided by XGBoost, from [43] . . . . . . . . . . 41
4.3 Comparison of the number of problems solved by leanCoP
guided by the invariant-preserving GNN and by XGBoost,
from [45] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
4.4 Resolution calculus . . . . . . . . . . . . . . . . . . . . . . . . 42
4.5 Comparison of different E strategies on MPTP2078 bushy
from [54] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
4.6 Comparison of different E strategies on Mizar40 from [56] . 43
4.7 Comparison of different provers on MPTP2078 bushy from
[60] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

6.1 Connection Calculus . . . . . . . . . . . . . . . . . . . . . . . 58

7.1 leanCoP vs. base Python prover (MPTP2078 bushy) . . . . . 75


7.2 Comparison of various strategies (MPTP2078 bushy) . . . . 75
7.3 Unique problems proven by various strategies (MPTP2078
bushy) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
7.4 leanCoP vs base Python prover (M2k) . . . . . . . . . . . . . 76
7.5 Comparison of various strategies (M2k) . . . . . . . . . . . . 77
7.6 Unique problems proven by various restricted strategies (M2k) 77

83
84
Bibliography

[1] Frederic Portoraro. ‘Automated Reasoning’. In: The Stanford Encyclo-


pedia of Philosophy. Ed. by Edward N. Zalta. Spring 2019. Metaphysics
Research Lab, Stanford University, 2019.
[2] Mordechai Ben-Ari. Mathematical Logic for Computer Science. 3rd.
Springer Publishing Company, Incorporated, 2012. ISBN: 1447141288.
[3] Metamath. http : / / us . metamath . org / mpegif / mmset . html # scaxioms.
Accessed: 2021-06-13.
[4] A. M. Turing. ‘On Computable Numbers, with an Application to
the Entscheidungsproblem’. In: Proceedings of the London Mathematical
Society s2-42.1 (Jan. 1937), pp. 230–265. ISSN: 0024-6115. DOI: 10.1112/
plms/s2-42.1.230. URL: https://doi.org/10.1112/plms/s2-42.1.230.
[5] Stephen A. Cook. ‘The Complexity of Theorem-Proving Procedures’.
In: Proceedings of the Third Annual ACM Symposium on Theory of
Computing. STOC ’71. Shaker Heights, Ohio, USA: Association for
Computing Machinery, 1971, pp. 151–158. ISBN: 9781450374644. DOI:
10.1145/800157.805047. URL: https://doi.org/10.1145/800157.805047.
[6] G. Gentzen and M. E. Szabo. ‘The collected papers of Gerhard
Gentzen’. In: 1969.
[7] Raymond M. Smullyan. First-Order Logic. New York [Etc.]Springer-
Verlag, 1968.
[8] Wolfgang Bibel. Automated Theorem Proving. second. First edition
1982. Braunschweig: Vieweg Verlag, 1987.
[9] Jens Otten and Wolfgang Bibel. ‘leanCoP: Lean Connection-Based
Theorem Proving’. In: Journal of Symbolic Computation 36 (2003),
pp. 139–161.
[10] Jens Otten. ‘Restricting backtracking in connection calculi’. In: AI
Communications 23 (2010), pp. 159–182.
[11] Martin Davis and Hilary Putnam. ‘A Computing Procedure for
Quantification Theory’. In: J. ACM 7.3 (July 1960), pp. 201–215. ISSN:
0004-5411.
[12] J. A. Robinson. ‘A Machine-Oriented Logic Based on the Resolution
Principle’. In: J. ACM 12.1 (Jan. 1965), pp. 23–41. ISSN: 0004-5411.

85
[13] Laura Kovács and Andrei Voronkov. ‘First-Order Theorem Proving
and Vampire’. In: Computer Aided Verification. Ed. by Natasha Shary-
gina and Helmut Veith. Berlin, Heidelberg: Springer Berlin Heidel-
berg, 2013, pp. 1–35. ISBN: 978-3-642-39799-8.
[14] Stephan Schulz, Simon Cruanes and Petar Vukmirović. ‘Faster,
Higher, Stronger: E 2.3’. In: Proc. of the 27th CADE, Natal, Brasil. Ed. by
Pacal Fontaine. LNAI 11716. Springer, 2019, pp. 495–507.
[15] ‘The CADE-27 Automated theorem proving System Competition -
CASC-27’. English (US). In: AI Communications 32.5-6 (2020), pp. 373–
389. ISSN: 0921-7126. DOI: 10.3233/AIC-190627.
[16] Stuart J. Russell and Peter Norvig. Artificial Intelligence: A Modern
Approach. 4th ed. Pearson, 2020.
[17] Gareth James et al. An Introduction to Statistical Learning: with
Applications in R. Springer, 2013. URL: https : / / faculty. marshall . usc .
edu/gareth-james/ISL/.
[18] William L. Hamilton. ‘Graph Representation Learning’. In: Synthesis
Lectures on Artificial Intelligence and Machine Learning 14.3 (), pp. 1–
159.
[19] Ian Goodfellow, Yoshua Bengio and Aaron Courville. Deep Learning.
http://www.deeplearningbook.org. MIT Press, 2016.
[20] M. Tim Jones. Models for machine learning. Dec. 2017. URL: https : / /
developer . ibm . com / technologies / artificial - intelligence / articles / cc -
models-machine-learning/#.
[21] Anthony Goldbloom. "What algorithms are most successful on Kaggle?"
2016 (accessed September 17, 2020). URL: https://www.kaggle.com/
antgoldbloom/what-algorithms-are-most-successful-on-kaggle.
[22] Volodymyr Mnih et al. ‘Asynchronous Methods for Deep Reinforce-
ment Learning’. In: Proceedings of the 33rd International Conference on
International Conference on Machine Learning - Volume 48. ICML’16.
New York, NY, USA: JMLR.org, 2016, pp. 1928–1937.
[23] John Schulman et al. Proximal Policy Optimization Algorithms. 2017.
arXiv: 1707.06347 [cs.LG].
[24] John Schulman et al. ‘Trust Region Policy Optimization’. In: Proceed-
ings of the 32nd International Conference on International Conference on
Machine Learning - Volume 37. ICML’15. Lille, France: JMLR.org, 2015,
pp. 1889–1897.
[25] Scott Fujimoto, Herke van Hoof and David Meger. ‘Addressing
Function Approximation Error in Actor-Critic Methods’. In: Proceed-
ings of the 35th International Conference on Machine Learning, ICML
2018, Stockholmsmässan, Stockholm, Sweden, July 10-15, 2018. Ed. by
Jennifer G. Dy and Andreas Krause. Vol. 80. Proceedings of Machine
Learning Research. PMLR, 2018, pp. 1582–1591.

86
[26] Tuomas Haarnoja et al. ‘Soft Actor-Critic: Off-Policy Maximum
Entropy Deep Reinforcement Learning with a Stochastic Actor’. In:
Proceedings of the 35th International Conference on Machine Learning. Ed.
by Jennifer Dy and Andreas Krause. Vol. 80. Proceedings of Machine
Learning Research. Stockholmsmässan, Stockholm Sweden: PMLR,
Oct. 2018, pp. 1861–1870.
[27] Volodymyr Mnih et al. ‘Playing Atari With Deep Reinforcement
Learning’. In: NIPS Deep Learning Workshop. 2013.
[28] Marc G. Bellemare, Will Dabney and Rémi Munos. ‘A Distributional
Perspective on Reinforcement Learning’. In: Proceedings of the 34th
International Conference on Machine Learning - Volume 70. ICML’17.
Sydney, NSW, Australia: JMLR.org, 2017, pp. 449–458.
[29] Will Dabney et al. Distributional Reinforcement Learning With Quantile
Regression. 2018.
[30] Marcin Andrychowicz et al. ‘Hindsight Experience Replay’. In:
Proceedings of the 31st International Conference on Neural Information
Processing Systems. NIPS’17. Long Beach, California, USA: Curran
Associates Inc., 2017, pp. 5055–5065. ISBN: 9781510860964.
[31] David Ha and Jürgen Schmidhuber. ‘Recurrent World Models Fa-
cilitate Policy Evolution’. In: Advances in Neural Information Pro-
cessing Systems. Ed. by S. Bengio et al. Vol. 31. Curran Associates,
Inc., 2018. URL: https : / / proceedings . neurips . cc / paper / 2018 / file /
2de5d16682c3c35007e4e92982f1a2ba-Paper.pdf.
[32] Sébastien Racanière et al. ‘Imagination-Augmented Agents for Deep
Reinforcement Learning’. In: Proceedings of the 31st International
Conference on Neural Information Processing Systems. NIPS’17. Long
Beach, California, USA: Curran Associates Inc., 2017, pp. 5694–5705.
ISBN : 9781510860964.

[33] David Silver et al. ‘A general reinforcement learning algorithm that


masters chess, shogi, and Go through self-play’. In: Science 362.6419
(2018), pp. 1140–1144. ISSN: 0036-8075. DOI: 10.1126/science.aar6404.
[34] OpenAI. 2018 (accessed March 23, 2021). URL: https : / / spinningup .
openai.com/en/latest/spinningup/rl_intro2.html#part- 2- kinds- of- rl-
algorithms.
[35] David Silver et al. ‘Mastering the Game of Go with Deep Neural
Networks and Tree Search’. In: Nature 529.7587 (jan 2016), pp. 484–
489. ISSN: 0028-0836. DOI: 10.1038/nature16961.
[36] Levente Kocsis and Csaba Szepesvári. ‘Bandit Based Monte-Carlo
Planning’. In: Machine Learning: ECML 2006. Ed. by Johannes
Fürnkranz, Tobias Scheffer and Myra Spiliopoulou. Berlin, Heidel-
berg: Springer Berlin Heidelberg, 2006, pp. 282–293. ISBN: 978-3-540-
46056-5.

87
[37] Wikimedia Commons Rmoss92. The 4 steps of Monte Carlo tree
search: selection, expansion, simulation, and backpropagation. File:
MCTS-steps.svg. 2020. URL: https://https://commons.wikimedia.org/
wiki/File:MCTS-steps.svg.
[38] Wolfgang Ertel, Johann M. Ph. Schumann and Christian B. Suttner.
‘Learning Heuristics for a Theorem Prover using Back Propagation’.
In: 5. Österreichische Artificial-Intelligence-Tagung. Ed. by Johannes
Retti and Karl Leidlmair. Berlin, Heidelberg: Springer, 1989, pp. 87–
95. ISBN: 978-3-642-74688-8.
[39] Josef Urban, Jiří Vyskočil and Petr Štěpánek. ‘MaLeCoP Machine
Learning Connection Prover’. In: Automated Reasoning with Analytic
Tableaux and Related Methods. Ed. by Kai Brünnler and George
Metcalfe. Berlin, Heidelberg: Springer, 2011, pp. 263–277. ISBN: 978-
3-642-22119-4.
[40] Cezary Kaliszyk and Josef Urban. ‘FEMaLeCoP: Fairly Efficient
Machine Learning Connection Prover’. In: Logic for Programming,
Artificial Intelligence, and Reasoning. Ed. by Martin Davis et al. Berlin,
Heidelberg: Springer Berlin Heidelberg, 2015, pp. 88–96. ISBN: 978-3-
662-48899-7.
[41] Jesse Alama et al. ‘Premise Selection for Mathematics by Corpus
Analysis and Kernel Methods’. In: Journal of Automated Reasoning 52.2
(Feb. 2014), pp. 191–213. ISSN: 1573-0670. DOI: 10.1007/s10817- 013-
9286-5. URL: https://doi.org/10.1007/s10817-013-9286-5.
[42] Michael Färber, Cezary Kaliszyk and Josef Urban. ‘Monte Carlo
Tableau Proof Search’. In: Automated Deduction – CADE 26. Ed. by
Leonardo de Moura. Cham: Springer International Publishing, 2017,
pp. 563–579. ISBN: 978-3-319-63046-5.
[43] Cezary Kaliszyk et al. ‘Reinforcement Learning of Theorem Proving’.
In: Advances in Neural Information Processing Systems. Ed. by S. Bengio
et al. Vol. 31. Curran Associates, Inc., 2018.
[44] Zsolt Zombori, Josef Urban and Chad E. Brown. ‘Prolog Technology
Reinforcement Learning Prover’. In: Automated Reasoning. Ed. by
Nicolas Peltier and Viorica Sofronie-Stokkermans. Cham: Springer
International Publishing, 2020, pp. 489–507. ISBN: 978-3-030-51054-1.
[45] Miroslav Olsák, Cezary Kaliszyk and Josef Urban. ‘Property Invari-
ant Embedding for Automated Reasoning’. In: ECAI 2020 - 24th
European Conference on Artificial Intelligence, 29 August-8 September
2020, Santiago de Compostela, Spain, August 29 - September 8, 2020 -
Including 10th Conference on Prestigious Applications of Artificial In-
telligence (PAIS 2020). Ed. by Giuseppe De Giacomo et al. Vol. 325.
Frontiers in Artificial Intelligence and Applications. IOS Press, 2020,
pp. 1395–1402.

88
[46] Stephane Ross, Geoffrey Gordon and Drew Bagnell. ‘A Reduction of
Imitation Learning and Structured Prediction to No-Regret Online
Learning’. In: Proceedings of the Fourteenth International Conference on
Artificial Intelligence and Statistics. Ed. by Geoffrey Gordon, David
Dunson and Miroslav Dudík. Vol. 15. Proceedings of Machine
Learning Research. 2011, pp. 627–635.
[47] Tianqi Chen and Carlos Guestrin. ‘XGBoost: A Scalable Tree Boosting
System’. In: Proceedings of the 22nd ACM SIGKDD International
Conference on Knowledge Discovery and Data Mining. KDD ’16. San
Francisco, California, USA: Association for Computing Machinery,
2016, pp. 785–794. ISBN: 9781450342322. DOI: 10 . 1145 / 2939672 .
2939785. URL: https://doi.org/10.1145/2939672.2939785.
[48] Cezary Kaliszyk and Josef Urban. ‘MizAR 40 for Mizar 40’. In: Journal
of Automated Reasoning 55.3 (Oct. 2015), pp. 245–256. ISSN: 1573-0670.
DOI : 10.1007/s10817-015-9330-8. URL : https://doi.org/10.1007/s10817-
015-9330-8.
[49] Adam Grabowski, Artur Kornilowicz and Adam Naumowicz.
‘Mizar in a Nutshell’. In: Journal of Formalized Reasoning 3 (Dec. 2010),
pp. 153–245. DOI: 10.6092/issn.1972-5787/1980. URL: https://jfr.unibo.
it/article/view/1980.
[50] Jens Otten. ‘nanoCoP: Natural Non-clausal Theorem Proving’. In:
Proceedings of the Twenty-Sixth International Joint Conference on Artificial
Intelligence, IJCAI 2017, Melbourne, Australia. Ed. by Carles Sierra.
ijcai.org, 2017, pp. 4924–4928.
[51] Wolfgang Bibel and Jens Otten. ‘From Schütte’s Formal Systems to
Modern Automated Deduction’. In: The Legacy of Kurt Schütte. Ed. by
Reinhard Kahle and Michael Rathjen. Springer, 2020, pp. 217–251.
[52] Jens Otten. ‘Non-clausal Connection Calculi for Non-classical Lo-
gics’. In: 26th International Conference on Automated Reasoning with
Analytic Tableaux and Related Methods. Ed. by R. Schmidt and
C. Nalon. Vol. 10501. Lecture Notes in Artificial Intelligence.
Springer, 2017, pp. 209–227.
[53] Jens Otten and Wolfgang Bibel. ‘Advances in Connection-based
Automated Theorem Proving’. In: Provably Correct Systems. Ed.
by Jonathan Bowen, Mike Hinchey and Ernst-Rüdiger Olderog.
NASA Monographs in Systems and Software Engineering. London:
Springer, 2017, pp. 211–241.
[54] Karel Chvalovský et al. ‘ENIGMA-NG: Efficient Neural and
Gradient-Boosted Inference Guidance for E’. In: Automated Deduc-
tion – CADE 27. Ed. by Pascal Fontaine. Cham: Springer International
Publishing, 2019, pp. 197–215. ISBN: 978-3-030-29436-6.
[55] Jan Jakubův and Josef Urban. ‘ENIGMA: Efficient Learning-Based
Inference Guiding Machine’. In: Intelligent Computer Mathematics. Ed.
by Herman Geuvers et al. Cham: Springer International Publishing,
2017, pp. 292–302. ISBN: 978-3-319-62075-6.

89
[56] Jan Jakubův et al. ‘ENIGMA Anonymous: Symbol-Independent
Inference Guiding Machine (System Description)’. In: Automated
Reasoning. Ed. by Nicolas Peltier and Viorica Sofronie-Stokkermans.
Cham: Springer International Publishing, 2020, pp. 448–463. ISBN:
978-3-030-51054-1.
[57] Rong-En Fan et al. ‘LIBLINEAR: A Library for Large Linear Classi-
fication’. In: J. Mach. Learn. Res. 9 (June 2008), pp. 1871–1874. ISSN:
1532-4435.
[58] Josef Urban. ‘MPTP 0.2: Design, Implementation, and Initial Exper-
iments’. In: Journal of Automated Reasoning 37.1 (Aug. 2006), pp. 21–
43. ISSN: 1573-0670. DOI: 10 . 1007 / s10817 - 006 - 9032 - 3. URL: https :
//doi.org/10.1007/s10817-006-9032-3.
[59] Grzegorz Bancerek et al. ‘Mizar: State-of-the-art and Beyond’. In:
Intelligent Computer Mathematics. Ed. by Manfred Kerber et al. Cham:
Springer International Publishing, 2015, pp. 261–279. ISBN: 978-3-319-
20615-8.
[60] Maxwell Crouse et al. ‘A Deep Reinforcement Learning Approach to
First-Order Logic Theorem Proving’. In: CoRR abs/1911.02065 (2019).
arXiv: 1911.02065. URL: http://arxiv.org/abs/1911.02065.
[61] Peter Baumgartner, Joshua Bax and Uwe Waldmann. ‘Beagle – A
Hierarchic Superposition Theorem Prover’. In: Automated Deduction -
CADE-25. Ed. by Amy P. Felty and Aart Middeldorp. Cham: Springer
International Publishing, 2015, pp. 367–377. ISBN: 978-3-319-21401-6.
[62] Thibault Gauthier, Cezary Kaliszyk and Josef Urban. ‘TacticToe:
Learning to Reason with HOL4 Tactics’. In: LPAR-21. 21st Interna-
tional Conference on Logic for Programming, Artificial Intelligence and
Reasoning. Ed. by Thomas Eiter and David Sands. Vol. 46. EPiC Series
in Computing. EasyChair, 2017, pp. 125–143. DOI: 10.29007/ntlb. URL:
https://easychair.org/publications/paper/WsM.
[63] Thibault Gauthier et al. ‘TacticToe: Learning to Prove with Tactics’.
In: Journal of Automated Reasoning 65.2 (Feb. 2021), pp. 257–286. ISSN:
1573-0670. DOI: 10.1007/s10817-020-09580-x. URL: https://doi.org/10.
1007/s10817-020-09580-x.
[64] Lasse Blaauwbroek, Josef Urban and Herman Geuvers. ‘The Tac-
tician’. In: Intelligent Computer Mathematics. Ed. by Christoph Ben-
zmüller and Bruce Miller. Cham: Springer International Publishing,
2020, pp. 271–277. ISBN: 978-3-030-53518-6.
[65] Lasse Blaauwbroek, Josef Urban and Herman Geuvers. ‘Tactic Learn-
ing and Proving for the Coq Proof Assistant’. In: LPAR23. LPAR-23:
23rd International Conference on Logic for Programming, Artificial Intelli-
gence and Reasoning. Ed. by Elvira Albert and Laura Kovacs. Vol. 73.
EPiC Series in Computing. EasyChair, 2020, pp. 138–150. DOI: 10 .
29007/wg1q. URL: https://easychair.org/publications/paper/JLdB.

90
[66] Aditya Paliwal et al. ‘Graph Representations for Higher-Order Logic
and Theorem Proving’. In: Proceedings of the AAAI Conference on
Artificial Intelligence 34.03 (Apr. 2020), pp. 2967–2974. DOI: 10.1609/
aaai.v34i03.5689. URL: https://ojs.aaai.org/index.php/AAAI/article/
view/5689.
[67] Stanislas Polu and Ilya Sutskever. ‘Generative Language Modeling
for Automated Theorem Proving’. In: CoRR abs/2009.03393 (2020).
URL : https://arxiv.org/abs/2009.03393.

[68] Maxime Gasse et al. ‘Exact Combinatorial Optimization with Graph


Convolutional Neural Networks’. In: Advances in Neural Information
Processing Systems 32. 2019.
[69] Robert Kowalski. ‘Algorithm = Logic + Control’. In: Commun. ACM
22.7 (July 1979), pp. 424–436. ISSN: 0001-0782. DOI: 10.1145/359131.
359136. URL: https://doi.org/10.1145/359131.359136.
[70] Greg Brockman et al. ‘Openai gym’. In: arXiv preprint arXiv:1606.01540
(2016).
[71] The Computer Language Benchmarks Game. https : / / benchmarksgame -
team.pages.debian.net/benchmarksgame/index.html. Accessed: 2021-
06-14.
[72] Adam Paszke et al. ‘PyTorch: An Imperative Style, High-Performance
Deep Learning Library’. In: Advances in Neural Information Processing
Systems 32. Ed. by H. Wallach et al. Curran Associates, Inc., 2019,
pp. 8024–8035. URL: http : / / papers . neurips . cc / paper / 9015 - pytorch -
an-imperative-style-high-performance-deep-learning-library.pdf.
[73] R. Letz, K. Mayr and C. Goller. ‘Controlled integration of the
cut rule into connection tableau calculi’. In: Journal of Automated
Reasoning 13.3 (Oct. 1994), pp. 297–337. ISSN: 1573-0670. DOI: 10.1007/
BF00881947. URL: https://doi.org/10.1007/BF00881947.
[74] Reinhold Letz and Gernot Stenz. ‘Model Elimination and Connection
Tableau Procedures’. In: Handbook of Automated Reasoning. NLD: El-
sevier Science Publishers B. V., 2001, pp. 2015–2112. ISBN: 0444508120.
[75] Jens Otten. ‘Restricting Backtracking in Connection Calculi’. In: AI
Commun. 23.2–3 (Apr. 2010), pp. 159–182. ISSN: 0921-7126.
[76] Michael Färber, Cezary Kaliszyk and Josef Urban. ‘Machine Learn-
ing Guidance for Connection Tableaux’. In: Journal of Automated Reas-
oning 65.2 (Feb. 2021), pp. 287–320. ISSN: 1573-0670. DOI: 10 . 1007 /
s10817-020-09576-7. URL: https://doi.org/10.1007/s10817-020-09576-
7.
[77] Daniel Kühlwein et al. ‘MaSh: Machine Learning for Sledgehammer’.
In: Interactive Theorem Proving. Ed. by Sandrine Blazy, Christine
Paulin-Mohring and David Pichardie. Berlin, Heidelberg: Springer
Berlin Heidelberg, 2013, pp. 35–50. ISBN: 978-3-642-39634-2.

91
[78] Alexander A. Alemi et al. ‘DeepMath - Deep Sequence Models for
Premise Selection’. In: Proceedings of the 30th International Conference
on Neural Information Processing Systems. NIPS’16. Barcelona, Spain:
Curran Associates Inc., 2016, pp. 2243–2251. ISBN: 9781510838819.
[79] Mingzhe Wang et al. ‘Premise Selection for Theorem Proving by Deep
Graph Embedding’. In: Advances in Neural Information Processing
Systems. Ed. by I. Guyon et al. Vol. 30. Curran Associates, Inc.,
2017. URL: https : / / proceedings . neurips . cc / paper / 2017 / file /
18d10dc6e666eab6de9215ae5b3d54df-Paper.pdf.
[80] Ibrahim Abdelaziz et al. ‘An Experimental Study of Formula Em-
beddings for Automated Theorem Proving in First-Order Logic’. In:
CoRR abs/2002.00423 (2020). arXiv: 2002.00423. URL: https://arxiv.
org/abs/2002.00423.
[81] Maxwell Crouse et al. Improving Graph Neural Network Representations
of Logical Formulae with Subgraph Pooling. 2020. arXiv: 1911 . 06904
[cs.AI].
[82] Michael Sejr Schlichtkrull et al. ‘Modeling Relational Data with
Graph Convolutional Networks’. In: The Semantic Web - 15th Interna-
tional Conference, ESWC 2018, Heraklion, Crete, Greece, June 3-7, 2018,
Proceedings. Ed. by Aldo Gangemi et al. Vol. 10843. Lecture Notes in
Computer Science. Springer, 2018, pp. 593–607. DOI: 10.1007/978-3-
319- 93417- 4\_38. URL: https://doi.org/10.1007/978- 3- 319- 93417-
4%5C_38.
[83] Marc Brockschmidt. ‘GNN-FiLM: Graph Neural Networks with
Feature-wise Linear Modulation’. In: Proceedings of the 37th Interna-
tional Conference on Machine Learning. Ed. by Hal Daumé III and Aarti
Singh. Vol. 119. Proceedings of Machine Learning Research. PMLR,
13–18 Jul 2020, pp. 1144–1152. URL: http://proceedings.mlr.press/v119/
brockschmidt20a.html.
[84] Bishan Yang et al. ‘Embedding Entities and Relations for Learning
and Inference in Knowledge Bases’. In: Proceedings of the International
Conference on Learning Representations (ICLR) 2015. May 2015. URL:
https://www.microsoft.com/en- us/research/publication/embedding-
entities-and-relations-for-learning-and-inference-in-knowledge-bases/.
[85] Diederik P. Kingma and Jimmy Ba. ‘Adam: A Method for Stochastic
Optimization’. In: 3rd International Conference on Learning Representa-
tions, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track
Proceedings. Ed. by Yoshua Bengio and Yann LeCun. 2015. URL: http:
//arxiv.org/abs/1412.6980.
[86] Leslie N. Smith. ‘Cyclical Learning Rates for Training Neural Net-
works’. In: 2017 IEEE Winter Conference on Applications of Computer
Vision (WACV). 2017, pp. 464–472. DOI: 10.1109/WACV.2017.58.

92
[87] Matthias Fey and Jan E. Lenssen. ‘Fast Graph Representation Learn-
ing with PyTorch Geometric’. In: ICLR 2019 Workshop on Representa-
tion Learning on Graphs and Manifolds. New Orleans, USA, 2019. URL:
https://arxiv.org/abs/1903.02428.
[88] William Falcon et al. ‘PyTorch Lightning’. In: GitHub. Note:
https://github.com/PyTorchLightning/pytorch-lightning 3 (2019).

93

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy