Part III Logic
Based on lectures by T. E. Forster
Notes taken by Dexter Chua
Lent 2017
These notes are not endorsed by the lecturers, and I have modified them (often
significantly) after lectures. They are nowhere near accurate representations of what
was actually lectured, and in particular, all errors are almost surely mine.
This course is the sequel to the Part II courses in Set Theory and Logic and in Automata
and Formal Languages lectured in 2015-6. (It is already being referred to informally
as “Son of ST&L and Automata & Formal Languages”). Because of the advent of
that second course this Part III course no longer covers elementary computability in
the way that its predecessor (“Computability and Logic”) did, and this is reflected in
the change in title. It will say less about Set Theory than one would expect from a
course entitled ‘Logic’; this is because in Lent term Benedikt owe will be lecturing a
course entitled ‘Topics in Set Theory’ and I do not wish to tread on his toes. Material
likely to be covered include: advanced topics in first-order logic (Natural Deduction,
Sequent Calculus, Cut-elimination, Interpolation, Skolemisation, Completeness and
Undecidability of First-Order Logic, Curry-Howard, Possible world semantics, odel’s
Negative Interpretation, Generalised quantifiers. . . ); Advanced Computability (
λ
-
representability of computable functions, Tennenbaum’s theorem, Friedberg-Muchnik,
Baker-Gill-Solovay. . . ); Model theory background (ultraproducts, Los’s theorem, ele-
mentary embeddings, omitting types, categoricity, saturation, Ehrenfeucht-Mostowski
theorem. . . ); Logical combinatorics (Paris-Harrington, WQO and BQO theory at least
as far as Kruskal’s theorem on wellquasiorderings of trees. . . ). This is a new syllabus
and may change in the coming months. It is entirely in order for students to contact
the lecturer for updates.
Pre-requisites
The obvious prerequisites from last year’s Part II are Professor Johnstone’s Set Theory
and Logic and Dr Chiodo’s Automata and Formal Languages, and I would like to
assume that everybody coming to my lectures is on top of all the material lectured
in those courses. This aspiration is less unreasonable than it may sound, since in
2016-7 both these courses are being lectured the term before this one, in Michaelmas;
indeed supervisions for Part III students attending them can be arranged if needed:
contact me or your director of studies. I am lecturing Part II Set Theory and Logic
and I am even going to be issuing a “Sheet 5” for Set Theory and Logic, of material
likely to be of interest to people who are thinking of pursuing this material at Part
III. Attending these two Part II courses in Michaelmas is a course of action that may
app eal particularly to students from outside Cambridge.
Contents
1 Proof theory and constructive logic
1.1 Natural deduction
1.2 Curry–Howard correspondence
1.3 Possible world semantics
1.4 Negative interpretation
1.5 Constructive mathematics
2 Model theory
2.1 Universal theories
2.2 Products
2.3 Ehrenfeucht–Mostowski theorem
2.4 The omitting type theorem
3 Computability theory
3.1 Computability
3.2 Decidable and semi-decidable sets
3.3 Computability elsewhere
3.4 Logic
3.5 Computability by λ-calculus
3.6 Reducibility
4 Well-quasi-orderings
1 Proof theory and constructive logic
1.1 Natural deduction
The first person to have the notion of “proof” as a mathematical notion was
probably odel, and he needed this to write down the incompleteness theorem.
The notion of proof he had was a very unintuitive notion. It is not very easy to
manipulate, but they are easy to reason about.
In later years, people came up with more “natural” ways of defining proofs,
and they are called natural deduction. In the formalism we learnt in IID Logic
and Set Theory, we had three axioms only, and one rule of inference. In natural
deduction, we have many rules of deduction.
We write or rules in the following form:
A B
-int
A B
This says if we know
A
and
B
are true, then we can conclude
A B
. We call
the things above the line the premises, and those below the line the conclusions.
We can write out the other rules as follows:
A B
-int
A B
A B
-elim
A
A B
-elim
B
A
-int
A B
B
-int
A B
A A B
-elim
B
Here we are separating these rules into two kinds the first column is the
introduction rules. These tell us how we can introduce a
or
into our
conclusions. The second column is the elimination rules. These tell us how we
can eliminate the or from our premises.
In general, we can think of these rules as “LEGO pieces”, and we can use
them to piece together to get “LEGO assembles”, i.e. proofs.
Example. For example, we might have a proof that looks like
A
-int
A B A B C
-elim
C
This corresponds to a proof that we can prove
C
from
A
and
A B C
. Note
that sometimes we are lazy and don’t specify the rules we are using.
Instead of trying to formally describe how we can put these rules together
to form a proof, we will work through some examples as we go, and it should
become clear.
We see that we are missing some rules from the table, as there are no
introduction rule for and elimination rule for .
We work with
first. How we can prove
A B
? To do so, we assume
A
,
and then try to prove
B
. If we can do so, then we have proved
A B
. But we
cannot express this in the form of our previous rules. Instead what we want is
some “function” that takes proof trees to proof trees.
The actual rule is as follows: suppose we have derivation that looks like
A
.
.
.
C
This is a proof of C under the assumption A. The
-introduction rule says we
can take this and turn it into a proof of A C.
.
.
.
A C
This rule is not a LEGO piece. Instead, it is a magic wand that turns a LEGO
assembly into a LEGO assembly.
But we do not want magic wands in our proofs. We want to figure out some
more static way of writing this rule. We decided that it should look like this:
[A]
.
.
.
C
-int
A C
Here the brackets denotes that we have given up on our assumption
A
to obtain
the conclusion
A C
. After doing so, we are no longer assuming
A
. When we
work with complicated proofs, it is easy to get lost where we are eliminating the
assumptions. So we would label them, and write this as, say
[A]
1
.
.
.
C
-int (1)
A C
Example. We can transform our previous example to say
[A]
1
-int
A B A B C
-elim
C
-int (1)
A C
Originally, we had a proof that
A
and
A B C
proves
C
. Now what we have
is a proof that A B C implies A C.
Next, we need an elimination rule of
A B
. What should this be? Suppose
we proved both that
A
proves
C
, and
B
proves
C
. Then if we know
A B
, then
we know C must be true.
In other words, if we have
A B
A
.
.
.
C
B
.
.
.
C
then we can deduce C. We write this as
A B
[A]
.
.
.
C
[B]
.
.
.
C
-elim
C
There is an obvious generalization to many disjunctions:
A
1
· · · A
n
[A
1
]
.
.
.
C
[A
n
]
.
.
.
C
-elim
C
How about when we have an empty disjunction? The empty disjunction is just
false. So this gives the rule
B
In other words, we can prove anything assume falsehood. This is known as ex
falso sequitur quolibet.
Note that we did not provide any rules for talking about negation. We do
not need to do so, because we just take ¬A to be A .
So far, what we have described it constructive propositional logic. What is
missing? We cannot use our system to prove the law of excluded middle,
A ¬A
,
or the law of double negation
¬¬A A
. It is not difficult to convince ourselves
that it is impossible to prove these using the laws we have described above.
To obtain classical propositional logic, we need just one more rule, which is
[A ]
.
.
.
A
If we add this, then we get classical propositional calculus, and it is a theorem
that any truth-table-tautology (i.e. propositions that are always true for all
possible values of
A, B, C
etc) can be proved using natural deduction with the
law of excluded middle.
Example. Suppose we want to prove
A (B C) ((A B) (A C))
How can we possibly prove this? The only way we can obtain this is to get
something of the form
[A (B C)]
.
.
.
(A B) (A C)
-int
A (B C) ((A B) (A C))
Now the only way we can get the second-to-last conclusion is
A B [A]
.
.
.
C
A C
and then further eliminating
A B
gives us (
A B
)
(
A C
). At this
point we might see how we can patch them together to get a proof:
[A]
3
[A (B C)]
1
B C
[A B]
2
[A]
3
B
C
-int (3)
A C
-int (2)
(A B) (A C)
-int (1)
A (B C) ((A B) (A C))
Note that when we did
-int (3), we consumed two copies of
A
in one go. This
is allowed, as an assumption doesn’t become false after we use it once.
However, some people study logical systems that do not allow it, and demand
that assumptions can only be used once. These are known as resource logics,
one prominent example of which is linear logic.
Example. Suppose we want to prove A (B A). We have a proof tree
[A]
1
[B]
2
A
-int (2)
B A
-int (1)
A (B A)
How did we manage to do the step from
A B
to
A
? One can prove it as follows:
A B
-int
A B
-elim
A
but this is slightly unpleasant. It is redundant, and also breaks some nice
properties of natural deduction we are going to prove later, e.g. the subformula
property. So instead, we just put an additional weakening rule that just says we
can drop any assumptions we like at any time.
Example. Suppose we wanted to prove
A (B C) ((A B) (A C)).
This is indeed a truth-table tautology, as we can write out the truth table and
see this is always true.
If we try very hard, we will find out that we cannot prove it without using the
law of excluded middle. In fact, this is not valid in constructive logic. Intuitively,
the reason is that assuming
A
is true, which of
B
or
C
is true can depend on
why
A
is true, and this it is impossible to directly prove that either
A B
is
true, or A C is true.
Of course, this is true in classical logic.
Exercise. Prove the following:
(P Q) ((Q R) (P R))
(A C) ((A B) C)
((A B) C) (A C)
P (¬P Q)
A (A A)
(P Q) (((P R) (Q S)) R S)
(P Q) (((P R) (Q S)) R S)
A ((((A B) B) C) C)
((((P Q) P ) P ) Q) Q
The last two are hard.
Instead of thinking of natural deduction as an actual logical theory, we often
think of it as a “platform” that allows us to describe different logical theories,
which we can do by introducing other connectives.
Example.
For example, we might want to describe first-order logic. We can
give introduction rules for as follows:
ϕ(t)
-int
x
ϕ(x)
Similarly, we can have an elimination rule for easily:
x
ϕ(x)
-elim
ϕ(t)
The introduction rule for
is a bit more complicated. The rule again looks
something like
ϕ(x)
-int
x
ϕ(x)
but for such a proof to be valid, the
x
must be “arbitrary”. So we need an
additional condition on this rule that there is no free occurrence of x’s.
Finally, suppose we know
x
ϕ
(
x
). How can we use this to deduce some
p
?
Suppose we know that
ϕ
(
x
) implies
p
, and also
p
does not depend on
x
. Then
just knowing that there exists some x such that ϕ(x), we know p must be true.
x
ϕ(x)
ϕ(x)
.
.
.
p
-elim
p
Again we need x to be arbitrary in p.
Example.
We can also use natural deduction for “naive set theory”, which we
will later find to be inconsistent. We can give an -introduction rule
ϕ(x)
-int
x {y : ϕ(y)}
and a similar elimination rule
x {y : ϕ(y)}
-elim
ϕ(x)
One cannot expect that we can write down whatever introduction and
elimination rule we like, and expect to get a sensible theory. For example, if we
introduce a connective & with the following silly rules:
A
&-int
A&B
A&B
&-elim
B
then this would immediately give us a theory that proves anything we like.
There are in fact two problems with our introduction and elimination rules.
The first is that once we use our
A
to deduce
A
&
B
, it is completely gone, and
we cannot recover anything about
A
from
A
&
B
. The second problem is that,
obviously, proving
A
&
B
doesn’t really involve
B
at all, but we can use
A
&
B
to
deduce something about B.
In general, to obtain a sensible natural deduction theory, we need harmony.
Definition
(Harmony)
.
We say rules for a connective $ are harmonious if
φ
$
ψ
is the strongest assertion you can deduce from the assumptions in the rule of
$-introduction, and
φ
$
ψ
is the weakest thing that implies the conclusion of the
$-elimination rule.
We will not make the notion more precise than this.
Example.
The introduction and elimination rules for
we presented are cer-
tainly harmonious, because the conclusions and assumptions are in fact equiva-
lent.
One can also check that the introduction and elimination rules for
and
are harmonious.
Example (Russel’s paradox). In naive set theory, consider
R = {x : x x ⊥}.
Then we can have the deductions
R R
R R
-elim
R R
-elim
We can then move on to use this to deduce that R R:
[R R]
1
[R R]
1
-elim
R R
-elim
-int (1)
R R
-int
R R
Now we have proved
R R
. But we previously had a proof that
R R
gives a
contradiction. So what we are going to do is to make two copies of this proof:
[R R]
1
[R R]
1
-elim
R R
-elim
-int (1)
R R
-int
R R
[R R]
2
[R R]
2
-elim
R R
-elim
-int (2)
R R
-elim
So we showed that we have an inconsistent theory.
Note that we didn’t use a lot of “logical power” here. We didn’t use the
law of excluded middle, so this contradiction manifests itself even in the case
of constructive logic. Moreover, we didn’t really use many axioms of naive set
theory. We didn’t even need things like extensionality for this to work.
Now we notice that this proof is rather weird. We first had an
-introduction
(2), and then subsequently eliminated it immediately. In general, if we have a
derivation of the form
P
.
.
.
Q
-int
P Q
.
.
.
P
-elim
Q
We can just move the second column to the top of the first to get
.
.
.
Q
In general,
Definition
(Maximal formula)
.
We say a formula in a derivation is maximal iff
it is both the conclusion of an occurrence of an introduction rule, and the major
premiss of an occurrence of the elimination rule for the same connective.
Here it is important that we are talking about the major premise. In a
deduction
A B A
-elim
B
we call
A B
the major premise, and
B
the minor premise. In the case of
and , the terms are symmetric, and we do not need such a distinction.
This distinction is necessary since a deduction of the form
(A B) C
[A]
.
.
.
B
-int
A B
-elim
C
One can prove that any derivation in propositional logic can be converted to
one that does not contain maximal formulae. However, if we try to eliminate
the maximal formula in our proof of inconsistency of naive set theory, we will
find that we will end up with the same proof!
1.2 Curry–Howard correspondence
Return to the days where we did IID Logic and Set Theory. Instead of natural
deduction, we bootstrapped our system with two axioms K and S:
K: A (B A)
S: [A (B C)] [(A B) (A C)]
It is a fact that
{K, S}
axiomatizes the conditional (
) fragment of constructive
propositional logic. To axiomatize the conditional fragment of classical logic, we
need a new law, Peirce’s law :
((A B) A) A.
In particular, we cannot prove Peirce’s law from K and S.
How can we go about proving that Peirce’s law is unprovable? The most
elementary way of doing so would be to try to allow for more truth values, such
that
K
,
S
and modus ponens hold, but Peirce’s law does not. It turns out three
truth values are sufficient. We consider the following truth table for :
1 2 3
1 1 2 3
2 1 1 3
3 1 1 1
Here we interpret 1 as “true”, 3 as “false”, and 2 as “maybe”.
By writing out the truth tables, one can verify that the formulas
K
and
S
always take truth value 1. Also, if
A
and
A B
are both 1, then so is
B
. So
by induction on the structure of the proof, we know that anything deduced from
K and S and modus ponens will have truth value 1.
However, if we put
A
= 2 and
B
= 3, then ((
A B
)
A
)
A
has value 2,
not 1. So it cannot possibly be deduced from K and S.
This is, of course, a very unsatisfactory answer. The truth table was just
pulled out of a hat, and we just checked to see that Peirce’s law doesn’t hold. It
doesn’t tell us where the truth table came from, and why we can’t prove Peirce’s
law.
We will answer the second question first using the Curry–Howard correspon-
dence. Then we will see how we can come up with this truth table using possible
world semantics in the next chapter.
Consider the piece of syntax
A B.
So far, we have viewed this as a logical formula
A
and
B
are formulae, and
denotes implication. However, in mathematics, there is another common use
of this notation we can think of
A
and
B
as sets, and
A B
is the set of
all functions from
A
to
B
. The idea of the Curry–Howard correspondence is to
view implication as functions between sets.
While this might seem like a bizarre thing to do, it actually makes sense.
Given a formula
A
, we can view
A
as a “set”, whose “elements” are the proofs
of
A
. Now if we know
A B
, this means any proof of
A
becomes a proof of
B
. In other words, we have a function
A B
. As we move on, we will see that
this idea of viewing a formula
A
as the set of all proofs of
A
gives a very close
correspondence between the logical connectives and set-theoretic constructions.
This is also known as propositions as types, and it is common to call
A
,
B
etc “types” instead of “sets”.
Under this correspondence, modus ponens has a very natural interpretation.
We can just think of the rule
A A B
B
as function application! If we have an element
a A
, and an element
f A B
,
then we have
f
(
a
)
B
. Now it is awkward to write
f A B
instead of
f
:
A B
. So in fact, we will write
f
:
A B
instead. Moreover, we will use
the colon instead of for all types. So we can write
x : A f : A B
f(x) : B
Often, we will just write
fx
instead of
f
(
x
). If we write
fxy
, then it means
(fx)y, but we will usually put brackets to indicate.
Example.
What is the interpretation of
K
under the Curry–Howard correspon-
dence? It corresponds to a function
A (B A).
Let’s try to think of a function that does this. Given an element
a
:
A
, we need
to find a function
B A
. Since the only thing we know about is
a
, we could
return the constant function that always returns
a
. So for each
a
:
A
, we define
K(a) to be the constant function a.
This is indeed where the name
K
came from. It’s just that the inventor was
German, and so it is not C.
What about conjunction? To prove
A B
, we need to prove
A
and
B
. In
other words, an element of
A B
should be an element of the Cartesian product
A × B.
We can write the introduction rule as
x : A y : B
hx, yi : A B
Here we are using the angled brackets to denote ordered pairs. The elimination
rules are just given by the projection fst : A × B A and snd : A × B B:
x : A B
fst(x) : A
x : A B
snd(x) : B
Some people write fst and snd as π
1
and π
2
instead.
We are not going to talk about the rules for “or”, because they are horrendous.
But if we think hard about it, then
A B
corresponds to the disjoint union of
A and B.
So far, we only know how to decorate our LEGO pieces. When we assemble
them to get proofs, we can get more complicated tress. Suppose we had the tree
[A B]
1
-elim
B
[A B]
1
-elim
A
-int
B A
-int (1)
(A B) (B A)
We now try to decorate this tree:
x : [A B]
1
-elim
snd(x) : B
x : [A B]
1
-elim
fst(x) : A
-int
hsnd(x), fst(x)i : B A
-int (1)
(A B) (B A)
Note that both [
A B
]
1
are decorated using the same letter
x
. This is since
they are cancelled at the same time, so we are using the same instance of
A B
in the statement (A B) (B A).
Now the remaining question to answer is how we are going to decorate the
-introduction. We want some notation that denotes the function that takes
x
and returns
hsnd
(
x
)
, fst
(
x
). More generally, if we have an expression
ϕ
(
x
) in
x
,
we want a notation that denotes the function that takes in x and returns ϕ(x).
The convention is to denote this using a λ:
λx.ϕ(x).
These expressions we produce are known as λ expressions.
For example, in the previous case, the final result is
λx.hfst(x), snd(x)i : (A B) (B A).
The general decoration rule for -introduction is
x : [A]
1
.
.
.
ϕ(x) : C
-int (1)
λx.ϕ(x) : A C
Recall that previously we had a rule for getting rid of maximal formulae. Given
a tree of the form
P
.
.
.
Q
-int
P Q
.
.
.
P
-elim
Q
We can just move the second column to the top of the first to get
.
.
.
Q
This conversion corresponds to the conversion
(λx.ϕ(x))a ϕ(a),
which is just applying the function! This process is known as β-reduction.
Since there is
β
-reduction, there is, of course, also
α
-reduction, but this
is boring. It is just re-lettering
λx.φ
(
x
) to
λy
(
y
). There will also be an
η-reduction, and we will talk about that later.
In the following exercise/example, note that we have a convention that if we
write A B C, then we mean A (B C).
Exercise. Find λ-expressions that encode proofs of
(i) ((A B) (C D)) ((A C) (B D))
(ii) (A B) (B C) (A C)
(iii) ((A B) A) (A B) B
(iv) (A B C) (B A C)
(v) (A B C) ((A B) C)
(vi) (B A C) (A B C)
Note that statements of the form
A B C
are to be interpreted as
A
(B C).
Solutions.
(i) λf.λx.hfst(f)(fst(x)), snd(f)(snd(x))i
(ii) λf.λg.λa.g(fa)
(iii) λf.λg.g(fg)
(iv) λf.λb.λa.(f a)b
(v) λf.λx.f(fst x)(snd x)
(vi) λf.λa.λb.fhb, ai
One can write out the corresponding trees explicitly. For example, (iii) can be
done by
g : [A B]
1
f : [(A B) A]
2
-elim
fg : A g : [A B]
1
-elim
g(fg)
-int (1)
λg.g(fg) : (A B) B
-int (2)
λf.λg.g(fg) : (A B) A (A B) B
Note that we always decorate assumptions with a single variable, say
f
, even
if they are very complicated. For example, if we have an assumption
A B
, it
might be tempting to decorate it as ha, bi, but we do not.
Now what we have is that for any proof tree, we obtain a
λ
formula. But for
any formula, it can have many proofs, and so it can be given by many
λ
formula.
Example. Consider
x : [A]
1
f : [A A]
2
-elim
fx : A
-int (1)
λx.fx : A A
-int (2)
λf.λx.fx : (A A) (A A)
This is really just the identity function!
We can also try something slightly more complicated.
x : [A]
1
f : [A A]
2
fx : A f : [A A]
2
f(fx) : A
-int (1)
λx.f(f(x)) : A A
-int (2)
λf.λx.f(fx) : (A A) (A A)
This is another
λ
term for (
A A
)
(
A A
). This says given any
f
,
we return the function that does
f
twice. These are two genuinely different
functions, and thus these two
λ
terms correspond to two different proofs of
(A A) (A A).
Note that given any proof tree, we can produce some
λ
expression representing
the proof rather systematically. However, given any
λ
expression, it is difficult
to figure out what it is a proof of, or if it represents a proof at all. For example,
if we write down
λx.xx,
then we have no idea what this can possibly mean.
The problem is that we don’t know what formula, or type each variable is
representing. The expression
λf.λx.f(fx)
is confusing, because we need to guess that
x
is probably just some arbitrary
variable of type
A
, and that
f
is probably a function
A A
. Instead, we can
make these types explicit:
λf
AA
.λx
A
.f
AA
(f
AA
x
A
).
Given such a term, we can systematically deduce that it is of type (
A A
)
(
A A
), i.e. it encodes a proof of (
A A
)
(
A A
). Such an expression
with explicit types is known as a typed
λ
term, and one without types is known
as an untyped λ term.
Moreover, it is also straightforward to reconstruct the proof tree from this
expression. Of course, we need to make sure the typing of the expression is valid.
For example, we cannot write
f
AA
x
B
,
since
f
is not a function from
B
. Assuming we restrict to these valid terms, we
have established a correspondence
Proofs in propositional natural deduction are in a 1-to-1 correspondence
with typed λ terms.
That is, if we also make it clear what the rules for are.
Finally, what about
? To what set does the proposition
correspond? The
proposition
should have no proofs. So we would think
should correspond
to .
In propositional calculus, the defining property of
is that we can prove any
proposition we like from
. Indeed, the empty set is a set that admits a function
to any set
X
whatsoever, and in fact this function can be uniformly defined for
all sets, by the empty function. We again see that the ideas of propositional
calculus correspond nicely with the properties of sets.
Example. Consider the law of excluded middle:
((A ) ) A.
Can we write down a λ expressions for this?
We will argue rather informally. If there is a
λ
expression of this type, then
it should define the function “uniformly” without regards to what
A
is about.
We now see what this function could be.
If
A
is empty, then
A
contains exactly the empty function. So
(
A
)
is the empty set, since there is no function from a non-
empty set to the empty set. And there is a function
A
, namely the
empty function again.
If
A
is non-empty, then
A
contains no functions, so (
A
)
contains the empty function. So ((
A
)
)
A
is given by exactly
an element of A.
We see that to construct such a function, we need to first know if
A
is empty or
not. If it is non-empty, then we have to arbitrarily pick an element of
A
. Thus,
there is no way we can uniformly construct such a function without knowing
anything about A.
Proposition.
We cannot prove ((
A B
)
A
)
A
in natural deduction
without the law of excluded middle.
Proof. Suppose there were a lambda term P : ((A B) A) A.
We pick
B = {0, 1}, A = {0, 1, 2, 3, 4}.
In this setting, any function f : A B identifies a distinguished member of B,
namely the one that is hit by more members of
A
. We know
B A
, so this
is an element in
A
. So we have a function
F
: (
A B
)
A
. Then
P
(
F
) is a
member of A. We claim that in fact P (F ) B.
Indeed, we write
P
(
F
) =
a A
. Let
π
be a 3-cycle moving everything in
A \ B
and fixing
B
. Then this induces an action on functions among
A
and
B
by conjugation. We have.
π(P (F )) = π(a).
But since
P
is given by a
λ
term, it cannot distinguish between the different
memebers of A \ B. So we have
P (π(F)) = π(a).
We now claim that
π
(
F
) =
F
. By construction,
π
(
F
) =
π
1
F π
. Then for
all f : A B, we have
π(F )(f ) = (π
1
f π)(f) = (π
1
F )(π(f))
= π
1
(F (π(f))) = π
1
(F (f)) = F (f ).
Noting that
π
fixes
f
, the only equality that isn’t obvious from unravelling the
definitions is
F
(
π
(
f
)) =
F
(
f
). Indeed, we have
π
(
f
) =
π
1
f π
=
f π
. But
F
(
f π
) =
F
(
f
) because our explicit construction of
F
depends only on the
image of
f
. So
π
(
F
) =
F
. So this implies
a
=
π
(
a
), which means
a B
. So we
always have P (F ) B.
But this means we have found a way of uniformly picking a distinguished
element out of a two-membered set. Indeed, given any such set
B
, we pick any
three random objects, and combine it with
B
to obtain
A
. Then we just apply
P (F ). This is clearly nonsense.
1.3 Possible world semantics
For classical logic, semantics was easy. We had a collection of propositions
P
,
and a model is just a function
P {>, ⊥}
. We can then define truth in this
model recursively in the obvious way.
It turns out to get a sensible model theory of constructive logic, we need to
consider many such objects.
Definition
(Possible world semantics)
.
Let
P
be a collection of propositions.
A world is a subset
w P
. A model is a collection
W
of worlds, and a partial
order on W called accessibility, satisfying the persistence property:
If p P is such that p w and w
0
w, then p w
0
.
Given any proposition ϕ, we define the relation w ϕ by
w 6
If ϕ is atomic (and not ), then then w ϕ iff ϕ w.
w ϕ ψ iff w ϕ and w ψ.
w ϕ ψ iff w ϕ or w ψ.
w ϕ ψ iff (for all w
0
w, if w
0
ϕ, then w
0
ψ).
We will say that w “believes” ϕ if w ϕ, and that w “sees” w
0
if w
0
w.
We also require that there is a designated minimum element under
, known
as the root world . We can think of a possible world model as a poset decorated
with worlds, and such a poset is called a frame.
All but the last rule are obvious. The idea is that each world can have some
beliefs, namely the subset
w p
. The important idea is that if a world
w
does
not believe in
p
, it does not mean it believes that
p
is false. It just that
w
remains ignorant about whether p is true.
Now we have w
0
w if w
0
a more knowledgable version of w that has more
beliefs. Alternatively,
w
0
is a world whose beliefs are compatible with that of
w
. It is easy to show by induction that if
ϕ
is any formula that
w
believes, and
w
0
w
, then
w
0
also believes in
ϕ
. Then we can interpret the law of implication
as saying
w
believes that
p q
iff for any possible thinkable world where
p
is
true, then q is also true”.
Note that if we only have one world, then we recover models of classical logic.
What happens to negation? By definition, we have
w ¬A
iff
w A
.
Expanding the definition, believing that
¬A
means there is no possible world
compatible with
w
in which
A
is true. This is much stronger than just saying
w 6 A, which just says we do not hold any opinion on whether A is true.
Example.
We want to construct a world
w
where
w 6 A ¬A
. So we need a
W such that w neither believes A nor believes ¬A.
We can achieve the former by just making the world not believe in
A
. To
make the world not believe
¬A
, it needs to be able to see a world
w
0
such that
w
0
A. So we can consider the following world:
w
0
A
w
6 A
Example.
This time consider the statement
w ¬¬A A
. This takes some
work to unravel.
We need a world
w
such that for all
w
0
w
, if
w
0
¬¬A
, then
w
0
A
. If
we unravel a bit more, we find that
w
0
¬¬A
means for any
w
00
w
0
, we can
find some w
000
w
00
such that w
000
A.
It is easy to see that actually in the model of the previous question,
w 6
¬¬A A, since w ¬¬A, but w 6 A.
Example. The following model is a counter-model for ¬¬A ¬A:
6 A
A
6 A
where the bottom world believes in neither
¬¬A
nor
¬A
, since it sees worlds
where ¬A is true and also worlds where A is true.
Usually, we don’t talk about a particular world believing in something. We
talk about an entire structure of possible worlds believing something.
Notation. If the root world of our model is w, then we write
ϕ w ϕ.
Almost by definition, if ϕ, then for any world w
0
, we have w
0
ϕ.
Exercise. Find a possible world model that does not believe Peirce’s law.
One can readily verify that classically, we have (
A B
)
B
is equivalent
to A B, but they are not constructively equivalent.
Exercise.
Find a possible world model that
(
A B
)
B
, but does not
A B.
Exercise.
Find a possible world model that
(
A B
)
B
but does not
believe (B A) A.
Exercise.
Find a possible world model that does not believe in
(
A B
)
(B A).
Possible world semantics can be used for all sorts of logical systems, not just
propositional logic. In general, the notion of accessibility need not be a partial
order, but can be anything we want. Also, we do not necessarily have to require
persistence. Of course, we also need to modify the according notion of a world.
Example.
If we want to produce possible world semantics for constructive
first-order logic, we need to modify our definition of worlds so that they have
“elements”. Assuming we have done so, we can encode
and
using the following
rules:
w
x
, F (x) if there is some y w for which w F (y).
w
x
, F (x) if for all w
0
w and x w
0
, we have w
0
F (x).
The second condition is much stronger than the naive “for all
x w
, we have
w F
(
x
)”, because the
w
00
w
may have many more elements. What our
condition say is that
w
somehow has a “global” way of showing that everything
satisfies F .
Example. The proposition A ¬¬A is valid all possible world models.
For classical logic, we had semantics given by truth tables. This easily shows
that classical truth is decidable. Similarly, given any formula, it involves only
finitely many variables, there are only finitely many distinct possible world
models one has to check. So whether something is refuted by possible world
models is again a decidable problem.
Now suppose we have a possible world model in front of us. Then given a
formula
ϕ
, the set of worlds that believe
ϕ
is an upward-closed subset of the
frame. So we can think of the truth value of ϕ as these upward-closed set.
Example. Suppose we have the following worlds:
w
2
w
1
Then the possible truth values are
, {w
2
}, {w
1
, w
2
}.
Suppose the top world believes
A
, but the bottom world does not. Then we say
that
A
has truth value
{w
2
}
. This is indeed the three-valued algebra we used to
disprove Peirce’s law.
What can we say about the collection of upward-closed subset of frames, i.e.
our truth values? Under inclusion, it is easy to see that these are complete and
co-complete, i.e. we have all infinite unions and intersections. More importantly,
we can do “implications”.
Definition
(Heyting algebra)
.
A Heyting algebra is a poset with
>
,
,
and
and an operator such that A B is the largest C such that
C A B.
If a poset is complete co-complete, then we would expect the implication to
be given by
A B =
_
{C : C A B}.
Indeed, if the poset is a Heyting algebra, then it must be given by this. So we
just have to check that this works.
The thought is that Heyting algebras would “validate” all constructive theses,
by interpreting
with
,
with
and
with
. In other words, if we try to
build a truth table for a formula with values in a Heyting algebra, we will get
the top element
>
all the time iff the formula is constructively valid. So we have
a parallel between the relation between classical logic and boolean algebras, and
constructive logic and Heyting algebras.
It is an easy inductive proof that any constructively valid statement is “true”
in all Heyting algebras, and the other direction is also true, but takes more work
to prove.
Note that in classical logic, we only ever use the two-element boolean algebra.
If we use the 4-element Boolean algebra, then we would detect the same set of
tautologies. However, in the case of constructive logic we genuinely need all
Heyting algebras.
Finally, we prove half the completeness theorem for possible world models.
Lemma.
Any formula with a natural deduction proof not using the rule for
classical negation is true in all possible world models.
For any frame
F
, every such formula is satisfied by every possible world
model on that frame. We say it is valid on
F
. Such formulae are valid on all
posets with a bottom element.
Proof.
The hard case is implication. Suppose we have a natural deduction
proof of
A B
. The last rule is a
-introduction. So by the induction
hypothesis, every world that believes
A
also believes
B
. Now let
w
be a world
that believes all the other undischarged assumptions in
A B
. By persistence,
every
w
0
w
believes similarly. So any
w
0
w
that believes
A
also believes
B
.
So w A B.
1.4 Negative interpretation
Suppose a constructive logician found a classical logician, and then they try to
talk to each other. Then they would find that they disagree on a lot of things.
The classical logician has no problems accepting
¬¬A A
, but the constructive
logician thinks that is nonsense. Can they make sense of each other?
The classical logician can understand the constructive logician via possible
world semantics. He could think that when the constructive logician says
a proposition is false, they mean there are possible world models where the
proposition fails. This is all fine.
But can a constructive logician make sense of the classical logician? When the
constructivist sees
A B
, he thinks that either we can prove
A
, or we can prove
B
. However, the classical logician views this as a much weaker statement. It is
more a long the lines of “it is not true that
A
and
B
are both false”. In general,
we can try to interpret classical statements by putting in enough negations:
Classical proposition Constructive interpretation
A B ¬(¬A ¬B)
A B A B
x
W (x) ¬∀
x
¬W (x)
A B ¬(A ¬B)
This gives rise to the negative interpretation of classical logic into constructive
logic, due to odel:
Definition
(Negative interpretation)
.
Given a proposition
φ
, the interpretation
φ
is defined recursively by
= .
If ϕ is atomic, then ϕ
= ¬¬ϕ.
If ϕ is negatomic, then ϕ
= ϕ.
If ϕ = ψ θ, then ϕ
= ψ
θ
.
If ϕ = ψ θ, then ϕ
= ¬(¬ψ
¬θ
).
If ϕ =
x
ψ(x), then (
x
)(ψ
(x)).
If ϕ = ψ θ, then ϕ
= ¬(ψ
¬θ
).
If ϕ =
x
ψ(x), then ϕ
= ¬∀
x
¬ψ
(x).
One can check that we always have
` ϕ ϕ
. So the constructive version is
always stronger.
Definition (Stable formula). A formula is stable if
` ϕ
ϕ.
Lemma.
Any formula built up from negated and doubly negated atomics by
¬
,
and is stable.
Proof.
By induction on formulae. The base case is immediate, using the fact
that ¬¬¬A ¬A. This follows from the more general fact that
(((p q) q) q) p q.
It is less confusing to prove this in two steps, and we will write
λ
-terms for our
proofs. First note that if we have
f
:
A B
, then we can obtain
f
T
: (
B
q) A q for any q, using
f
T
= λg
Bq
.λa
A
.g(f(a)).
So it suffices to prove that
p (p q) q,
and the corresponding term is
λx
p
.λg
pq
.gx
We now proceed by induction.
Now assume that we have proofs of
¬¬p p
and
¬¬q q
. We want
to prove
p q
from
¬¬
(
p q
). It suffices to prove
¬¬p
and
¬¬q
form
¬¬(p q), and we will just do the first one by symmetry.
We suppose
¬p
. Then we know
¬
(
p q
). But we know
¬¬
(
p q
). So we
obtain a contradiction. So we have proved that ¬¬p.
Note that we have
`
x
¬¬ϕ(x) ¬¬∃
x
ϕ(x),
but not the other way round. For universal quantification, we have
¬¬∀
x
ϕ(x)
x
¬¬ϕ(x),
but not the other way round. We can construct a proof as follows:
[
x
ϕ(x)]
1
-elim
ϕ(a) [¬ϕ(x)]
2
-elim
-int (1)
¬∀
x
ϕ(x) [¬¬∀
x
ϕ(x)]
3
-elim
-int (2)
¬¬ϕ(a)
-int
¬¬∀
x
ϕ(x)
-int (3)
¬¬∀
x
F (x)
x
¬¬F (x)
We now want to show that if
ϕ
is stable, then
x
ϕ
(
x
) is stable. In other
words, we want
¬¬∀
x
ϕ
(x)
x
ϕ
(x).
But form ¬¬∀
x
ϕ
(x), we can deduce
x
¬¬ϕ
(x), which implies
x
ϕ
(x).
So every formula in the range of the negative interpretation is stable. Every
stable formula is equivalent to its double negation (classically). Every formula is
classically equivalent to a stable formula.
So if a constructive logician meets a classical logician who is making some
bizarre assertions about first order logic. To make sense of this, the constructive
logician can translate it to its negative interpretation, which the classical logician
thinks is equivalent to the original one.
1.5 Constructive mathematics
We now end with some remarks about how we can do mathematics “construc-
tively”, and the subtleties we have to be careful of.
We first discuss the notion of finiteness. Even classically, if we do not have
the axiom of choice, then we have many distinct notions of “finite set”. When
we go constructive, things get more complicated.
We can produce (at least) two different definitions of finiteness.
Definition
(Kuratowski finite)
.
We define “finite” recursively:
is Kuratowski
finite. If x is Kuratowski finite, then so is x {y}.
There is a separate definition of N-finiteness:
Definition
(
N
-finite)
.
is
N
-finite. If
x
is
N
-finite, and
y 6∈ x
, then
x {y}
is
N-finite.
These two definitions are not the same! In a
N
-finite set, we know any
two things are either equal, or they are not, as this is built into the definition
of an
N
-finite set. However, in the case of Kuratowski finite, this is not true,
since we don’t have the law of excluded middle. We say
N
-finite sets have
decidable equality. If we want to do constructive arithmetic, the correct notion
is N -finiteness.
We can still define natural numbers as the smallest set containing
and is
closed under successor. This gives us
N
-finite sets, but the least number principle
is dodgy. It turns out the least number principle implies excluded middle.
There are also subtleties involving the notion of a non-empty set!
Definition (Non-empty set). A set x is non-empty if ¬∀
y
y 6∈ x.
Definition (Inhabited set). A set x is inhabited if
y
y x.
They are not the same! Being inhabited is stronger than being non-empty!
We shall not go into more details about constructive mathematics, because,
disappointingly, very few people care.
2 Model theory
2.1 Universal theories
Recall the following definition:
Definition
(Universal theory)
.
A universal theory is a theory that can be
axiomatized in a way such that all axioms are of the form
···
(stuff not involving quantifiers) ()
For example, groups, rings, modules etc. are universal theories. It is easy to
see that if we have a model of a universal theory, then any substructure is also a
model.
It turns out the converse is also true! If a theory is such that every sub-
structure of a model is a model, then it is universal. This is a very nice result,
because it gives us a correspondence between syntax and semantics.
The proof isn’t very hard, and this is the first result we will prove about
model theory. We begin with some convenient definitions.
Definition
(Diagram)
.
Let
L
be a language and
M
a structure of this language.
The diagram of
M
is the theory obtained by adding a constant symbol
a
x
for
each
x M
, and then taking the axioms to be all quantifier-free sentences that
are true in M. We will write the diagram as D(M).
Lemma.
Let
T
be a consistent theory, and let
T
be the set of all universal
consequences of
T
, i.e. all things provable from
T
that are of the form (
). Let
M be a model of T
. Then T D(M) is also consistent.
Proof.
Suppose
T D
(
M
) is not consistent. Then there is an inconsistency
that can be derived from finitely many of the new axioms. Call this finite
conjunction
ψ
. Then we have a proof of
¬ψ
from
T
. But
T
knows nothing
about the constants we added to
T
. So we know
T `
x
¬ψ
. This is a universal
consequence of T that M does not satisfy, and this is a contradiction.
Theorem.
A theory
T
is universal if and only if every substructure of a model
of T is a model of T .
Proof.
is easy. For
, suppose
T
is a theory such that every substructure of
a model of T is still a model of T .
Let
M
be an arbitrary model of
T
. Then
T D
(
M
) is consistent. So it
must have a model, say
M
, and this is in particular a model of
T
. Moreover,
M is a submodel of M
. So M is a model of T .
So any model of
T
is also a model of
T
, and the converse is clearly true. So
we know T
is equivalent to T .
2.2 Products
In model theory, we would often like to produce new models of a theory from
old ones. One way to do so is via products.
We will use λ-notation to denote functions.
Definition
(Product of structures)
.
Suppose
{A
i
}
iI
is a family of structures
of the same signature. Then the product
Y
iI
A
i
has carrier set the set of all functions
α : I
[
iI
A
i
such that α(i) A
i
.
Given an
n
-ary function
f
in the language, the interpretation in the product
is given pointwise by
f(α
1
, · · · , α
n
) = λi.f(α
1
(i), · · · , α
n
(i)).
Relations are defined by
ϕ(α
1
, · · · , α
n
) =
^
iI
ϕ(α
1
(i), · · · , α
n
(i)).
The natural question to ask is if we have a model of a theory, then is the
product a model again? We know this is not always true. For example, the
product of groups is a group, but the product of total orders is not a total order.
We say a product preserves a formula ϕ if
Y
iI
A
i
ϕ
iI
, A
i
ϕ
What sort of things do products preserve? As we know from undergraduate
mathematics, theories such as groups and rings are preserved.
Definition
(Equational theory)
.
An equational theory is a theory all of whose
axioms are of the form
x
(w
1
(x) = w
2
(x)),
where w
i
are some terms in x.
It is not hard to see that models of equational theories are preserved under
products.
It turns out actually the class of things preserved by products is much more
general.
Definition
(Horn clause)
.
A Horn clause is a disjunction of atomics and
negatomics of which at most one disjunct is atomic.
It is usually better to think of Horn clauses as formulae of the form
^
ϕ
i
χ
where
ϕ
i
and
χ
are atomic formulae. Note that
is considered an atomic
formula.
A universal Horn clause is a universal quantifier followed by a Horn clause.
Example.
Transitivity, symmetry, antisymmetry, reflexivity are all universal
Horn clauses.
Proposition. Products preserve (universal) Horn formulae.
Proof. Suppose every factor
A
i
x
^
ϕ
i
(x) χ(x).
We want to show that the product believes in the same statement. So let
(
f
1
, · · · , f
k
) be a tuple in the product of the right length satisfying the antecedent,
i.e. for each n I, we have
A
n
ϕ
i
(f
1
(n), · · · , f
k
(n))
for each i, But then by assumption,
A
n
χ(f
1
(n), · · · , f
k
(n))
for all n. So the product also believes in ϕ
j
(f
1
, · · · , f
n
). So we are done.
Reduced products
Unfortunately, it is rarely the case that products give us new interesting models.
However, reduced products do.
Reduced products arise from having a filter on the index set.
Definition
(Filter)
.
Let
I
be a set. A filter on
I
is a (non-empty) subset
F P
(
I
) such that
F
is closed under intersection and superset. A proper filter
is a filter F 6= P (I).
The intuition to hang on to is that F captures the intuition of “largeness”.
Example.
Take
I
=
N
, and
F
the set of cofinite subsets of
N
, namely the sets
whose complement are finite, then F is a filter.
Similarly, we can take
F
to contain the set of naturals with asymptotic
density 1.
Example. Let
F = {x N : 17 N}
Then this is a maximal proper filter.
This is a rather silly filter. We will see later that model-theoretically, these
give us uninteresting reduced products.
Definition (Principal filter). A principal filter is a filter of the form
F = {X I : x 6∈ X}
for some x I.
We don’t tend to care about principal filter, as they are boring.
Definition
(Complete filter)
.
A filter
F
is
κ
-complete if it is closed under
intersection of < κ many things.
Example. By definition, every filter is
0
-complete.
Example. Principal filters are κ-complete for all κ.
A natural question is, are there any non-principal ultrafilters that are
1
complete? It turns out this is a deep question, and a lot of the study of set
theory originated from this question.
Filters on
I
form a complete poset under inclusion, with top element given
by
F
=
P
(
I
). Moreover, the proper filters are a chain complete poset. Thus, by
Zorn’s lemma, there are maximal proper filters.
Definition (Ultrafilter). An ultrafilter is a maximal filter.
We saw that principal filters are ultra. Are there non-principal ultrafilters?
Fortunately, by Zorn’s lemma, they do exist.
Example.
By Zorn’s lemma, there is a maximal filter extending the cofinite
filter on N, and this is non-principal.
It turns out it is impossible to explicitly construct a non-principal ultrafilter,
but Zorn’s lemma says they exist.
Now we can get to reduced products.
Definition
(Reduced product)
.
let
{A
i
:
i I}
be a family of structures, and
F a filter on I. We define the reduced product
Y
iI
A
i
/F
as follows: the underlying set is the usual product
Q
A
i
quotiented by the
equivalence relation
α
F
β {i : α(i) = β(i)} F
Given a function symbol
f
, the interpretation of
f
in the reduced product is
induced by that on the product.
Given a relational symbol ϕ, we define
ϕ(α
1
, · · · , α
n
) {i : ϕ(α
1
(i), · · · , α
n
(i))} F.
If
F
is an ultrafilter, then we call it the ultraproduct. If all the factors in an
ultraproduct are the same, then we call it an ultrapower.
It is an easy exercise to show that these are all well-defined. If we view the
filter as telling us which subsets of
I
are “large” then, our definition of reduced
product says we regard two functions in the reduced products as equivalent if
they agree on a “large” set.
Note that if our filter is principal, say
F
=
{J I
:
i J}
, then the reduced
product is just isomorphic to A
i
itself. So this is completely uninteresting.
Reduced products preserve various things, but not everything. For example,
the property of being a total order is in general not preserved by reduced products.
So we still have the same problem.
But if the filter F is an ultrafilter, then nice things happen.
Theorem
(
L
o´s theorem)
.
Let
{A
i
:
i I}
be a family of structures of the same
(first-order) signature, and U P (I) an ultrafilter. Then
Y
iI
A
i
/U ϕ {i : A
i
ϕ} U.
In particular, if A
i
are all models of some theory, then so is
Q
A
i
/U.
The key of the proof is the following lemma, which is a nice exercise:
Lemma. Let F be a filter on I. Then the following are equivalent:
(i) F is an ultrafilter.
(ii) For X I, either X F or I \ X F (“F is prime”).
(iii) If X, Y I and X Y I, then X I or Y I.
With this in mind, it is hard not to prove the theorem.
Let’s now look at some applications of Lo´s theorem.
Recall the compactness theorem of first order logic if we have a theory
T
such that every finite subset of
T
has a model, then so does
T
. The way
we proved it was rather roundabout. We proved the completeness theorem of
first order logic. Then we notice that if every finite subset of
T
has a model,
then every finite subset of
T
is consistent. Since proofs are finite, we know
T
is
consistent. So by completeness, T has a model.
Now that we are equipped with ultraproducts, it is possible to prove the
compactness theorem directly!
Theorem
(Compactness theorem)
.
Let
T
be a theory in first order logic such
that every finite subset has a model. Then T has a model.
Proof.
Let be such a theory. Let
S
=
P
0
(∆) be the set of all finite subsets
of ∆. For each s S, we pick a model M
s
of s.
Given s S, we define
X
s
= {t S : s t}.
We notice that
{X
s
:
s S}
generate a proper filter on
S
. We extend this to
ultrafilter U by Zorn’s lemma. Then we claim that
Y
sS
M
s
/U .
Indeed, for any ϕ ∆, we have
{s : M
s
ϕ} X
{ϕ}
U.
Example.
Let
T
be the theory consisting of all first-order statements true in
hR,
0
,
1
,
+
, ×i
. Add to
T
a constant
ε
and the axioms
ε <
1
n
for all
n N
. Then
for any finite subset of this new theory,
R
is still a model, by picking
ε
small
enough. So by the compactness theorem, we know this is consistent.
Using our proof of the compactness theorem, we now have a “concrete” model
of real numbers with infinitesimals we just take the ultraproduct of infinitely
many copies of R.
2.3 Ehrenfeucht–Mostowski theorem
Definition
(Skolem function)
.
Skolem functions for a structure are functions
f
ϕ
for each ϕ L such that if
M
x
y
ϕ(x, y),
then
M
x
ϕ(x, f
ϕ
(x)).
Definition
(Skolem hull)
.
The Skolem hull of a structure is obtained from the
constants term by closure under the Skolem functions.
Definition
(Elementary embedding)
.
Let Γ be a set of formulae. A function
i
:
M
1
M
2
is Γ-elementary iff for all
ϕ
Γ we have
M
1
ϕ
(
x
) implies
M
2
ϕ(i(x)).
If Γ is the set of all formulae in the language, then we just say it is elementary.
Usually, we take Γ to be the set of all formulae, and
M
1
to be a substructure
of M
2
.
Example.
It is, at least intuitively, clear that the inclusion
hQ, ≤i hR, ≤i
is
elementary.
However, Q as a field is not elementary as a substructure of R as a field.
Note that first order logic is not decidable. However, monadic first order
logic is.
Definition
(Monadic first order logic)
.
Monadic first-order logic is first order
logic with only one-place predicates, no equality and no function symbols.
Proposition. Monadic first-order logic is decidable.
Proof.
Consider any formula
ϕ
. Suppose it involves the one-place predicates
p
1
, . . . , p
n
. Given any structure M, we consider the quotient of M by
x y p
i
(x) = p
i
(y) for all i.
Then there are at most 2
n
things in the quotient.
Then given any transversal of the quotient, we only have to check if the
formula holds for this transversal, and this is finite. So we can decide.
Definition
(Set of indiscernibles)
.
We say
hI,
I
i
is a set of indiscernibles for
L
and a structure
M
with
I M
if for any
ϕ L
(
M
) of arity
n
, and for all
increasing tuples x, y I,
M ϕ(x) M ϕ(y)
This is weaker than just saying everything in there are the same.
Example. hQ, Ii is a set of indiscernibles for hR, ≤i.
Let
T
be a theory with infinite models. The general line of thought is can
we get a model of T with special properties?
Given any set
M
, invent names for every member of
M
. Add these names to
the language of
T
. We add axioms to say all these constants are distinct. Since
T has infinite models, we know this theory has a model.
Let be the set of countable ordinals, consider the language of
R
, and add
a name to this language for every countable ordinal. We further add axioms to
say that α < β in the ordering of R whenever α < β as ordinals.
Again by compactness, we find that this has a model. So we can embed the
set of countable ordinals into a model of reals, which we know is impossible for
the genuine
R
. However, in general, there is no reason to believe these elements
are a set of indiscernibles. What Ehrenfeucht–Mostowski says is that we can in
fact do so.
Theorem
(Ehrenfeucht–Mostowski theorem (1956))
.
Let
hI, ≤i
be a total order,
and let
T
be a theory with infinite models. Suppose we have a unary predicate
P and a 2-ary relation 4 L(T ) such that
T ` 4 is a total order on {x : P (x)}.
Then
T
has a model
M
with a copy of
I
as a sub-order of
4
, and the copy of
I
is a set of indiscernibles. Moreover, we can pick
M
such that every order-
automorphism of hI, ≤i extends to an automorphism of M.
We will give two proofs of this result. We first give the original proof of
Ehrenfeucht–Mostowski, which uses Ramsey theory.
Proof.
Let
T
and
hI, ≤i
be as in the statement of the theorem. We add to
L
(
T
)
names for every element of
I
, say
{c
i
:
i I}
. We add axioms that says
P
(
c
i
)
and
c
i
4 c
j
whenever
i < j
. We will thereby confuse the orders
and
4
, partly
because is much easier to type. We call this theory T
.
Now we add to
T
new axioms to say that the
c
i
form a set of indiscernibles.
So we are adding axioms like
ϕ(c
i
, c
j
) ϕ(c
i
0
, c
j
0
) ()
for all
i < j
and
i
0
< j
0
. We do this simultaneously for all
ϕ L
(
T
) and all
tuples of the appropriate length. We call this theory
T
I
, and it will say that
hI, ≤i
forms a set of indiscernibles. The next stage is, of course, to prove that
T
I
is consistent.
Consider any finite fragment
T
0
of
T
I
. We want to show that
T
0
is consistent.
By finiteness,
T
0
only mentions finitely many constants, say
c
1
< · · · < c
K
, and
only involve finitely many axioms of the form (
). Denote those predicates as
ϕ
1
, · · · , ϕ
n
. We let N be the supremum of the arities of the ϕ
i
.
Pick an infinite model M of T . We write
M
[N]
= {A M : |A| = N },
For each
ϕ
i
, we partition
M
[N]
as follows given any collection
{a
k
}
N
k=1
, we
use the order relation
4
to order them, so we suppose
a
k
4 a
k+1
. If
ϕ
i
has arity
m N
, then we can check whether
ϕ
i
(
a
1
, · · · , a
m
) holds, and the truth value
gives us a partition of M
[N]
into 2 bits.
If we do this for all
ϕ
i
, then we have finitely partitioned
M
[N]
. By Ramsey’s
theorem, this has an infinite monochromatic subset, i.e. a subset such that
any two collection of
N
members fall in the same partition. We pick elements
c
1
, · · · , c
K
, · · · , c
K+N
, in increasing order (under
4
). We claim that picking the
c
1
, · · · , c
K
to be our constants satisfy the axioms of T
0
.
Indeed, given any
ϕ
i
mentioned in
T
0
with arity
m < N
, and sequences
c
`
1
< · · · < c
`
m
and
c
`
0
1
< · · · < c
`
0
m
, we can extend these sequences on the right
by adding more of those c
i
. Then by our choice of the colouring, we know
ϕ
i
(c
`
1
, · · · , c
`
m
) ϕ
i
(c
`
0
1
, · · · ,
`
0
m
).
So we know
T
0
is consistent. So
T
I
is consistent. So we can just take a model of
T
I
, and the Skolem hull of the indiscernibles is the model desired.
There is another proof of Ehrenfeucht–Mostowski due to Gaifman, using
ultraproducts. We will sketch the proof here.
To do so, we need the notion of colimits.
Definition
(Colimit)
.
Let
{A
i
:
i I}
be a family of structures index by a poset
hI, ≤i
, with a family of (structure-preserving) maps
{σ
ij
:
A
i
A
j
| i j}
such
that whenever i j k, we have
σ
jk
σ
ij
= σ
ik
.
In particular
σ
ii
is the identity. A colimit or direct limit of this family of
structures is a “minimal” structure
A
with maps
σ
i
:
A
i
A
such that
whenever i j, then the maps
A
i
A
A
j
σ
ij
σ
i
σ
j
commute.
By “minimal”, we mean if
A
0
is another structure with this property, then
there is a unique inclusion map A
A
0
such that for any i I, the maps
A
i
A
A
0
σ
i
σ
0
i
Example.
The colimit of a family of sets is obtained by taking the union of all
of them, and then identifying x σ
ij
(x) for all x A
i
and i j.
The key observation is the following:
Every structure is a direct limit of its finitely generated substructures,
with the maps given by inclusion.
We will neither prove this nor make this more precise, but it will be true for the
different kinds of structures we are interested in. In particular, for a poset, a
finitely generated substructure is just a finite suborder, because there are no
function symbols to “generate” anything in the language of posets.
To prove Ehrenfeucht–Mostowski, we construct a model satisfying the state-
ment of Ehrenfeucht–Mostowski as the direct limit of structures indexed by finite
subset of I:
{M
s
: s P(I)}
and when
s t P
(
I
), we have an elementary embedding
M
s
M
t
. We
then note that if all the
σ
ij
are elementary, then the
σ
i
:
A
i
A
are also
elementary. In particular, if each M
s
are models of our theory, then so is A
.
We start by picking any infinite model
M
of
I
. By standard compactness
arguments, we may wlog there is no last
4
-thing in
M
, and also that
J
=
{x
:
P
(
x
)
}
is infinite. We pick
U
an ultrafilter on
J
that contains all terminal
segments of
4
. Since
J
does not have a last element, this is a non-principal
ultrafilter.
We now write
L(M) = M
|J|
/U.
We will define
M
s
= L
|s|
(M).
In particular, M
= M.
To construct the embedding, we consider the following two classes of maps:
(i)
For any structure
M
, there is a map
K
=
K
(
M
) :
M L
(
M
) given by
the constant embedding.
(ii) If i is an embedding from M into N , then there is an embedding
L(i) : L(M) L(N ).
which is “compose with i on the right”.
It is easy to see that both of these maps are elementary embeddings, where
we require
i
to be elementary in the second case. Moreover, these maps are
compatible in the sense that the following diagram always commutes:
M N
L(M) L(N )
i
K(M) K(N )
L(i)
We now further consider the functions
head
and
tail
defined on finite linear orders
by
head(a
1
< · · · < a
n
) = a
1
tail(a
1
< · · · < a
n
) = a
2
< · · · < a
n
.
We can now define the embeddings recursively. We write
I
(
s, t
) for the desired
inclusion from M
s
into M
t
. We set
If s = t, then I(s, t) is the identity.
If head(s) = head(t), then I(s, t) is L(I(tail(s), tail(t))).
Otherwise, I(s, t) is K I(s, tail(t)).
By our previous remark, we know these are all elementary embeddings, and that
these maps form a commuting set. Thus, we obtain a direct limit M
.
Now we want to produce a copy of
I
in the direct limit. So we want to
produce a copy of
s
in
M
s
for each
s
, such that the maps preserve these copies.
It suffices to be able to do it in the case
|s|
= 2, and then we can bootstrap it
up by induction, since if
|s| >
3, then we find two sets of
s
of order
< |s|
whose
union gives
s
, and compatibility means we can glue them together. To ensure
compatibility, we pick a fixed element
x L
(
M
), and set this as the desired
element in
M
s
whenever
|s|
= 1. It then suffices to prove that, say, if 0
<
1,
then
I
(
{
0
}, {
0
,
1
}
)(
x
)
< I
(
{
1
}, {
0
,
1
}
)(
x
). It is then a straightforward exercise
to check that
x = [λp.p] L(M) = M
|P |
/U
works.
Once we have done this, it is immediate that the copy of
I
in the direct limit
is a family of indiscernibles. Indeed, we simply have to check that the copy of
s
in
M
s
is a family of indiscernibles, since every formula only involves finitely
many things. Then since the inclusino of
M
s
into
M
is elementary, the result
follows.
2.4 The omitting type theorem
Definition
(Type)
.
A type is a set of formulae all with the same number of free
variables. An n-type is a set of formulae each with n free variables.
What is the motivation of this? A 1-type is a set of formulae all with one
free variable. So if we have a model
M
and an element
x M
, then we can
consider all formulae that x satisfies, and we see that this gives as a 1-type.
Definition
(Realization of type)
.
A model
M
realizes an
n
-type Σ if there
exists x
1
, · · · , x
n
M such that for all σ Σ, we have
M σ(x
1
, · · · , x
n
).
Any fool can realize a type, but it takes a model theorist to omit one!
Definition
(Omit a type)
.
A model
M
omits an
n
-type Σ if for all
x
1
, · · · , x
n
,
there exists σ Σ such that
M 6 σ(x
1
, · · · , x
n
).
Example.
Consider the language of PA, and consider the type containing
σ
i
(
x
)
saying x 6= s
i
(x). We really want to omit this type!
Of course, it is not hard in this case the standard model of PA does.
Let’s look for a condition on
T
and Σ for
T
to have a model that omits this
type.
Definition (Locally realize). We say ϕ realizes Σ locally if
T `
x
(ϕ(x) σ(x)).
Suppose there is some ϕ L(T ) such that
T `
x
ϕ(x)
and ϕ(x) locally realizes Σ, then we certainly cannot omit Σ.
Now if
T
x
¬ϕ
(
x
) whenever
ϕ
locally realizes Σ, then we have a chance.
We say T locally omits Σ. It turns out this is sufficient!
Theorem
(Omitting type theorem)
.
Let
T
be a first-order theory, and Σ an
n-type. If
T `
x
¬ϕ(x)
whenever ϕ locally realizes Σ, then T has a model omitting Σ.
We first prove a special case for propositional logic.
Theorem.
Let
T
be a propositional theory, and Σ
L
(
T
) a type (with
n
= 0).
If T locally omits Σ, then there is a T -valuation that omits Σ.
Proof.
Suppose there is no
T
-valuation omitting Σ. By the completeness theorem,
we know everything in Σ is a theorem of T . So T can’t locally omit Σ.
That was pretty trivial. But we can do something more than that the
extended omitting types theorem. It says we can omit countably many types
simultaneously.
Theorem.
Let
T
be a propositional theory, and for each
i N
, we let Σ
i
L
(
T
)
be types for each
i N
. If
T
locally omits each Σ
i
, then there is a
T
-valuation
omitting all Σ
i
.
Proof.
We will show that whenever
T A
1
, · · · , ¬A
i
}
is consistent, where
A
n
Σ
n
for n i, then we can find A
n+1
Σ
n+1
such that
T A
1
, · · · , ¬A
n
, ¬A
n+1
}
is consistent.
Suppose we can’t do this. Then we know that
T `
^
1jn
¬A
j
A
n+1
,
for every
A
n+1
Σ
n+1
. But by assumption,
T
locally omits Σ
i+1
. This implies
T ` ¬
^
1jn
¬A
j
,
contradicting the inductive hypothesis that T A
1
, · · · , ¬A
i
} is consistent.
Thus by compactness, we know
T A
1
, ¬A
2
, · · · }
is consistent, and a
model of this would give us a model of T omitting all those types.
We now prove the general case.
Proof.
Let
T
be a first order theory (in a countable language) locally omitting
Σ. For simplicity, we suppose Σ is a 1-type. We want to find a model omitting
Σ. Suppose
T
locally omits Σ, and let
{c
i
:
i N}
be a countable set of new
constant symbols. Let
hϕ
i
:
i Ni
be an enumeration of the sentences of
L
(
T
).
We will construct an increasing sequence
{T
i
:
i N}
of finite extensions of
T
such that for each m N,
(0) T
m+1
is consistent.
(i) T
m+1
decides ϕ
n
for n m, i.e. T
m+1
` ϕ
n
or T
m+1
` ¬ϕ
n
.
(ii)
If
ϕ
m
is
x
ψ
(
x
) and
ϕ
m
T
m+1
, then
ψ
(
c
p
)
T
m+1
, where
c
p
is the first
constant not occurring in T
m
or ϕ
m
.
(iii) There is a formula σ(x) Σ such that ¬σ(c
m
) T
m+1
.
We will construct this sequence by recursion. Given
T
m
, we construct
T
m+1
as
follows: think of T
m
as T {θ
1
, · · · , θ
r
}, and let
Θ =
^
jr
θ
j
.
We let {c
1
, · · · , c
N
} be the constants that have appeared in Θ, and let
Θ(x)
be the result of replacing
c
i
with
x
i
in
θ
. Then clearly, Θ(
x
) is consistent with
T . Since T locally omits Σ, we know there exists some σ(x) Σ such that
Θ ¬σ(x
m
)
is consistent with T . We put ¬σ(c
m
) into T
m+1
, and this takes care of (iii).
If
ϕ
m
is consistent with
T
m
σ
(
c
m
)
}
, then put it into
T
m+1
. Otherwise,
put in ¬ϕ
m
. This takes care of (i).
If
ϕ
m
is
x
ψ
(
x
) and it’s consistent with
T
m
σ
(
c
m
)
}
, we put
ψ
(
c
m
) into
T
m+1
. This takes care of (ii).
Now consider
T
=
[
nN
T
n
Then T
is complete by construction, and is complete by compactness.
Consider an arbitrary countable model of
T
, and the submodel generated
by the constants in
C
. This is a model of
T
T
and condition (iii) ensures
that it omits Σ.
3 Computability theory
3.1 Computability
We begin by discussing some number theory. Consider the Diophantine equation
x
2
+ y
2
= z
2
. ()
We can certainly write down some solutions to this equation, e.g.
3
2
+ 4
2
= 5
2
.
But is it possible to find all solutions? It is a well-known result that we can. Of
course, to do so, it suffices to enumerate all solutions where
x, y, z
are coprime
It turns out every such solution of () can be written as
x = m
2
n
2
, y = 2mn, z = m
2
+ n
2
for some
m, n N
, and conversely, for any
m, n N
, the (
x, y, z
) given by this
formula is a solution. Thus, if we want to list all solutions to (
), then it suffices
to plug in all possible values of
m
and
n
. In other words, we have an algorithm
that enumerates all solutions to ().
Given any Diophantine equation, it would be great to have some similar
algorithm that enumerates all solutions to the equation. But even before that, it
would be great to have some algorithms that tell us whether there is any solution
at all. This was Hilbert’s 10th problem is there an algorithm that determines
whether or not a Diophantine equation has a solution?
If such an algorithm existed, then to prove this, we merely have to exhibit
such an algorithm, and prove that it works? But how could we prove that there
isn’t one. To do so, we must formulate an “algorithm” as a mathematical object.
Alternatively, we want a formal notion of what is a “computable” function.
One way to do so is to define “computable functions” recursively. We will be
concerned with functions
N
n
N
m
for different
k
and
m
, and the idea is to list
some functions that any sensible computer should be able to compute, and then
close it under operations that really should preserve computability. We then
hope that the set of functions produced this way is exactly the functions that
can be “computed” by an “algorithm”.
We start with a naive attempt, and what we obtain is known as “primitive
recursive” functions.
Definition
(Primitive recursive functions)
.
The class of primitive recursive
functions N
n
N
m
for all n, m is defined inductively as follows:
The constantly zero function f (x) = 0, f : N
n
N is primitive recursive.
The successor function
succ
:
N N
sending a natural number to its
successor (i.e. it “plus one”) is primitive recursive.
The identity function id : N N is primitive recursive.
The projection functions
proj
i
j
(x) : N
j
N
(x
1
, · · · , x
j
) x
i
are primitive recursive.
Moreover,
Let
f
:
N
k
N
m
and
g
1
, · · · , g
k
:
N
n
N
be primitive recursive. Then
the function
(x
1
, · · · , x
n
) 7→ f(g
1
(x
1
, · · · , x
n
), · · · , g
k
(x
1
, · · · , x
n
)) : N
n
N
m
is primitive recursive.
Finally, we have closure under primitive recursion
If
g
:
N
k
N
m
and
f
:
N
m+k+1
N
m
are primitive recursive, then so is
the function h : N
k+1
N
m
defined by
h(0, x) = g(x)
h(succ n, x) = f(h(n, x), n, x).
Do these give us everything we want? We first notice that we can define
basic arithmetic using primitive recursion by
plus(0, n) = n
plus(succ m, n) = succ(plus(m, n))
mult(1, n) = n
mult(succ m, n) = plus(n, mult(m, n)),
Similarly, we can define exp etc.
What else can we do? We might want to define the Fibonacci function
Fib(0) = Fib(1) = 1, Fib(n + 1) = Fib(n) + Fib(n 1)?
But the definition of primitive recursion only allows us to refer to
h
(
n, x
) when
computing
h
(
succ n, x
). This isn’t really a problem. The trick is to not define
Fib directly, but to define the function
Fib
0
(n) =
Fib(n)
Fib(n 1)
instead, using primitive recursion, and then projecting back to the first compo-
nent. But this requires us to fix the number of things we are required to refer
back to. What if we want to refer back to everything less than n?
The solution to this is that we can actually encode pairs as natural numbers
themselves:
Proposition.
There exists primitive recursion functions
pair
:
N
2
N
and
unpair : N N
2
such that
unpair(pair(x, y)) = (x, y)
for all x, y N.
Proof. We can define the pairing function by
pair(x, y) =
x + y + 1
2
+ y.
The unpairing function can be shown to be primitive recursive, but is more
messy.
The benefit of this is that we can now encode lists as naturals. What would
it mean to be a list? We will informally denote a list by square brackets, e.g.
[x
1
, · · · , x
n
]. The idea is that we have functions head, tail, cons such that
head [x
1
, · · · , x
n
] = x
1
tail [x
1
, · · · , x
n
] = [x
2
, · · · , x
n
]
cons(x, [x
1
, · · · , x
n
]) = [x, x
1
, · · · , x
n
]
If we think a bit harder, then what we are really looking for is really the following
property:
Corollary.
There exists
cons
:
N N, head
:
N N
and
tail
:
N N
such that
cons(head x, tail x) = x
Proof.
Take
cons
=
pair
;
head
to be the first projection of the unpairing and
tail
to be a second projection.
So we can encode lists as well. We can lock ourselves in a room with enough
pen and paper, and try to prove that most functions
N N
we encounter “in
daily life” are indeed primitive recursive.
However, it turns out there are some contrived examples that really should
be computable, but not primitive recursive.
Definition (Ackermann function). The Ackermann function is defined to be
A(0, n) = n + 1
A(m, 0) = A(m 1, 1)
A(m + 1, n + 1) = A(m, A(m + 1, n)).
Proposition. The Ackermann function is well-defined.
Proof.
To see this, note that computing
A
(
m
+ 1
, n
+ 1) requires knowledge of
the values of A(m + 1, n), and A(m, A(m + 1, n)).
Consider
N × N
ordered lexicographically. Then computing
A
(
m
+ 1
, n
+ 1)
requires knowledge of the values of
A
at pairs lying below
hm
+ 1
, n
+ 1
i
in this
order. Since the lexicographic order of
N × N
is well-ordered (it has order type
ω × ω), by transfinite induction, we know A is well-defined.
We can indeed use this definition to compute
A
, and it is going to take
forever. But we can still do it.
However, this function is not primitive recursive! Intuitively, this is because
the definition of
A
does “transfinite recursion over
ω × ω
”, while the definition
of primitive recursion only allows us to do “transfinite recursion over ω”.
That is the idea, but obviously not a proof. To prove it properly, this is done
by proving that the Ackermann “grows too fast”.
Definition
(Dominating function)
.
Let
f, g
:
N N
be functions. Then we
write
f < g
if for all sufficiently large integer
n
, we have
f
(
n
)
< g
(
n
). We say
g
dominates f .
It is a remarkable theorem that
Theorem.
The function
n 7→ A
(
n, n
) dominates all primitive recursive functions.
We will not prove this, as it is messy, and not really the point of the course.
The point of bringing up the Ackermann function is just to know that we are
not doing good enough!
The obvious fix would be to allow some more complicated form of recursion,
and hope that we will be safe. But this doesn’t really do the job.
It turns out the problem is that we are only defining total functions. If we
tried to compute any of the primitive recursive functions, we are guaranteed
that we can do it in finite time. In particular, our functions are always defined
everywhere. But anyone with programming experience knows very well that
most programs we write are not everywhere defined. Often, they are nowhere
defined, and just doesn’t run, or perhaps gets stuck in a loop for ever! But we
did write some code for this program, so surely this nowhere defined function
should be computable as well!
Our way of thinking about functions that are not everywhere defined is via
non-termination. We imagine we have a computer program trying to compute
a function
f
. If
f
(42) is undefined, instead of the program running for a while
and then saying “I give up”, the program keeps running forever,
It turns out we can “fix” the problem merely by introducing one operation
inverses. This is slightly subtle. Imagine we are writing a compute program, and
suppose we have a (computable) function
f
:
N N
. We want to compute an
inverse to
f
. Say we want to find
f
1
(17). To compute this, we just compute
f
(0),
f
(1),
f
(2), etc., and if 17 is in the image, then we will eventually find
something that gets sent to 17.
There are two minor problems and one major problems with this:
f
might not be injective, and there are multiple values that get sent to 17.
If this happens, we just agree that we return the first natural number that
gets sent to 17, which is what is going to happen if we perform the above
procedure.
f
might not be surjective, and we can never find something that gets sent
to 17. But this is not a problem, since we just decided that our functions
don’t have to be total! If
f
is not surjective, we can just let the program
keep running forever.
But there is still a major problem. Since we decided functions need not be total,
what happens if
f
is total? We might have
f
(23) = 17, but
f
(11) never halts.
So we never learn that f(23) = 17.
We might think we can get around this problem by “diagonalizing”. We
assume our computers perform computations for discrete time steps. Then we
compute
f
(1) for one step; then
f
(1) and
f
(2) for two steps; then
f
(1),
f
(2)
and
f
(3) for three steps, etc. Then if
f
(23) = 17, and it takes 100 steps of
computation to compute this, then we will figure this out on the 100th iteration
of this process.
This doesn’t really solve our problem, though. If
f
is not injective, then this
doesn’t guarantee to return the minimum
x
such that
f
(
x
) = 17. It just returns
some arbitrary x such that f(x) = 17. This is bad.
So we make a compromise. The obvious compromise is to just allow computing
inverses of total functions, but since total functions are hard to invert, we shall
make a further compromise to just allow computing inverse of primitive recursive
functions.
Before we continue, we define some useful conventions:
Notation.
We write
f
(
x
)
if
f
(
x
) is undefined, or alternatively, after we define
what this actually means, if the computation of
f
(
x
) doesn’t halt. We write
f(x) otherwise.
Definition
(Partial recursive function)
.
The class of partial recursive functions
is given inductively by
Every primitive recursive function is partial recursive.
The inverse of every primitive recursive function is partial recursive, where if
f
:
N
k+n
N
m
, then the/an inverse of
f
is the function
f
1
:
N
k+m
N
n
given by
f
1
(x; y) =
(
min{z : f(x, z) = y} if exists
otherwise
.
The set of partial recursive functions is closed under primitive recursion
and composition.
Of course, this allows us to compute the inverse of functions of several inputs
using pairing and unpairing functions.
It turns out this definition is “correct”. What do we mean by this?
What we have described so far is syntactic declarations of functions. We
describe how we can build up our functions. We can also define computability
semantically, by describing some rules for constructing machines that compute
functions. In other words, we actually define what a “computer” is, and then a
function is computable if it can be computed by a computer. What we want to
prove is that the partial recursive functions are exactly the functions that can
be computed by a machine.
We will not go into details about how we can define these machines. It is pos-
sible to do so; it doesn’t really matter; and the actual details are messy. However,
any sensible construction of machines will satisfy the following description:
A machine can be described by a finite string. In particular, it is possible
to number all the possible machines. Thus, each machine can be specified
by a odel number uniquely. We write
{m}
for the machine corresponding
to the numbering m.
A machine has a countable (enumerated) number of possible states, and
one of them is called
HALT
. A machine also has a register that can store a
natural number.
A machine can take in a fixed number of inputs. Afterwards, in every
discrete time step, it (potentially) changes the content of its register, and
moves to a new state. However, if the current state is
HALT
, then it does
nothing.
For inputs x
1
, · · · , x
n
into the machine {m}, we write
{m}(x
1
, · · · , x
n
) =
(
value in the register after halting if machine halts
otherwise
Thus, we will also write {m} for the function computed by {m}.
The above process is “computable” in the following precise sense we
define Kleene’s
T
function
T
(
m, i, t
) :
N
3
N
whose value encodes (in a
primitive recursive way) the states and register contents during the first
t
steps in the evaluation of
{m}
(
i
). Then
T
is a primitive recursive function.
For any partial recursive function
f
, there is a machine
{m}
such that
{m}(x) = f(x) for all x N. In particular {m}(x) always halts.
Now of course, given our requirements for what a “machine” should be, it is
absolutely obvious that functions computable by programs in the above sense is
exactly the same as the functions that are partial recursive. The only perhaps
slightly non-trivial part is to show that every function computed by a machine
is partial recursive. To prove this, given a machine
{m}
, inversion gives us the
function
g(i) = min{t : T(m, i, t) has final state HALT},
and then we can define
h
(
i
) by computing
T
(
m, i, g
(
i
)) and then extracting the
register content in the final state, which we assumed is a primitive recursive
process. Then h(i) = {m}(i), and so {m} is a partial recursive function.
3.2 Decidable and semi-decidable sets
So we’ve managed to come up with a notion of computability of a function. From
now on, we will assume anything that is “intuitively computable” is actually
computable.
We now want to try to talk about “computable subsets” of
N
. There are two
possible things we might be interested in.
Definition
(Decidable set)
.
A subset
X N
is decidable if there is a total
computable function N N such that
f(n) =
(
1 n X
0 n 6∈ X
.
This is a rather strong notion. We can always tell, in finite time, whether
an element is in
X
. Often, a weaker notion is desired, namely semi-decidability.
Informally, a subset
X N
is semi-decidable if upon given some
n X
, if
n X
,
then we can learn this in finite time. If
n 6∈ X
, we might never figure this is the
case.
There are several ways to define semi-decidable sets properly.
Definition
(Semi-decidable set)
.
We say a subset
X N
is semi-decidable if it
satisfies one of the following equivalent definitions:
(i) X is the image of some partial computable function f : N N.
(ii) X is the image of some total computable function f : N N.
(iii) There is some partial computable function f : N N such that
X = {n N : f(n) ↓}
(iv) The function χ
X
: N {0} given by
χ
X
=
(
0 n X
n 6∈ X
is computable.
There are obvious generalizations of these definitions to subsets of
N
k
for any
k
. It is an easy, and also important, exercise to prove that these are equivalent.
We can prove another equivalent characterization of semi-decidable sets.
Proposition.
A set
X N
k
is semi-decidable iff it is a projection of a decidable
subset of N
k+1
.
Proof.
(
) Let
Y N
k+1
be such that
proj
k
Y
=
X
, where
proj
k
here denotes
the projection to the first
k
factors. Since
Y
is decidable, there is a computable
function
f
:
N N
k+1
such that
im f
=
Y
. Then
im
(
proj
k
f
) =
X
. So
X
is
the image of a computable function.
(
) Suppose
X
is semi-decidable. So
X
=
dom
(
{m}
) for some
m N
, i.e.
X = {n : {m}(n) ↓}. Then we can pick
Y = {(x, t) | {m}(x) halts in t steps}.
A crucial example of a semi-decidable set is the halting set.
Definition (Halting set). The halting set is
{hp, ii : {p}(i) ↓} N
2
.
Some people prefer to define it as
{m : {m}(m) ↓} N
instead.
These two definitions are “the same” in the sense that if we are given some
magic “oracle” that determines membership of one of these sets, then we can
use it to build a program that determines the other. (Exercise)
Theorem (Turing). The halting set is semi-decidable but not decidable.
The proof is just some version of Liar’s paradox.
Proof.
Suppose not, and
M
is a machine with two inputs such that for all
p, i
,
we have
M(p, i) =
(
yes {p}(i)
no {p}(i)
.
If there were such a machine, then we could do some “wrapping” if the output
is “yes”, we intercept this, and pretend we are still running. If the output is
“no”, then we halt. Call this
M
0
. From this, we construct
M
00
(
n
) =
M
0
(
n, n
).
Suppose this machine is coded by m.
Now does
{m}
(
m
) halt? Suppose it does. Then
M
(
m, m
) =
yes
, and hence
M
0
(
m, m
) does not halt. This means
m
(
m
) doesn’t halt, which is a contradiction.
Conversely, if
m
(
m
) does not halt, then
M
0
(
m, m
) says
no
. Thus,
m
(
m
)
halts. This is again a contradiction!
So M cannot exist.
So the problem of deciding whether a program halts is not decidable. In fact,
we can prove something much stronger any non-trivial question you can ask
about the behaviour of the function represented by a program is undecidable.
To prove this, we need to go through a few equally remarkable theorems.
Theorem
(
smn
theorem)
.
There is a total computable function
s
of two vari-
ables such that for all e, we have
{e}(b, a) = {s(e, b)}(a).
Similarly, we can find such an s for any tuples b and a.
This is called the
smn
theorem because the function is called
s
, and
m
and
n usually refers to the length of the tuples b and a.
Computer scientists call this process “currying”.
Proof. We can certainly write a program that does this.
Theorem
(Fixed point theorem)
.
Let
h
:
N N
be total computable. Then
there is an n N such that {n} = {h(n)} (as functions).
Proof. Consider the map
he, xi 7→ {h(s(e, e))}(x).
This is clearly computable, and is computed by a machine numbered
a
, say. We
then pick n = s(a, a). Then we have
{n}(x) = {s(a, a)}(x) = {a}(a, x) = {h(s(a, a))}(x) = {h(n)}(x).
This is a piece of black magic. When we do lambda calculus later, we will see
that this is an example of the
Y
combinator, which is what is used to produce
fixed points in general (and is also black magic).
Theorem
(Rice’s theorem)
.
Let
A
be a non-empty proper subset of the set of
all computable functions N N. Then {n : {n} A} is not decidable.
This says it is impossible to decide what a program does based on what its
code is.
Proof.
We fix an
A
. Suppose not. We let
χ
be the (total) characteristic function
of
{n
:
{n} A}
. By assumption,
χ
is computable. We find naturals
a, b
such
that
{a} A
and
{b} 6∈ A
. Then by the hypothesis, the following is computable:
g(n) =
(
b {n} A
a otherwise
.
The key point is that this is the wrong way round. We return something in
A
if the graph of
{n}
is not in
A
. Now by the fixed point theorem, there is some
n N such that
{n} = {g(n)}.
We now ask ourselves — do we have
{n} A
? If so, then we also have
{g
(
n
)
} A
.
But these separately imply
g
(
n
) =
b
and
g
(
g
(
n
)) =
b
respectively. This implies
g(b) = b, which is not possible.
Similarly, if
{n} 6∈ A
, then
{g
(
n
)
} 6∈ A
. These again separately imply
g
(
n
) =
a
and g(g(n)) = a. So we find g(a) = a, which is again a contradiction.
Corollary. It is impossible to grade programming homework.
3.3 Computability elsewhere
In all of mathematics, whenever the objects we talk about are countable, we can
try to sneak in computability theory. Examples include, for example, Ramsey
theory, or logic.
3.4 Logic
When we do logic, our theorems, proofs, rules etc. are all encoded as some finite
strings, which can certainly be encoded as numbers. Even though our theories
often have infinitely many axioms, the set of axioms is also decidable. It follows
from this that the set of theorems in our theory is semi-decidable, as to know if
something is a theorem, we can run through all possible strings and see if they
encode a proof of the statement. This also works if our set of axioms is just
semi-decidable, but we will need to use some diagonalization argument.
We start with a rather cute observation.
Theorem
(Craig’s theorem)
.
Every first-order theory with a semi-decidable set
of axioms has a decidable set of axioms.
Proof.
By assumption, there is a total computable function
f
such that the
axioms of the theory are exactly
{f
(
n
) :
n N}
. We write the
n
th axiom as
ϕ
n
.
The idea is to give an alternative axiom that says the same thing as
ϕ
n
, but
from the form of the new axiom itself, we can deduce the value of
n
, and so we
can compute
f
(
n
) to see if they agree. There are many possible ways to do this.
For example, we can say that we add
n
many useless brackets around
ϕ
n
and
take it as the new axiom.
Alternatively, without silly bracketing, we can take the nth axiom to be
φ
n
=
^
i<n
ϕ
i
!
ϕ
n
.
Then given any statement
ψ
, we can just keep computing
f
(1)
, f
(2)
, · · ·
, and
then if
ψ
=
φ
n
, then we will figure in finite time. Otherwise, we will, in finite
time, see that
ψ
doesn’t look like
f
(1)
f
(2)
· · ·
, and thus deduce it is not an
axiom.
We can also do some other fun things. Suppose we take our favorite copy of
N
with the usual 0 and successor functions. We let
T
be the theory of all true
first-order statements in
N
. We call this the theory of true arithmetic. We add
a constant c to this theory, and add axioms to say
0 < c, 1 < c, 2 < c, · · · .
Then by compactness, this has a model, and by Lowenheim–Skolem, it has a
countable model, which is a non-standard model of
N
. We wlog assume the
carrier set of this model is in fact N. How bad can this be?
Theorem
(Tennenbaum’s theorem)
.
For any countable non-standard model of
true arithmetic, the graph of + and × cannot be decidable.
Interestingly, computability theory offers us an easy proof of odel’s incom-
pleteness theorem. We first note the following result:
Theorem.
The set of odel numbers of machines that compute total functions
is not semi-decidable.
Proof.
Suppose it were. Then there is a computable total function
f
:
N N
such that every computable total function is
{f
(
n
)
}
for some
n
. Now consider
the function
g(n) = {f(n)}(n) + 1.
This is a total computable function! But it is not
{f
(
n
)
}
for any
n
, because it
differs from {f(n)} at n.
Note that this proof is very explicit and constructive. If we are presented
with a function
{e}
:
N N
, then there is an algorithm that returns an
n
such
that {n} is total, and is not in the image of {e}.
Definition
(Productive set)
.
A set
X N
is productive if there is a total
computable function
f
:
N N
such that for all
n N
, we have
f
(
n
)
X \ (im{n}).
Then the theorem says the set of all machines that compute total functions
is productive.
In some sense, productive sets are “decidably non-semi-decidable”.
If we work in, say, ZFC, then there exists a fixed, canonical copy of
N
. We
note the following definition:
Definition
(Sound theory)
.
A sound theory of arithmetic is one all of whose
axioms are true in N.
This is not to be confused with consistency, which does not refer to any
model. For example, the theory of Peano arithmetic plus the statement that PA
is inconsistent is actually a consistent theory, but is unsound (if PA is consistent).
Theorem
(G¨odel’s incompleteness theorem)
.
Let
T
be a recursively axiomatized
theory of arithmetic, i.e. the set of axioms is semi-decidable (hence decidable).
Suppose
T
is sufficiently strong to talk about computability of functions (e.g.
Peano arithmetic is). Then there is some proposition true in
N
that cannot be
proven in T .
Proof.
We know the set of theorems of
T
is semi-decidable. So
{n
:
T `
{n} is a total function}
is semi-decidable. So there must be some total function
such that T does not prove it is total. In other words, T is not complete!
Ramsey theory
In Ramsey theory, we have the following famous result:
Theorem
(Ramsey’s theorem)
.
We write
N
(k)
for the set of all subsets of
N
of
size k. Suppose we partition N
(k)
in m many distinct pieces. Then there exists
some infinite
X N
such that
X
is monochromatic, i.e.
X
(k)
N
(k)
lie entirely
within a partition.
The natural thing for us to do is to insert the word “decidable” everywhere
in the theorem, and see if it is true. Note that a partition of
N
(k)
into
k
pieces
is equivalently a function
N
(k)
{
0
,
1
, · · · , k
1
}
, and it is not hard to encode
k
-subsets of
N
as natural numbers, e.g. by encoding them as an increasing
k
-tuple.
Thus, it makes sense for us to ask if the partition is decidable, then are we
guaranteed to have a decidable infinite monochromatic set?
One might expect not, because the proof of Ramsey’s theorem is rather
non-constructive. Indeed, we have the following theorem:
Theorem
(Jockusch)
.
There exists a decidable partition of
N
(3)
into two pieces
with no infinite decidable monochromatic set.
Proof.
We define a partition
ρ
:
N
(3)
{
0
,
1
}
as follows. Given
x < y < z
, we
define
ρ
(
{x, y, z}
) to be 0 if for any
p, d < x
, we have
{p}
(
d
) halts in
y
steps iff
{p}
(
d
) halts in
z
steps”, and 1 otherwise. This is a rather weird colouring, but
let’s see what it gives us.
We first note that there can be no monochromatic set coloured 1, as given
any
x
, for sufficiently large
y
and
z
, we must have
{p}
(
d
) halts in
y
steps iff
{p}(d) halts in z steps.
We claim that an infinite decidable monochromatic set
A
for 0 will solve the
halting problem. Indeed, if we want to know if
{p}
halts on
d
, then we pick
some
x A
such that
p, d < x
. Then pick some
y A
such that
y > x
. Then
by construction of
ρ
, if
{p}
(
d
) halts at all, then it must halt in
x
steps, and we
can just check this.
Thus, there cannot be an infinite decidable monochromatic set for this
partition.
But what about partitioning
N
(2)
? The same trick doesn’t work, and the
situation is quite complex. We will not talk about this.
3.5 Computability by λ-calculus
We first introduced
λ
-calculus in the setting of decorating natural deduction
proofs. Remarkably, it turns out
λ
-calculus can be used to encode any program
at all! For the sake of concreteness, we will properly define λ calculus.
Definition (λ terms). The set Λ of λ terms is defined recursively as follows:
If x is any variable, then x Λ.
If x is a variable and g Λ, then λx. g Λ.
If f, g Λ, then (f g) Λ.
As always, we use variables sensibly, so that we don’t write things like
x
(
λx. fx
). We also write
f
(
x
) to denote a
λ
term that has a free variable
x
,
and then f(y) would mean taking f and replacing all occurrences of x with y.
We will often be lazy and write λxy.f instead of λx. λy. f .
We can define free variables and bound variables in the obvious way. If we
want to be really formal, we can define them as follows:
Definition (Free and bound variables).
In the λ term x, we say x is a free variable.
In the λ term λx. g, the free variables are all free variables of g except x.
In the
λ
term (
f g
), the free variables is the union of the free variables of
f and those of g,
The variables that are not free are said to be bound.
The definitions become more subtle when we have, say, a variable that is
free in
f
and bound in
g
, but this is not a course on pedantry. We will just not
worry about these, and keep ourself disciplined and not wrote things like that.
We will gloss over subtleties like this in our upcoming definitions as well.
As mentioned previously, we want to think of
λx. f
(
x
) as some sort of
function. So we want things like
(λx. f(x)) y = f (y)
to be true. Of course, they are not actually equal as expressions, but they are
more-or-less equivalent.
Definition
(
α
-equivalence)
.
We say two
λ
terms are
α
-equivalent if they are
the same up to renaming of bound variables.
Example. λx. x and λy. y are α-equivalent.
For any sane purposes at all in
λ
calculus,
α
-equivalent terms are really
considered equal.
Definition
(
β
-reduction)
.
If
f
= (
λx. y
(
x
))
z
and
g
=
y
(
z
), then we say
g
is
obtained from f via β-reduction.
This captures the idea that λ expressions are really functions.
Definition
(
η
-reduction)
. η
-conversion is the conversion from
λx.
(
f x
) to
f
,
whenever x is not free in f .
This captures “function extensionality”, i.e. the idea that functions are the
same iff they give the same results for all inputs.
Definition
(
β
-normal form)
.
We say a term is in
β
-normal form if it has no
possible β-reduction.
Our model of computation with
λ
calculus is as follows. A program is a
λ
expression
f
, and an input is another
λ
expression
x
. To run
f
on
x
, we take
f x
, and then try to
β
-reduce it as far as possible. If it reaches a
β
-normal form,
then we are done, and say that is the output. If we never get to a
β
-normal
form, then we say we don’t terminate.
To do so, we need a reduction strategy. It is possible that a
λ
expression can
be β-reduced in many ways, e.g. if we have
(λx. (λy. f y x) x) z,
then we can reduce this to either (
λy. f y z
)
z
or (
λx f x x
)
z
. These are not
α
-equivalent, but of course performing one more
β
-reduction will yield the same
terms. A reduction strategy is a prescription of which
β
-reduction we should
perform when there is more than one choice.
There are some things that might worry us. The first is that if we reduce in
different ways, then we might end up in different normal forms. Even worse, it
could be that one reduction strategy would not terminate, while another might
halt in 3 steps!
For this to work, we need the following theorem:
Theorem.
Every
λ
expression can be reduced to at most one
β
-normal form.
Moreover, there exists a reduction strategy such that whenever a
λ
expression
can be reduced to a
β
-normal form, then it will be reduced to it via this reduction
strategy.
This magic reduction strategy is just to always perform
β
-reduction on the
leftmost thing possible.
Notation ( ). We write f g if f β-reduces to g.
There is another thing we should note, which is currying. If we want to
implement a function with, say, 2 inputs, say a function
f
:
N × N N
, we can
instead implement it as a function
˜
f : N (N N), where
˜
f(x)(y) = f(x, y).
This allows us to always work with functions of one variable that can potentially
return functions. We will usually just write the type as N N N.
Note that in this setting, officially, our
λ
terms do not have “types”. However,
it is often convenient to imagine they have types to make things less confusing,
and we will often do so.
With the formalities out of the way, we should start doing things. We claimed
that we can do programming with
λ
terms. The first thing we might want to
try is to encode numbers. How can we do so?
Recalling that types don’t exist in
λ
calculus, consider a “generic type”
A
.
We will imagine that natural numbers take in “functions”
A A
and returns a
new function A A. We write
N = (A A) (A A)
for the “type” of natural numbers. The idea is that
n
represents the
λ
term that
sends f to the nth iteration f
n
.
Formally, we define them as follows:
Definition (Church numerals). We define
0 = K(id) = λf. λx. x : N
succ = λn. λf. λx. f ((n f) x) : N N
We write n = succ
n
(0).
We can define arithmetic as follows:
plus = λn. λm. λf. λx. (n f) ((m f) x) : N N N
mult = λn. λm. λf. λx. (n (m f)) x : N N N
exp = λn. λm. λf. λx. ((n m) f)) x : N N N
Here exp n m = n
m
.
Of course, our “typing” doesn’t really make sense, but helps us figure out
what we are trying to say.
Certain people might prefer to write mult instead as
mult = λn. λm. λf. n (m f).
We can define a lot of other things. It is possible to give motivation for these,
and if one thinks hard enough, these are “obviously” the right definitions to
make. However, we are not really concerned about this, because what we really
want to talk about is recursion. We can define
true = λxy. x
false = λxy. y
if b then x else y = λbxy. b x y
iszero = λn. n (λx. false) true
pair = λxyf. f x y
fst = λp. p true
snd = λp. p false
nil = λx. true
null = λp. p (λxy. false)
We can check that
fst (pair x y) = x
snd (pair x y) = y
Using
pair
,
fst
and
snd
, we can implement lists just as pairs of pair, with functions
head, tail, cons satisfying
cons (head x) (tail x) = x.
We by convention always end our lists with a
nil
, and
null
is a test of whether
something is nil.
Exercise.
Implement the predecessor function that sends
n
to (
n
1), if
n >
1,
and 0 otherwise. Also implement
and
,
or
, and test of equality for two general
church numerals.
We want to show that we can in fact represent all primitive recursive functions
with
λ
-terms. We will in fact do something stronger we will implement
general recursion. This is a version of recursion that doesn’t always guarantee
termination, but will be useful if we want to implement “infinite structures”.
We look at an example. We take our favorite example fact : N N
fact n = if iszero n then 1 else n · fact (n 1).
But this is not a valid definition, since it is circular, and doesn’t really give us a
λ
expression. The trick is to introduce a new function
metafact
: (
N N
)
(N N), given by
metafact f n = if n = 0 then 1 else n · f (n 1)
This is a genuinely legitimate definition of a function. Now suppose
f
is a fixed
point of metafact, i.e. f = metafact f . Then we must have
f(n) = (metafact f)(n) = if n = 0 then 1 else n · (f (n 1)).
So f is in fact the factorial function!
Thus, we have reduced the problem of recursion to the problem of computing
fixed points. If we have a general fixed point operator, then we can implement
all recursive functions.
Definition (Y-combinator). The Y-combinator is
Y = λf.
h
(λx.f(x x))(λx.f(x x))
i
.
Theorem. For any g, we have
Y g g (Y g).
Proof.
Y g λf.
h
(λx.f(x x))(λx.f(x x))
i
g
(λx.(g(x x)))(λx.g(x x))
g
(λx.(g(x x)))(λx.g(x x))
We also need to implement the minimization operator. To compute
f
1
(17),
our plan is as follows:
(i) Produce an “infinite list” (stream) of natural numbers.
(ii) Apply a function f to every element of the stream.
(iii) Find the first element in the stream that is sent to 17.
Note that infinite lists are genuinely allowed in
λ
-calculus. These behave like
lists, but they never have an ending
nil
element. We can always extract more
and more terms out of it.
We do them in a slightly different order
(ii)
We implement a function
map
that takes in a list or stream, and applies a
function f to every element of the list. We simply define it as
map f x = if null x then x else cons (f (head x)) (map f (tail x)).
Of course, we need to use our Y combinator to actually achieve this.
(i)
We can produce a stream of natural numbers by asking for the fixed point
of
λ. cons 0(map succ ).
(iii)
We will actually apply
map
to
λn. pair n
(
f n
), so that we remember the
index. We then use the function
find n x = if snd (head x) = n then fst (head x) else find n (tail x).
3.6 Reducibility
We’ve alluded to the notion of reducibility several times previously. For example,
we had two versions of the halting set, and said that if we could decide membership
of one of them, then we can decide membership of the other. When we proved that
there is a decidable partition of
N
(3)
with no infinite decidable monochromatic
set, we said if we had such a thing, then we could use it to solve the halting
problem.
The general idea is that we can often “reduce” one problem to another one.
The most basic way of expressing this is via many-to-one reducibility.
Definition
(Many-to-one reducibility)
.
Let
A, B N
. We write
B
m
A
if
there exists a total computable functions
f
:
N N
such that for all
n N
, we
have n B f(n) A.
Thus, if we have such a set, then whenever we can correctly answer questions
about membership in
A
, then we can answer questions about membership in
B
.
It is easy to see that this is a quasi-order. If we take the intersection
m
m
of the relations, then we get an equivalence relation. The equivalence classes are
called many-one degrees. We think of these as “degree of unsolvability”.
Proposition. Let A N. Then A
m
K iff A is semi-decidable.
This is a technical exercise. Another useful fact is the following:
Proposition. (N \ K)
M
A iff A is productive.
But it turns out this is a rather awkward definition of “relative computability”.
In our definition of many-one reducibility, we are only allowed to run
f
once,
and see if the answer is in A. But why once?
The idea is define reducibility semantically. We assume that when writing
programs, we can do all the usual stuff, but also have access to a magic function
f
, and we are allowed to call
f
as many times as we wish and use the results.
We call
f
an oracle. For our purposes,
f
is the (total) characteristic function of
A.
For technical reasons, we will suppose we have a fixed programming language
that just refers to an opaque function
f
, and then we can later plug different
things as
f
. This allows us to compare odel numberings for functions with
different oracles.
Definition
(Turing reducibility)
.
We say
B
T
A
, or simply
B A
, if it is
possible to determine membership of
B
whenever we have access to an oracle
that computes χ
A
.
Again
is a quasi-order, and intersecting with the reverse, we get an
equivalence relation, which we call Turing degree, or just degree.
Now the quasi-order gives us a relation on Turing degrees. A natural question
to ask is if this is actually a total order.
To begin with, Turing degrees certainly form an upper semi-lattice, as given
A
and
B
, we can form the disjoint union of
A
and
B
, and knowing about
membership of
A
`
B
means we know membership of
A
and
B
, and vice versa.
But are they linearly ordered by ? The answer is no!
Theorem
(Friedberg–Muchnik)
.
There exists two
A, B N
such that
A 6≤ B 6≤
A. Moreover, A and B are both semi-decidable.
Proof. We will obtain the sets A and B as
A =
[
n<ω
A
n
, B =
[
n<ω
B
n
,
where
A
n
and
B
n
are finite (and in particular decidable) and nested, i.e.
i < j
implies A
i
A
j
and B
i
B
j
.
We introduce a bit of notation. As mentioned, if we allow our programs to
access the oracle
B
, then our “programming language” will allow consultation of
B
. Then in this new language, we can again assign odel numbers to programs,
and we write
{e}
B
for the program whose G¨odel number is
e
. Instead of inventing
a new language for each
B
, we invent a language that allows calling an “oracle”,
without specifying what it is. We can then plug in different sets and run.
Our objective is to pick A, B such that
χ
A
6= {e}
B
χ
B
6= {e}
A
for all e. Here we are taking χ
A
to be the total version, i.e.
χ
A
(x) =
(
1 x A
0 x 6∈ A
.
The idea is, of course, to choose
A
m
such that it prevents
χ
A
from being
{m}
B
,
and vice versa. But this is tricky, because when we try to pick
A
m
, we don’t
know the whole of B yet.
It helps to spell out more explicitly what we want to achieve. We want to
find some n
i
, m
i
N such that for each i, we have
χ
A
(n
(i)
) 6= {i}
B
(n
(i)
)
χ
B
(m
(i)
) 6= {i}
A
(m
(i)
)
These n
i
and m
i
are the witness to the functions being not equal.
We now begin by producing infinite lists
N
i
= {n
(i)
1
, n
(i)
2
, · · · }
M
i
= {m
(i)
1
, m
(i)
2
, · · · }
of “candidate witnesses”. We will assume all of these are disjoint. For reasons
that will become clear later, we assign some “priorities” to these sets, say
N
1
> M
1
> N
2
> M
2
> N
3
> N
3
> · · · .
We begin by picking
A
0
= B
0
= .
Suppose at the
t
th iteration of the process, we have managed to produce
A
t1
and
B
t1
. We now look at
N
1
, . . . , N
t
and
M
1
, · · · , M
t
. If they have managed to
found a witness, then we leave them alone. Otherwise, suppose
N
i
hasn’t found
a witness yet. Then we run
{i}
B
t1
(
n
(i)
1
) for
t
many time steps. We consider
the various possibilities:
Let
n
be the first remaining element of
N
i
. If
{i}
B
t1
(
n
) halts within
t
steps and returns 0, then we put
n
into
A
t
, and then pick this as the
desired witness. We now look at all members of
B
t1
our computation of
{i}
B
t1
has queried. We remove all of these from all sets of lower priority
than N
i
.
Otherwise, do nothing, and move on with life.
Now the problem, of course, is that whenever we add things into
A
t
or
B
t
, it
might have changed the results of some previous computations. Suppose we
originally found that
{
10
}
B
4
(32) = 0, and therefore put 32 into
A
5
as a witness.
But later, we found that
{
3
}
A
23
(70) = 0, and so we put 70 into
B
24
. But if the
computation of
{
10
}
B
4
(32) relied on the fact that 70
6∈ B
4
, then we might have
{10}
B
24
(32) 6= 0,
and now our witness is “injured”. When this happens, we forget the fact that we
picked 32 as a witness, and then pretend
N
10
hasn’t managed to find a witness
after all.
Fortunately, this can only happen because
M
3
is of higher priority than
N
10
,
and thus 70 was not forbidden from being a witness of
M
3
. Since there are only
finitely many things of higher priorities, our witness can be injured only finitely
many times, and we will eventually be stabilized.
We are almost done. After all this process, we take the union and get
A
and
B
. If, say,
{i}
A
has a witness, then we are happy. What if it does not?
Suppose
{i}
A
is some characteristic function, and
m
is the first element in the
list of witnesses. Then since the lists of candidate witnesses are disjoint, we
know
m 6∈ B
. So it suffices to show that
{i}
A
(
m
)
6
= 0. But if
{i}
A
(
m
) = 0,
then since this computation only involves finitely many values of
A
, eventually,
the membership of these things in
A
would have stabilized. So we would have
{i}
A
(m) = 0 long ago, and made m a witness.
4 Well-quasi-orderings
We end by a discussion on well-quasi-orders. The beginning of all this is the
idea of well-foundedness. We begin by recalling the definition of a quasi-order.
Definition
(Quasi-order)
.
An order
hX, ≤i
is a quasi-order if it is transitive
and reflexive, i.e. for all a, b, c X,
If a b and b c, then a c.
We always have a a.
Note that unlike a poset, we do not require reflexivity, i.e. we can have distinct
a, b
satisfying
a b
and
b a
. There is a canonical way of turning a quasi-order
into a partial order, namely by actually identifying elements satisfying
a b
and
b a
, and so it appears we don’t lose much by restricting to partial orders.
However, it turns out a lot of the orders we construct are naturally quasi-orders,
not partial orders, and the extra hypothesis of being a partial order is not really
useful. So we will talk about quasi-orders.
The main idea behind the whole chapter is the notion of well-foundedness.
We probably first met this concept in set theory, where the axiom of foundation
said sets are well-founded. Chances are, it was motivated to get rid of scenarios
such as
x x
. However, as we do set theory, we figure that the real benefit it
brings us is the ability to do ε-induction.
In general, for an arbitrary relation
R
on
X
, we want to admit the following
R-induction principle:
xX
((
yX
R(y, x) ϕ(y)) ϕ(x))
xX
ϕ(x)
How can this fail? Suppose the top line holds, but the bottom line does not. We
set
A = {x : ¬ϕ(x)}.
Then we know this set is non-empty. Moreover, the top line says
xA
yA
R(y, x).
Thus, if we forbid such sets from existing, then we can admit the induction
principle.
Definition
(Well-founded relation)
.
A relation
R
on
X
is said to be well-founded
if for all
A X
non-empty, there is some
x A
such that for any
y A
, we
have ¬R(y, x).
There is another way of expressing the well-founded relation, if we believe in
the axiom of dependent choice.
Axiom
(Axiom of dependent choice)
.
Let
X
be a set an
R
a relation on
X
.
The axiom of dependent choice says if for all
x X
, there exists
y X
such
that
R
(
x, y
), then we can find a sequence
x
1
, x
2
, · · ·
such that
R
(
x
i
, x
i+1
) for
all i N.
This is a rather weak form of the axiom of choice, which we will freely assume.
Assuming this, then being well-founded is equivalent to the non-existing of
infinite decreasing R-chains.
Why do we care about well-foundedness? There are many reasons, and one we
can provide comes from computer science. In general, it guarantees termination.
When writing programs, we often find ourselves writing loops like
while ( some condition holds ) {
do something;
}
This repeats the “do something” until the condition fails. In general, there is no
reason why such a program should eventually terminate.
Often, the loop is actually of the form
while ( i != BOT ) {
do something with i;
}
where
i
takes value in some partial order
R
, with
BOT
being the bottom element.
Suppose every time the loop runs, the value of
i
decreases. Then if
R
is well-
founded, then we are guaranteed that the program will indeed terminate. More
generally, if
R
doesn’t have a unique bottom element, but is still well-founded,
then loops of the form
while ( i is not a minimal element ) {
do something with i;
}
are still guaranteed to terminate.
Example.
Consider typed
λ
calculus, which is the form of
λ
calculus where we
require each term to have a type, and all terms have to be typed, just as we did
when we used it to decorate natural deduction proofs.
We let
X
be the set of all typed
λ
terms up to
α
-equivalence, and for
f, g X
,
we write
f g
if
g
can be
β
-reduced to
f
. It is a remarkable theorem that
X
is well-founded. In other words, we can keep applying
β
-reduction in any
way we like, and we are guaranteed to always end up in a normal form. It is
also true that the normal form of a
λ
term is unique, but this is not implied by
well-foundedness.
It is clear from definition that a quasi-order is never well-founded, since it
is reflexive. It is also clear that it fails for a silly reason, and we can fix it by
defining
Definition
(Well-ordered quasi-order)
.
A quasi-order is well-ordered if there is
no strictly decreasing infinite sequence.
Here we are not worried about choice issues anymore.
Once we are convinced that being well-founded is a good thing, it is natu-
ral ask what operations preserve well-foundedness. We begin by listing some
constructions we can do.
Definition
(
P
(
X
))
.
Let
hX,
X
i
be a quasi-order. We define a quasi-order
+
X
on P(X) by
X
1
+
X
X
2
if
x
1
X
1
x
2
X
2
(x
1
X
x
2
).
Definition
(
X
)
.
Let
hX,
X
i
be a quasi-order. We let
X
be the set of all
finite lists (i.e. sequences) in
X
. We define a quasi-order on
X
recursively by
nil
1
, where nil is the empty list.
tail(
1
)
1
If tail(
1
) tail(
2
) and head(
1
)
X
head(
2
), then
1
2
.
for all lists
1
,
2
.
Equivalently, sequences
{x
i
}
n
i=1
s
{y
i
}
`
i=1
if there is a subsequence
{y
i
k
}
n
k=1
of {y
i
} such that x
k
y
i
k
for all k.
Finally, the construction we are really interested in is trees.
Definition
(Tree)
.
Let
hX,
X
i
be a quasi-order. The set of all trees (in
X
) is
defined inductively as follows:
If
x X
and
L
is a list of trees, then (
x, L
) is a tree. We call
x
a node of
the tree. In particular (x, nil) is a tree. We write
root(x, L) = x, children(x, L) = L.
Haskell programmers would define this by
data Tree a = Branch a [Tree a]
We write Trees(X) for the set of all trees on X.
We define an order relation
s
on
Trees
(
X
) as follows let
T
1
, T
2
Trees
(
X
).
Then T
1
T
2
if
(i) T
1
T
0
for some T
0
children(T
2
).
(ii) root(T
1
) root(T
2
) and children(T
1
) children(T
2
) as lists.
It is an exercise to think hard and convince yourself this recursively-defined
relation is indeed well-defined.
Which of these operations preserve well-foundedness? It is not hard to see
the following:
Lemma.
If
hX,
X
i
is a well-founded quasi-order, then so is
X
and
Trees
(
X
).
The reason why this works is that the lifts of the order on
X
to
X
and
Trees
(
X
) are “of finite character”. To compare two elements in
X
or
Trees
(
X
),
we only have to appeal to finitely many instances of
X
.
On the other hand,
hP
(
X
)
,
+
X
i
is in general not well-founded. For example,
hN,
=
i
is a well-founded quasi-order, being the reflexive transitive closure of the
trivial relation. But
hP
(
N
)
,
=
+
i
has an infinite decreasing sequence. Indeed, =
+
is just the subset relation.
Definition
(Well-quasi-order)
.
A well-quasi-order (WQO) is a well-founded
quasi-order hX,
X
i such that hP(X),
+
X
i is also well-founded.
Note that the power set of a well-quasi-order need not be a well-quasi-order.
The question we want to answer is whether the operations
X
and
Trees
(
X
)
preserve well-quasi-orders. To do so, we try to characterize well-quasi-orders in
a more concrete way.
When does a quasi-order fail to be a well quasi-order? Suppose
x
1
, x
2
, x
3
, · · ·
is an infinite antichain, i.e. a collection of mutually unrelated elements. Then as
above
h{x
i
: i > n} : n Ni
is an infinite descending sequence in
hP
(
X
)
,
+
i
. Of course, to be well-founded,
we also need the non-existence of infinite decreasing sequences.
Proposition. Let hX, be a quasi-order. Then the following are equivalent:
(i) hX, ≤i is a well-quasi-order.
(ii) There is no infinite decreasing sequence and no infinite anti-chain.
(iii)
Whenever we have any sequence
x
i
X
whatsoever, we can find
i < j
such that x
i
x
j
.
In standard literature, (iii) is often taken as the definition of a well-quasi-order
instead.
Proof.
(i) (ii): We just showed this.
(iii)
(i): Suppose
X
1
, X
2
, X
3
, · · ·
is a strictly decreasing chain in
P
(
X
).
Then by definition, we can pick
x
i
X
i
such that
x
i
is not
anything in
X
i+1
. Now for any
j > i
, we know
x
i
6≤ x
j
, because
x
j
is
something in
X
i+1
. So we have found a sequence {x
i
} such that x
i
6≤ x
j
for all i < j.
(iii) (ii) is clear.
(ii)
(iii), we show that whenever
x
i
is a sequence without
i < j
such
that
x
i
x
j
, then it either contains an infinite anti-chain, or an infinite
descending chain.
To do so, we two-colour [N] by
X({i < j}) =
(
0 x
i
> x
j
1 otherwise
By Ramsey’s theorem, this has an infinite monochromatic set. If we have
an infinite monochromatic set of colour 0, then this is a descending chain.
Otherwise, if it is of colour 1, then together with our hypothesis, we know
{x
i
} is an anti-chain.
If we want to reason about well-founded relations, then the following notion
is often a useful one to have in mind:
Definition
(Well-founded part of a relation)
.
Let
hX, Ri
be a set with a relation.
Then the well-founded part of a relation is the
-least subset
A X
satisfying
the property
If x is such that all predecessors of x are in A, then x A.
So for example, any minimal element of
X
would be in the well-founded part.
One can prove that the well-founded part is indeed well-founded without much
difficulty.
Is there a good notion of the WQO-part of a quasi-order? It turns out not,
or at least there isn’t an obvious one that works. For our later discussion, it is
convenient to have the following definition:
Definition
(Bad sequence)
.
Let
hX, ≤i
be a well-founded quasi-order. A se-
quence {x
i
} is bad if for all i < j, we have f(i) 6≤ f(j).
Then a quasi-order is a WQO iff it has no bad sequences.
The idea is now to pick a bad sequence that is “minimal” in some sense, and
then looking at things below this minimal bad sequence will give is a well-quasi-
order. This “WQO-part” isn’t some object canonically associated to the order
hX, ≤i
, because it depends on the choice of the minimal bad sequence, but is
very useful in our upcoming proof.
Definition
(Minimal bad sequence)
.
Let
hX, ≤i
be a quasi-order. A minimal
bad sequence is a bad sequence
{x
i
}
such that for each
k N
,
x
k
is a
-minimal
element in
{x : there exists a bad sequence starting with x
1
, x
2
, · · · , x
k1
, x}.
Lemma.
Let
hX, ≤i
be a well-founded quasi-order that is not a WQO. Then it
has a minimal bad sequence.
Proof.
Just pick each
x
i
according to the requirement in the definition. Such an
x
i
can always be found because hX, ≤i is well-founded.
What makes minimal bad sequences useful is the following promised lemma:
Lemma
(Minimal bad sequence lemma)
.
Let
hX, ≤i
be a quasi-order and
B = {b
i
} a minimal bad sequence. Let
X
0
= {x X :
nN
x < b
n
}.
Then hX
0
, ≤i is a WQO.
Proof.
Suppose
{s
i
}
is a bad sequence in
X
0
. We prove by induction that nothing
in {s
i
} is below b
n
, which gives a contradiction.
Suppose
s
i
< b
0
for some
i
. Then
s
i
, s
i+1
, s
i+2
, · · ·
is a bad sequence, whose
first element is less than b
0
. This then contradicts the minimality of B.
For the induction step, suppose nothing in
S
is strictly less than
b
0
, · · · , b
n
.
Suppose, for contradiction, that s
i
< b
n+1
. Now consider the sequence
b
0
, · · · , b
n
, s
i
, s
i+1
, s
i+2
, · · · .
By minimality of
B
, we know this cannot be a bad sequence. This implies there
is some
n, m
such that the
n
th element of the sequence is less than the
m
th
element. But they can’t be both from the
{b
k
}
or both from then
{s
k
}
. This
implies there is some b
j
s
k
with j n and k i + 1.
Consider
s
k
. Since
s
k
X
0
, it must be
b
m
for some
m
. Moreover, by
induction hypothesis, we must have
m > n
. Then by transitivity, we have
b
j
b
m
, contradicting the fact that B is bad.
We need one more technical lemma.
Lemma
(Perfect subsequence lemma)
.
Let
hX, ≤i
be a WQO. Then every
sequence
{x
n
}
in
X
has a perfect subsequence, i.e. an infinite subset
A N
such
that for all i < j A, we have f (i) f(j).
By definition of well-quasi order, we can always find some
i < j
such that
f(i) f(j). This says we can find an infinite subset such that this works.
Proof. We apply Ramsey’s theorem. We two-colour [N]
2
by
c(i < j) =
(
red f(i) f (j)
blue f(i) 6≤ f(j)
Then Ramsey’s theorem gives us an infinite monochromatic set. But we cannot
have an infinite monochromatic blue set, as this will give a bad sequence. So
there is a monochromatic red set, which is a perfect subsequence.
We are now in a position to prove that lists over a WQO are WQO’s.
Lemma
(Higman’s lemma)
.
Let
hX,
X
i
be a WQO. Then
hX
,
s
i
is a
WQO.
Proof.
We already know that
X
is well-founded. Suppose it is not a WQO.
Then there is a minimal bad sequence {x
i
} of X-lists under
s
.
Now consider the sequence
{head
(
x
i
)
}
. By the perfect subsequence lemma,
after removing some elements in the sequence, we may wlog this is a perfect
sequence.
Now consider the sequence
{tail
(
x
i
)
}
. Now note that
tail
(
x
i
)
< x
i
for each
i
(here it is crucial that lists are finite). Thus, using the notation in the minimal
bad sequence lemma, we know tail(x
i
) X
0
for all i.
But
X
0
is a well quasi-order. So we can find some
i < j
such that
tail
(
x
i
)
tail
(
x
j
). But also
head
(
x
i
)
head
(
x
j
) by assumption. So
x
i
x
j
, and this is a
contradiction.
We can apply a similar juggle to prove Kruskal’s theorem. The twist is that
we will have to apply Higman’s lemma!
Theorem
(Kruskal’s theorem)
.
Let
hX,
X
i
be a WQO. Then
hTrees
(
X
)
,
s
i
is a WQO.
Proof.
We already know that
Trees
(
X
) is well-founded. Suppose it is not a
WQO. Then we can pick a minimal bad sequence {x
i
} of trees.
As before, we may wlog {root(x
i
)} is a perfect sequence. Consider
Y = {T : T children(x
i
) for some i}.
Then
Y X
0
. So
Y
is a WQO, and hence by Higman’s lemma,
Y
is a
WQO. So there exists some
i, j
such that
children
(
x
i
)
children
(
x
j
). But
by perfectness, we have
root
(
x
i
)
root
(
x
j
). So by definition of the ordering,
x
i
x
j
.
What good is this theorem? It is possible to use Kruskal’s theorem to prove
the termination of some algorithms, but this is not a computer science course, and
we will not go into that. Instead, we look at some proof-theoretic consequences
of Kruskal’s theorem.
We begin with the simplest possible WQO the WQO with one element
only and the only possible order. Then a tree over this WQO is just a tree with
no labels on each node, i.e. a tree.
Consider the statement
P (k) =
there exists some
n N
such that there is no finite bad
sequence T
1
, · · · , T
n
where T
i
has k + i nodes
.
This is a finitary version of the statement of Kruskal’s theorem. We claim that
P
(
k
) is true for all
k
. Suppose not. Then for each
n
, we find a bad sequence
T
n
1
, · · · , T
n
n
, where
T
n
i
is a tree with
k
+
i
nodes. We can put these trees in a
grid:
T
1
1
T
2
1
T
2
2
T
3
1
T
3
2
T
3
3
T
4
1
T
4
2
T
4
3
T
4
4
T
5
1
T
5
2
T
5
3
T
5
4
T
5
5
We look at the first column. Each tree has
k
+ 1 nodes. But there are only
finitely many such trees. So there is some tree that appears infinitely often. Pick
one, call it
T
1
, and throw away all other rows. Similarly, we can then find some
tree that appears infinitely often in the second row, call it
T
2
, and throw the
other rows away. We keep going. Then we find a sequence
T
1
, T
2
, T
3
, · · ·
such that every initial segment is a bad sequence. So this is a bad sequence,
which is impossible.
Proposition
(Friedman’s finite form)
.
For all
k N
, the statement
P
(
k
) is
true.
Why do we care about this? The statement
P
(
k
) is something we can write
out in Peano arithmetic, as it is finitary in nature. But our proof certainly
didn’t live in Peano arithmetic, as we had to talk about infinite sets. It turns
out for each
k N
, we can indeed prove
P
(
k
) in Peano arithmetic. However,
it is impossible to prove
k
P
(
k
)! This in particular means we cannot “prove
Kruskal’s theorem” in Peano arithmetic (whatever that might mean).
We can also relate this to ordinal analysis. Given a WQO
hX, ≤i
, we can
construct a well-founded structure from it. The obvious thing to try is to take
hP
(
X
)
,
+
i
, but this is not what we want. The problem is that talking about
power sets involves talking about infinite sets, and this is not something we know
how to do in Peano arithmetic.
Instead, we let
˜
X
be the set of all finite bad sequences in
X
, and order them
by
s t
if
s
is an end-extension of
t
. Note that this is “upside down”. Then by
definition of a WQO, this is a well-founded downward-branching tree whose top
element is the empty sequence. Then by taking the rank of this structure, we
obtain an ordinal.
Using Kruskal’s theorem, we can produce some really huge WQO’s, which in
turn gives us some really huge ordinals. So in some sense, being able to prove
Kruskal’s theorem means we can prove the well-foundedness of some really big
ordinal in our theory.
We end with some natural generalization of WQO’s. Suppose
hX, ≤i
is
a WQO. Then
hP
(
X
)
,
+
i
need not be a WQO. Suppose it is not, and let
X
1
, X
2
, X
3
, · · ·
be a bad sequence of subsets. Then for each
i < j
, we can find
some x
ij
X
i
such that x
ij
is not less than anything in X
j
.
Using dependent choice, we can find a function
f : {hi, ji : i < j N} X
such that for all
i < j < k
, we have
f
(
i, j
)
6≤ f
(
j, k
). This is a strengthening of
bad sequence, where the sequence is now indexed by pairs (i < j).
Definition (ω
2
-good). A well-ordering hX, ≤i is ω
2
-good if for any
f : {hi, ji : i < j N} X,
there is some i < j < k such that f(i, j) f(j, k).
There is an analogue of the perfect subsequence lemma for these, but there
is no analogous notion of a minimal bad sequence (or at least no obvious one).