III Logic (Full)

Part III — Logic

Based on lectures by T. E. Forster

Notes taken by Dexter Chua

Lent 2017

These notes are not endorsed by the lecturers, and I have modified them (often

significantly) after lectures. They are nowhere near accurate representations of what

was actually lectured, and in particular, all errors are almost surely mine.

This course is the sequel to the Part II courses in Set Theory and Logic and in Automata

and Formal Languages lectured in 2015-6. (It is already being referred to informally

as “Son of ST&L and Automata & Formal Languages”). Because of the advent of

that second course this Part III course no longer covers elementary computability in

the way that its predecessor (“Computability and Logic”) did, and this is reflected in

the change in title. It will say less about Set Theory than one would expect from a

course entitled ‘Logic’; this is because in Lent term Benedikt L¨owe will be lecturing a

course entitled ‘Topics in Set Theory’ and I do not wish to tread on his toes. Material

likely to be covered include: advanced topics in first-order logic (Natural Deduction,

Sequent Calculus, Cut-elimination, Interpolation, Skolemisation, Completeness and

Undecidability of First-Order Logic, Curry-Howard, Possible world semantics, G¨odel’s

Negative Interpretation, Generalised quantifiers. . . ); Advanced Computability (

representability of computable functions, Tennenbaum’s theorem, Friedberg-Muchnik,

Baker-Gill-Solovay. . . ); Model theory background (ultraproducts, Los’s theorem, ele-

mentary embeddings, omitting types, categoricity, saturation, Ehrenfeucht-Mostowski

theorem. . . ); Logical combinatorics (Paris-Harrington, WQO and BQO theory at least

as far as Kruskal’s theorem on wellquasiorderings of trees. . . ). This is a new syllabus

and may change in the coming months. It is entirely in order for students to contact

the lecturer for updates.

Pre-requisites

The obvious prerequisites from last year’s Part II are Professor Johnstone’s Set Theory

and Logic and Dr Chiodo’s Automata and Formal Languages, and I would like to

assume that everybody coming to my lectures is on top of all the material lectured

in those courses. This aspiration is less unreasonable than it may sound, since in

2016-7 both these courses are being lectured the term before this one, in Michaelmas;

indeed supervisions for Part III students attending them can be arranged if needed:

contact me or your director of studies. I am lecturing Part II Set Theory and Logic

and I am even going to be issuing a “Sheet 5” for Set Theory and Logic, of material

likely to be of interest to people who are thinking of pursuing this material at Part

III. Attending these two Part II courses in Michaelmas is a course of action that may

app eal particularly to students from outside Cambridge.

Contents

1 Proof theory and constructive logic

1.1 Natural deduction

1.2 Curry–Howard correspondence

1.3 Possible world semantics

1.4 Negative interpretation

1.5 Constructive mathematics

2 Model theory

2.1 Universal theories

2.2 Products

2.3 Ehrenfeucht–Mostowski theorem

2.4 The omitting type theorem

3 Computability theory

3.1 Computability

3.2 Decidable and semi-decidable sets

3.3 Computability elsewhere

3.4 Logic

3.5 Computability by λ-calculus

3.6 Reducibility

4 Well-quasi-orderings

1 Proof theory and constructive logic

1.1 Natural deduction

The first person to have the notion of “proof” as a mathematical notion was

probably G¨odel, and he needed this to write down the incompleteness theorem.

The notion of proof he had was a very unintuitive notion. It is not very easy to

manipulate, but they are easy to reason about.

In later years, people came up with more “natural” ways of defining proofs,

and they are called natural deduction. In the formalism we learnt in IID Logic

and Set Theory, we had three axioms only, and one rule of inference. In natural

deduction, we have many rules of deduction.

We write or rules in the following form:

A B

∧-int

A ∧ B

This says if we know

and

are true, then we can conclude

A ∧ B

. We call

the things above the line the premises, and those below the line the conclusions.

We can write out the other rules as follows:

A B

∧-int

A ∧ B

∧-elim

A ∧ B

∧-elim

∨-int

A ∨ B

∨-int

A ∨ B

A A → B

→-elim

Here we are separating these rules into two kinds — the first column is the

introduction rules. These tell us how we can introduce a

∧

∨

into our

conclusions. The second column is the elimination rules. These tell us how we

can eliminate the ∧ or → from our premises.

In general, we can think of these rules as “LEGO pieces”, and we can use

them to piece together to get “LEGO assembles”, i.e. proofs.

Example. For example, we might have a proof that looks like

∨-int

A ∨ B A ∨ B → C

→-elim

This corresponds to a proof that we can prove

from

and

A ∨ B → C

. Note

that sometimes we are lazy and don’t specify the rules we are using.

Instead of trying to formally describe how we can put these rules together

to form a proof, we will work through some examples as we go, and it should

become clear.

We see that we are missing some rules from the table, as there are no

introduction rule for → and elimination rule for ∨.

We work with

→

first. How we can prove

A → B

? To do so, we assume

and then try to prove

. If we can do so, then we have proved

A → B

. But we

cannot express this in the form of our previous rules. Instead what we want is

some “function” that takes proof trees to proof trees.

The actual rule is as follows: suppose we have derivation that looks like

This is a proof of C under the assumption A. The

→

-introduction rule says we

can take this and turn it into a proof of A → C.

A → C

This rule is not a LEGO piece. Instead, it is a magic wand that turns a LEGO

assembly into a LEGO assembly.

But we do not want magic wands in our proofs. We want to figure out some

more static way of writing this rule. We decided that it should look like this:

[A]

→-int

A → C

Here the brackets denotes that we have given up on our assumption

to obtain

the conclusion

A → C

. After doing so, we are no longer assuming

. When we

work with complicated proofs, it is easy to get lost where we are eliminating the

assumptions. So we would label them, and write this as, say

[A]

→-int (1)

A → C

Example. We can transform our previous example to say

[A]

∨-int

A ∨ B A ∨ B → C

→-elim

→-int (1)

A → C

Originally, we had a proof that

and

A ∨ B → C

proves

. Now what we have

is a proof that A ∨ B → C implies A → C.

Next, we need an elimination rule of

A ∨ B

. What should this be? Suppose

we proved both that

proves

, and

proves

. Then if we know

A ∨ B

, then

we know C must be true.

In other words, if we have

A ∨ B

then we can deduce C. We write this as

A ∨ B

[A]

[B]

∨-elim

There is an obvious generalization to many disjunctions:

∨ · · · ∨ A

]

∨-elim

How about when we have an empty disjunction? The empty disjunction is just

false. So this gives the rule

⊥

In other words, we can prove anything assume falsehood. This is known as ex

falso sequitur quolibet.

Note that we did not provide any rules for talking about negation. We do

not need to do so, because we just take ¬A to be A → ⊥.

So far, what we have described it constructive propositional logic. What is

missing? We cannot use our system to prove the law of excluded middle,

A ∨ ¬A

or the law of double negation

¬¬A → A

. It is not difficult to convince ourselves

that it is impossible to prove these using the laws we have described above.

To obtain classical propositional logic, we need just one more rule, which is

[A → ⊥]

⊥

If we add this, then we get classical propositional calculus, and it is a theorem

that any truth-table-tautology (i.e. propositions that are always true for all

possible values of

A, B, C

etc) can be proved using natural deduction with the

law of excluded middle.

Example. Suppose we want to prove

A → (B → C) → ((A → B) → (A → C))

How can we possibly prove this? The only way we can obtain this is to get

something of the form

[A → (B → C)]

(A → B) → (A → C)

→-int

A → (B → C) → ((A → B) → (A → C))

Now the only way we can get the second-to-last conclusion is

A → B [A]

A → C

and then further eliminating

A → B

gives us (

A → B

)

→

(

A → C

). At this

point we might see how we can patch them together to get a proof:

[A]

[A → (B → C)]

B → C

[A → B]

[A]

→-int (3)

A → C

→-int (2)

(A → B) → (A → C)

→-int (1)

A → (B → C) → ((A → B) → (A → C))

Note that when we did

→

-int (3), we consumed two copies of

in one go. This

is allowed, as an assumption doesn’t become false after we use it once.

However, some people study logical systems that do not allow it, and demand

that assumptions can only be used once. These are known as resource logics,

one prominent example of which is linear logic.

Example. Suppose we want to prove A → (B → A). We have a proof tree

[A]

[B]

→-int (2)

B → A

→-int (1)

A → (B → A)

How did we manage to do the step from

A B

? One can prove it as follows:

A B

∧-int

A ∧ B

∧-elim

but this is slightly unpleasant. It is redundant, and also breaks some nice

properties of natural deduction we are going to prove later, e.g. the subformula

property. So instead, we just put an additional weakening rule that just says we

can drop any assumptions we like at any time.

Example. Suppose we wanted to prove

A → (B ∨ C) → ((A → B) ∨ (A → C)).

This is indeed a truth-table tautology, as we can write out the truth table and

see this is always true.

If we try very hard, we will find out that we cannot prove it without using the

law of excluded middle. In fact, this is not valid in constructive logic. Intuitively,

the reason is that assuming

is true, which of

is true can depend on

why

is true, and this it is impossible to directly prove that either

A → B

true, or A → C is true.

Of course, this is true in classical logic.

Exercise. Prove the following:

– (P → Q) → ((Q → R) → (P → R))

– (A → C) → ((A ∧ B) → C)

– ((A ∨ B) → C) → (A → C)

– P → (¬P → Q)

– A → (A → A)

– (P ∨ Q) → (((P → R) ∧ (Q → S)) → R ∨ S)

– (P ∧ Q) → (((P → R) ∨ (Q → S)) → R ∨ S)

– A → ((((A → B) → B) → C) → C)

– ((((P → Q) → P ) → P ) → Q) → Q

The last two are hard.

Instead of thinking of natural deduction as an actual logical theory, we often

think of it as a “platform” that allows us to describe different logical theories,

which we can do by introducing other connectives.

Example.

For example, we might want to describe first-order logic. We can

give introduction rules for ∃ as follows:

ϕ(t)

∃-int

∃

ϕ(x)

Similarly, we can have an elimination rule for ∀ easily:

∀

ϕ(x)

∀-elim

ϕ(t)

The introduction rule for

∀

is a bit more complicated. The rule again looks

something like

ϕ(x)

∀-int

∀

ϕ(x)

but for such a proof to be valid, the

must be “arbitrary”. So we need an

additional condition on this rule that there is no free occurrence of x’s.

Finally, suppose we know

∃

(

). How can we use this to deduce some

Suppose we know that

(

) implies

, and also

does not depend on

. Then

just knowing that there exists some x such that ϕ(x), we know p must be true.

∃

ϕ(x)

∃-elim

Again we need x to be arbitrary in p.

Example.

We can also use natural deduction for “naive set theory”, which we

will later find to be inconsistent. We can give an ∈-introduction rule

ϕ(x)

∈-int

x ∈ {y : ϕ(y)}

and a similar elimination rule

x ∈ {y : ϕ(y)}

∈-elim

ϕ(x)

One cannot expect that we can write down whatever introduction and

elimination rule we like, and expect to get a sensible theory. For example, if we

introduce a connective & with the following silly rules:

&-int

A&B

&-elim

then this would immediately give us a theory that proves anything we like.

There are in fact two problems with our introduction and elimination rules.

The first is that once we use our

to deduce

, it is completely gone, and

we cannot recover anything about

from

. The second problem is that,

obviously, proving

doesn’t really involve

at all, but we can use

deduce something about B.

In general, to obtain a sensible natural deduction theory, we need harmony.

Definition

(Harmony)

We say rules for a connective $ are harmonious if

is the strongest assertion you can deduce from the assumptions in the rule of

$-introduction, and

is the weakest thing that implies the conclusion of the

$-elimination rule.

We will not make the notion more precise than this.

Example.

The introduction and elimination rules for

∈

we presented are cer-

tainly harmonious, because the conclusions and assumptions are in fact equiva-

lent.

One can also check that the introduction and elimination rules for

∧

and

∨

are harmonious.

Example (Russel’s paradox). In naive set theory, consider

R = {x : x ∈ x → ⊥}.

Then we can have the deductions

R → R

∈-elim

R ∈ R → ⊥

→-elim

⊥

We can then move on to use this to deduce that R ∈ R:

[R → R]

∈-elim

R ∈ R → ⊥

→-elim

⊥

→-int (1)

R ∈ R → ⊥

∈-int

R ∈ R

Now we have proved

R ∈ R

. But we previously had a proof that

R ∈ R

gives a

contradiction. So what we are going to do is to make two copies of this proof:

[R → R]

∈-elim

R ∈ R → ⊥

→-elim

⊥

→-int (1)

R ∈ R → ⊥

∈-int

R ∈ R

[R → R]

∈-elim

R ∈ R → ⊥

→-elim

⊥

→-int (2)

R ∈ R → ⊥

→-elim

⊥

So we showed that we have an inconsistent theory.

Note that we didn’t use a lot of “logical power” here. We didn’t use the

law of excluded middle, so this contradiction manifests itself even in the case

of constructive logic. Moreover, we didn’t really use many axioms of naive set

theory. We didn’t even need things like extensionality for this to work.

Now we notice that this proof is rather weird. We first had an

→

-introduction

(2), and then subsequently eliminated it immediately. In general, if we have a

derivation of the form

→-int

P → Q

→-elim

We can just move the second column to the top of the first to get

In general,

Definition

(Maximal formula)

We say a formula in a derivation is maximal iff

it is both the conclusion of an occurrence of an introduction rule, and the major

premiss of an occurrence of the elimination rule for the same connective.

Here it is important that we are talking about the major premise. In a

deduction

A → B A

→-elim

we call

A → B

the major premise, and

the minor premise. In the case of

∧

and ∨, the terms are symmetric, and we do not need such a distinction.

This distinction is necessary since a deduction of the form

(A → B) → C

[A]

→-int

A → B

→-elim

One can prove that any derivation in propositional logic can be converted to

one that does not contain maximal formulae. However, if we try to eliminate

the maximal formula in our proof of inconsistency of naive set theory, we will

find that we will end up with the same proof!

1.2 Curry–Howard correspondence

Return to the days where we did IID Logic and Set Theory. Instead of natural

deduction, we bootstrapped our system with two axioms K and S:

K: A → (B → A)

S: [A → (B → C)] → [(A → B) → (A → C)]

It is a fact that

{K, S}

axiomatizes the conditional (

→

) fragment of constructive

propositional logic. To axiomatize the conditional fragment of classical logic, we

need a new law, Peirce’s law :

((A → B) → A) → A.

In particular, we cannot prove Peirce’s law from K and S.

How can we go about proving that Peirce’s law is unprovable? The most

elementary way of doing so would be to try to allow for more truth values, such

that

and modus ponens hold, but Peirce’s law does not. It turns out three

truth values are sufficient. We consider the following truth table for →:

→ 1 2 3

1 1 2 3

2 1 1 3

3 1 1 1

Here we interpret 1 as “true”, 3 as “false”, and 2 as “maybe”.

By writing out the truth tables, one can verify that the formulas

and

always take truth value 1. Also, if

and

A → B

are both 1, then so is

. So

by induction on the structure of the proof, we know that anything deduced from

K and S and modus ponens will have truth value 1.

However, if we put

= 2 and

= 3, then ((

A → B

)

→ A

)

→ A

has value 2,

not 1. So it cannot possibly be deduced from K and S.

This is, of course, a very unsatisfactory answer. The truth table was just

pulled out of a hat, and we just checked to see that Peirce’s law doesn’t hold. It

doesn’t tell us where the truth table came from, and why we can’t prove Peirce’s

law.

We will answer the second question first using the Curry–Howard correspon-

dence. Then we will see how we can come up with this truth table using possible

world semantics in the next chapter.

Consider the piece of syntax

A → B.

So far, we have viewed this as a logical formula —

and

are formulae, and

→

denotes implication. However, in mathematics, there is another common use

of this notation — we can think of

and

as sets, and

A → B

is the set of

all functions from

. The idea of the Curry–Howard correspondence is to

view implication as functions between sets.

While this might seem like a bizarre thing to do, it actually makes sense.

Given a formula

, we can view

as a “set”, whose “elements” are the proofs

. Now if we know

A → B

, this means any proof of

becomes a proof of

. In other words, we have a function

A → B

. As we move on, we will see that

this idea of viewing a formula

as the set of all proofs of

gives a very close

correspondence between the logical connectives and set-theoretic constructions.

This is also known as propositions as types, and it is common to call

etc “types” instead of “sets”.

Under this correspondence, modus ponens has a very natural interpretation.

We can just think of the rule

A A → B

as function application! If we have an element

a ∈ A

, and an element

f ∈ A → B

then we have

(

)

∈ B

. Now it is awkward to write

f ∈ A → B

instead of

A → B

. So in fact, we will write

A → B

instead. Moreover, we will use

the colon instead of ∈ for all types. So we can write

x : A f : A → B

f(x) : B

Often, we will just write

instead of

(

). If we write

fxy

, then it means

(fx)y, but we will usually put brackets to indicate.

Example.

What is the interpretation of

under the Curry–Howard correspon-

dence? It corresponds to a function

A → (B → A).

Let’s try to think of a function that does this. Given an element

, we need

to find a function

B → A

. Since the only thing we know about is

, we could

return the constant function that always returns

. So for each

, we define

K(a) to be the constant function a.

This is indeed where the name

came from. It’s just that the inventor was

German, and so it is not C.

What about conjunction? To prove

A ∧ B

, we need to prove

and

. In

other words, an element of

A ∧ B

should be an element of the Cartesian product

A × B.

We can write the introduction rule as

x : A y : B

hx, yi : A ∧ B

Here we are using the angled brackets to denote ordered pairs. The elimination

rules are just given by the projection fst : A × B → A and snd : A × B → B:

x : A ∧ B

fst(x) : A

x : A ∧ B

snd(x) : B

Some people write fst and snd as π

and π

instead.

We are not going to talk about the rules for “or”, because they are horrendous.

But if we think hard about it, then

A ∨ B

corresponds to the disjoint union of

A and B.

So far, we only know how to decorate our LEGO pieces. When we assemble

them to get proofs, we can get more complicated tress. Suppose we had the tree

[A ∧ B]

∧-elim

[A ∧ B]

∧-elim

∧-int

B ∧ A

→-int (1)

(A ∧ B) → (B ∧ A)

We now try to decorate this tree:

x : [A ∧ B]

∧-elim

snd(x) : B

x : [A ∧ B]

∧-elim

fst(x) : A

∧-int

hsnd(x), fst(x)i : B ∧ A

→-int (1)

(A ∧ B) → (B ∧ A)

Note that both [

A ∧ B

]

are decorated using the same letter

. This is since

they are cancelled at the same time, so we are using the same instance of

A ∧ B

in the statement (A ∧ B) → (B ∧ A).

Now the remaining question to answer is how we are going to decorate the

→

-introduction. We want some notation that denotes the function that takes

and returns

hsnd

(

)

, fst

(

). More generally, if we have an expression

(

) in

we want a notation that denotes the function that takes in x and returns ϕ(x).

The convention is to denote this using a λ:

λx.ϕ(x).

These expressions we produce are known as λ expressions.

For example, in the previous case, the final result is

λx.hfst(x), snd(x)i : (A ∧ B) → (B ∧ A).

The general decoration rule for →-introduction is

x : [A]

ϕ(x) : C

→-int (1)

λx.ϕ(x) : A → C

Recall that previously we had a rule for getting rid of maximal formulae. Given

a tree of the form

→-int

P → Q

→-elim

We can just move the second column to the top of the first to get

This conversion corresponds to the conversion

(λx.ϕ(x))a ϕ(a),

which is just applying the function! This process is known as β-reduction.

Since there is

-reduction, there is, of course, also

-reduction, but this

is boring. It is just re-lettering

λx.φ

(

) to

λy.φ

(

). There will also be an

η-reduction, and we will talk about that later.

In the following exercise/example, note that we have a convention that if we

write A → B → C, then we mean A → (B → C).

Exercise. Find λ-expressions that encode proofs of

(i) ((A → B) ∧ (C → D)) → ((A ∧ C) → (B ∧ D))

(ii) (A → B) → (B → C) → (A → C)

(iii) ((A → B) → A) → (A → B) → B

(iv) (A → B → C) → (B → A → C)

(v) (A → B → C) → ((A ∧ B) → C)

(vi) (B ∧ A → C) → (A → B → C)

Note that statements of the form

A → B → C

are to be interpreted as

A →

(B → C).

Solutions.

(i) λf.λx.hfst(f)(fst(x)), snd(f)(snd(x))i

(ii) λf.λg.λa.g(fa)

(iii) λf.λg.g(fg)

(iv) λf.λb.λa.(f a)b

(v) λf.λx.f(fst x)(snd x)

(vi) λf.λa.λb.fhb, ai

One can write out the corresponding trees explicitly. For example, (iii) can be

done by

g : [A → B]

f : [(A → B) → A]

→-elim

fg : A g : [A → B]

→-elim

g(fg)

→-int (1)

λg.g(fg) : (A → B) → B

→-int (2)

λf.λg.g(fg) : (A → B) → A → (A → B) → B

Note that we always decorate assumptions with a single variable, say

, even

if they are very complicated. For example, if we have an assumption

A ∧ B

, it

might be tempting to decorate it as ha, bi, but we do not.

Now what we have is that for any proof tree, we obtain a

formula. But for

any formula, it can have many proofs, and so it can be given by many

formula.

Example. Consider

x : [A]

f : [A → A]

→-elim

fx : A

→-int (1)

λx.fx : A → A

→-int (2)

λf.λx.fx : (A → A) → (A → A)

This is really just the identity function!

We can also try something slightly more complicated.

x : [A]

f : [A → A]

fx : A f : [A → A]

f(fx) : A

→-int (1)

λx.f(f(x)) : A → A

→-int (2)

λf.λx.f(fx) : (A → A) → (A → A)

This is another

term for (

A → A

)

→

(

A → A

). This says given any

we return the function that does

twice. These are two genuinely different

functions, and thus these two

terms correspond to two different proofs of

(A → A) → (A → A).

Note that given any proof tree, we can produce some

expression representing

the proof rather systematically. However, given any

expression, it is difficult

to figure out what it is a proof of, or if it represents a proof at all. For example,

if we write down

λx.xx,

then we have no idea what this can possibly mean.

The problem is that we don’t know what formula, or type each variable is

representing. The expression

λf.λx.f(fx)

is confusing, because we need to guess that

is probably just some arbitrary

variable of type

, and that

is probably a function

A → A

. Instead, we can

make these types explicit:

λf

A→A

.λx

A→A

Given such a term, we can systematically deduce that it is of type (

A → A

)

→

(

A → A

), i.e. it encodes a proof of (

A → A

)

→

(

A → A

). Such an expression

with explicit types is known as a typed

term, and one without types is known

as an untyped λ term.

Moreover, it is also straightforward to reconstruct the proof tree from this

expression. Of course, we need to make sure the typing of the expression is valid.

For example, we cannot write

A→A

since

is not a function from

. Assuming we restrict to these valid terms, we

have established a correspondence

Proofs in propositional natural deduction are in a 1-to-1 correspondence

with typed λ terms.

That is, if we also make it clear what the rules for ∨ are.

Finally, what about

⊥

? To what set does the proposition

⊥

correspond? The

proposition

⊥

should have no proofs. So we would think

⊥

should correspond

to ∅.

In propositional calculus, the defining property of

⊥

is that we can prove any

proposition we like from

⊥

. Indeed, the empty set is a set that admits a function

to any set

whatsoever, and in fact this function can be uniformly defined for

all sets, by the empty function. We again see that the ideas of propositional

calculus correspond nicely with the properties of sets.

Example. Consider the law of excluded middle:

((A → ⊥) → ⊥) → A.

Can we write down a λ expressions for this?

We will argue rather informally. If there is a

expression of this type, then

it should define the function “uniformly” without regards to what

is about.

We now see what this function could be.

–

is empty, then

A → ⊥

contains exactly the empty function. So

(

A → ⊥

)

→ ⊥

is the empty set, since there is no function from a non-

empty set to the empty set. And there is a function

∅ → A

, namely the

empty function again.

–

is non-empty, then

A → ⊥

contains no functions, so (

A → ⊥

)

→ ⊥

contains the empty function. So ((

A → ⊥

)

→ ⊥

)

→ A

is given by exactly

an element of A.

We see that to construct such a function, we need to first know if

is empty or

not. If it is non-empty, then we have to arbitrarily pick an element of

. Thus,

there is no way we can uniformly construct such a function without knowing

anything about A.

Proposition.

We cannot prove ((

A → B

)

→ A

)

→ A

in natural deduction

without the law of excluded middle.

Proof. Suppose there were a lambda term P : ((A → B) → A) → A.

We pick

B = {0, 1}, A = {0, 1, 2, 3, 4}.

In this setting, any function f : A → B identifies a distinguished member of B,

namely the one that is hit by more members of

. We know

B ⊆ A

, so this

is an element in

. So we have a function

: (

A → B

)

→ A

. Then

(

) is a

member of A. We claim that in fact P (F ) ∈ B.

Indeed, we write

(

) =

a ∈ A

. Let

be a 3-cycle moving everything in

A \ B

and fixing

. Then this induces an action on functions among

and

by conjugation. We have.

π(P (F )) = π(a).

But since

is given by a

term, it cannot distinguish between the different

memebers of A \ B. So we have

P (π(F)) = π(a).

We now claim that

(

) =

. By construction,

(

) =

−1

◦ F ◦ π

. Then for

all f : A → B, we have

π(F )(f ) = (π

−1

◦ f ◦ π)(f) = (π

−1

◦ F )(π(f))

= π

−1

(F (π(f))) = π

−1

(F (f)) = F (f ).

Noting that

fixes

, the only equality that isn’t obvious from unravelling the

definitions is

(

)) =

(

). Indeed, we have

(

) =

−1

◦ f ◦ π

f ◦ π

. But

(

f ◦ π

) =

(

) because our explicit construction of

depends only on the

image of

. So

(

) =

. So this implies

(

), which means

a ∈ B

. So we

always have P (F ) ∈ B.

But this means we have found a way of uniformly picking a distinguished

element out of a two-membered set. Indeed, given any such set

, we pick any

three random objects, and combine it with

to obtain

. Then we just apply

P (F ). This is clearly nonsense.

1.3 Possible world semantics

For classical logic, semantics was easy. We had a collection of propositions

and a model is just a function

P → {>, ⊥}

. We can then define truth in this

model recursively in the obvious way.

It turns out to get a sensible model theory of constructive logic, we need to

consider many such objects.

Definition

(Possible world semantics)

Let

be a collection of propositions.

A world is a subset

w ⊆ P

. A model is a collection

of worlds, and a partial

order ≥ on W called accessibility, satisfying the persistence property:

– If p ∈ P is such that p ∈ w and w

≥ w, then p ∈ w

Given any proposition ϕ, we define the relation w  ϕ by

– w 6 ⊥

– If ϕ is atomic (and not ⊥), then then w  ϕ iff ϕ ∈ w.

– w  ϕ ∧ ψ iff w  ϕ and w  ψ.

– w  ϕ ∨ ψ iff w  ϕ or w  ψ.

– w  ϕ → ψ iff (for all w

≥ w, if w

 ϕ, then w

 ψ).

We will say that w “believes” ϕ if w  ϕ, and that w “sees” w

if w

≥ w.

We also require that there is a designated minimum element under

≤

, known

as the root world . We can think of a possible world model as a poset decorated

with worlds, and such a poset is called a frame.

All but the last rule are obvious. The idea is that each world can have some

beliefs, namely the subset

w ⊆ p

. The important idea is that if a world

does

not believe in

, it does not mean it believes that

is false. It just that

remains ignorant about whether p is true.

Now we have w

≥ w if w

a more knowledgable version of w that has more

beliefs. Alternatively,

is a world whose beliefs are compatible with that of

. It is easy to show by induction that if

is any formula that

believes, and

≥ w

, then

also believes in

. Then we can interpret the law of implication

as saying “

believes that

p → q

iff for any possible thinkable world where

true, then q is also true”.

Note that if we only have one world, then we recover models of classical logic.

What happens to negation? By definition, we have

w  ¬A

iff

w  A → ⊥

Expanding the definition, believing that

¬A

means there is no possible world

compatible with

in which

is true. This is much stronger than just saying

w 6 A, which just says we do not hold any opinion on whether A is true.

Example.

We want to construct a world

where

w 6 A ∨ ¬A

. So we need a

W such that w neither believes A nor believes ¬A.

We can achieve the former by just making the world not believe in

. To

make the world not believe

¬A

, it needs to be able to see a world

such that

 A. So we can consider the following world:

 A

6 A

Example.

This time consider the statement

w  ¬¬A → A

. This takes some

work to unravel.

We need a world

such that for all

≥ w

, if

 ¬¬A

, then

 A

. If

we unravel a bit more, we find that

 ¬¬A

means for any

≥ w

, we can

find some w

000

≥ w

such that w

000

 A.

It is easy to see that actually in the model of the previous question,

w 6

¬¬A → A, since w  ¬¬A, but w 6 A.

Example. The following model is a counter-model for ¬¬A ∨ ¬A:

6 A

 A

6 A

where the bottom world believes in neither

¬¬A

nor

¬A

, since it sees worlds

where ¬A is true and also worlds where A is true.

Usually, we don’t talk about a particular world believing in something. We

talk about an entire structure of possible worlds believing something.

Notation. If the root world of our model is w, then we write

 ϕ ⇐⇒ w  ϕ.

Almost by definition, if  ϕ, then for any world w

, we have w

 ϕ.

Exercise. Find a possible world model that does not believe Peirce’s law.

One can readily verify that classically, we have (

A → B

)

→ B

is equivalent

to A ∨ B, but they are not constructively equivalent.

Exercise.

Find a possible world model that



(

A → B

)

→ B

, but does not

 A ∨ B.

Exercise.

Find a possible world model that



(

A → B

)

→ B

but does not

believe  (B → A) → A.

Exercise.

Find a possible world model that does not believe in



(

A → B

)

∨

(B → A).

Possible world semantics can be used for all sorts of logical systems, not just

propositional logic. In general, the notion of accessibility need not be a partial

order, but can be anything we want. Also, we do not necessarily have to require

persistence. Of course, we also need to modify the according notion of a world.

Example.

If we want to produce possible world semantics for constructive

first-order logic, we need to modify our definition of worlds so that they have

“elements”. Assuming we have done so, we can encode

∀

and

∃

using the following

rules:

– w  ∃

, F (x) if there is some y ∈ w for which w  F (y).

– w  ∀

, F (x) if for all w

≥ w and x ∈ w

, we have w

 F (x).

The second condition is much stronger than the naive “for all

x ∈ w

, we have

w  F

(

)”, because the

≥ w

may have many more elements. What our

condition say is that

somehow has a “global” way of showing that everything

satisfies F .

Example. The proposition A → ¬¬A is valid all possible world models.

For classical logic, we had semantics given by truth tables. This easily shows

that classical truth is decidable. Similarly, given any formula, it involves only

finitely many variables, there are only finitely many distinct possible world

models one has to check. So whether something is refuted by possible world

models is again a decidable problem.

Now suppose we have a possible world model in front of us. Then given a

formula

, the set of worlds that believe

is an upward-closed subset of the

frame. So we can think of the truth value of ϕ as these upward-closed set.

Example. Suppose we have the following worlds:

Then the possible truth values are

∅, {w

}, {w

, w

Suppose the top world believes

, but the bottom world does not. Then we say

that

has truth value

}

. This is indeed the three-valued algebra we used to

disprove Peirce’s law.

What can we say about the collection of upward-closed subset of frames, i.e.

our truth values? Under inclusion, it is easy to see that these are complete and

co-complete, i.e. we have all infinite unions and intersections. More importantly,

we can do “implications”.

Definition

(Heyting algebra)

A Heyting algebra is a poset with

⊥

∧

and

∨ and an operator ⇒ such that A ⇒ B is the largest C such that

C ∧ A ≤ B.

If a poset is complete co-complete, then we would expect the implication to

be given by

A ⇒ B =

{C : C ∧ A ≤ B}.

Indeed, if the poset is a Heyting algebra, then it must be given by this. So we

just have to check that this works.

The thought is that Heyting algebras would “validate” all constructive theses,

by interpreting

∧

with

∧

∨

with

∨

and

→

with

⇒

. In other words, if we try to

build a truth table for a formula with values in a Heyting algebra, we will get

the top element

all the time iff the formula is constructively valid. So we have

a parallel between the relation between classical logic and boolean algebras, and

constructive logic and Heyting algebras.

It is an easy inductive proof that any constructively valid statement is “true”

in all Heyting algebras, and the other direction is also true, but takes more work

to prove.

Note that in classical logic, we only ever use the two-element boolean algebra.

If we use the 4-element Boolean algebra, then we would detect the same set of

tautologies. However, in the case of constructive logic we genuinely need all

Heyting algebras.

Finally, we prove half the completeness theorem for possible world models.

Lemma.

Any formula with a natural deduction proof not using the rule for

classical negation is true in all possible world models.

For any frame

, every such formula is satisfied by every possible world

model on that frame. We say it is valid on

. Such formulae are valid on all

posets with a bottom element.

Proof.

The hard case is implication. Suppose we have a natural deduction

proof of

A → B

. The last rule is a

→

-introduction. So by the induction

hypothesis, every world that believes

also believes

. Now let

be a world

that believes all the other undischarged assumptions in

A → B

. By persistence,

every

≥ w

believes similarly. So any

≥ w

that believes

also believes

So w  A → B.

1.4 Negative interpretation

Suppose a constructive logician found a classical logician, and then they try to

talk to each other. Then they would find that they disagree on a lot of things.

The classical logician has no problems accepting

¬¬A → A

, but the constructive

logician thinks that is nonsense. Can they make sense of each other?

The classical logician can understand the constructive logician via possible

world semantics. He could think that when the constructive logician says

a proposition is false, they mean there are possible world models where the

proposition fails. This is all fine.

But can a constructive logician make sense of the classical logician? When the

constructivist sees

A ∨ B

, he thinks that either we can prove

, or we can prove

. However, the classical logician views this as a much weaker statement. It is

more a long the lines of “it is not true that

and

are both false”. In general,

we can try to interpret classical statements by putting in enough negations:

Classical proposition Constructive interpretation

A ∨ B ¬(¬A ∨ ¬B)

A ∧ B A ∧ B

∃

W (x) ¬∀

¬W (x)

A → B ¬(A ∧ ¬B)

This gives rise to the negative interpretation of classical logic into constructive

logic, due to G¨odel:

Definition

(Negative interpretation)

Given a proposition

, the interpretation

∗

is defined recursively by

– ⊥

∗

= ⊥.

– If ϕ is atomic, then ϕ

∗

= ¬¬ϕ.

– If ϕ is negatomic, then ϕ

∗

= ϕ.

– If ϕ = ψ ∧ θ, then ϕ

∗

= ψ

∗

∧ θ

∗

– If ϕ = ψ ∨ θ, then ϕ

∗

= ¬(¬ψ

∗

∧ ¬θ

∗

– If ϕ = ∀

ψ(x), then (∀

)(ψ

∗

(x)).

– If ϕ = ψ → θ, then ϕ

∗

= ¬(ψ

∗

∧ ¬θ

∗

– If ϕ = ∃

ψ(x), then ϕ

∗

= ¬∀

¬ψ

∗

(x).

One can check that we always have

` ϕ → ϕ

∗

. So the constructive version is

always stronger.

Definition (Stable formula). A formula is stable if

` ϕ

∗

→ ϕ.

Lemma.

Any formula built up from negated and doubly negated atomics by

∧ and ∀ is stable.

Proof.

By induction on formulae. The base case is immediate, using the fact

that ¬¬¬A → ¬A. This follows from the more general fact that

(((p → q) → q) → q) → p → q.

It is less confusing to prove this in two steps, and we will write

-terms for our

proofs. First note that if we have

A → B

, then we can obtain

: (

B →

q) → A → q for any q, using

= λg

B→q

.λa

.g(f(a)).

So it suffices to prove that

p → (p → q) → q,

and the corresponding term is

λx

.λg

p→q

.gx

We now proceed by induction.

–

Now assume that we have proofs of

¬¬p → p

and

¬¬q → q

. We want

to prove

p ∧ q

from

¬¬

(

p ∧ q

). It suffices to prove

¬¬p

and

¬¬q

form

¬¬(p ∧ q), and we will just do the first one by symmetry.

We suppose

¬p

. Then we know

(

p ∧ q

). But we know

¬¬

(

p ∧ q

). So we

obtain a contradiction. So we have proved that ¬¬p.

– Note that we have

` ∃

¬¬ϕ(x) → ¬¬∃

ϕ(x),

but not the other way round. For universal quantification, we have

¬¬∀

ϕ(x) → ∀

¬¬ϕ(x),

but not the other way round. We can construct a proof as follows:

[∀

ϕ(x)]

∀-elim

ϕ(a) [¬ϕ(x)]

→-elim

→

→-int (1)

¬∀

ϕ(x) [¬¬∀

ϕ(x)]

→-elim

⊥

→-int (2)

¬¬ϕ(a)

∀-int

¬¬∀

ϕ(x)

→-int (3)

¬¬∀

F (x) → ∀

¬¬F (x)

We now want to show that if

is stable, then

∀

(

) is stable. In other

words, we want

¬¬∀

∗

(x) → ∀

∗

(x).

But form ¬¬∀

∗

(x), we can deduce ∀

¬¬ϕ

∗

(x), which implies ∀

∗

(x).

So every formula in the range of the negative interpretation is stable. Every

stable formula is equivalent to its double negation (classically). Every formula is

classically equivalent to a stable formula.

So if a constructive logician meets a classical logician who is making some

bizarre assertions about first order logic. To make sense of this, the constructive

logician can translate it to its negative interpretation, which the classical logician

thinks is equivalent to the original one.

1.5 Constructive mathematics

We now end with some remarks about how we can do mathematics “construc-

tively”, and the subtleties we have to be careful of.

We first discuss the notion of finiteness. Even classically, if we do not have

the axiom of choice, then we have many distinct notions of “finite set”. When

we go constructive, things get more complicated.

We can produce (at least) two different definitions of finiteness.

Definition

(Kuratowski finite)

We define “finite” recursively:

∅

is Kuratowski

finite. If x is Kuratowski finite, then so is x ∪ {y}.

There is a separate definition of N-finiteness:

Definition

(

-finite)

. ∅

-finite. If

-finite, and

y 6∈ x

, then

x ∪ {y}

N-finite.

These two definitions are not the same! In a

-finite set, we know any

two things are either equal, or they are not, as this is built into the definition

of an

-finite set. However, in the case of Kuratowski finite, this is not true,

since we don’t have the law of excluded middle. We say

-finite sets have

decidable equality. If we want to do constructive arithmetic, the correct notion

is N -finiteness.

We can still define natural numbers as the smallest set containing

∅

and is

closed under successor. This gives us

-finite sets, but the least number principle

is dodgy. It turns out the least number principle implies excluded middle.

There are also subtleties involving the notion of a non-empty set!

Definition (Non-empty set). A set x is non-empty if ¬∀

y 6∈ x.

Definition (Inhabited set). A set x is inhabited if ∃

y ∈ x.

They are not the same! Being inhabited is stronger than being non-empty!

We shall not go into more details about constructive mathematics, because,

disappointingly, very few people care.

2 Model theory

2.1 Universal theories

Recall the following definition:

Definition

(Universal theory)

A universal theory is a theory that can be

axiomatized in a way such that all axioms are of the form

∀

···

(stuff not involving quantifiers) (∗)

For example, groups, rings, modules etc. are universal theories. It is easy to

see that if we have a model of a universal theory, then any substructure is also a

model.

It turns out the converse is also true! If a theory is such that every sub-

structure of a model is a model, then it is universal. This is a very nice result,

because it gives us a correspondence between syntax and semantics.

The proof isn’t very hard, and this is the first result we will prove about

model theory. We begin with some convenient definitions.

Definition

(Diagram)

Let

be a language and

a structure of this language.

The diagram of

is the theory obtained by adding a constant symbol

for

each

x ∈ M

, and then taking the axioms to be all quantifier-free sentences that

are true in M. We will write the diagram as D(M).

Lemma.

Let

be a consistent theory, and let

∀

be the set of all universal

consequences of

, i.e. all things provable from

that are of the form (

∗

). Let

M be a model of T

∀

. Then T ∪ D(M) is also consistent.

Proof.

Suppose

T ∪ D

(

) is not consistent. Then there is an inconsistency

that can be derived from finitely many of the new axioms. Call this finite

conjunction

. Then we have a proof of

¬ψ

from

. But

knows nothing

about the constants we added to

. So we know

T ` ∀

¬ψ

. This is a universal

consequence of T that M does not satisfy, and this is a contradiction.

Theorem.

A theory

is universal if and only if every substructure of a model

of T is a model of T .

Proof. ⇒

is easy. For

⇒

, suppose

is a theory such that every substructure of

a model of T is still a model of T .

Let

be an arbitrary model of

∀

. Then

T ∪ D

(

) is consistent. So it

must have a model, say

∗

, and this is in particular a model of

. Moreover,

M is a submodel of M

∗

. So M is a model of T .

So any model of

∀

is also a model of

, and the converse is clearly true. So

we know T

∀

is equivalent to T .

2.2 Products

In model theory, we would often like to produce new models of a theory from

old ones. One way to do so is via products.

We will use λ-notation to denote functions.

Definition

(Product of structures)

Suppose

}

i∈I

is a family of structures

of the same signature. Then the product

i∈I

has carrier set the set of all functions

α : I →

[

i∈I

such that α(i) ∈ A

Given an

-ary function

in the language, the interpretation in the product

is given pointwise by

f(α

, · · · , α

) = λi.f(α

(i), · · · , α

(i)).

Relations are defined by

ϕ(α

, · · · , α

) =

i∈I

ϕ(α

(i), · · · , α

(i)).

The natural question to ask is if we have a model of a theory, then is the

product a model again? We know this is not always true. For example, the

product of groups is a group, but the product of total orders is not a total order.

We say a product preserves a formula ϕ if

i∈I

 ϕ ⇐⇒ ∀

i∈I

, A

 ϕ

What sort of things do products preserve? As we know from undergraduate

mathematics, theories such as groups and rings are preserved.

Definition

(Equational theory)

An equational theory is a theory all of whose

axioms are of the form

∀

(x) = w

(x)),

where w

are some terms in x.

It is not hard to see that models of equational theories are preserved under

products.

It turns out actually the class of things preserved by products is much more

general.

Definition

(Horn clause)

A Horn clause is a disjunction of atomics and

negatomics of which at most one disjunct is atomic.

It is usually better to think of Horn clauses as formulae of the form





→ χ

where

and

are atomic formulae. Note that

⊥

is considered an atomic

formula.

A universal Horn clause is a universal quantifier followed by a Horn clause.

Example.

Transitivity, symmetry, antisymmetry, reflexivity are all universal

Horn clauses.

Proposition. Products preserve (universal) Horn formulae.

Proof. Suppose every factor

 ∀

(x) → χ(x).

We want to show that the product believes in the same statement. So let

(

, · · · , f

) be a tuple in the product of the right length satisfying the antecedent,

i.e. for each n ∈ I, we have

 ϕ

(n), · · · , f

(n))

for each i, But then by assumption,

 χ(f

(n), · · · , f

(n))

for all n. So the product also believes in ϕ

, · · · , f

). So we are done.

Reduced products

Unfortunately, it is rarely the case that products give us new interesting models.

However, reduced products do.

Reduced products arise from having a filter on the index set.

Definition

(Filter)

Let

be a set. A filter on

is a (non-empty) subset

F ⊆ P

(

) such that

is closed under intersection and superset. A proper filter

is a filter F 6= P (I).

The intuition to hang on to is that F captures the intuition of “largeness”.

Example.

Take

, and

the set of cofinite subsets of

, namely the sets

whose complement are finite, then F is a filter.

Similarly, we can take

to contain the set of naturals with asymptotic

density 1.

Example. Let

F = {x ⊆ N : 17 ∈ N}

Then this is a maximal proper filter.

This is a rather silly filter. We will see later that model-theoretically, these

give us uninteresting reduced products.

Definition (Principal filter). A principal filter is a filter of the form

F = {X ⊆ I : x 6∈ X}

for some x ∈ I.

We don’t tend to care about principal filter, as they are boring.

Definition

(Complete filter)

A filter

-complete if it is closed under

intersection of < κ many things.

Example. By definition, every filter is ℵ

-complete.

Example. Principal filters are κ-complete for all κ.

A natural question is, are there any non-principal ultrafilters that are

ℵ

complete? It turns out this is a deep question, and a lot of the study of set

theory originated from this question.

Filters on

form a complete poset under inclusion, with top element given

(

). Moreover, the proper filters are a chain complete poset. Thus, by

Zorn’s lemma, there are maximal proper filters.

Definition (Ultrafilter). An ultrafilter is a maximal filter.

We saw that principal filters are ultra. Are there non-principal ultrafilters?

Fortunately, by Zorn’s lemma, they do exist.

Example.

By Zorn’s lemma, there is a maximal filter extending the cofinite

filter on N, and this is non-principal.

It turns out it is impossible to explicitly construct a non-principal ultrafilter,

but Zorn’s lemma says they exist.

Now we can get to reduced products.

Definition

(Reduced product)

let

i ∈ I}

be a family of structures, and

F a filter on I. We define the reduced product

i∈I

as follows: the underlying set is the usual product

quotiented by the

equivalence relation

α ∼

β ⇐⇒ {i : α(i) = β(i)} ∈ F

Given a function symbol

, the interpretation of

in the reduced product is

induced by that on the product.

Given a relational symbol ϕ, we define

ϕ(α

, · · · , α

) ⇐⇒ {i : ϕ(α

(i), · · · , α

(i))} ∈ F.

is an ultrafilter, then we call it the ultraproduct. If all the factors in an

ultraproduct are the same, then we call it an ultrapower.

It is an easy exercise to show that these are all well-defined. If we view the

filter as telling us which subsets of

are “large” then, our definition of reduced

product says we regard two functions in the reduced products as equivalent if

they agree on a “large” set.

Note that if our filter is principal, say

{J ⊆ I

i ∈ J}

, then the reduced

product is just isomorphic to A

itself. So this is completely uninteresting.

Reduced products preserve various things, but not everything. For example,

the property of being a total order is in general not preserved by reduced products.

So we still have the same problem.

But if the filter F is an ultrafilter, then nice things happen.

Theorem

(

o´s theorem)

Let

i ∈ I}

be a family of structures of the same

(first-order) signature, and U ⊆ P (I) an ultrafilter. Then

i∈I

/U  ϕ ⇐⇒ {i : A

 ϕ} ∈ U.

In particular, if A

are all models of some theory, then so is

/U.

The key of the proof is the following lemma, which is a nice exercise:

Lemma. Let F be a filter on I. Then the following are equivalent:

(i) F is an ultrafilter.

(ii) For X ⊆ I, either X ∈ F or I \ X ∈ F (“F is prime”).

(iii) If X, Y ⊆ I and X ∪ Y ∈ I, then X ∈ I or Y ∈ I.

With this in mind, it is hard not to prove the theorem.

Let’s now look at some applications of Lo´s theorem.

Recall the compactness theorem of first order logic — if we have a theory

such that every finite subset of

has a model, then so does

. The way

we proved it was rather roundabout. We proved the completeness theorem of

first order logic. Then we notice that if every finite subset of

has a model,

then every finite subset of

is consistent. Since proofs are finite, we know

consistent. So by completeness, T has a model.

Now that we are equipped with ultraproducts, it is possible to prove the

compactness theorem directly!

Theorem

(Compactness theorem)

Let

be a theory in first order logic such

that every finite subset has a model. Then T has a model.

Proof.

Let ∆ be such a theory. Let

ℵ

(∆) be the set of all finite subsets

of ∆. For each s ∈ S, we pick a model M

of s.

Given s ∈ S, we define

= {t ∈ S : s ⊆ t}.

We notice that

s ∈ S}

generate a proper filter on

. We extend this to

ultrafilter U by Zorn’s lemma. Then we claim that

s∈S

/U  ∆.

Indeed, for any ϕ ∈ ∆, we have

{s : M

 ϕ} ⊇ X

{ϕ}

∈ U.

Example.

Let

be the theory consisting of all first-order statements true in

hR,

, ×i

. Add to

a constant

and the axioms

ε <

for all

n ∈ N

. Then

for any finite subset of this new theory,

is still a model, by picking

small

enough. So by the compactness theorem, we know this is consistent.

Using our proof of the compactness theorem, we now have a “concrete” model

of real numbers with infinitesimals — we just take the ultraproduct of infinitely

many copies of R.

2.3 Ehrenfeucht–Mostowski theorem

Definition

(Skolem function)

Skolem functions for a structure are functions

for each ϕ ∈ L such that if

M  ∀

∃

ϕ(x, y),

then

M  ∀

ϕ(x, f

(x)).

Definition

(Skolem hull)

The Skolem hull of a structure is obtained from the

constants term by closure under the Skolem functions.

Definition

(Elementary embedding)

Let Γ be a set of formulae. A function

→ M

is Γ-elementary iff for all

ϕ ∈

Γ we have

 ϕ

(

) implies

 ϕ(i(x)).

If Γ is the set of all formulae in the language, then we just say it is elementary.

Usually, we take Γ to be the set of all formulae, and

to be a substructure

of M

Example.

It is, at least intuitively, clear that the inclusion

hQ, ≤i → hR, ≤i

elementary.

However, Q as a field is not elementary as a substructure of R as a field.

Note that first order logic is not decidable. However, monadic first order

logic is.

Definition

(Monadic first order logic)

Monadic first-order logic is first order

logic with only one-place predicates, no equality and no function symbols.

Proposition. Monadic first-order logic is decidable.

Proof.

Consider any formula

. Suppose it involves the one-place predicates

, . . . , p

. Given any structure M, we consider the quotient of M by

x ∼ y ⇔ p

(x) = p

(y) for all i.

Then there are at most 2

things in the quotient.

Then given any transversal of the quotient, we only have to check if the

formula holds for this transversal, and this is finite. So we can decide.

Definition

(Set of indiscernibles)

We say

hI, ≤

is a set of indiscernibles for

and a structure

with

I ⊆ M

if for any

ϕ ∈ L

(

) of arity

, and for all

increasing tuples x, y ∈ I,

M  ϕ(x) ⇐⇒ M  ϕ(y)

This is weaker than just saying everything in there are the same.

Example. hQ, Ii is a set of indiscernibles for hR, ≤i.

Let

be a theory with infinite models. The general line of thought is — can

we get a model of T with special properties?

Given any set

, invent names for every member of

. Add these names to

the language of

. We add axioms to say all these constants are distinct. Since

T has infinite models, we know this theory has a model.

Let Ω be the set of countable ordinals, consider the language of

, and add

a name to this language for every countable ordinal. We further add axioms to

say that α < β in the ordering of R whenever α < β as ordinals.

Again by compactness, we find that this has a model. So we can embed the

set of countable ordinals into a model of reals, which we know is impossible for

the genuine

. However, in general, there is no reason to believe these elements

are a set of indiscernibles. What Ehrenfeucht–Mostowski says is that we can in

fact do so.

Theorem

(Ehrenfeucht–Mostowski theorem (1956))

Let

hI, ≤i

be a total order,

and let

be a theory with infinite models. Suppose we have a unary predicate

P and a 2-ary relation 4∈ L(T ) such that

T ` 4 is a total order on {x : P (x)}.

Then

has a model

with a copy of

as a sub-order of

, and the copy of

is a set of indiscernibles. Moreover, we can pick

such that every order-

automorphism of hI, ≤i extends to an automorphism of M.

We will give two proofs of this result. We first give the original proof of

Ehrenfeucht–Mostowski, which uses Ramsey theory.

Proof.

Let

and

hI, ≤i

be as in the statement of the theorem. We add to

(

)

names for every element of

, say

i ∈ I}

. We add axioms that says

(

)

and

4 c

whenever

i < j

. We will thereby confuse the orders

≤

and

, partly

because ≤ is much easier to type. We call this theory T

∗

Now we add to

∗

new axioms to say that the

form a set of indiscernibles.

So we are adding axioms like

ϕ(c

, c

) ⇔ ϕ(c

, c

) (∗)

for all

i < j

and

< j

. We do this simultaneously for all

ϕ ∈ L

(

) and all

tuples of the appropriate length. We call this theory

, and it will say that

hI, ≤i

forms a set of indiscernibles. The next stage is, of course, to prove that

is consistent.

Consider any finite fragment

. We want to show that

is consistent.

By finiteness,

only mentions finitely many constants, say

< · · · < c

, and

only involve finitely many axioms of the form (

∗

). Denote those predicates as

, · · · , ϕ

. We let N be the supremum of the arities of the ϕ

Pick an infinite model M of T . We write

[N]

= {A ⊆ M : |A| = N },

For each

, we partition

[N]

as follows — given any collection

}

k=1

, we

use the order relation

to order them, so we suppose

4 a

k+1

. If

has arity

m ≤ N

, then we can check whether

(

, · · · , a

) holds, and the truth value

gives us a partition of M

[N]

into 2 bits.

If we do this for all

, then we have finitely partitioned

[N]

. By Ramsey’s

theorem, this has an infinite monochromatic subset, i.e. a subset such that

any two collection of

members fall in the same partition. We pick elements

, · · · , c

K+N

, in increasing order (under

). We claim that picking the

, · · · , c

to be our constants satisfy the axioms of T

Indeed, given any

mentioned in

with arity

m < N

, and sequences

< · · · < c

and

< · · · < c

, we can extend these sequences on the right

by adding more of those c

. Then by our choice of the colouring, we know

, · · · , c

) ⇔ ϕ

, · · · ,

So we know

is consistent. So

is consistent. So we can just take a model of

, and the Skolem hull of the indiscernibles is the model desired.

There is another proof of Ehrenfeucht–Mostowski due to Gaifman, using

ultraproducts. We will sketch the proof here.

To do so, we need the notion of colimits.

Definition

(Colimit)

Let

i ∈ I}

be a family of structures index by a poset

hI, ≤i

, with a family of (structure-preserving) maps

{σ

→ A

| i ≤ j}

such

that whenever i ≤ j ≤ k, we have

= σ

In particular

is the identity. A colimit or direct limit of this family of

structures is a “minimal” structure

∞

with maps

→ A

∞

such that

whenever i ≤ j, then the maps

∞

commute.

By “minimal”, we mean if

∞

is another structure with this property, then

there is a unique inclusion map A

∞

→ A

∞

such that for any i ∈ I, the maps

∞

Example.

The colimit of a family of sets is obtained by taking the union of all

of them, and then identifying x ∼ σ

(x) for all x ∈ A

and i ≤ j.

The key observation is the following:

Every structure is a direct limit of its finitely generated substructures,

with the maps given by inclusion.

We will neither prove this nor make this more precise, but it will be true for the

different kinds of structures we are interested in. In particular, for a poset, a

finitely generated substructure is just a finite suborder, because there are no

function symbols to “generate” anything in the language of posets.

To prove Ehrenfeucht–Mostowski, we construct a model satisfying the state-

ment of Ehrenfeucht–Mostowski as the direct limit of structures indexed by finite

subset of I:

: s ∈ P(I)}

and when

s ⊆ t ∈ P

(

), we have an elementary embedding

→ M

. We

then note that if all the

are elementary, then the

→ A

∞

are also

elementary. In particular, if each M

are models of our theory, then so is A

∞

We start by picking any infinite model

. By standard compactness

arguments, we may wlog there is no last

-thing in

, and also that

(

)

}

is infinite. We pick

an ultrafilter on

that contains all terminal

segments of

. Since

does not have a last element, this is a non-principal

ultrafilter.

We now write

L(M) = M

|J|

/U.

We will define

= L

|s|

(M).

In particular, M

∅

= M.

To construct the embedding, we consider the following two classes of maps:

(i)

For any structure

, there is a map

(

) :

M → L

(

) given by

the constant embedding.

(ii) If i is an embedding from M into N , then there is an embedding

L(i) : L(M) → L(N ).

which is “compose with i on the right”.

It is easy to see that both of these maps are elementary embeddings, where

we require

to be elementary in the second case. Moreover, these maps are

compatible in the sense that the following diagram always commutes:

M N

L(M) L(N )

K(M) K(N )

L(i)

We now further consider the functions

head

and

tail

defined on finite linear orders

head(a

< · · · < a

) = a

tail(a

< · · · < a

) = a

< · · · < a

We can now define the embeddings recursively. We write

(

s, t

) for the desired

inclusion from M

into M

. We set

– If s = t, then I(s, t) is the identity.

– If head(s) = head(t), then I(s, t) is L(I(tail(s), tail(t))).

– Otherwise, I(s, t) is K ◦ I(s, tail(t)).

By our previous remark, we know these are all elementary embeddings, and that

these maps form a commuting set. Thus, we obtain a direct limit M

∞

Now we want to produce a copy of

in the direct limit. So we want to

produce a copy of

for each

, such that the maps preserve these copies.

It suffices to be able to do it in the case

|s|

= 2, and then we can bootstrap it

up by induction, since if

|s| >

3, then we find two sets of

of order

< |s|

whose

union gives

, and compatibility means we can glue them together. To ensure

compatibility, we pick a fixed element

x ∈ L

(

), and set this as the desired

element in

whenever

|s|

= 1. It then suffices to prove that, say, if 0

then

(

{

}, {

}

)(

)

< I

(

{

}, {

}

)(

). It is then a straightforward exercise

to check that

x = [λp.p] ∈ L(M) = M

|P |

works.

Once we have done this, it is immediate that the copy of

in the direct limit

is a family of indiscernibles. Indeed, we simply have to check that the copy of

is a family of indiscernibles, since every formula only involves finitely

many things. Then since the inclusino of

into

is elementary, the result

follows.

2.4 The omitting type theorem

Definition

(Type)

A type is a set of formulae all with the same number of free

variables. An n-type is a set of formulae each with n free variables.

What is the motivation of this? A 1-type is a set of formulae all with one

free variable. So if we have a model

and an element

x ∈ M

, then we can

consider all formulae that x satisfies, and we see that this gives as a 1-type.

Definition

(Realization of type)

A model

realizes an

-type Σ if there

exists x

, · · · , x

∈ M such that for all σ ∈ Σ, we have

M  σ(x

, · · · , x

Any fool can realize a type, but it takes a model theorist to omit one!

Definition

(Omit a type)

A model

omits an

-type Σ if for all

, · · · , x

there exists σ ∈ Σ such that

M 6 σ(x

, · · · , x

Example.

Consider the language of PA, and consider the type containing

(

)

saying x 6= s

(x). We really want to omit this type!

Of course, it is not hard in this case — the standard model of PA does.

Let’s look for a condition on

and Σ for

to have a model that omits this

type.

Definition (Locally realize). We say ϕ realizes Σ locally if

T ` ∀

(ϕ(x) → σ(x)).

Suppose there is some ϕ ∈ L(T ) such that

T ` ∃

ϕ(x)

and ϕ(x) locally realizes Σ, then we certainly cannot omit Σ.

Now if

T  ∀

¬ϕ

(

) whenever

locally realizes Σ, then we have a chance.

We say T locally omits Σ. It turns out this is sufficient!

Theorem

(Omitting type theorem)

Let

be a first-order theory, and Σ an

n-type. If

T ` ∀

¬ϕ(x)

whenever ϕ locally realizes Σ, then T has a model omitting Σ.

We first prove a special case for propositional logic.

Theorem.

Let

be a propositional theory, and Σ

⊆ L

(

) a type (with

= 0).

If T locally omits Σ, then there is a T -valuation that omits Σ.

Proof.

Suppose there is no

-valuation omitting Σ. By the completeness theorem,

we know everything in Σ is a theorem of T . So T can’t locally omit Σ.

That was pretty trivial. But we can do something more than that — the

extended omitting types theorem. It says we can omit countably many types

simultaneously.

Theorem.

Let

be a propositional theory, and for each

i ∈ N

, we let Σ

⊆ L

(

)

be types for each

i ∈ N

. If

locally omits each Σ

, then there is a

-valuation

omitting all Σ

Proof.

We will show that whenever

T ∪ {¬A

, · · · , ¬A

}

is consistent, where

∈ Σ

for n ≤ i, then we can find A

n+1

∈ Σ

n+1

such that

T ∪ {¬A

, · · · , ¬A

, ¬A

n+1

}

is consistent.

Suppose we can’t do this. Then we know that

T `





1≤j≤n

¬A





→ A

n+1

for every

n+1

∈

n+1

. But by assumption,

locally omits Σ

i+1

. This implies

T ` ¬





1≤j≤n

¬A





contradicting the inductive hypothesis that T ∪ {¬A

, · · · , ¬A

} is consistent.

Thus by compactness, we know

T ∪ {¬A

, ¬A

, · · · }

is consistent, and a

model of this would give us a model of T omitting all those types.

We now prove the general case.

Proof.

Let

be a first order theory (in a countable language) locally omitting

Σ. For simplicity, we suppose Σ is a 1-type. We want to find a model omitting

Σ. Suppose

locally omits Σ, and let

i ∈ N}

be a countable set of new

constant symbols. Let

hϕ

i ∈ Ni

be an enumeration of the sentences of

(

We will construct an increasing sequence

i ∈ N}

of finite extensions of

such that for each m ∈ N,

(0) T

m+1

is consistent.

(i) T

m+1

decides ϕ

for n ≤ m, i.e. T

m+1

` ϕ

or T

m+1

` ¬ϕ

(ii)

∃

(

) and

∈ T

m+1

, then

(

)

∈ T

m+1

, where

is the first

constant not occurring in T

or ϕ

(iii) There is a formula σ(x) ∈ Σ such that ¬σ(c

) ∈ T

m+1

We will construct this sequence by recursion. Given

, we construct

m+1

follows: think of T

as T ∪ {θ

, · · · , θ

}, and let

Θ =

j≤r

We let {c

, · · · , c

} be the constants that have appeared in Θ, and let

Θ(x)

be the result of replacing

with

. Then clearly, Θ(

) is consistent with

T . Since T locally omits Σ, we know there exists some σ(x) ∈ Σ such that

Θ ∧ ¬σ(x

)

is consistent with T . We put ¬σ(c

) into T

m+1

, and this takes care of (iii).

is consistent with

∪ {¬σ

(

)

}

, then put it into

m+1

. Otherwise,

put in ¬ϕ

. This takes care of (i).

∃

(

) and it’s consistent with

∪ {¬σ

(

)

}

, we put

(

) into

m+1

. This takes care of (ii).

Now consider

∗

[

n∈N

Then T

∗

is complete by construction, and is complete by compactness.

Consider an arbitrary countable model of

∗

, and the submodel generated

by the constants in

. This is a model of

∗

⊇ T

and condition (iii) ensures

that it omits Σ.

3 Computability theory

3.1 Computability

We begin by discussing some number theory. Consider the Diophantine equation

+ y

= z

. (∗)

We can certainly write down some solutions to this equation, e.g.

+ 4

= 5

But is it possible to find all solutions? It is a well-known result that we can. Of

course, to do so, it suffices to enumerate all solutions where

x, y, z

are coprime

It turns out every such solution of (∗) can be written as

x = m

− n

, y = 2mn, z = m

+ n

for some

m, n ∈ N

, and conversely, for any

m, n ∈ N

, the (

x, y, z

) given by this

formula is a solution. Thus, if we want to list all solutions to (

∗

), then it suffices

to plug in all possible values of

and

. In other words, we have an algorithm

that enumerates all solutions to (∗).

Given any Diophantine equation, it would be great to have some similar

algorithm that enumerates all solutions to the equation. But even before that, it

would be great to have some algorithms that tell us whether there is any solution

at all. This was Hilbert’s 10th problem — is there an algorithm that determines

whether or not a Diophantine equation has a solution?

If such an algorithm existed, then to prove this, we merely have to exhibit

such an algorithm, and prove that it works? But how could we prove that there

isn’t one. To do so, we must formulate an “algorithm” as a mathematical object.

Alternatively, we want a formal notion of what is a “computable” function.

One way to do so is to define “computable functions” recursively. We will be

concerned with functions

→ N

for different

and

, and the idea is to list

some functions that any sensible computer should be able to compute, and then

close it under operations that really should preserve computability. We then

hope that the set of functions produced this way is exactly the functions that

can be “computed” by an “algorithm”.

We start with a naive attempt, and what we obtain is known as “primitive

recursive” functions.

Definition

(Primitive recursive functions)

The class of primitive recursive

functions N

→ N

for all n, m is defined inductively as follows:

– The constantly zero function f (x) = 0, f : N

→ N is primitive recursive.

–

The successor function

succ

N → N

sending a natural number to its

successor (i.e. it “plus one”) is primitive recursive.

– The identity function id : N → N is primitive recursive.

– The projection functions

proj

(x) : N

, · · · , x

) x

are primitive recursive.

Moreover,

–

Let

→ N

and

, · · · , g

→ N

be primitive recursive. Then

the function

, · · · , x

) 7→ f(g

, · · · , x

), · · · , g

, · · · , x

)) : N

→ N

is primitive recursive.

Finally, we have closure under primitive recursion

–

→ N

and

m+k+1

→ N

are primitive recursive, then so is

the function h : N

k+1

→ N

defined by

h(0, x) = g(x)

h(succ n, x) = f(h(n, x), n, x).

Do these give us everything we want? We first notice that we can define

basic arithmetic using primitive recursion by

plus(0, n) = n

plus(succ m, n) = succ(plus(m, n))

mult(1, n) = n

mult(succ m, n) = plus(n, mult(m, n)),

Similarly, we can define exp etc.

What else can we do? We might want to define the Fibonacci function

Fib(0) = Fib(1) = 1, Fib(n + 1) = Fib(n) + Fib(n − 1)?

But the definition of primitive recursion only allows us to refer to

(

n, x

) when

computing

(

succ n, x

). This isn’t really a problem. The trick is to not define

Fib directly, but to define the function

Fib

(n) =



Fib(n)

Fib(n − 1)



instead, using primitive recursion, and then projecting back to the first compo-

nent. But this requires us to fix the number of things we are required to refer

back to. What if we want to refer back to everything less than n?

The solution to this is that we can actually encode pairs as natural numbers

themselves:

Proposition.

There exists primitive recursion functions

pair

→ N

and

unpair : N → N

such that

unpair(pair(x, y)) = (x, y)

for all x, y ∈ N.

Proof. We can define the pairing function by

pair(x, y) =



x + y + 1



+ y.

The unpairing function can be shown to be primitive recursive, but is more

messy.

The benefit of this is that we can now encode lists as naturals. What would

it mean to be a list? We will informally denote a list by square brackets, e.g.

, · · · , x

]. The idea is that we have functions head, tail, cons such that

head [x

, · · · , x

] = x

tail [x

, · · · , x

] = [x

, · · · , x

]

cons(x, [x

, · · · , x

]) = [x, x

, · · · , x

]

If we think a bit harder, then what we are really looking for is really the following

property:

Corollary.

There exists

cons

N → N, head

N → N

and

tail

N → N

such that

cons(head x, tail x) = x

Proof.

Take

cons

pair

;

head

to be the first projection of the unpairing and

tail

to be a second projection.

So we can encode lists as well. We can lock ourselves in a room with enough

pen and paper, and try to prove that most functions

N → N

we encounter “in

daily life” are indeed primitive recursive.

However, it turns out there are some contrived examples that really should

be computable, but not primitive recursive.

Definition (Ackermann function). The Ackermann function is defined to be

A(0, n) = n + 1

A(m, 0) = A(m − 1, 1)

A(m + 1, n + 1) = A(m, A(m + 1, n)).

Proposition. The Ackermann function is well-defined.

Proof.

To see this, note that computing

(

+ 1

, n

+ 1) requires knowledge of

the values of A(m + 1, n), and A(m, A(m + 1, n)).

Consider

N × N

ordered lexicographically. Then computing

(

+ 1

, n

+ 1)

requires knowledge of the values of

at pairs lying below

+ 1

, n

+ 1

in this

order. Since the lexicographic order of

N × N

is well-ordered (it has order type

ω × ω), by transfinite induction, we know A is well-defined.

We can indeed use this definition to compute

, and it is going to take

forever. But we can still do it.

However, this function is not primitive recursive! Intuitively, this is because

the definition of

does “transfinite recursion over

ω × ω

”, while the definition

of primitive recursion only allows us to do “transfinite recursion over ω”.

That is the idea, but obviously not a proof. To prove it properly, this is done

by proving that the Ackermann “grows too fast”.

Definition

(Dominating function)

Let

f, g

N → N

be functions. Then we

write

f < g

if for all sufficiently large integer

, we have

(

)

< g

(

). We say

dominates f .

It is a remarkable theorem that

Theorem.

The function

n 7→ A

(

n, n

) dominates all primitive recursive functions.

We will not prove this, as it is messy, and not really the point of the course.

The point of bringing up the Ackermann function is just to know that we are

not doing good enough!

The obvious fix would be to allow some more complicated form of recursion,

and hope that we will be safe. But this doesn’t really do the job.

It turns out the problem is that we are only defining total functions. If we

tried to compute any of the primitive recursive functions, we are guaranteed

that we can do it in finite time. In particular, our functions are always defined

everywhere. But anyone with programming experience knows very well that

most programs we write are not everywhere defined. Often, they are nowhere

defined, and just doesn’t run, or perhaps gets stuck in a loop for ever! But we

did write some code for this program, so surely this nowhere defined function

should be computable as well!

Our way of thinking about functions that are not everywhere defined is via

non-termination. We imagine we have a computer program trying to compute

a function

. If

(42) is undefined, instead of the program running for a while

and then saying “I give up”, the program keeps running forever,

It turns out we can “fix” the problem merely by introducing one operation —

inverses. This is slightly subtle. Imagine we are writing a compute program, and

suppose we have a (computable) function

N → N

. We want to compute an

inverse to

. Say we want to find

−1

(17). To compute this, we just compute

(0),

(1),

(2), etc., and if 17 is in the image, then we will eventually find

something that gets sent to 17.

There are two minor problems and one major problems with this:

– f

might not be injective, and there are multiple values that get sent to 17.

If this happens, we just agree that we return the first natural number that

gets sent to 17, which is what is going to happen if we perform the above

procedure.

– f

might not be surjective, and we can never find something that gets sent

to 17. But this is not a problem, since we just decided that our functions

don’t have to be total! If

is not surjective, we can just let the program

keep running forever.

But there is still a major problem. Since we decided functions need not be total,

what happens if

is total? We might have

(23) = 17, but

(11) never halts.

So we never learn that f(23) = 17.

We might think we can get around this problem by “diagonalizing”. We

assume our computers perform computations for discrete time steps. Then we

compute

(1) for one step; then

(1) and

(2) for two steps; then

(1),

(2)

and

(3) for three steps, etc. Then if

(23) = 17, and it takes 100 steps of

computation to compute this, then we will figure this out on the 100th iteration

of this process.

This doesn’t really solve our problem, though. If

is not injective, then this

doesn’t guarantee to return the minimum

such that

(

) = 17. It just returns

some arbitrary x such that f(x) = 17. This is bad.

So we make a compromise. The obvious compromise is to just allow computing

inverses of total functions, but since total functions are hard to invert, we shall

make a further compromise to just allow computing inverse of primitive recursive

functions.

Before we continue, we define some useful conventions:

Notation.

We write

(

)

↑

(

) is undefined, or alternatively, after we define

what this actually means, if the computation of

(

) doesn’t halt. We write

f(x) ↓ otherwise.

Definition

(Partial recursive function)

The class of partial recursive functions

is given inductively by

– Every primitive recursive function is partial recursive.

–

The inverse of every primitive recursive function is partial recursive, where if

k+n

→ N

, then the/an inverse of

is the function

−1

k+m

→ N

given by

−1

(x; y) =

(

min{z : f(x, z) = y} if exists

↑ otherwise

–

The set of partial recursive functions is closed under primitive recursion

and composition.

Of course, this allows us to compute the inverse of functions of several inputs

using pairing and unpairing functions.

It turns out this definition is “correct”. What do we mean by this?

What we have described so far is syntactic declarations of functions. We

describe how we can build up our functions. We can also define computability

semantically, by describing some rules for constructing machines that compute

functions. In other words, we actually define what a “computer” is, and then a

function is computable if it can be computed by a computer. What we want to

prove is that the partial recursive functions are exactly the functions that can

be computed by a machine.

We will not go into details about how we can define these machines. It is pos-

sible to do so; it doesn’t really matter; and the actual details are messy. However,

any sensible construction of machines will satisfy the following description:

–

A machine can be described by a finite string. In particular, it is possible

to number all the possible machines. Thus, each machine can be specified

by a G¨odel number uniquely. We write

{m}

for the machine corresponding

to the numbering m.

–

A machine has a countable (enumerated) number of possible states, and

one of them is called

HALT

. A machine also has a register that can store a

natural number.

–

A machine can take in a fixed number of inputs. Afterwards, in every

discrete time step, it (potentially) changes the content of its register, and

moves to a new state. However, if the current state is

HALT

, then it does

nothing.

For inputs x

, · · · , x

into the machine {m}, we write

{m}(x

, · · · , x

) =

(

value in the register after halting if machine halts

↑ otherwise

Thus, we will also write {m} for the function computed by {m}.

–

The above process is “computable” in the following precise sense — we

define Kleene’s

function

(

m, i, t

) :

→ N

whose value encodes (in a

primitive recursive way) the states and register contents during the first

steps in the evaluation of

{m}

(

). Then

is a primitive recursive function.

–

For any partial recursive function

, there is a machine

{m}

such that

{m}(x) = f(x) for all x ∈ N. In particular {m}(x) always halts.

Now of course, given our requirements for what a “machine” should be, it is

absolutely obvious that functions computable by programs in the above sense is

exactly the same as the functions that are partial recursive. The only perhaps

slightly non-trivial part is to show that every function computed by a machine

is partial recursive. To prove this, given a machine

{m}

, inversion gives us the

function

g(i) = min{t : T(m, i, t) has final state HALT},

and then we can define

(

) by computing

(

m, i, g

(

)) and then extracting the

process. Then h(i) = {m}(i), and so {m} is a partial recursive function.

3.2 Decidable and semi-decidable sets

So we’ve managed to come up with a notion of computability of a function. From

now on, we will assume anything that is “intuitively computable” is actually

computable.

We now want to try to talk about “computable subsets” of

. There are two

possible things we might be interested in.

Definition

(Decidable set)

A subset

X ⊆ N

is decidable if there is a total

computable function N → N such that

f(n) =

(

1 n ∈ X

0 n 6∈ X

This is a rather strong notion. We can always tell, in finite time, whether

an element is in

. Often, a weaker notion is desired, namely semi-decidability.

Informally, a subset

X ⊆ N

is semi-decidable if upon given some

n ∈ X

, if

n ∈ X

then we can learn this in finite time. If

n 6∈ X

, we might never figure this is the

case.

There are several ways to define semi-decidable sets properly.

Definition

(Semi-decidable set)

We say a subset

X ⊆ N

is semi-decidable if it

satisfies one of the following equivalent definitions:

(i) X is the image of some partial computable function f : N → N.

(ii) X is the image of some total computable function f : N → N.

(iii) There is some partial computable function f : N → N such that

X = {n ∈ N : f(n) ↓}

(iv) The function χ

: N → {0} given by

(

0 n ∈ X

↑ n 6∈ X

is computable.

There are obvious generalizations of these definitions to subsets of

for any

. It is an easy, and also important, exercise to prove that these are equivalent.

We can prove another equivalent characterization of semi-decidable sets.

Proposition.

A set

X ⊆ N

is semi-decidable iff it is a projection of a decidable

subset of N

k+1

Proof.

(

⇐

) Let

Y ⊆ N

k+1

be such that

proj

, where

proj

here denotes

the projection to the first

factors. Since

is decidable, there is a computable

function

N → N

k+1

such that

im f

. Then

(

proj

◦ f

) =

. So

the image of a computable function.

(

⇒

) Suppose

is semi-decidable. So

dom

(

{m}

) for some

m ∈ N

, i.e.

X = {n : {m}(n) ↓}. Then we can pick

Y = {(x, t) | {m}(x) halts in t steps}.

A crucial example of a semi-decidable set is the halting set.

Definition (Halting set). The halting set is

{hp, ii : {p}(i) ↓} ⊆ N

Some people prefer to define it as

{m : {m}(m) ↓} ⊆ N

instead.

These two definitions are “the same” in the sense that if we are given some

magic “oracle” that determines membership of one of these sets, then we can

use it to build a program that determines the other. (Exercise)

Theorem (Turing). The halting set is semi-decidable but not decidable.

The proof is just some version of Liar’s paradox.

Proof.

Suppose not, and

is a machine with two inputs such that for all

p, i

we have

M(p, i) =

(

yes {p}(i) ↓

no {p}(i) ↑

If there were such a machine, then we could do some “wrapping” — if the output

is “yes”, we intercept this, and pretend we are still running. If the output is

“no”, then we halt. Call this

. From this, we construct

(

) =

(

n, n

Suppose this machine is coded by m.

Now does

{m}

(

) halt? Suppose it does. Then

(

m, m

) =

yes

, and hence

(

m, m

) does not halt. This means

(

) doesn’t halt, which is a contradiction.

Conversely, if

(

) does not halt, then

(

m, m

) says

. Thus,

(

)

halts. This is again a contradiction!

So M cannot exist.

So the problem of deciding whether a program halts is not decidable. In fact,

we can prove something much stronger — any non-trivial question you can ask

about the behaviour of the function represented by a program is undecidable.

To prove this, we need to go through a few equally remarkable theorems.

Theorem

(

smn

theorem)

There is a total computable function

of two vari-

ables such that for all e, we have

{e}(b, a) = {s(e, b)}(a).

Similarly, we can find such an s for any tuples b and a.

This is called the

smn

theorem because the function is called

, and

and

n usually refers to the length of the tuples b and a.

Computer scientists call this process “currying”.

Proof. We can certainly write a program that does this.

Theorem

(Fixed point theorem)

Let

N → N

be total computable. Then

there is an n ∈ N such that {n} = {h(n)} (as functions).

Proof. Consider the map

he, xi 7→ {h(s(e, e))}(x).

This is clearly computable, and is computed by a machine numbered

, say. We

then pick n = s(a, a). Then we have

{n}(x) = {s(a, a)}(x) = {a}(a, x) = {h(s(a, a))}(x) = {h(n)}(x).

This is a piece of black magic. When we do lambda calculus later, we will see

that this is an example of the

combinator, which is what is used to produce

fixed points in general (and is also black magic).

Theorem

(Rice’s theorem)

Let

be a non-empty proper subset of the set of

all computable functions N → N. Then {n : {n} ∈ A} is not decidable.

This says it is impossible to decide what a program does based on what its

code is.

Proof.

We fix an

. Suppose not. We let

be the (total) characteristic function

{n} ∈ A}

. By assumption,

is computable. We find naturals

a, b

such

that

{a} ∈ A

and

{b} 6∈ A

. Then by the hypothesis, the following is computable:

g(n) =

(

b {n} ∈ A

a otherwise

The key point is that this is the wrong way round. We return something in

if the graph of

{n}

is not in

. Now by the fixed point theorem, there is some

n ∈ N such that

{n} = {g(n)}.

We now ask ourselves — do we have

{n} ∈ A

? If so, then we also have

(

)

} ∈ A

But these separately imply

(

) =

and

(

)) =

respectively. This implies

g(b) = b, which is not possible.

Similarly, if

{n} 6∈ A

, then

(

)

} 6∈ A

. These again separately imply

(

) =

and g(g(n)) = a. So we find g(a) = a, which is again a contradiction.

Corollary. It is impossible to grade programming homework.

3.3 Computability elsewhere

In all of mathematics, whenever the objects we talk about are countable, we can

try to sneak in computability theory. Examples include, for example, Ramsey

theory, or logic.

3.4 Logic

When we do logic, our theorems, proofs, rules etc. are all encoded as some finite

strings, which can certainly be encoded as numbers. Even though our theories

often have infinitely many axioms, the set of axioms is also decidable. It follows

from this that the set of theorems in our theory is semi-decidable, as to know if

something is a theorem, we can run through all possible strings and see if they

encode a proof of the statement. This also works if our set of axioms is just

semi-decidable, but we will need to use some diagonalization argument.

We start with a rather cute observation.

Theorem

(Craig’s theorem)

Every first-order theory with a semi-decidable set

of axioms has a decidable set of axioms.

Proof.

By assumption, there is a total computable function

such that the

axioms of the theory are exactly

(

) :

n ∈ N}

. We write the

th axiom as

The idea is to give an alternative axiom that says the same thing as

, but

from the form of the new axiom itself, we can deduce the value of

, and so we

can compute

(

) to see if they agree. There are many possible ways to do this.

For example, we can say that we add

many useless brackets around

and

take it as the new axiom.

Alternatively, without silly bracketing, we can take the nth axiom to be

i<n

→ ϕ

Then given any statement

, we can just keep computing

(1)

, f

(2)

, · · ·

, and

then if

, then we will figure in finite time. Otherwise, we will, in finite

time, see that

doesn’t look like

(1)

∧ f

(2)

∧ · · ·

, and thus deduce it is not an

axiom.

We can also do some other fun things. Suppose we take our favorite copy of

with the usual 0 and successor functions. We let

be the theory of all true

first-order statements in

. We call this the theory of true arithmetic. We add

a constant c to this theory, and add axioms to say

0 < c, 1 < c, 2 < c, · · · .

Then by compactness, this has a model, and by Lowenheim–Skolem, it has a

countable model, which is a non-standard model of

. We wlog assume the

carrier set of this model is in fact N. How bad can this be?

Theorem

(Tennenbaum’s theorem)

For any countable non-standard model of

true arithmetic, the graph of + and × cannot be decidable.

Interestingly, computability theory offers us an easy proof of G¨odel’s incom-

pleteness theorem. We first note the following result:

Theorem.

The set of G¨odel numbers of machines that compute total functions

is not semi-decidable.

Proof.

Suppose it were. Then there is a computable total function

N → N

such that every computable total function is

(

)

}

for some

. Now consider

the function

g(n) = {f(n)}(n) + 1.

This is a total computable function! But it is not

(

)

}

for any

, because it

differs from {f(n)} at n.

Note that this proof is very explicit and constructive. If we are presented

with a function

{e}

N → N

, then there is an algorithm that returns an

such

that {n} is total, and is not in the image of {e}.

Definition

(Productive set)

A set

X ⊆ N

is productive if there is a total

computable function

N → N

such that for all

n ∈ N

, we have

(

)

∈

X \ (im{n}).

Then the theorem says the set of all machines that compute total functions

is productive.

In some sense, productive sets are “decidably non-semi-decidable”.

If we work in, say, ZFC, then there exists a fixed, canonical copy of

. We

note the following definition:

Definition

(Sound theory)

A sound theory of arithmetic is one all of whose

axioms are true in N.

This is not to be confused with consistency, which does not refer to any

model. For example, the theory of Peano arithmetic plus the statement that PA

is inconsistent is actually a consistent theory, but is unsound (if PA is consistent).

Theorem

(G¨odel’s incompleteness theorem)

Let

be a recursively axiomatized

theory of arithmetic, i.e. the set of axioms is semi-decidable (hence decidable).

Suppose

is sufficiently strong to talk about computability of functions (e.g.

Peano arithmetic is). Then there is some proposition true in

that cannot be

proven in T .

Proof.

We know the set of theorems of

is semi-decidable. So

T `

{n} is a total function}

is semi-decidable. So there must be some total function

such that T does not prove it is total. In other words, T is not complete!

Ramsey theory

In Ramsey theory, we have the following famous result:

Theorem

(Ramsey’s theorem)

We write

(k)

for the set of all subsets of

size k. Suppose we partition N

(k)

in m many distinct pieces. Then there exists

some infinite

X ⊆ N

such that

is monochromatic, i.e.

(k)

⊆ N

(k)

lie entirely

within a partition.

The natural thing for us to do is to insert the word “decidable” everywhere

in the theorem, and see if it is true. Note that a partition of

(k)

into

pieces

is equivalently a function

(k)

→ {

, · · · , k −

}

, and it is not hard to encode

-subsets of

as natural numbers, e.g. by encoding them as an increasing

-tuple.

Thus, it makes sense for us to ask if the partition is decidable, then are we

guaranteed to have a decidable infinite monochromatic set?

One might expect not, because the proof of Ramsey’s theorem is rather

non-constructive. Indeed, we have the following theorem:

Theorem

(Jockusch)

There exists a decidable partition of

(3)

into two pieces

with no infinite decidable monochromatic set.

Proof.

We define a partition

(3)

→ {

}

as follows. Given

x < y < z

, we

define

(

{x, y, z}

) to be 0 if for any

p, d < x

, we have “

{p}

(

) halts in

steps iff

{p}

(

) halts in

steps”, and 1 otherwise. This is a rather weird colouring, but

let’s see what it gives us.

We first note that there can be no monochromatic set coloured 1, as given

any

, for sufficiently large

and

, we must have

{p}

(

) halts in

steps iff

{p}(d) halts in z steps.

We claim that an infinite decidable monochromatic set

for 0 will solve the

halting problem. Indeed, if we want to know if

{p}

halts on

, then we pick

some

x ∈ A

such that

p, d < x

. Then pick some

y ∈ A

such that

y > x

. Then

by construction of

, if

{p}

(

) halts at all, then it must halt in

steps, and we

can just check this.

Thus, there cannot be an infinite decidable monochromatic set for this

partition.

But what about partitioning

(2)

? The same trick doesn’t work, and the

situation is quite complex. We will not talk about this.

3.5 Computability by λ-calculus

We first introduced

-calculus in the setting of decorating natural deduction

proofs. Remarkably, it turns out

-calculus can be used to encode any program

at all! For the sake of concreteness, we will properly define λ calculus.

Definition (λ terms). The set Λ of λ terms is defined recursively as follows:

– If x is any variable, then x ∈ Λ.

– If x is a variable and g ∈ Λ, then λx. g ∈ Λ.

– If f, g ∈ Λ, then (f g) ∈ Λ.

As always, we use variables sensibly, so that we don’t write things like

(

λx. fx

). We also write

(

) to denote a

term that has a free variable

and then f(y) would mean taking f and replacing all occurrences of x with y.

We will often be lazy and write λxy.f instead of λx. λy. f .

We can define free variables and bound variables in the obvious way. If we

want to be really formal, we can define them as follows:

Definition (Free and bound variables).

– In the λ term x, we say x is a free variable.

– In the λ term λx. g, the free variables are all free variables of g except x.

–

In the

term (

f g

), the free variables is the union of the free variables of

f and those of g,

The variables that are not free are said to be bound.

The definitions become more subtle when we have, say, a variable that is

free in

and bound in

, but this is not a course on pedantry. We will just not

worry about these, and keep ourself disciplined and not wrote things like that.

We will gloss over subtleties like this in our upcoming definitions as well.

As mentioned previously, we want to think of

λx. f

(

) as some sort of

function. So we want things like

(λx. f(x)) y = f (y)

to be true. Of course, they are not actually equal as expressions, but they are

more-or-less equivalent.

Definition

(

-equivalence)

We say two

terms are

-equivalent if they are

the same up to renaming of bound variables.

Example. λx. x and λy. y are α-equivalent.

For any sane purposes at all in

calculus,

-equivalent terms are really

considered equal.

Definition

(

-reduction)

= (

λx. y

(

))

and

(

), then we say

obtained from f via β-reduction.

This captures the idea that λ expressions are really functions.

Definition

(

-reduction)

. η

-conversion is the conversion from

λx.

(

f x

) to

whenever x is not free in f .

This captures “function extensionality”, i.e. the idea that functions are the

same iff they give the same results for all inputs.

Definition

(

-normal form)

We say a term is in

-normal form if it has no

possible β-reduction.

Our model of computation with

calculus is as follows. A program is a

expression

, and an input is another

expression

. To run

, we take

f x

, and then try to

-reduce it as far as possible. If it reaches a

-normal form,

then we are done, and say that is the output. If we never get to a

-normal

form, then we say we don’t terminate.

To do so, we need a reduction strategy. It is possible that a

expression can

be β-reduced in many ways, e.g. if we have

(λx. (λy. f y x) x) z,

then we can reduce this to either (

λy. f y z

)

or (

λx f x x

)

. These are not

-equivalent, but of course performing one more

-reduction will yield the same

terms. A reduction strategy is a prescription of which

-reduction we should

perform when there is more than one choice.

There are some things that might worry us. The first is that if we reduce in

different ways, then we might end up in different normal forms. Even worse, it

could be that one reduction strategy would not terminate, while another might

halt in 3 steps!

For this to work, we need the following theorem:

Theorem.

Every

expression can be reduced to at most one

-normal form.

Moreover, there exists a reduction strategy such that whenever a

expression

can be reduced to a

-normal form, then it will be reduced to it via this reduction

strategy.

This magic reduction strategy is just to always perform

-reduction on the

leftmost thing possible.

Notation ( ). We write f g if f β-reduces to g.

There is another thing we should note, which is currying. If we want to

implement a function with, say, 2 inputs, say a function

N × N → N

, we can

instead implement it as a function

f : N → (N → N), where

f(x)(y) = f(x, y).

This allows us to always work with functions of one variable that can potentially

return functions. We will usually just write the type as N → N → N.

Note that in this setting, officially, our

terms do not have “types”. However,

it is often convenient to imagine they have types to make things less confusing,

and we will often do so.

With the formalities out of the way, we should start doing things. We claimed

that we can do programming with

terms. The first thing we might want to

try is to encode numbers. How can we do so?

Recalling that types don’t exist in

calculus, consider a “generic type”

We will imagine that natural numbers take in “functions”

A → A

and returns a

new function A → A. We write

N = (A → A) → (A → A)

for the “type” of natural numbers. The idea is that

represents the

term that

sends f to the nth iteration f

Formally, we define them as follows:

Definition (Church numerals). We define

0 = K(id) = λf. λx. x : N

succ = λn. λf. λx. f ((n f) x) : N → N

We write n = succ

(0).

We can define arithmetic as follows:

plus = λn. λm. λf. λx. (n f) ((m f) x) : N → N → N

mult = λn. λm. λf. λx. (n (m f)) x : N → N → N

exp = λn. λm. λf. λx. ((n m) f)) x : N → N → N

Here exp n m = n

Of course, our “typing” doesn’t really make sense, but helps us figure out

what we are trying to say.

Certain people might prefer to write mult instead as

mult = λn. λm. λf. n (m f).

We can define a lot of other things. It is possible to give motivation for these,

and if one thinks hard enough, these are “obviously” the right definitions to

make. However, we are not really concerned about this, because what we really

want to talk about is recursion. We can define

true = λxy. x

false = λxy. y

if b then x else y = λbxy. b x y

iszero = λn. n (λx. false) true

pair = λxyf. f x y

fst = λp. p true

snd = λp. p false

nil = λx. true

null = λp. p (λxy. false)

We can check that

fst (pair x y) = x

snd (pair x y) = y

Using

pair

fst

and

snd

, we can implement lists just as pairs of pair, with functions

head, tail, cons satisfying

cons (head x) (tail x) = x.

We by convention always end our lists with a

nil

, and

null

is a test of whether

something is nil.

Exercise.

Implement the predecessor function that sends

to (

n −

1), if

n >

and 0 otherwise. Also implement

and

, and test of equality for two general

church numerals.

We want to show that we can in fact represent all primitive recursive functions

with

-terms. We will in fact do something stronger — we will implement

general recursion. This is a version of recursion that doesn’t always guarantee

termination, but will be useful if we want to implement “infinite structures”.

We look at an example. We take our favorite example fact : N → N

fact n = if iszero n then 1 else n · fact (n − 1).

But this is not a valid definition, since it is circular, and doesn’t really give us a

expression. The trick is to introduce a new function

metafact

: (

N → N

)

→

(N → N), given by

metafact f n = if n = 0 then 1 else n · f (n − 1)

This is a genuinely legitimate definition of a function. Now suppose

is a fixed

point of metafact, i.e. f = metafact f . Then we must have

f(n) = (metafact f)(n) = if n = 0 then 1 else n · (f (n − 1)).

So f is in fact the factorial function!

Thus, we have reduced the problem of recursion to the problem of computing

fixed points. If we have a general fixed point operator, then we can implement

all recursive functions.

Definition (Y-combinator). The Y-combinator is

Y = λf.

(λx.f(x x))(λx.f(x x))

Theorem. For any g, we have

Y g g (Y g).

Proof.

Y g λf.

(λx.f(x x))(λx.f(x x))

(λx.(g(x x)))(λx.g(x x))



(λx.(g(x x)))(λx.g(x x))



We also need to implement the minimization operator. To compute

−1

(17),

our plan is as follows:

(i) Produce an “infinite list” (stream) of natural numbers.

(ii) Apply a function f to every element of the stream.

(iii) Find the first element in the stream that is sent to 17.

Note that infinite lists are genuinely allowed in

-calculus. These behave like

lists, but they never have an ending

nil

element. We can always extract more

and more terms out of it.

We do them in a slightly different order

(ii)

We implement a function

map

that takes in a list or stream, and applies a

function f to every element of the list. We simply define it as

map f x = if null x then x else cons (f (head x)) (map f (tail x)).

Of course, we need to use our Y combinator to actually achieve this.

(i)

We can produce a stream of natural numbers by asking for the fixed point

λ. cons 0(map succ ).

(iii)

We will actually apply

map

λn. pair n

(

f n

), so that we remember the

index. We then use the function

ﬁnd n x = if snd (head x) = n then fst (head x) else ﬁnd n (tail x).

3.6 Reducibility

We’ve alluded to the notion of reducibility several times previously. For example,

we had two versions of the halting set, and said that if we could decide membership

of one of them, then we can decide membership of the other. When we proved that

there is a decidable partition of

(3)

with no infinite decidable monochromatic

set, we said if we had such a thing, then we could use it to solve the halting

problem.

The general idea is that we can often “reduce” one problem to another one.

The most basic way of expressing this is via many-to-one reducibility.

Definition

(Many-to-one reducibility)

Let

A, B ∈ N

. We write

B ≤

there exists a total computable functions

N → N

such that for all

n ∈ N

, we

have n ∈ B ↔ f(n) ∈ A.

Thus, if we have such a set, then whenever we can correctly answer questions

about membership in

, then we can answer questions about membership in

It is easy to see that this is a quasi-order. If we take the intersection

≤

∩ ≥

of the relations, then we get an equivalence relation. The equivalence classes are

called many-one degrees. We think of these as “degree of unsolvability”.

Proposition. Let A ⊆ N. Then A ≤

K iff A is semi-decidable.

This is a technical exercise. Another useful fact is the following:

Proposition. (N \ K) ≤

A iff A is productive.

But it turns out this is a rather awkward definition of “relative computability”.

In our definition of many-one reducibility, we are only allowed to run

once,

and see if the answer is in A. But why once?

The idea is define reducibility semantically. We assume that when writing

programs, we can do all the usual stuff, but also have access to a magic function

, and we are allowed to call

as many times as we wish and use the results.

We call

an oracle. For our purposes,

is the (total) characteristic function of

For technical reasons, we will suppose we have a fixed programming language

that just refers to an opaque function

, and then we can later plug different

things as

. This allows us to compare G¨odel numberings for functions with

different oracles.

Definition

(Turing reducibility)

We say

B ≤

, or simply

B ≤ A

, if it is

possible to determine membership of

whenever we have access to an oracle

that computes χ

Again

≤

is a quasi-order, and intersecting with the reverse, we get an

equivalence relation, which we call Turing degree, or just degree.

Now the quasi-order gives us a relation on Turing degrees. A natural question

to ask is if this is actually a total order.

To begin with, Turing degrees certainly form an upper semi-lattice, as given

and

, we can form the disjoint union of

and

, and knowing about

membership of

means we know membership of

and

, and vice versa.

But are they linearly ordered by ≤? The answer is no!

Theorem

(Friedberg–Muchnik)

There exists two

A, B ⊆ N

such that

A 6≤ B 6≤

A. Moreover, A and B are both semi-decidable.

Proof. We will obtain the sets A and B as

A =

[

n<ω

, B =

[

n<ω

where

and

are finite (and in particular decidable) and nested, i.e.

i < j

implies A

⊆ A

and B

⊆ B

We introduce a bit of notation. As mentioned, if we allow our programs to

access the oracle

, then our “programming language” will allow consultation of

. Then in this new language, we can again assign G¨odel numbers to programs,

and we write

{e}

for the program whose G¨odel number is

. Instead of inventing

a new language for each

, we invent a language that allows calling an “oracle”,

without specifying what it is. We can then plug in different sets and run.

Our objective is to pick A, B such that

6= {e}

for all e. Here we are taking χ

to be the total version, i.e.

(x) =

(

1 x ∈ A

0 x 6∈ A

The idea is, of course, to choose

such that it prevents

from being

{m}

and vice versa. But this is tricky, because when we try to pick

, we don’t

know the whole of B yet.

It helps to spell out more explicitly what we want to achieve. We want to

find some n

, m

∈ N such that for each i, we have

(i)

) 6= {i}

(i)

)

(i)

) 6= {i}

(i)

)

These n

and m

are the witness to the functions being not equal.

We now begin by producing infinite lists

= {n

(i)

, n

(i)

, · · · }

= {m

(i)

, m

(i)

, · · · }

of “candidate witnesses”. We will assume all of these are disjoint. For reasons

that will become clear later, we assign some “priorities” to these sets, say

> M

> N

> M

> N

> · · · .

We begin by picking

= B

= ∅.

Suppose at the

th iteration of the process, we have managed to produce

t−1

and

t−1

. We now look at

, . . . , N

and

, · · · , M

. If they have managed to

found a witness, then we leave them alone. Otherwise, suppose

hasn’t found

a witness yet. Then we run

{i}

t−1

(

(i)

) for

many time steps. We consider

the various possibilities:

–

Let

be the first remaining element of

. If

{i}

t−1

(

) halts within

steps and returns 0, then we put

into

, and then pick this as the

desired witness. We now look at all members of

t−1

our computation of

{i}

t−1

has queried. We remove all of these from all sets of lower priority

than N

– Otherwise, do nothing, and move on with life.

Now the problem, of course, is that whenever we add things into

, it

might have changed the results of some previous computations. Suppose we

originally found that

{

}

(32) = 0, and therefore put 32 into

as a witness.

But later, we found that

{

}

(70) = 0, and so we put 70 into

. But if the

computation of

{

}

(32) relied on the fact that 70

6∈ B

, then we might have

{10}

(32) 6= 0,

and now our witness is “injured”. When this happens, we forget the fact that we

picked 32 as a witness, and then pretend

hasn’t managed to find a witness

after all.

Fortunately, this can only happen because

is of higher priority than

and thus 70 was not forbidden from being a witness of

. Since there are only

finitely many things of higher priorities, our witness can be injured only finitely

many times, and we will eventually be stabilized.

We are almost done. After all this process, we take the union and get

and

. If, say,

{i}

has a witness, then we are happy. What if it does not?

Suppose

{i}

is some characteristic function, and

is the first element in the

list of witnesses. Then since the lists of candidate witnesses are disjoint, we

know

m 6∈ B

. So it suffices to show that

{i}

(

)

= 0. But if

{i}

(

) = 0,

then since this computation only involves finitely many values of

, eventually,

the membership of these things in

would have stabilized. So we would have

{i}

(m) = 0 long ago, and made m a witness.

4 Well-quasi-orderings

We end by a discussion on well-quasi-orders. The beginning of all this is the

idea of well-foundedness. We begin by recalling the definition of a quasi-order.

Definition

(Quasi-order)

An order

hX, ≤i

is a quasi-order if it is transitive

and reflexive, i.e. for all a, b, c ∈ X,

– If a ≤ b and b ≤ c, then a ≤ c.

– We always have a ≤ a.

Note that unlike a poset, we do not require reflexivity, i.e. we can have distinct

a, b

satisfying

a ≤ b

and

b ≤ a

. There is a canonical way of turning a quasi-order

into a partial order, namely by actually identifying elements satisfying

a ≤ b

and

b ≤ a

, and so it appears we don’t lose much by restricting to partial orders.

However, it turns out a lot of the orders we construct are naturally quasi-orders,

not partial orders, and the extra hypothesis of being a partial order is not really

useful. So we will talk about quasi-orders.

The main idea behind the whole chapter is the notion of well-foundedness.

We probably first met this concept in set theory, where the axiom of foundation

said sets are well-founded. Chances are, it was motivated to get rid of scenarios

such as

x ∈ x

. However, as we do set theory, we figure that the real benefit it

brings us is the ability to do ε-induction.

In general, for an arbitrary relation

, we want to admit the following

R-induction principle:

∀

x∈X

((∀

y∈X

R(y, x) → ϕ(y)) → ϕ(x))

∀

x∈X

ϕ(x)

How can this fail? Suppose the top line holds, but the bottom line does not. We

set

A = {x : ¬ϕ(x)}.

Then we know this set is non-empty. Moreover, the top line says

∀

x∈A

∃

y∈A

R(y, x).

Thus, if we forbid such sets from existing, then we can admit the induction

principle.

Definition

(Well-founded relation)

A relation

is said to be well-founded

if for all

A ⊆ X

non-empty, there is some

x ∈ A

such that for any

y ∈ A

, we

have ¬R(y, x).

There is another way of expressing the well-founded relation, if we believe in

the axiom of dependent choice.

Axiom

(Axiom of dependent choice)

Let

be a set an

a relation on

The axiom of dependent choice says if for all

x ∈ X

, there exists

y ∈ X

such

that

(

x, y

), then we can find a sequence

, x

, · · ·

such that

(

, x

i+1

) for

all i ∈ N.

This is a rather weak form of the axiom of choice, which we will freely assume.

Assuming this, then being well-founded is equivalent to the non-existing of

infinite decreasing R-chains.

Why do we care about well-foundedness? There are many reasons, and one we

can provide comes from computer science. In general, it guarantees termination.

When writing programs, we often find ourselves writing loops like

while ( some condition holds ) {

do something;

}

This repeats the “do something” until the condition fails. In general, there is no

reason why such a program should eventually terminate.

Often, the loop is actually of the form

while ( i != BOT ) {

do something with i;

}

where

takes value in some partial order

, with

BOT

being the bottom element.

Suppose every time the loop runs, the value of

decreases. Then if

is well-

founded, then we are guaranteed that the program will indeed terminate. More

generally, if

doesn’t have a unique bottom element, but is still well-founded,

then loops of the form

while ( i is not a minimal element ) {

do something with i;

}

are still guaranteed to terminate.

Example.

Consider typed

calculus, which is the form of

calculus where we

require each term to have a type, and all terms have to be typed, just as we did

when we used it to decorate natural deduction proofs.

We let

be the set of all typed

terms up to

-equivalence, and for

f, g ∈ X

we write

f ≤ g

can be

-reduced to

. It is a remarkable theorem that

is well-founded. In other words, we can keep applying

-reduction in any

way we like, and we are guaranteed to always end up in a normal form. It is

also true that the normal form of a

term is unique, but this is not implied by

well-foundedness.

It is clear from definition that a quasi-order is never well-founded, since it

is reflexive. It is also clear that it fails for a silly reason, and we can fix it by

defining

Definition

(Well-ordered quasi-order)

A quasi-order is well-ordered if there is

no strictly decreasing infinite sequence.

Here we are not worried about choice issues anymore.

Once we are convinced that being well-founded is a good thing, it is natu-

ral ask what operations preserve well-foundedness. We begin by listing some

constructions we can do.

Definition

(

))

Let

hX, ≤

be a quasi-order. We define a quasi-order

≤

on P(X) by

≤

if ∀

∈X

∃

∈X

≤

Definition

(

<ω

)

Let

hX, ≤

be a quasi-order. We let

<ω

be the set of all

finite lists (i.e. sequences) in

. We define a quasi-order on

<ω

recursively by

– nil ≤ 

, where nil is the empty list.

– tail(

) ≤ 

– If tail(

) ≤ tail(

) and head(

) ≤

head(

), then 

≤ 

for all lists 

, 

Equivalently, sequences

}

i=1

≤

}

i=1

if there is a subsequence

}

k=1

of {y

} such that x

≤ y

for all k.

Finally, the construction we are really interested in is trees.

Definition

(Tree)

Let

hX, ≤

be a quasi-order. The set of all trees (in

) is

defined inductively as follows:

–

x ∈ X

and

is a list of trees, then (

x, L

) is a tree. We call

a node of

the tree. In particular (x, nil) is a tree. We write

root(x, L) = x, children(x, L) = L.

Haskell programmers would define this by

data Tree a = Branch a [Tree a]

We write Trees(X) for the set of all trees on X.

We define an order relation

≤

Trees

(

) as follows — let

, T

∈ Trees

(

Then T

≤ T

(i) T

≤ T

for some T

∈ children(T

(ii) root(T

) ≤ root(T

) and children(T

) ≤ children(T

) as lists.

It is an exercise to think hard and convince yourself this recursively-defined

relation is indeed well-defined.

Which of these operations preserve well-foundedness? It is not hard to see

the following:

Lemma.

hX, ≤

is a well-founded quasi-order, then so is

<ω

and

Trees

(

The reason why this works is that the lifts of the order on

<ω

and

Trees

(

) are “of finite character”. To compare two elements in

<ω

Trees

(

we only have to appeal to finitely many instances of ≤

On the other hand,

(

)

, ≤

is in general not well-founded. For example,

hN,

is a well-founded quasi-order, being the reflexive transitive closure of the

trivial relation. But

(

)

has an infinite decreasing sequence. Indeed, =

is just the subset relation.

Definition

(Well-quasi-order)

A well-quasi-order (WQO) is a well-founded

quasi-order hX, ≤

i such that hP(X), ≤

i is also well-founded.

Note that the power set of a well-quasi-order need not be a well-quasi-order.

The question we want to answer is whether the operations

<ω

and

Trees

(

)

preserve well-quasi-orders. To do so, we try to characterize well-quasi-orders in

a more concrete way.

When does a quasi-order fail to be a well quasi-order? Suppose

, x

, · · ·

is an infinite antichain, i.e. a collection of mutually unrelated elements. Then as

above

h{x

: i > n} : n ∈ Ni

is an infinite descending sequence in

(

)

, ≤

. Of course, to be well-founded,

we also need the non-existence of infinite decreasing sequences.

Proposition. Let hX, ≤ be a quasi-order. Then the following are equivalent:

(i) hX, ≤i is a well-quasi-order.

(ii) There is no infinite decreasing sequence and no infinite anti-chain.

(iii)

Whenever we have any sequence

∈ X

whatsoever, we can find

i < j

such that x

≤ x

In standard literature, (iii) is often taken as the definition of a well-quasi-order

instead.

Proof.

– (i) ⇒ (ii): We just showed this.

–

(iii)

⇒

(i): Suppose

, X

, · · ·

is a strictly decreasing chain in

(

Then by definition, we can pick

∈ X

such that

is not

≤

anything in

i+1

. Now for any

j > i

, we know

6≤ x

, because

≤

something in

i+1

. So we have found a sequence {x

} such that x

6≤ x

for all i < j.

– (iii) ⇒ (ii) is clear.

–

(ii)

⇒

(iii), we show that whenever

is a sequence without

i < j

such

that

≤ x

, then it either contains an infinite anti-chain, or an infinite

descending chain.

To do so, we two-colour [N] by

X({i < j}) =

(

0 x

> x

1 otherwise

By Ramsey’s theorem, this has an infinite monochromatic set. If we have

an infinite monochromatic set of colour 0, then this is a descending chain.

Otherwise, if it is of colour 1, then together with our hypothesis, we know

} is an anti-chain.

If we want to reason about well-founded relations, then the following notion

is often a useful one to have in mind:

Definition

(Well-founded part of a relation)

Let

hX, Ri

be a set with a relation.

Then the well-founded part of a relation is the

⊆

-least subset

A ⊆ X

satisfying

the property

– If x is such that all predecessors of x are in A, then x ∈ A.

So for example, any minimal element of

would be in the well-founded part.

One can prove that the well-founded part is indeed well-founded without much

difficulty.

Is there a good notion of the WQO-part of a quasi-order? It turns out not,

or at least there isn’t an obvious one that works. For our later discussion, it is

convenient to have the following definition:

Definition

(Bad sequence)

Let

hX, ≤i

be a well-founded quasi-order. A se-

quence {x

} is bad if for all i < j, we have f(i) 6≤ f(j).

Then a quasi-order is a WQO iff it has no bad sequences.

The idea is now to pick a bad sequence that is “minimal” in some sense, and

then looking at things below this minimal bad sequence will give is a well-quasi-

order. This “WQO-part” isn’t some object canonically associated to the order

hX, ≤i

, because it depends on the choice of the minimal bad sequence, but is

very useful in our upcoming proof.

Definition

(Minimal bad sequence)

Let

hX, ≤i

be a quasi-order. A minimal

bad sequence is a bad sequence

}

such that for each

k ∈ N

is a

≤

-minimal

element in

{x : there exists a bad sequence starting with x

, x

, · · · , x

k−1

, x}.

Lemma.

Let

hX, ≤i

be a well-founded quasi-order that is not a WQO. Then it

has a minimal bad sequence.

Proof.

Just pick each

according to the requirement in the definition. Such an

can always be found because hX, ≤i is well-founded.

What makes minimal bad sequences useful is the following promised lemma:

Lemma

(Minimal bad sequence lemma)

Let

hX, ≤i

be a quasi-order and

B = {b

} a minimal bad sequence. Let

= {x ∈ X : ∃

n∈N

x < b

Then hX

, ≤i is a WQO.

Proof.

Suppose

}

is a bad sequence in

. We prove by induction that nothing

in {s

} is below b

, which gives a contradiction.

Suppose

< b

for some

. Then

, s

i+1

, s

i+2

, · · ·

is a bad sequence, whose

first element is less than b

. This then contradicts the minimality of B.

For the induction step, suppose nothing in

is strictly less than

, · · · , b

Suppose, for contradiction, that s

< b

n+1

. Now consider the sequence

, · · · , b

, s

i+1

, s

i+2

, · · · .

By minimality of

, we know this cannot be a bad sequence. This implies there

is some

n, m

such that the

th element of the sequence is less than the

element. But they can’t be both from the

}

or both from then

}

. This

implies there is some b

≤ s

with j ≤ n and k ≥ i + 1.

Consider

. Since

∈ X

, it must be

≤ b

for some

. Moreover, by

induction hypothesis, we must have

m > n

. Then by transitivity, we have

≤ b

, contradicting the fact that B is bad.

We need one more technical lemma.

Lemma

(Perfect subsequence lemma)

Let

hX, ≤i

be a WQO. Then every

sequence

}

has a perfect subsequence, i.e. an infinite subset

A ⊆ N

such

that for all i < j ∈ A, we have f (i) ≤ f(j).

By definition of well-quasi order, we can always find some

i < j

such that

f(i) ≤ f(j). This says we can find an infinite subset such that this works.

Proof. We apply Ramsey’s theorem. We two-colour [N]

c(i < j) =

(

red f(i) ≤ f (j)

blue f(i) 6≤ f(j)

Then Ramsey’s theorem gives us an infinite monochromatic set. But we cannot

have an infinite monochromatic blue set, as this will give a bad sequence. So

there is a monochromatic red set, which is a perfect subsequence.

We are now in a position to prove that lists over a WQO are WQO’s.

Lemma

(Higman’s lemma)

Let

hX, ≤

be a WQO. Then

<ω

, ≤

is a

WQO.

Proof.

We already know that

<ω

is well-founded. Suppose it is not a WQO.

Then there is a minimal bad sequence {x

} of X-lists under ≤

Now consider the sequence

{head

(

)

}

. By the perfect subsequence lemma,

after removing some elements in the sequence, we may wlog this is a perfect

sequence.

Now consider the sequence

{tail

(

)

}

. Now note that

tail

(

)

< x

for each

(here it is crucial that lists are finite). Thus, using the notation in the minimal

bad sequence lemma, we know tail(x

) ∈ X

for all i.

But

is a well quasi-order. So we can find some

i < j

such that

tail

(

)

≤

tail

(

). But also

head

(

)

≤ head

(

) by assumption. So

≤ x

, and this is a

contradiction.

We can apply a similar juggle to prove Kruskal’s theorem. The twist is that

we will have to apply Higman’s lemma!

Theorem

(Kruskal’s theorem)

Let

hX, ≤

be a WQO. Then

hTrees

(

)

, ≤

is a WQO.

Proof.

We already know that

Trees

(

) is well-founded. Suppose it is not a

WQO. Then we can pick a minimal bad sequence {x

} of trees.

As before, we may wlog {root(x

)} is a perfect sequence. Consider

Y = {T : T ∈ children(x

) for some i}.

Then

Y ⊆ X

. So

is a WQO, and hence by Higman’s lemma,

<ω

is a

WQO. So there exists some

i, j

such that

children

(

)

≤ children

(

). But

by perfectness, we have

root

(

)

≤ root

(

). So by definition of the ordering,

≤ x

What good is this theorem? It is possible to use Kruskal’s theorem to prove

the termination of some algorithms, but this is not a computer science course, and

we will not go into that. Instead, we look at some proof-theoretic consequences

of Kruskal’s theorem.

We begin with the simplest possible WQO — the WQO with one element

only and the only possible order. Then a tree over this WQO is just a tree with

no labels on each node, i.e. a tree.

Consider the statement

P (k) =

there exists some

n ∈ N

such that there is no finite bad

sequence T

, · · · , T

where T

has k + i nodes

This is a finitary version of the statement of Kruskal’s theorem. We claim that

(

) is true for all

. Suppose not. Then for each

, we find a bad sequence

, · · · , T

, where

is a tree with

nodes. We can put these trees in a

grid:

We look at the first column. Each tree has

+ 1 nodes. But there are only

finitely many such trees. So there is some tree that appears infinitely often. Pick

one, call it

, and throw away all other rows. Similarly, we can then find some

tree that appears infinitely often in the second row, call it

, and throw the

other rows away. We keep going. Then we find a sequence

, T

, · · ·

such that every initial segment is a bad sequence. So this is a bad sequence,

which is impossible.

Proposition

(Friedman’s finite form)

For all

k ∈ N

, the statement

(

) is

true.

Why do we care about this? The statement

(

) is something we can write

out in Peano arithmetic, as it is finitary in nature. But our proof certainly

didn’t live in Peano arithmetic, as we had to talk about infinite sets. It turns

out for each

k ∈ N

, we can indeed prove

(

) in Peano arithmetic. However,

it is impossible to prove

∀

(

)! This in particular means we cannot “prove

Kruskal’s theorem” in Peano arithmetic (whatever that might mean).

We can also relate this to ordinal analysis. Given a WQO

hX, ≤i

, we can

construct a well-founded structure from it. The obvious thing to try is to take

(

)

, ≤

, but this is not what we want. The problem is that talking about

power sets involves talking about infinite sets, and this is not something we know

how to do in Peano arithmetic.

Instead, we let

be the set of all finite bad sequences in

, and order them

s ≤ t

is an end-extension of

. Note that this is “upside down”. Then by

definition of a WQO, this is a well-founded downward-branching tree whose top

element is the empty sequence. Then by taking the rank of this structure, we

obtain an ordinal.

Using Kruskal’s theorem, we can produce some really huge WQO’s, which in

turn gives us some really huge ordinals. So in some sense, being able to prove

Kruskal’s theorem means we can prove the well-foundedness of some really big

ordinal in our theory.

We end with some natural generalization of WQO’s. Suppose

hX, ≤i

a WQO. Then

(

)

, ≤

need not be a WQO. Suppose it is not, and let

, X

, · · ·

be a bad sequence of subsets. Then for each

i < j

, we can find

some x

∈ X

such that x

is not less than anything in X

Using dependent choice, we can find a function

f : {hi, ji : i < j ∈ N} → X

such that for all

i < j < k

, we have

(

i, j

)

6≤ f

(

j, k

). This is a strengthening of

bad sequence, where the sequence is now indexed by pairs (i < j).

Definition (ω

-good). A well-ordering hX, ≤i is ω

-good if for any

f : {hi, ji : i < j ∈ N} → X,

there is some i < j < k such that f(i, j) ≤ f(j, k).

There is an analogue of the perfect subsequence lemma for these, but there

is no analogous notion of a minimal bad sequence (or at least no obvious one).