Formalized mathematics is presented in terms of axiomatic theories and valid deductions therein. The best developed, most precise vehicle for handling axiomatic theories is predicate or first-order logic, as explicated by Frege more than a century ago. It is the baseline language for exact reasoning in mathematics, and universally accepted currency for exchanging proofs between mathematicians.
Logic comes in various degrees of expressive and deductive strength, but in its most common form, classical predicate logic means
The principle of excluded middle is accepted;
Quantification takes place over elements of basic types (making it first-order; higher-order logic permits quantification over relations between basic types, relations between higher types of relations, etc.);
Provision is made for an equality predicate, which in the common currency is treated as “extensional equality” between terms.
This article aims to describe finitary classical first-order logic with equality from a conceptual point of view. Many textbooks, e.g., Hodges’ A Shorter Model Theory, tend to pass through this material in a hurry, perhaps under the (basically correct) assumption that mathematicians already know how to reason correctly. However, the mathematical structure of predicate logic is rich in categorical significance and has many pleasant offshoots, and shouldn’t be relegated to a hurried recitation of the basic vocabulary of terms, symbols, well-formed formulas and so on.
For those who like condensed summaries, much of what we describe below is to be a detailed description of the free Boolean hyperdoctrine generated from a signature. We also outline Gentzen sequent calculus as a formal structuring of deductions.
There is a frequently voiced discomfort about logical foundations: that it calls upon the mathematics that it is supposed to be prior to. One is dealing with such mathematical items as the set of words or lists over an alphabet, sets of well-formed expressions, trees which exhibit the structure of a deduction, and so on. Thus, logical foundations appears to assume the prior existence of certain infinite sets, whose manipulation is the business of a set theory, or at the very least a complete formalization of logic would appear to involve some substrate of Peano arithmetic. But wouldn’t a background metatheory of sets or arithmetic then presuppose a logic to handle it correctly? Isn’t logical foundations hopelessly circular?
In spirit, this is like what might be called the “paradox of the dictionary”. A dictionary, real or ideal, defines words in terms of other words. So either a dictionary is hopelessly circular, or some words must be left undefined (permitting some inexactness to creep in). Similarly, it would seem that the structure of logic itself is circular, or needs “undefined” terms, or … perhaps calls on an infinite regress of metatheories?
Our own view is that logical foundations avoids this paradox ultimately by being relentlessly concrete. We may put it this way: logic at the primary level consists of instructions for dealing with formal linguistic items, but the concrete actions for executing those instructions (electrons moving through logic gates, a person unconsciously making an inference in response to a situation) are not themselves linguistic items, not of the language. They are nevertheless as precise and exact as one could wish.
We emphasize this point because in our descriptions below, we obviously must use language to describe logic, and some of this language indeed looks just like the formal mathematics that logic is supposed to be prior to. Nevertheless, the apparent circularity should be considered spurious: it is assumed that the programmer who reads an instruction such as “concatenate a list of lists into a master list” does not need full-blown mathematical language to formalize this, but will be able to translate this directly into actions performed in the real world. However, at the same time, the mathematically literate reader may like having a mathematical meta-layer in which to understand the instructions. The presence of this meta-level should not be a source of confusion, leading one to think we are pulling a circular “fast one”.
Overall, the most useful metaphor might be a computer that can recognize and construct valid deductions in a finitary set theory, such as ETCS. This doesn’t presuppose some infinite sets that somehow reside in the computer. The computer only deals with finite fragments of a theory at a time, using hardware and software of determinate size. The ultimate manifestation of logic is therefore machine-level and extra-linguistic (where the “machine” is a computer or a human who acquires the skill of exact reasoning). Our aim here is not to describe this ultimate concrete manifestation, but rather to present a conceptual overview for mathematically sophisticated readers.
Predicate logic comes in various layers, each layer given in terms of basic data or “generators”, and basic rules or operations for constructing more complicated expressions. There are four layers of data and constructions we deal with, each layer depending on the previous:
Sorts, and rules for constructing types;
Function symbols, and rules for constructing terms;
Relation symbols, and rules for constructing well-formed formulas or predicates;
Axioms, and rules for constructing valid deductions.
The stuff of logic per se is in these rules of construction. An axiomatic theory per se is given by a collection of basic data that generates it. In other words, a theory consists of:
An underlying language generated by a signature, i.e., a collection of sorts, function symbols, and (non-equality) relation symbols;
Axioms of the theory, i.e., a collection of sequents $\Gamma_i \vdash A_i$ where each $A_i$ is a formula in the language and $\Gamma_i$ is a finite list of formulas of the same type as $A_i$.
This follows a familiar generators-and-relations pattern: the signature itself can be viewed as generating a “free theory”, and then one mods out by the filter of deductions that can be derived from the axioms, which is like modding out by relations.
Sometimes theories are further classified. For example, a functional theory is one whose signature includes no basic (non-equality) predicates. An equational theory is a functional theory whose only axioms are of the form $\vdash B$ where $B$ is an equality predicate between terms. There are Horn theories, geometric theories, and so on, depending on the structure of formulas given in the axioms.
We now discuss each of these layers in turn. We suppose in what follows that the data of a theory (sorts, function symbols, relation symbols, and axioms) has been given.
A basic type is one of a (for us, finite) collection of items $s_1, \ldots, s_n$ called sorts. The basic rule of type construction is that given two types $T, T'$, we may construct a new type $T \times T'$, and there is an empty type $1$. (Thus, we are dealing only with product types.)
The best way to consider types is that they are the objects of the free category with finite cartesian products generated from a discrete category of objects $s_1, \ldots, s_n$. This may be constructed quite syntactically (and fussily), but however it is constructed, it is equivalent to the category
where $Fin$ is the category of finite sets, and $S$ is the finite collection of sorts. For our purposes, it is convenient to take the objects of $Fin$ to be finite cardinals $\{1, \ldots, n\}$, so that an object of $Fin/S$ is a function $[n] = \{1, \ldots, n\} \to S$, or an element of the free monoid $S^\ast$. Thus, a product type $T$ is identified with a word in $S$. The category $(Fin/S)^{op}$ may be regarded as a monoid in the bicategory whose 1-cells are spans from $S^\ast$ to $S^\ast$, and we sometimes commit an abuse of language, writing
for the underlying span (the apex is of course actually the set of morphisms of $(Fin/S)^{op}$), and at other times writing $Prod(S)$ for the category $(Fin/S)^{op}$.
A morphism $\phi: T \to T'$, where $T: [m] \to S$ and $T': [n] \to S$ are words in $S$, is described by a function $h: [n] \to [m]$ making the diagram
commute. Alternatively, a morphism is effectively an $S$-indexed collection of functions between finite sets.
Two words are isomorphic if they are equal when regarded as commutative words, and in this sense we informally (and harmlessly) write $\prod_{s \in S} s^{n(s)}$ for any such product type together with its product structure (the projection maps to individual sorts). From this point of view, a morphism between product types is a product of diagonal maps
where for each single-sort factor $s = g(m)$ in the domain, there is a diagonal map $s^!: s \to s^{h^{-1}(m)}$. (If $h^{-1}(m)$ is empty, this diagonal map is actually a projection map $s \to 1$.) In this way, each morphism in $Prod(S)$ is a “generalized diagonal map”. Such generalized diagonal maps may be considered as operations between types which come “for free”, by virtue of the cartesian structure of product types. These operations are basically what is needed to construct and manipulate equality predicates, as we will discuss below.
However such generalized diagonal maps (i.e., $S$-indexed collections of finite functions) may be registered in a machine, a good modern notation for visualizing generalized diagonal maps is in terms of string diagrams.
We now describe the terms generated by function symbols. Each function symbol has an arity $(T, s)$ where $T \in S^\ast$ is a type and $s \in S$ is a sort; we write $f: T \to s$ to indicate the arity. (The semantics in terms of sets is that an “element” of a product type $T$ is a tuple of multiple inputs of $f$, and an “element” of $s$ is a single output.) In other words, the collection of function symbols forms an $S$-multigraph:
Let $S$ be a set of sorts, and $S^\ast$ the free monoid on $S$. A multigraph over $S$ is a set $F$ together with a span $S^\ast \stackrel{d_0}{\leftarrow} F \stackrel{d_1}{\to} S$.
If the domain $d_0(f)$ of a function symbol is the identity $e \in S^\ast$ (the empty word), then we call $f$ a constant.
Roughly speaking, the terms generated from a multigraph of function symbols are built up by formally plugging in outputs of function symbols as inputs of another, and iterating. Allowances are made so that some of the inputs used might be equal, or might not be used at all. We describe this more precisely in two ways, first from the more or less traditional approach of term syntax, and then from a more conceptual categorical point of view.
The usual syntax (which we will only sketch) is via terms $\tau$, each of a specified sort $s$ which we indicate by writing $\tau: s$. Intuitively, terms of type $s$ are considered “elements of type $s$”. To get started, one supposes given for each sort $s$ an infinite stock of “variables” of that sort, and which are to be used as input placeholders in terms. Then general terms are built by syntactically replacing variables with other terms, which themselves have their own variables.
In the traditional approach, terms (relative to a multigraph of function symbols over a set $S$ of sorts) are introduced in a recursive manner:
Each variable $x$ of sort $s$ is a term $x: s$.
Given a function symbol $f: s_1, \ldots, s_n \to s$, and given terms $t_i: s_i$ for $i = 1, \ldots, n$, there is a term $f(t_1, \ldots, t_n): s$.
The last rule includes the case where the domain of $f$ is empty. Thus, if $c$ is a constant function symbol with codomain $s$, there is a corresponding term $c: s$, called a constant term of type $s$.
So a typical term $\tau$ might look like this:
where $x, y, u, v$ denote variables of appropriate sorts and $c$ is a constant. In particular, a variable $u$ might appear more than once in a term $\tau$. In the older logical approaches, one would read off the set of variables of $\tau$ as the set of variables that appear “inside”, in this example $x, y, u, v$. In more expressive typed theories, which allow for “function-space” types, one moreover distinguishes between free variables and bound variables inside terms, and one would analogously read off the set of free variables inside a term. Even though we won’t consider theories with function-types here, we will refer to “free” variables even in the “products-only” case.
In more recent approaches to term syntax, one considers that terms might have free variables which aren’t explicitly given within the body of the term. (Intuitively, variables which don’t appear can be considered to have been “projected out”, by means of precomposing a function with a projection map like $(x_1, \ldots, x_{n-1}, x_n) \mapsto (x_1, \ldots, x_{n-1})$.)
An approach that works for “products-only” typed theories and for theories with function-types alike proceeds by giving “terms-in-context”. A term-in-context is an expression
where the context, appearing before the turnstyle, lists distinct variable terms and their types that are considered to be “free variables” of $\tau$. The context in effect prescribes a domain for the term: a term-in-context has an arity $s_1,\ldots, s_n \vdash s$; in a set-theoretic interpretation, a term-in-context as above gets interpreted as a function of the form
Each variable appearing in $\tau$ appears within the context, but there may be extra variables which don’t appear in $\tau$ (ones that are “projected out”). Note that the context is a listing of the free variables in a specific order, so that
is considered different to $x': s', x: s \vdash f(x, x')$ (but each is considered as valid as the other).
Basic to the enterprise is the notion of substituting terms (in context) for free variables. The idea is that given a term
and given terms whose contexts are abbreviated by capital Greek letters, $\Gamma_i \vdash t_i: s_i$ for $i = 1, \ldots, n$, one may define by recursion a term
where the idea is that one substitutes the term $t_i$ for each instance of $x_i$, and then concatenates the contexts of the $t_i$ into a single context.
Finally there is a global rule of so-called “$\alpha$-conversion”, by which we identify a term $\Gamma, x:s, \Delta \vdash \tau: s'$ with another term obtained by substituting a fresh variable in for $x$. (In more formal terms, by performing a substitution $[x/t]$ where the term $t$ (in context) is $x': s \vdash x': s$, provided that $x': s$ does not already occur in $\Gamma$ or $\Delta$.) This is reasonable, since it is intuitively clear that the semantics of both terms are exactly the same.
In summary, “normal terms” are $\alpha$-equivalence classes of terms in contexts as sketched above.
This concludes our informal description of the term syntax. It should be clear that although the rules of term formation, substitutions, etc. are not hard to understand, the fully formal set-up is a bit fussy, and somewhat heavy on syntax.
There are a number of ways of explaining the same material from a more categorical point of view. Whatever the explanation is, the point is to describe the free category with finite products generated from a multigraph $F$ over $S$. We denote this free category with products as $Term(S, F)$.
As before, the objects of $Term(S, F)$ are product types: elements of $S^\ast$. Morphisms from a type $T: [m] \to S$ to a type $T': [n] \to S$ are, in the language of the preceding section, $n$-tuples of normal terms $(t_1, \ldots, t_n)$ where each $t_i$ has arity $T \to T'(i)$. Composition is effected by term substitution. The cartesian product structure is essentially given by concatenating (juxtaposing) lists of sorts and terms.
It is hopefully more or less clear how all of this works in the term syntax approach. Despite this, we believe that the bureaucracy of handling variables in the term syntax is something of a hack; from one point of view (closely related to string diagrams), some of it is actually unnecessary.
For example, the input strings play the role of variables declared in the context, but the difference is that they do not need “variable names” – they only need to be labeled by an appropriate sort, for type-checking purposes. This trick effectively eliminates the need for rules of $\alpha$-conversion. As we will see, one can also effect a neat division of labor between the business of variable declarations and the business of “pure substitutions”; moreover, this division clarifies the precise entry point of the particular doctrine we are working in (the doctrine of finite product categories).
To begin, recall the following abstract definition:
Let $M$ be a cartesian monad acting on a finitely complete category $C$. An $M$-span in $C$ from $c$ to $d$ is a span of the form
$M$-spans are composed by consideration of a pullback
where $m: M M \to M$ is the multiplication on the monad $M$. Under this composition, the $M$-spans are 1-cells of a bicategory $M$-$Span$.
In the case where $M$ is the free monoid monad acting on $Set$, an $M$-span from a set $S$ to $S$ is the same as a multigraph over $S$. A monad on $S$ in the bicategory $M$-$Span$ is a multicategory over $S$. We are especially interested in the free multicategory generated from a multigraph over $S$.
The free multicategory construction has other names and descriptions. We could also call it the free nonpermutative $S$-sorted operad generated by a set of $S$-typed function symbols. The apex of its underlying span, together with its map to $S$, can also be referred to as the initial algebra for the polynomial endofunctor $P$ on $Set/S$ which takes an $S$-indexed set $X_s$ to
($f \in F$). However it is named, the underlying multigraph of the free multicategory generated by a multigraph $F$ can be described as
where $Tree(F)$ is the set of $F$-labeled planar trees. This means that
Nodes of a planar tree are labeled by elements $f: T \to s$ of $F$;
Edges of that planar tree are labeled by sorts $s$, such that the list of labels of incoming edges at a node $f: T \to s$ is the word $T$, and the outgoing edge is labeled $s$.
The list of labels of “leaves” of an $F$-labeled tree $t$ (edges that are not outgoing edges of any node) gives an element $in(t) \in S^\ast$, and the label of the “root” edge gives an element $out(t) \in S$. Notice that $F$-labeled trees have obvious string diagram representations.
Next, any multicategory generates a (strict monoidal) category. In the present instance, we denote this category as $Pro(S, F)$ (we write “Pro” by analogy with “prop” – whereas props are used for the doctrine of symmetric monoidal categories, pros are used for monoidal categories). The objects of $Pro(S, F)$ are elements of $S^\ast$. The morphisms of $Pro(S, F)$ could be described as “$F$-labeled forests”. Formally, the underlying span of $Pro(S, F)$ is
The monoidal product on forests is simply juxtaposition. The composition of forests is the obvious one, plugging in roots of one forest for leaves of another.
The construction $Pro(S, F)$ formalizes what we meant earlier by “pure substitutions of terms”, and lives in the doctrine of (strict) monoidal categories. To switch over to the doctrine of categories with finite cartesian products, we apply a simple trick. Let $Pro(S, F) \circ Prod(S)$ be the composite of the two spans
Then, in the first place, $Pro(S, F) \circ Prod(S)$ carries a canonical category structure. The reason is that, viewing $Prod(S)$ and $Pro(S, F)$ as monads in the bicategory of spans, there is a canonical distributive law
(here $Pro(S, F)$ can be replaced by any pro over $S$). The idea is that the distributive law $\theta$ maps an element of $Prod(S) \circ Pro(S, F)$, which is a pair of arrows
with $f$ an arrow of $Pro(S, F)$ and $d$ an arrow of $Prod(S)$, to a pair of arrows in $Pro(S, F) \circ Prod(S)$ whose precise form is dictated by a naturality requirement in $d$. For example, if $d = \delta_T': T' \to T' \otimes T'$, then
Using this distributive law, the composite of the monads $Prod(S)$ and $Pro(S, F)$ is another monad in the bicategory of spans, and therefore a category with set of objects $S^\ast$.
The same trick works for other doctrines over the doctrine of monoidal categories. For example, if $Perm(S)$ is the free symmetric (strict) monoidal category generated by $S$, regarded as a span from $S^\ast$ to $S^\ast$, then there is a distributive law $Perm(S) \circ Pro(S, F) \to Pro(S, F) \circ Perm(S)$. Similarly with “symmetric monoidal” replaced with “braided monoidal”.
$Term(S, F) \coloneqq Pro(S, F) \circ Prod(S)$, regarded as a cartesian category with product structure inherited from $Prod(S)$, meaning that
is a product-preserving functor, where $u$ is the unit of the monad $Pro(S, F)$.
More explicitly, the tensor product $\bigotimes$ on $Term(S, F)$ is a cartesian product provided that there are natural transformations $\delta: id \to \otimes \Delta$, $\varepsilon: id \to e !$ which endow each object with a cocommutative comonoid structure. But the naturality follows from the definition of the distributive law, and the cocommutative comonoid axioms already hold in $Prod(S)$.
Note that this abstract description of $Term(S, F)$ is identical to that given by the syntax of normal terms. The “bureaucracy of variables” is here organized into two departments, $Prod(S)$ and $Pro(S, F)$, each having its own individual categorical structure, which interact via the distributive law. (This is the “division of labor” we were talking about, where each arrow of $Term(S, F)$ is factorized into a generalized diagonal map followed by an $F$-labeled forest.)
Another name for $Term(S, F)$ is “term algebra”, and yet another name for it is “the free $S$-sorted Lawvere theory generated by a set of $S$-sorted operation symbols $F$”.
$Term(S, F)$ is the free category with products generated by the multigraph $F$ over $S$.
$Tree(S, F)$ is the free multicategory generated by the multigraph $F$ over $S$, and $Pro(S, F)$ is the free monoidal category generated by the multicategory $Tree(S, F)$. Therefore $Pro(S, F)$ is the free monoidal category generated by the multigraph $F$. It remains to show that the monoidal inclusion
is universal among monoidal functors $X: Pro(S, F) \to C$ to cartesian monoidal categories $C$.
There is of course a product-preserving functor $Prod(S) \to C$ compatible with the restriction $S^\ast \hookrightarrow Pro(S, F) \stackrel{X}{\to} C$. At the level of spans, this gives a composable pair of span morphisms
which we then compose with the span morphism $m: C_1 \circ C_1 \to C_1$ given by composition in $C$:
This gives a morphism between underlying spans, $Term(S, F) \to C$. This is functorial (i.e., is a morphism of monads in $Span$) because both $Prod(S) \to C$ and $Pro(S, F) \to C$ are functorial, and also because the compositional equalities enforced by the distributive law $\theta$ are preserved: they are taken to equalities expressed by naturality of diagonal maps and projection maps in $C$. The functor $Term(S, F) \to C$ is product-preserving, because $Prod(S) \to C$ is product-preserving. The uniqueness of the product-preserving extension $Term(S, F) \to C$ is clear since the subcategories $Prod(S)$ and $Pro(S, F)$ together generate $Term(S, F)$.
This section contains some technical material which will be important for the upcoming discussion of the Beck-Chevalley condition. The point is that the Beck-Chevalley condition is a very powerful principle, and should be applied only to the “right” sorts of pullbacks. The pullbacks we are interested in here can be roughly described as those which are “based on products”; more precisely, as pullbacks that are absolute with respect to the doctrine of finite-product categories, in the sense of the following definition.
Let $C$ be a category with finite products, and let $S$ be a pullback square in $C$. We say that $S$ is a (semantically) productive pullback if, for every category with finite products $D$ and every finite-product preserving functor $F: C \to D$, the square $F(S)$ is a pullback square in $D$.
It suffices to test this condition for the case $D = Set$:
For a pullback square $S$ in $C$ to be productive, it is necessary and sufficient that every product-preserving functor $F: C \to Set$ take $S$ to a pullback in $Set$.
Since the Yoneda embedding $y: D \to Set^{D^{op}}$ preserves and reflects pullbacks, and since limits in $Set^{D^{op}}$ are computed objectwise, the representable functors $\hom(d, -) = ev_d \circ y: D \to Set$ jointly preserve and reflect pullbacks in $D$.
Example: Every pullback in $Prod(S) = (Fin/S)^{op}$ is productive. This is intuitively obvious since all morphisms in $Prod(S)$ are built from product data. More formally, every finite-product preserving functor
is isomorphic to one of the form $Set/S(i-, X)$, where $i: Fin/S \to Set/S$ is the inclusion and $X$ is the restriction
For there is, up to isomorphism, only one extension of $X: S \to Set$ to a product-preserving functor $(Fin/S)^{op} \to X$, and both $F$ and $Set/S(i-, X)$ are such extensions. On the other hand, notice that both $i$ and $Set/S(-, X)$ preserve not just finite products but all finite limits. Therefore an arbitrary product-preserving functor $Set/S(i-, X)$ preserves any pullback square in $(Fin/S)^{op}$.
Example: On the other hand, not every pullback in a term algebra $Term(S, F)$ need be productive. For an especially simple example, take a single-sorted theory with a single unary function symbol $s$. The Lawvere theory for a single unary function, whose models are sets equipped with an $\mathbb{N}$-action (i.e., with an endofunction), is the category opposite to the full subcategory of $Set^{\mathbb{N}}$ whose objects are finite coproducts of copies of the representable $\hom(\mathbb{N}, -)$, or the finite coproduct cocompletion of the one-object category $\mathbb{N}$. Now, the following square is a pushout in the finite coproduct cocompletion:
For, any square in the finite coproduct cocompletion of the form
factors through one of the $m$ summands $\hom(\mathbb{N}, -)$ of the lower-right corner (by extensivity of $Set^{\mathbb{N}}$, say), so we are reduced to checking that the first square is a pushout in the image of the Yoneda embedding – and it is, because $s$ is cancellable. It follows that the dual of the first square – call this dual square $\Sigma$ – is a pullback in the Lawvere theory, i.e., in $Term(S, F)$. However, $\Sigma$ cannot be preserved by all product-preserving functors $X: Term(S, F) \to Set$, because there are plenty of models $X$ where $X(s)$ is non-injective, i.e., where the square
is not a pullback.
One way of thinking about the last example is that, externally speaking, we can see that $s: \mathbb{N} \to \mathbb{N}$ is cancellable (monic) in the term category, but the term category cannot “see” this fact “internally” in the doctrine of finite product categories, otherwise the monicity would be preserved under any interpretation (i.e., in any model $X$).
There is a more verbose (but straightforward) syntactic notion of productive pullback, i.e., a pullback which comes “for free” in any category with finite products. They include the following:
(Compare Seely, section 3.) Let $C$ be a category with finite products. A pullback square in $C$ is syntactically productive if it belongs to the smallest class $\Pi$ of squares containing squares of form (1)-(4) above and closed under the following operations:
If
belong to $\Pi$, then so does their composite;
If
belongs to $\Pi$, then so does the square obtained by applying a functor of the form $C \times -$ or $- \times C$ to this square.
It is clear by induction that if $F: C \to D$ is a finite-product preserving functor and $\Sigma$ is a syntactically productive pullback in $C$, then $F(\Sigma)$ is a pullback square in $D$. Thus, a syntactically productive square is semantically productive. This may be regarded as a “soundness theorem”.
We aim to prove a “completeness theorem” as well: that every semantically productive pullback square in $Term(S, F)$ is syntactically productive.
is clearly a semantically productive pullback. It is syntactically productive because it is a composite of squares of types (1) and (2):
where $!: A \to 1$ denotes the unique map, and we tacitly condense product expressions by erasing empty factors $1$.
Let $F \hookrightarrow Mor(Term(S, F))$ be the canonical inclusion. Then each morphism in the image of this inclusion is monic in $Term(S, F)$.
Suppose $f h_1 = f h_2$ for two morphisms $h_1, h_2$ of $Term(S, F)$. Each morphism $h_1$ of $Term(S, F)$ has a unique factorization as $g_1 d_1$, where $g_1$ belongs to $Pro(S, F)$ and $d_1$ to $Prod(S)$; similarly we uniquely factorize $h_2$ as $g_2 d_2$. From $f g_1 d_1 = f g_2 d_2$, we derive $f g_1 = f g_2$ and $d_1 = d_2$, again by the unique factorization. From $f g_1 = f g_2$ we derive $g_1 = g_2$, according to the recursive definition of $F$-labeled forest. Therefore $g_1 d_1 = g_2 d_2$, as desired.
Let $i: Pro(S, F) \to Term(S, F)$ be the canonical inclusion. Then each morphism in the image of $i$ is monic in $Term(S, F)$.
This follows easily by induction. Each morphism of the form $i(\phi)$ is a cartesian product $\bigtimes_{k=1}^n i(\phi_k)$ of $F$-labeled trees, and each non-trivial tree $\phi_k$ is of the form $i(f_k \circ \psi_k)$ where $f_k$ is a function symbol and $\psi_k$ is an $F$-labeled forest. Then $i(f_k)$ is monic by the preceding lemma, and $i(\psi_k)$ is monic by induction. Thus $i(f_k \psi_k)$ is monic, and since a cartesian product of monos is monic, this completes the proof.