The key principle of compression lies in the creation and application of
abstractions. In their most primitive form, abstractions involve the creation of
some kind of symbol, to refer to a pattern that has been seen before. Instead
of repeating the pattern, which may be cumbersome, we can use the new symbol as
a more concise way to represent the same piece of data.
The rest of this chapter is devoted to a cursory overview of the simple
abstractions we find in everyday life. It may seem overindulgent to make such a
fuss over the use of symbols to stand for things but, as we will see in the
remaining chapters, once we have introduced a sound basis for abstraction, we
will not need anything else to understand every computation that we or our
machines can perform. Certainly, we will see how to create a decompressor that
can inflate every kind of data from an optimally compact form but, along the
way, weÕll also see that we can find abstractions for many things that we might
otherwise have considered primitive or axiomatic. WeÕll see later that logical values
(true and false), logical operators (and, or and not), numbers,
arithmetic operations like addition and multiplication, recursion and many
other mathematical ideas are not things that we have to assume to exist up
front but things we can represent and explain using a single simple model of
computation: the creation and application of abstractions.
One of the most familiar examples of compression is to be found in an
area normally considered outside the range of the mathematical sciences -
natural language. In the vast majority of cases (perhaps all), human languages
were first spoken and only later given a written form. Long before the
emergence of their written forms, languages evolved verbal shortcuts and
conventions in deference to guttural economics. So it is that the currency of
compression in natural language is not the bit, the byte or even the letter Ð but the
syllable.
In the phrase:
ÒTony Blair went to war with Saddam Hussain because he thought it was
rightÓ.
We have used the proper noun ÒheÓ in place of the proper noun ÒTony
BlairÓ Ð saving ourselves two syllables over the equivalent phrase:
ÒTony Blair went to war with Saddam Hussain because Tony Blair thought it was
rightÓ.
In English, the convention that pronouns refer to the subject or object
of the verb in the current sentence allows the abstraction to be introduced
implicitly - making the resulting form more concise. It is interesting, and
perhaps no coincidence, that the nominative pronouns in English
{I, you, he/she/it, we, they} are all one syllable in length. The same is true
of our accusative pronouns: {me, you,
him/her/it, us, them}. Nominative pronouns in French {je, tu, il/elle, nous,
vous, ils/elles} and even German pronouns{ich, du, er/sie/es, wir, ihr, sie}
share this trait though the longer equivalents in Spanish {yo, tœ/vos, el/ella, nosotros/nosotras, vosotros/vosotras, ellos/ellas} might be taken as a sign that the Spanish
were not in as much of a rush to finish their sentences.
The use of implicit abstractions works well in simple sentences but computer language aficionados can console themselves by noting how quickly this technique becomes ambiguous:
ÒTony Blair went to war with Saddam Hussain because he thought he was a tyrantÓ.
Who thought whom to be a tyrant? We donÕt know which pronoun refers to
which noun. Often though, we can deliberately choose a subject that requires a
different pronoun:
ÒThe British people went to war with Saddam Hussain because they thought he was a tyrantÓ.
The resulting statement, while false, is at least unambiguous. With six
distinct pronouns at play unavoidable ambiguities are rare and the common cases
can all be handled pretty well.
It is interesting to note the tension between simplicity and ambiguity in the pronouns of the languages rooted in Latin. English has the smallest number of pronouns, partly because it does not to assign gender to inanimate objects. In English however, as we have seen, it is the very generality of our smaller set of pronouns that leads to semantic ambiguities that might not otherwise arise. Perhaps this delicate balance of pros and cons is why the languages derived from Latin are so varied in their handling of the pronoun.
In what seems rare relief, computer languages adopt a much simpler
abstractive technique than any we find in the natural languages. Computer
languages simply give every variable its very own name. We now
have to use a special construction to formally baptize each variable before it
is used but that is the end of the matter. Whether the thing our variable
refers to is masculine, feminine, singular, plural, nominative, accusative or
dative - we can call it whatever we like.
The representation of the following function is much longer than it
needs to be:
f() {
return ((sin(theta) + cos(theta)) * (sin(theta) + cos(theta)) Ð
sin(theta) - cos(theta)
}
In modern programming languages, abstractions can be introduced by using
local variables which give a name to a something Ð but only
temporarily. Eventually the abstraction ceases to be needed and the modern
computer language will provide some syntactic element (the closing curly brace
here) to end the scope of the abstraction.
The function above can be written more concisely by introducing the
following variable:
f() {
let x = sin(theta) + cos(theta)
return (x * x Ð x)
}
Every-day mathematics has similar devices, though conventional mathematical
notation does not force us to define the scope of the introduced abstractions.
In computer science parlance, variables in basic mathematics are global.
Abstractions in mathematics normally use letters for their abstractions
and therefore avoid the ambiguities of the intricate world of pronouns in
natural language. When a nested abstraction is required, two different symbols
may have different values at the same time. Conventionally, in mathematical
functions, the first variable is denoted by the letter, ÒxÓ, the second by next
letter of the alphabet, ÒyÓ, and so on, all the way up to the letter ÒzÓ - at
which point, having run out of alphabet, this particular convention falls flat
on its face.
Perhaps the most intriguing part of all this lies, not our apparently
universal enthusiasm to eliminate waste in intangible things, but in the fact
that, from the moment human beings were able to communicate, our brains appear
to have been abstracting information from of the world around them Ð normally
without their hostÕs awareness or control. Our brains seem to be insatiable
abstraction machines, taking the continuous streams of information from their
available inputs and rewarding obscure pleasure centers when they find new
abstractions amongst them.
White-noise, a form of sound with no repeating patterns is not at
all pleasurable to listen to. Our mental apparatus has nothing to achieve in
monitoring noise and quickly loses interest in the task. When interesting
patterns appear in sound, however, we stop calling it noise and call it music instead. Listening
to music is an activity which we enjoy Ð but why do we enjoy it? What is the
point? We know that we would not enjoy it if it were just noise Ð is it the patterns
it contains that we delight in deciphering?
It is not just the existence of patterns that we find enchanting Ð our
fix seems to come from spotting new ones. When a simple pattern is repeated,
over and over again, the sound is almost as boring to us as noise. Conversely,
what seems to give us the greatest pleasure is the introduction of a subtle
variation that forces the brain to seek a new abstraction to explain it. In
music, a change of key, or the alteration of just a few notes in a phrase that
give the new phrase a completely different harmonic structure - both force the
deployment of new and sophisticated abstractive techniques for us to avoid
disorientation.
In jazz, a common technique for progressing from one chord to the next
is finding harmonic inversions that move each of the constituent notes through
the smallest distance. Jazz musicians are typically amongst the most
technically capable of musicians - this is not a device to hide inadequacies of
their technique. It is instead because, for many who appreciate jazz, the
greatest pleasure comes from finding elaborate musical ideas behind the
subtlest collection of hints.
The two symbols |: and :| are the braces of abstractions in written
(Western) music, indicating that the enclosed passage should be repeated Ð
allowing a shorter representation of a piece of music than would be possible
without them. Unlike the computer language, no name is typically given to the
abstraction and the ambiguities that can arise are frequently dealt with by
augmenting the musical notation with natural language clarifications, like: repeat
to coda.
It is not the fact that music with structure and repetition can be
written on a smaller piece of paper that makes us enjoy it Ð weÕd still enjoy
it even if it had never been written down. The potential savings in a written
representation merely seem to be related to the mysterious qualities in music
for which we seem to have an innate appreciation.
Physiology
The instructions used to make us are held in our DNA. The coding for
that scheme is a set of four symbols; the bases: adenine (A), thymine (T),
guanine (G), and cytosine (C). A strand of DNA with 2 billion such symbols on
it, when fully stretched out, is shorter in length than the width of a human
hair and yet has enough information to build a complete working copy of a human
being Ð hair and all. There isnÕt
enough information in the DNA to store the locations of the cells in our
bodies; let alone how to make them. How then can this inconceivably diminutive
of things constitute a complete blueprint for creating you, including, as you
do, trillions of complete copies of the blueprint itself?
There are very few inorganic processes in the real world that we can
turn to to visualize the remarkable process of self-replication. TodayÕs cars,
for all their impressive gadgetry, do not produce cars. Factories that produce
cars cannot produce factories.
There is, however, something inorganic that can produce a copy of
itself: a certain type of computer program. A popular puzzle for undergraduates
of Computer Science courses is writing such a program: one which when run,
produces a copy of itself as its output. ItÕs a difficult puzzle Ð but there
are many, many solutions and the shortest self-replicating programs are
normally less than a page long.
So, should we think of a DNA molecule as a self-replicating program: one
that happens to written in some weird language for which we have no
interpreter? If so, the human genome project has recently supplied the
techniques to write out a complete listing of this program - in terms of its
four symbols or instructions. In terms of their information content, such
programs are small enough that we could write any one of them onto a single CD.
Somehow, in the space we might set aside for an hour of music, nature has
figured out how to store a complete description of how to build itself a
musician.
As impressed as we must be with the level of compression that nature has
achieved, we neednÕt worry about unauthorized clones of our physical selves
being knocked out in the far-East, as no-one has yet built a decompressor for
this particular format. A decompressor for this format is, nevertheless,
perfectly easy to define. It would have to take as input the sequence of bases
in our genome and output a list of atoms, and where they should be put, so as to
make a human. Remarkably, that decompressor does exist, itÕs just that we
currently only have it in organic form, embodied in the complex processes that
make our cells grow, mutate and divide throughout our lives. We know precious
little about the busy metropolis of messengers, agents and factories that make
this wonderful machine work Ð only that if it didnÕt, we wouldnÕt be here to marvel
at it.
My DNA is a compressed form of me? Nonsense! ItÕs true that weÕre not, one day, going to zoom in on a image of a cell closely enough to discover a mitochondrial laptop running MSGenesis. Instead, this idea is itself an abstraction, and doesnÕt really exist other than in our heads. Many different ideas could be used to explain the same physical phenomena and generally each of us internalizes the world around us in a different way. While some ideas are more generally accepted than others, we can prove that some of our most universally accepted models are, in reality, deeply flawed. The discovery of wave/particle duality in Physics in the early part of the last century left physicists in the uncomfortable position of knowing that their two most popular models for thinking about matter both break down in certain situations Ð meaning that neither really offers a complete and reliable model of the physical world. Perfect or not, abstractions help us to avoid being overwhelmed by the sheer volume of data around us and see things we would simply be blind to without them.
Next time youÕre in the shower, have a think about why your foot has so
much in common with your hand.
Psychology
Sometimes it seems as if we use our brains to try to predict the
immediate future - based on events of the past. One, extraordinarily
satisfying, example seems to take place when a perfectly reasonable abstraction
turns out to be completely wrong, and the inclusion of one small scrap of extra
information reveals a quite different, but much more fitting, model of the
available information.
ÒTime flies like an arrow; fruit flies like a ÉÓ.
Unless weÕve seen this phrase before, our brains are in an interesting
state on reaching the penultimate word of the sentence. Most people assume they
are seeing a very common pattern in English language - the analogy: a concept
repeated with a different subject and object. There are many ambiguities in the
words ÒfliesÓ and ÒlikeÓ but we have already resolved them in the successful
comprehension of the first phrase. Our brains are used to assuming that if an
ambiguity has been resolved once, very recently, that same resolution will
apply now. ÒFliesÓ, was previously a verb, so we assume it is still a verb. Similarly, ÒlikeÓ
is a conjunction - surely?
Then we read the last word Ð ÒbananaÓ. No problem: ÒFruit flies like a bananaÓ. Sentence parsed. Structure understood Ð over to the higher functions to work out what it means. But then a surprise: grammatically correct as it may be, the phrase turns out to be meaningless. Our primitive abstraction heuristics have lead us up a blind alley.
ÒTime flies like an arrow; fruit-flies like a bananaÓ.
ÒFliesÓ is no longer a verb but now part of a noun. ÒLikeÓ has
changed from a conjunction
to a verb. The heuristics our brains use to economize
the processing of language are perfectly tuned to the efficient handling of the
common case Ð and this sort of sentence is sufficiently rare that good
heuristics are entitled to fail on first encountering it.
Perhaps the most extraordinary element of this, to those interested in
the algorithms involved, is not just realizing the complexity of what our
brains are doing, nor the turmoil our brains must be in in having their rules
of thumb thwarted by such a contrived special case. Nor is it even in the speed
with which the error is recovered and the coordinated backtracking of higher
and lower brain functions invoked to resolve the problem. No, while that is all
impressive enough the truly extraordinary part is that at the end of this
intellectual assault, we feel the inclination to laugh.
If only we could build machines that were as gracious in failure as the
brains that created them.
Higher forms of abstraction are, of course, not confined to comedy or
music; they are also evident in literature, science, art, drama, mathematics
and poetry Ð in short, they seem to be common to most forms of intellectual
stimulation. Why are we being
rewarded for exposing our senses to stimuli which require so much mental energy
for us to unravel?
In evolutionary terms, useful knowledge, the combination of facts and
abstractions about our surroundings, provided a competitive advantage over
ignorance. Knowledge of the relationship between the seasons and the immediate
food supply was decisive in the survival of early humans. The individual who
came to understand the temporal patterns of nature could store food from
plentiful times and so survive the potentially fatal famine of leaner pickings.
With knowledge and working abstractions so critical to our survival it seems
reasonable to assume that individuals who provided themselves with emotional
encouragement for the acquisition of knowledge would have an advantage over
those who did not.
In body and mind, we are merely the vehicles of our pedigree; it should come
as no surprise that we are pleasured by mental activities that have, throughout
our evolution, improved the survival chances of our ancestors.