Welcome to Joseph
Schreiner’s web
site for illustrating patterns of characters in west European
languages. In particular, I illustrate the patterns and
frequencies of digraphs, or 2-character combinations. On this
site I illustrate the most common digraphs found in English, German,
French, and Spanish. Not only do we see the most common
digraphs,
but we also see which digraphs are most likely to precede or follow
each other. Half of the graphs are color-coded, so that we
can
see in which language the digraphs are most likely to occur.
This portion of
my web site
is based on the two Main graphs below – Color and
Monochrome. By clicking on these graphs, you can see the
Detail
graphs. When you click on a section of a Main graph, a new
window
will open with a Detail graph, showing you the magnified
section.
As you can see, the Main graphs are quite complex - each pixel has
meaning. The Detail graphs show you the actual digraphs
represented by each pixel.
|
 |
The graphs show us the relationship among digraphs. They show
us
how the digraphs precede and follow each other. Consider the
English text
red
ball. The graphs break down the text
into series of digraphs as preceding-digraphs and following-digraphs,
as in:
| Preceding |
Following |
| Digraph |
Digraph |
|
|
| re |
d_ |
| ed |
_b |
| d_ |
ba |
| _b |
al |
| ba |
ll |
For clarity, the space character has been converted to an
underscore. So we see that the preceding-digraph
re is
followed by the following digraph
d_.
ed is followed
by
_b,
and so on.
From the patterns of digraphs in these four west European languages
(all using the Roman alphabet), I wove the pattern of the Main graphs.
Monochrome
Graph
(Black & White)
Click on a section to see it magnified.
Magnified section will open in a separate window.
You may need to make the new window full screen to see all detail.
Color Graph
Click on a section to see it magnified.
Magnified section will open in a separate window.
You may need to make the new window full screen to see all detail.
Structure
of the Main Graphs
The
rows (A through I) represent the preceding-digraph. The
columns
(1 through 9) represent following-digraphs. re, as a
preceding-digraph, is found in row G. d_, as a following
digraph, is found in column 8. So we could look at section G8
to see the re-d_
combination, and determine its frequency, and in which language it
occurs in most often.
 |
I illustrate the
most common
450 digraphs. It is too difficult to display more digraphs,
given
the constraints of browsers and pixel resolution. The overall
graphs are 450 x 450 pixels, but there is no one-to-one
correspondence between digraph and pixel. More common
digraphs
are displayed with pixel-lengths of greater than 1.00. Less
common digraphs are displayed with pixel-lengths of less than 1.00.
|
In
the Monochrome graphs, digraph frequency is also represented by
brightness. Black means that this digraph combination did not
occur in the sample text. Bright white means that the digraph
combination occurred frequently. Shades of gray represent
intermediate frequencies. The Color graphs also use
brightness
and pixel-length to express frequency.
In the Color graphs, the color or hue indicates which language most
frequently has this digraph combination:
| Language |
Color |
| Spanish |
Red |
| French |
Yellow |
| English |
Green |
| German |
Blue |
Strong hues indicate that the digraph combination shows a strong
preference for one of the languages. A weak hue, one tending
toward white or gray, indicates that the digraph combination occurs
nearly equally in all four languages.
I
find the
graphs fascinating, and I hope that you do also. The Main
graphs
give us a bird’s-eye view of the frequencies. They
may be
compared to spectroscopes, or to chemical electrophoresis. But they do
not
allow us to see information about the individual digraph. For
that, we must examine the Detail graphs. Even these graphs
are
complex, but they show enough detail for us to examine individual
digraphs.
|
 |
Structure of the Detail Graphs
When you click on a section
of a Main
graph, you will see the corresponding Detail graph, which magnifies
that cell. The pixel lengths are multiplied by 12.
The
Detail graphs show the same colors and brightness as the Main
graphs. And they show us the digraphs along the axes.
The Detail graphs are just as complex as the Main graphs.
Remember that these graphs are displaying (on average) 50 digraphs
along the horizontal and vertical axes. The digraphs cannot
be
displayed along a single line. In order to list them all, the
digraphs are displayed in 4 staggered lines. The example
below
shows how I do it. Along the vertical axis (the
preceding-digraph) we see the digraphs
co,
ha,
no,
pa,
so ...
Along the horizontal axis (the following-digraph) we see the
digraphs
_p,
_c,
_l,
_e,
_d
...
This is the order in which they occur on the axis. But I had
to
place these digraphs on different lines to squeeze them into a
manageable space.
How Did I Do This?
For sample text, I used 1.8
megabytes of
on-line text (evenly distributed among English, Spanish, French, and
German). 60% of the text came from Wikipedia articles
(discussing
the USA, France, Mexico, Germany, computers, television, religion, the
sun, and the moon). 10% came from Yahoo! news
articles. And
30% came from children’s stories, fables, and fairy tales.
I edited the text to remove the square brackets [] of Wikipedia
citations, the captions for pictures, and gratuitous line
breaks.
I also converted or translated the following characters:
- Upper case characters were converted to lower case
characters.
- Double-spaces (double-blanks) were converted to single
spaces.
- Spaces became underscores “_”.
- Numerals were converted to “9”.
- Punctuation that ends a sentence, or begins a sentence in
Spanish, (period, question mark, exclamation point …) was
converted to the exclamation point “!”.
- All other characters were translated to the crosshatch
“#”.
- The beginning and end of paragraphs were converted to
"¶¶".
 |
Using
Visual Basic, I scanned all text and found all possible
digraphs.
I chose the 450 most frequent digraphs for further analysis.
I
created separate 450 x 450 crosstabulation tables for all four
languages.
|
The
most complex and subtle
part of the process was determining the order of the digraphs on the
horizontal and vertical axes. I followed the principle that
digraphs with similar response patterns should be next to each
other. For instance, as preceding-digraphs, ra and na
and are usually followed by the same digraphs. So
these two
digraphs are adjacent on the vertical axis, which defines the
rows. As following-digraphs, zu
and ko
are usually preceded by the same digraphs, so they are adjacent on the
horizontal axis, which defines the columns.
The concept of similarity is easy to understand, but difficult to
implement in a computational or statistical algorithm. I
tried
many customized methods until I settled on the algorithm that I finally
used. If you want to know the details, feel free to send
email to
me. Otherwise, I will not dwell on my method here.
Some Results
The Main graphs are rich in detail. They are similar to
fractals,
in that they still show much detail even as you look more closely at
smaller sections. I encourage you to examine the Detail
graphs.
But
let me provide some orientation. Let us first look at some of
the major horizontal
stripes, or rows.
|
 |
Stripe crossing E
& F –
These are almost all digraphs whose second character is a space, or
blank. (The Detail graphs show the blanks as underscores).
Blue stripe within I
– These are digraphs that tend to occur at the
end of German words or syllables, such as eh, tz,
or hn.
Red stripe within F
– These are digraphs with accented vowels that
occur in Spanish, such as rá,
án,
and ió.
Green stripe within A
– These are digraphs, mostly with y
or w
as the second character, which often end English words, such
as ly, ow,
and sh.
 |
The
vertical
stripes, or columns, are just as interesting: |
Stripe around 2
– These are digraphs where the first character
is a space.
Stripe crossing 5
& 6 – These are digraphs where the
first character is r, n,
or l,
and the second character is a consonant.
Red stripe within 7
– These are digraphs where the first character
is a
or o,
and the second character is a space or punctuation (the end of Spanish
words or sentences).
And
we have some interesting rectangles
(the intersection of rows and columns):
|
 |
Empty E2 & F2
– The
text was edited to convert double-spaces to single-spaces.
The
E/F row is digraphs ending with a space. The column-2 is
digraphs
beginning with a space. So there were virtually no digraph
combinations that contain two consecutive spaces.
Sparse A5 through F5
– With
a few exceptions, the digraphs in rows A through F end in
consonants. Column-5 (which crosses over into column-6)
contains
double-consonant digraphs. So this intersection contains
three
consecutive consonants, which a rare.
Bright E3 through F5
– Columns-3,
-4, and half of -5 contain digraphs where the first character is a
consonant, and the second character is a vowel. This
combination
often appears at the beginning of a word. So this rectangle
represents:
letter-space-consonant-vowel
which is what we find in the transition from
one word into another.
Red G2 & H2
– The G/H stripe is mostly digraphs where the
first character is a consonant, and the second character is o
or a.
So this rectangle is:
consonant-o/a-space-letter
Many Spanish nouns end in o
or a,
so this represents the transition from one Spanish word into another.
Yellow H2 – Row
H is mostly digraphs where the second character is é, u,
or i.
Many French past participles end in é, u,
and i.
So this rectangle represents:
consonant-é/u/i-space-letter
the transition from one French word into
another.
Any comments? Questions? Suggestions?
Please email to:

Copyright © 2008, Joseph Schreiner