West European Language Digraphs
In 2-Dimensional Color


What Are These Graphs?

Welcome to Joseph Schreiner’s web site for illustrating patterns of characters in west European languages.  In particular, I illustrate the patterns and frequencies of digraphs, or 2-character combinations.  On this site I illustrate the most common digraphs found in English, German, French, and Spanish.  Not only do we see the most common digraphs, but we also see which digraphs are most likely to precede or follow each other.  Half of the graphs are color-coded, so that we can see in which language the digraphs are most likely to occur.

This portion of my web site is based on the two Main graphs below – Color and Monochrome.  By clicking on these graphs, you can see the Detail graphs.  When you click on a section of a Main graph, a new window will open with a Detail graph, showing you the magnified section.  As you can see, the Main graphs are quite complex - each pixel has meaning.  The Detail graphs show you the actual digraphs represented by each pixel.
A7C

The graphs show us the relationship among digraphs.  They show us how the digraphs precede and follow each other.  Consider the English text red ball.  The graphs break down the text into series of digraphs as preceding-digraphs and following-digraphs, as in:

Preceding Following
Digraph Digraph
re d_
ed _b
d_ ba
_b al
ba ll

For clarity, the space character has been converted to an underscore.  So we see that the preceding-digraph re is followed by the following digraph d_.  ed is followed by _b, and so on.

From the patterns of digraphs in these four west European languages (all using the Roman alphabet), I wove the pattern of the Main graphs.


Monochrome Graph
(Black & White)
Click on a section to see it magnified.
Magnified section will open in a separate window.
You may need to make the new window full screen to see all detail.

monochrome


Color Graph
Click on a section to see it magnified.
Magnified section will open in a separate window.
You may need to make the new window full screen to see all detail.

colorg


Structure of the Main Graphs

The rows (A through I) represent the preceding-digraph.  The columns (1 through 9) represent following-digraphs.  re, as a preceding-digraph, is found in row G.  d_, as a following digraph, is found in column 8.  So we could look at section G8 to see the re-d_ combination, and determine its frequency, and in which language it occurs in most often.

C2C I illustrate the most common 450 digraphs.  It is too difficult to display more digraphs, given the constraints of browsers and pixel resolution.  The overall graphs are 450 x 450 pixels, but there is no one-to-one correspondence between digraph and pixel.  More common digraphs are displayed with pixel-lengths of greater than 1.00.  Less common digraphs are displayed with pixel-lengths of less than 1.00.

In the Monochrome graphs, digraph frequency is also represented by brightness.  Black means that this digraph combination did not occur in the sample text.  Bright white means that the digraph combination occurred frequently.  Shades of gray represent intermediate frequencies.  The Color graphs also use brightness and pixel-length to express frequency.

In the Color graphs, the color or hue indicates which language most frequently has this digraph combination:

Language Color
Spanish Red
French Yellow
English Green
German Blue

Strong hues indicate that the digraph combination shows a strong preference for one of the languages.  A weak hue, one tending toward white or gray, indicates that the digraph combination occurs nearly equally in all four languages.

I find the graphs fascinating, and I hope that you do also.  The Main graphs give us a bird’s-eye view of the frequencies.  They may be compared to spectroscopes, or to chemical electrophoresis.  But they do not allow us to see information about the individual digraph.  For that, we must examine the Detail graphs.  Even these graphs are complex, but they show enough detail for us to examine individual digraphs.
G1C


Structure of the Detail Graphs

When you click on a section of a Main graph, you will see the corresponding Detail graph, which magnifies that cell.  The pixel lengths are multiplied by 12.  The Detail graphs show the same colors and brightness as the Main graphs.  And they show us the digraphs along the axes.

The Detail graphs are just as complex as the Main graphs.  Remember that these graphs are displaying (on average) 50 digraphs along the horizontal and vertical axes.  The digraphs cannot be displayed along a single line.  In order to list them all, the digraphs are displayed in 4 staggered lines.  The example below shows how I do it.  Along the vertical axis (the preceding-digraph) we see the digraphs co, ha, no, pa, so ...  Along the horizontal axis (the following-digraph) we see the digraphs _p, _c, _l, _e, _d ...

G2C

This is the order in which they occur on the axis.  But I had to place these digraphs on different lines to squeeze them into a manageable space.

How Did I Do This?

For sample text, I used 1.8 megabytes of on-line text (evenly distributed among English, Spanish, French, and German).  60% of the text came from Wikipedia articles (discussing the USA, France, Mexico, Germany, computers, television, religion, the sun, and the moon).  10% came from Yahoo! news articles.  And 30% came from children’s stories, fables, and fairy tales.

I edited the text to remove the square brackets [] of Wikipedia citations, the captions for pictures, and gratuitous line breaks.  I also converted or translated the following characters:

  • Upper case characters were converted to lower case characters.
  • Double-spaces (double-blanks) were converted to single spaces.
  • Spaces became underscores “_”.
  • Numerals were converted to “9”.
  • Punctuation that ends a sentence, or begins a sentence in Spanish, (period, question mark, exclamation point …) was converted to the exclamation point “!”.
  • All other characters were translated to the crosshatch “#”.
  • The beginning and end of paragraphs were converted to "¶¶".
I2C Using Visual Basic, I scanned all text and found all possible digraphs.  I chose the 450 most frequent digraphs for further analysis.  I created separate 450 x 450 crosstabulation tables for all four languages.

The most complex and subtle part of the process was determining the order of the digraphs on the horizontal and vertical axes.  I followed the principle that digraphs with similar response patterns should be next to each other.  For instance, as preceding-digraphs, ra and na and  are usually followed by the same digraphs.  So these two digraphs are adjacent on the vertical axis, which defines the rows.  As following-digraphs, zu and ko are usually preceded by the same digraphs, so they are adjacent on the horizontal axis, which defines the columns.

The concept of similarity is easy to understand, but difficult to implement in a computational or statistical algorithm.  I tried many customized methods until I settled on the algorithm that I finally used.  If you want to know the details, feel free to send email to me.  Otherwise, I will not dwell on my method here.


Some Results

The Main graphs are rich in detail.  They are similar to fractals, in that they still show much detail even as you look more closely at smaller sections.  I encourage you to examine the Detail graphs.

But let me provide some orientation.  Let us first look at some of the major horizontal stripes, or rows.
F6C


Stripe crossing E & F – These are almost all digraphs whose second character is a space, or blank.  (The Detail graphs show the blanks as underscores).

Blue stripe within I – These are digraphs that tend to occur at the end of German words or syllables, such as ehtz, or hn.

Red stripe within F – These are digraphs with accented vowels that occur in Spanish, such as , án, and .

Green stripe within A – These are digraphs, mostly with y or w as the second character, which often end English words, such as lyow, and sh.

E8C The vertical stripes, or columns, are just as interesting:

Stripe around 2 – These are digraphs where the first character is a space.

Stripe crossing 5 & 6 – These are digraphs where the first character is rn, or l, and the second character is a consonant.

Red stripe within 7 – These are digraphs where the first character is a or o, and the second character is a space or punctuation (the end of Spanish words or sentences).

And we have some interesting rectangles (the intersection of rows and columns):
D2C

Empty E2 & F2 – The text was edited to convert double-spaces to single-spaces.  The E/F row is digraphs ending with a space.  The column-2 is digraphs beginning with a space.  So there were virtually no digraph combinations that contain two consecutive spaces.

Sparse A5 through F5 – With a few exceptions, the digraphs in rows A through F end in consonants.  Column-5 (which crosses over into column-6) contains double-consonant digraphs.  So this intersection contains three consecutive consonants, which a rare.

Bright E3 through F5 – Columns-3, -4, and half of -5 contain digraphs where the first character is a consonant, and the second character is a vowel.  This combination often appears at the beginning of a word.  So this rectangle represents:
letter-space-consonant-vowel
which is what we find in the transition from one word into another.

Red G2 & H2 – The G/H stripe is mostly digraphs where the first character is a consonant, and the second character is o or a.  So this rectangle is:
consonant-o/a-space-letter
Many Spanish nouns end in o or a, so this represents the transition from one Spanish word into another.

Yellow H2 – Row H is mostly digraphs where the second character is éu, or i.  Many French past participles end in éu, and i.  So this rectangle represents:
consonant-é/u/i-space-letter
the transition from one French word into another.

Any comments?  Questions?  Suggestions?  Please email to:
 contact

Copyright © 2008, Joseph Schreiner