Archive

Archive for the ‘Recipes Folios’ Category

Are the Glyphs placed in specific folio locations?

June 6, 2016 15 comments

Based on a lot of circumstantial evidence related to the weirdness of the Voynich text (such as the odd repeating words, the curious faintness and boldness of some glyphs, and the sometimes curious positioning of text words and lines), it appears that the folios were perhaps not written Left to Right (or Right to Left) and Top to Bottom.

Instead, suppose the scribe started each folio with a prescription: for example “put an h-Gallows at the top left, then put a c in the middle of the folio, then a 9 at the end of the last line”, and so on. This would be sort of like filling out the answers to a bizarre crossword puzzle.

If there was such a prescription, might it explain some of the Voynich text features?

In the following selected charts I’m showing a virtual folio from the Recipes section. Each chart has lines and columns. Line 1 position 1 is the top left of the folio. Let’s look at the chart folio for Glyph “o”:

Recipes_o

Each disc indicates that the “o” appears at least twice in that location in the Recipes. The size of the disc indicates how many times it appears there: the bigger the disc, the more times it appeared. The random appearance of the chart suggests that “o” is not placed on the page in any particular pattern.

Let’s now look at the “s” glyph:

Recipes_s

Here it is clear that this glyph vastly prefers the first column, but not the first line. It is infrequently found elsewhere on the folio. In contrast, take a look at the rare glyphs (I just call them “?”):

Recipes_?

These abhor the early columns, and love the ends of the lines. They also seem to prefer the ends of the first lines (notice a little cluster there). Perhaps they hate the “s” glyphs…

The “4” glyph:

Recipes_4

The gap after the first column is explained by how “4” only appears at the start of a word.

Here are some more glyphs:

Recipes_y

Recipes_1Recipes_2Recipes_8Recipes_9

No conclusions here, as usual!

Addendum: the distribution for “c”:

Recipes_c

 

 

Advertisements

The Relationship Between Currier Languages “A” and “B”

March 1, 2013 24 comments

Captain Prescott Currier, a cryptographer, looked at the Voynich many moons ago, and made some very perceptive comments about it, which can be seen here on Rene Zandbergen’s site.

In particular, he noticed that the handwriting was different between some folios and others, and he also noticed (based on glyph/character counts) that there were two “languages” being used.

When I first looked at the manuscript, I was principally considering the initial (roughly) fifty folios, constituting the herbal section. The first twenty-five folios in the herbal section are obviously in one hand and one ‘‘language,’’ which I called ‘‘A.’’ (It could have been called anything at all; it was just the first one I came to.) The second twenty-five or so folios are in two hands, very obviously the work of at least two different men. In addition to this fact, the text of this second portion of the herbal section (that is, the next twenty-five of thirty folios) is in two ‘‘languages,’’ and each ‘‘language’’ is in its own hand. This means that, there being two authors of the second part of the herbal section, each one wrote in his own ‘‘language.’’ Now, I’m stretching a point a bit, I’m aware; my use of the word language is convenient, but it does not have the same connotations as it would have in normal use. Still, it is a convenient word, and I see no reason not to continue using it.

We can look at some statistics to see what he was referring to. Let’s compare the most common words in Folios 1 to 25 (in the Herbal section, Language A, written in Hand 1) and in Folios 107 to 116 (in the Recipes section, Language B, written in a different Hand):

Comparison between word frequencies in Languages A and B

Comparison between word frequencies in Languages A and B

So, for example, in Language A the most common word is “8am” and it occurs 192 times in the folios, whereas in Language B the most common word is “am”, occuring 137 times.

We might expect that these are the same word, enciphered differently. The question then is, how does one convert between words in Language A and words in Language B, and vice versa? In the case of the “8am” to “am” it’s just a question of dropping the “8”, as if “8” is a null character in Language A. In the case of the next most popular words, “1oe”(A) and “1c89″(B) it looks like “oe”(A) converts to “c89″(B). And so on.

If we look at the most popular nGrams (substrings) in both Languages, perhaps there is a mapping that translates between the two. Perhaps the cipher machinery that was used to generate the text had different settings, that produced Language A in one configuration, and Language B in another. Perhaps, if we look at the nGram correspondence that results in the best match between the two Languages, a clue will be revealed as to how that machinery worked.

This involves some software (I’m using Python now, which is fun). The software first calculates the word frequencies for Language A and B in a set of folios (the table above is an output from this stage). It then calculates the nGram frequencies for each Language. Here are the top 10:

nGramFrequencies

The software then runs a Genetic Algorithm to find the best mapping between the two sets of nGrams, so that when the mapping is applied to all words in Language B, it produces a set of words in Language A the frequencies of which most closely match the frequencies of words observed in Language A (i.e.  the frequencies shown in the first table above).

Here is an initial result. With the following mapping, you can take most common words in Language B, and convert them to Language A.

Table for converting between a Language B word and a Language A word

Table for converting between a Language B word and a Language A word

A couple of remarks. This is an early result and probably not the best match. There are some interesting correspondences :

  • “9” and “c” are immutable, and have the same function
  • Another interesting feature is that “4o” in Language B maps to “o” in Language A, and vice versa!
  • in Language B, “ha” maps to “h” in Language A, as if “a” is a null

In the Comments, Dave suggested looking at word pair frequencies between the Languages. Here is a table of the most common pairs in each Language.

Common word pairs in Languages A and B

Common word pairs in Languages A and B

For clarity, I am using what I call the “HerbalRecipesAB” folios for this study i.e.

Using folios for HerbalRecipeAB : [107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25]

More results coming …

More on Consonants/Vowels in the Recipes

September 26, 2010 Leave a comment

Here are some results from the Recipes folios for the verbose homophonic cipher idea proposed earlier.

Using the Recipes Folios, we find
1085 lines of VMs words
3150 different words on those lines

Looking for word sequences within a line that fit the pattern XYYZ (note that X=Y as well as X=Z is allowed):

50 XYYZ sequences
102 different words

(This is somewhat disappointing, as 102 is a small fraction of the total vocabulary.)

Two of the 50 sequences are of the form YYYZ or XYYY (“2oy 2coe 2coe 2coe” and “2coe 2coe 2coe 4oh1c89“) and so I remove “2coe” from further consideration as being ambiguously a vowel or a consonant or something else such as a number digit. This involves removing it wherever it appears in any of the 50 sequences.

Next I collect a list of all the different Y words (there are 31), and for each, a list of the X and Z words it appears with.

The hypothesis is that for each sequence, X and Z must code for vowels and Y for a consonant, or vice versa. (This holds for Latin, for example.)

At this point, the words can be categorised into two sets: Category 1 and Category 2. A Cat1 word cannot appear in the Cat2 list, and vice versa. The categorisation is done by first taking the the initial Y word, assigning it to Cat1, and assigning its XZ words to Cat2:

Y=4ohii89 (Cat1)    X/Z=4oh29 1sk9 e1c89 4ohco 82coe 1c9 4ohcc89 4okc9 (Cat2)

The next Y word is then examined:

Y=4ohii9 X/Z=4okc9 okc8(

Since 4okc9 has already been categorised as Cat2 in the first step, it follows that 4ohii9 is Cat1, and okc8( is Cat2.

This procedure continues for several iterations over all the Y and X/Z words until all have either been allocated to Cat1 or Cat2 or cannot be allocated to either (16 words). One word cannot be unambiguously assigned: 4ohcc9

The contents of the two categories are:

Category 1 (28 words)
4ohii89 oe 1oe 4ohcc9 4ohii9 2cae 4ohc9 kii9 1cae okc8aiN 4okc8( 1ii9 1c8ae 1ae yae 4oh89 8ae 1c8 4okay ohaiN e 1c89kcahaiN ohciiN kcc89 hco8( hae okaiiN okay

Category 2 (18 words)
4oh29 1sk9 e1c89 4ohco 82coe 1c9 4ohcc89 4okc9 okae ohii89 4oh1c9 1c89 ohcokcc9 4o okc8( 1oy ay 4okaiN

How about a “Verbose Homophonic cipher”?

September 24, 2010 7 comments

I’ve had a bit of hiatus from the VMs, but it’s always popping up in my mind and niggling me, even when I haven’t got time to spend on it. The latest niggle was the idea that the VMs scribe used a set of simple tables that showed how to convert plaintext letters into codes. So, in an example table, letter “A” is written “4oh”, letter “B” is written “8am” and so on. Also, spaces in the plaintext have their own code. Veteran VMs researcher Philip Neal informed me that this is called a “verbose homophonic cipher”.

Elaborating on the idea:  the scribe uses one of the set of tables for each folio s/he is writing. To encipher the plaintext onto the folio, it’s simply a matter of writing down the VMs “word”  for each letter in the plaintext word. If there is more space on the line for the next plaintext word, the scribe writes down the code for space, and then the codes for the letters in the next word. Long spaces are written by writing the code for space more than once … The next line is used for the next word, and so on.

On the next folio, a different table may be used.

It’s hard to imagine the justification for such a scheme, but it does appear (at least initially) to fit some of the features of the VMs script (especially the repeating VMs words often seen).

I made a quick test that looks at VMs word frequencies on a single folio (in the Recipes section, which has the densest text). These showed a word frequency distribution that looks similar to the letter frequency distribution in Latin, apart from the most frequently occurring word (which is much more frequent) and which it is suggested would code for a space in the cipher.

However, on a typical folio, there are usually many more VMs words than there are plaintext letters. So the scheme has to be extended to allow the scribe a choice between several different VMs words to encode a single letter. Each table must have a set of words appearing in each plaintext letter column. Something like this:

Plaintext (space) a b
VMs words 8am ay okoe 4ohoe 2ay 1coe faiis 4ay oka

If this is indeed the scheme, one would expect to see patterns in the VMs word sequences that match patterns seen in the letter sequences of e.g. Latin words. Also, as Philip Neal pointed out, patterns like “word1 word2 word2 word1” would indicate a plaintext letter sequence of either “vowel consonant consonant vowel” or vice versa.

Looking through the whole of the VMs for sequence patterns (on the same line of text), I found the following:

  • There are no 4 word sequences that repeat at all
  • There are only four 3 word sequences that repeat, and each only twice
  • There are no sequences at all of the form “xyyx”

(all of which I find rather surprising, and thought provoking).

So it looks like this hypothesis is dead in the water, and can be ticked off that long list of “things it might have been but in fact don’t fit”!

(It turns out that Elmar Vogt has been working on a related, but more sophisticated, idea which he describes on his blog and is called a “Stroke Theory”.)

Glyph Sequence Probabilities

February 26, 2010 2 comments

In the following tables, the probability (0..1) of finding a glyph following another glyph is shown, for various parts of the Voynich manuscript and also for some other texts.

In the tables, the ‘ ‘ character (blank) signifies the start or end of a word. The “#” character signifies a rare character not listed in the tables.

For example, in the Recipes table below (generated from the “Recipes” section of the VMs), the probability of finding “o” as the first character of a VMs word can be found by looking up the row for the ” ” (blank) in the first column, then moving along to the “o” column and reading off the probability = 0.2.

Some immediate features of Note:


1) The most probable glyph to find at the start of a word in the Recipes: “1”

2) In the Herbal: “1”

3) In the Labels: “o”

4) The glyph “4” is commonly found in the Recipes and Herbals text, and it is followed by “o” at least 90% of the time. It is very rarely found in the Labels.

5) The most probable glyph to find at the end of a word in the Recipes: “n”

6) In the Herbal: “m”

7) In the Labels: “p”

The most probable words

These tables allow us to generate the “most probable” words (i.e. just by taking the most probable transitions in turn)

1) Recipes: “1oe”

2) Herbal: “oe”

3) Labels: “oe”

4) Star names: “alalal….”

5) Thomas Hardy: “s”

6) Augustinus (Latin): “is”

Normalized Transition table: vertical = first character, horizontal = following character voyn_101Recipes_Sentences.txt

 

         ' '     'o'      'c'     'a'     '9'     '1'     'e'     '8'     'h'     'y'     'k'     '4'      '2'     '7'     'm'     'g'     's'     'n'     'p'     '#'
' ' :   0.0     0.2      0.0     0.05    0.05    0.13    0.07    0.03    0.03    0.01    0.04    0.12     0.04    0.0     0.0     0.04    0.01    0.0     0.0     0.06
'o' :   0.05    0.0      0.03    0.02    0.0     0.01    0.18    0.11    0.17    0.08    0.14    0.0      0.0     0.01    0.0     0.03    0.03    0.0     0.0     0.06
'c' :   0.01    0.22     0.13    0.05    0.12    0.02    0.0     0.21    0.02    0.0     0.01    0.0      0.0     0.08    0.0     0.0     0.01    0.0     0.0     0.03
'a' :   0.01    0.0      0.0     0.0     0.0     0.0     0.22    0.01    0.01    0.22    0.0     0.0      0.0     0.0     0.14    0.0     0.0     0.11    0.07    0.14
'9' :   0.81    0.0      0.0     0.0     0.0     0.03    0.01    0.0     0.05    0.0     0.02    0.0      0.01    0.0     0.0     0.0     0.0     0.0     0.0     0.01
'1' :   0.0     0.14     0.5     0.05    0.07    0.0     0.01    0.08    0.02    0.0     0.01    0.0      0.0     0.02    0.0     0.0     0.02    0.0     0.0     0.03
'e' :   0.3     0.07     0.0     0.06    0.03    0.11    0.01    0.02    0.2     0.01    0.02    0.0      0.04    0.0     0.0     0.0     0.01    0.0     0.0     0.04
'8' :   0.09    0.03     0.03    0.3     0.38    0.02    0.02    0.0     0.0     0.0     0.0     0.0      0.01    0.0     0.0     0.0     0.0     0.0     0.0     0.05
'h' :   0.01    0.04     0.42    0.22    0.05    0.11    0.01    0.0     0.0     0.0     0.0     0.0      0.02    0.0     0.0     0.0     0.0     0.0     0.0     0.06
'y' :   0.58    0.1      0.0     0.18    0.04    0.03    0.0     0.0     0.0     0.0     0.0     0.0      0.01    0.0     0.0     0.0     0.0     0.0     0.0     0.01
'k' :   0.01    0.08     0.32    0.22    0.05    0.18    0.0     0.01    0.0     0.0     0.0     0.0      0.02    0.0     0.0     0.0     0.0     0.0     0.0     0.05
'4' :   0.0     0.91     0.04    0.0     0.0     0.0     0.0     0.0     0.01    0.0     0.0     0.0      0.0     0.0     0.0     0.0     0.0     0.0     0.0     0.01
'2' :   0.0     0.13     0.58    0.03    0.07    0.0     0.0     0.05    0.02    0.0     0.01    0.0      0.0     0.01    0.0     0.0     0.01    0.0     0.0     0.02
'7' :   0.03    0.0      0.02    0.28    0.57    0.0     0.01    0.0     0.01    0.0     0.0     0.0      0.0     0.0     0.0     0.0     0.0     0.0     0.0     0.04
'm' :   0.93    0.02     0.0     0.01    0.01    0.0     0.0     0.0     0.0     0.0     0.0     0.0      0.0     0.0     0.0     0.0     0.0     0.0     0.0     0.0
'g' :   0.0     0.19     0.02    0.17    0.03    0.49    0.0     0.03    0.0     0.0     0.0     0.0      0.02    0.0     0.0     0.0     0.0     0.0     0.0     0.02
's' :   0.51    0.11     0.01    0.22    0.04    0.0     0.01    0.01    0.0     0.0     0.0     0.0      0.02    0.0     0.0     0.0     0.0     0.0     0.0     0.01
'n' :   0.95    0.0      0.0     0.01    0.01    0.0     0.0     0.0     0.0     0.0     0.0     0.0      0.0     0.0     0.0     0.0     0.0     0.0     0.0     0.0
'p' :   0.94    0.01     0.0     0.0     0.0     0.0     0.0     0.01    0.0     0.0     0.0     0.0      0.0     0.01    0.0     0.0     0.0     0.0     0.0     0.0
 '#'  :   0.24    0.09    0.14    0.11    0.1     0.11    0.02    0.03    0.01     0.01    0.0     0.0     0.0     0.01    0.0     0.0     0.02    0.0     0.0      0.02

Normalized Transition table: vertical = first character, horizontal = following character Voyn101_First10Herbal_Sentences.txt

 

         ' '     'o'     '9'     '1'     'a'     'c'     '8'     'h'     'k'      'e'     'y'     's'     'm'     '2'     '4'     'g'     'p'     'n'     'j'      '#'
' '  :   0.0     0.15    0.07    0.17    0.0     0.0     0.08    0.08    0.05     0.01    0.01    0.04    0.0     0.05    0.08    0.03    0.0     0.0     0.01     0.09
'o'  :   0.06    0.01    0.02    0.02    0.02    0.02    0.13    0.11    0.13     0.19    0.1     0.02    0.02    0.0     0.0     0.01    0.02    0.0     0.0      0.03
'9'  :   0.73    0.0     0.0     0.03    0.0     0.0     0.05    0.05    0.08     0.0     0.0     0.0     0.0     0.0     0.0     0.0     0.0     0.0     0.0      0.01
'1'  :   0.0     0.43    0.14    0.0     0.1     0.21    0.0     0.02    0.02     0.01    0.0     0.01    0.0     0.0     0.0     0.0     0.0     0.0     0.0      0.01
'a'  :   0.01    0.0     0.0     0.0     0.0     0.0     0.0     0.0     0.0      0.12    0.19    0.0     0.27    0.0     0.0     0.0     0.08    0.09    0.0      0.13
'c'  :   0.02    0.28    0.31    0.0     0.09    0.13    0.0     0.03    0.01     0.0     0.0     0.06    0.0     0.0     0.0     0.0     0.0     0.0     0.0      0.0
'8'  :   0.08    0.09    0.29    0.04    0.35    0.02    0.0     0.01    0.0      0.01    0.0     0.0     0.0     0.02    0.0     0.0     0.0     0.0     0.0      0.02
'h'  :   0.0     0.23    0.12    0.22    0.19    0.12    0.0     0.0     0.0      0.0     0.0     0.0     0.0     0.06    0.0     0.0     0.0     0.0     0.0      0.02
'k'  :   0.01    0.29    0.13    0.23    0.15    0.09    0.0     0.0     0.0      0.0     0.0     0.0     0.0     0.03    0.0     0.0     0.0     0.0     0.0      0.02
'e'  :   0.46    0.09    0.06    0.04    0.01    0.0     0.09    0.02    0.03     0.0     0.0     0.06    0.0     0.03    0.0     0.0     0.0     0.0     0.0      0.03
'y'  :   0.68    0.05    0.08    0.08    0.06    0.0     0.0     0.0     0.0      0.0     0.0     0.0     0.0     0.0     0.0     0.0     0.0     0.0     0.0      0.0
's'  :   0.48    0.12    0.07    0.08    0.14    0.04    0.0     0.01    0.0      0.0     0.0     0.0     0.0     0.01    0.0     0.0     0.0     0.0     0.0      0.0
'm'  :   0.96    0.0     0.0     0.0     0.0     0.0     0.0     0.0     0.0      0.0     0.0     0.0     0.0     0.0     0.0     0.0     0.0     0.0     0.0      0.0
'2'  :   0.03    0.48    0.1     0.0     0.05    0.27    0.0     0.02    0.03     0.0     0.0     0.0     0.0     0.0     0.0     0.0     0.0     0.0     0.0      0.0
'4'  :   0.0     0.97    0.0     0.0     0.01    0.0     0.0     0.01    0.0      0.0     0.0     0.0     0.0     0.0     0.0     0.0     0.0     0.0     0.0      0.0
'g'  :   0.02    0.21    0.13    0.43    0.06    0.04    0.0     0.0     0.0      0.0     0.0     0.0     0.0     0.08    0.0     0.0     0.0     0.0     0.0      0.0
'p'  :   0.93    0.02    0.0     0.0     0.0     0.0     0.02    0.0     0.0      0.0     0.0     0.0     0.0     0.0     0.0     0.0     0.0     0.0     0.0      0.02
'n'  :   0.93    0.0     0.03    0.0     0.0     0.0     0.03    0.0     0.0      0.0     0.0     0.0     0.0     0.0     0.0     0.0     0.0     0.0     0.0      0.0
'j'  :   0.0     0.33    0.09    0.42    0.09    0.04    0.0     0.0     0.0      0.0     0.0     0.0     0.0     0.0     0.0     0.0     0.0     0.0     0.0      0.0
'#'  :   0.23    0.22    0.12    0.04    0.08    0.09    0.01    0.02    0.01     0.0     0.0     0.03    0.0     0.0     0.0     0.0     0.0     0.0     0.0      0.03

 

Normalized Transition table: vertical = first character, horizontal = following character LabelsAll.txt

          ' '     'o'     'a'     '9'     'e'     'y'     '8'     'c'     'h'      '1'     'k'     's'     '7'     'm'     '2'     'p'     'n'     'j'     'g'      '#'
' '  :   0.0     0.48    0.05    0.04    0.01    0.0     0.12    0.01    0.01     0.05    0.01    0.09    0.01    0.0     0.01    0.0     0.0     0.0     0.0      0.03
'o'  :   0.03    0.0     0.0     0.0     0.21    0.14    0.08    0.02    0.14     0.01    0.16    0.03    0.01    0.01    0.0     0.0     0.0     0.02    0.02     0.04
'a'  :   0.02    0.0     0.0     0.0     0.27    0.36    0.01    0.0     0.02     0.0     0.0     0.02    0.0     0.05    0.0     0.06    0.05    0.0     0.0      0.06
'9'  :   0.82    0.0     0.0     0.0     0.0     0.0     0.05    0.0     0.04     0.0     0.03    0.01    0.0     0.0     0.0     0.0     0.0     0.0     0.0      0.0
'e'  :   0.33    0.09    0.07    0.13    0.0     0.0     0.09    0.01    0.06     0.05    0.0     0.06    0.02    0.0     0.01    0.0     0.0     0.0     0.0      0.04
'y'  :   0.33    0.14    0.25    0.13    0.0     0.0     0.01    0.0     0.0      0.06    0.0     0.0     0.01    0.0     0.01    0.0     0.0     0.0     0.0      0.0
'8'  :   0.07    0.09    0.37    0.29    0.03    0.0     0.01    0.02    0.0      0.05    0.0     0.0     0.0     0.0     0.0     0.0     0.0     0.0     0.0      0.03
'c'  :   0.01    0.3     0.05    0.11    0.0     0.0     0.12    0.08    0.04     0.01    0.02    0.04    0.08    0.0     0.0     0.0     0.0     0.01    0.0      0.04
'h'  :   0.01    0.24    0.27    0.11    0.0     0.0     0.0     0.17    0.0      0.13    0.0     0.0     0.0     0.0     0.02    0.0     0.0     0.0     0.0      0.0
'1'  :   0.0     0.26    0.03    0.14    0.0     0.0     0.14    0.26    0.0      0.0     0.0     0.01    0.05    0.0     0.0     0.0     0.0     0.0     0.0      0.01
'k'  :   0.01    0.31    0.2     0.11    0.01    0.0     0.01    0.19    0.0      0.11    0.0     0.0     0.02    0.0     0.01    0.0     0.0     0.0     0.0      0.02
's'  :   0.22    0.21    0.38    0.08    0.0     0.0     0.01    0.02    0.02     0.01    0.0     0.01    0.0     0.0     0.01    0.0     0.0     0.0     0.0      0.01
'7'  :   0.0     0.07    0.39    0.48    0.0     0.0     0.0     0.0     0.0      0.0     0.0     0.0     0.0     0.0     0.02    0.0     0.0     0.0     0.0      0.02
'm'  :   0.69    0.04    0.04    0.04    0.0     0.0     0.08    0.04    0.0      0.0     0.0     0.0     0.0     0.0     0.0     0.0     0.0     0.0     0.0      0.04
'2'  :   0.04    0.34    0.0     0.08    0.0     0.0     0.17    0.21    0.0      0.04    0.0     0.0     0.0     0.0     0.0     0.0     0.0     0.0     0.0      0.08
'p'  :   0.9     0.0     0.0     0.04    0.0     0.0     0.0     0.0     0.0      0.0     0.0     0.0     0.04    0.0     0.0     0.0     0.0     0.0     0.0      0.0
'n'  :   0.89    0.0     0.0     0.05    0.0     0.0     0.05    0.0     0.0      0.0     0.0     0.0     0.0     0.0     0.0     0.0     0.0     0.0     0.0      0.0
'j'  :   0.0     0.15    0.15    0.15    0.0     0.0     0.1     0.1     0.0      0.31    0.0     0.0     0.0     0.0     0.0     0.0     0.0     0.0     0.0      0.0
'g'  :   0.0     0.38    0.23    0.0     0.0     0.0     0.0     0.07    0.0      0.3     0.0     0.0     0.0     0.0     0.0     0.0     0.0     0.0     0.0      0.0
'#'  :   0.18    0.15    0.13    0.2     0.01    0.01    0.02    0.08    0.03     0.03    0.02    0.03    0.0     0.0     0.01    0.0     0.0     0.0     0.0      0.06

Normalized Transition table: vertical = first character, horizontal = following character StarNamesLatham.txt

          ' '     'a'     'i'     'h'     'e'     'l'     'r'     's'     'n'      'm'     'u'     't'     'b'     'k'     'd'     'c'     'z'     'f'     'o'      '#'
' '  :   0.0     0.26    0.01    0.04    0.03    0.0     0.04    0.1     0.03     0.1     0.01    0.04    0.03    0.06    0.04    0.01    0.02    0.02    0.0      0.07
'a'  :   0.09    0.01    0.05    0.07    0.0     0.16    0.1     0.04    0.06     0.03    0.02    0.07    0.05    0.04    0.03    0.02    0.02    0.01    0.0      0.04
'i'  :   0.1     0.06    0.0     0.02    0.02    0.05    0.09    0.07    0.11     0.06    0.0     0.03    0.07    0.04    0.05    0.03    0.01    0.02    0.01     0.07
'h'  :   0.29    0.35    0.09    0.0     0.11    0.0     0.01    0.0     0.0      0.0     0.04    0.0     0.0     0.0     0.0     0.0     0.0     0.01    0.03     0.0
'e'  :   0.12    0.01    0.03    0.01    0.02    0.15    0.09    0.05    0.13     0.04    0.03    0.06    0.04    0.02    0.05    0.03    0.02    0.0     0.0      0.04
'l'  :   0.14    0.19    0.09    0.04    0.07    0.02    0.0     0.04    0.0      0.04    0.02    0.03    0.03    0.02    0.01    0.04    0.0     0.03    0.0      0.1
'r'  :   0.21    0.31    0.09    0.0     0.06    0.0     0.02    0.02    0.03     0.0     0.05    0.02    0.0     0.03    0.02    0.0     0.01    0.02    0.01     0.01
's'  :   0.19    0.3     0.08    0.1     0.04    0.0     0.0     0.01    0.0      0.0     0.08    0.01    0.0     0.0     0.0     0.11    0.0     0.0     0.02     0.0
'n'  :   0.34    0.21    0.12    0.01    0.08    0.0     0.0     0.01    0.0      0.0     0.02    0.03    0.01    0.03    0.0     0.01    0.01    0.0     0.0      0.03
'm'  :   0.16    0.29    0.17    0.0     0.15    0.0     0.0     0.0     0.0      0.02    0.11    0.0     0.01    0.0     0.0     0.0     0.0     0.0     0.03     0.0
'u'  :   0.01    0.02    0.0     0.03    0.02    0.07    0.15    0.08    0.04     0.05    0.01    0.07    0.12    0.05    0.07    0.02    0.05    0.0     0.0      0.04
't'  :   0.28    0.28    0.05    0.18    0.04    0.0     0.0     0.0     0.01     0.0     0.03    0.0     0.0     0.0     0.0     0.0     0.0     0.0     0.04     0.0
'b'  :   0.29    0.29    0.08    0.02    0.13    0.0     0.01    0.01    0.0      0.0     0.03    0.0     0.04    0.0     0.0     0.0     0.0     0.0     0.01     0.0
'k'  :   0.22    0.4     0.11    0.01    0.04    0.01    0.04    0.0     0.0      0.0     0.04    0.0     0.02    0.03    0.01    0.0     0.0     0.0     0.01     0.0
'd'  :   0.27    0.22    0.11    0.12    0.1     0.0     0.02    0.02    0.0      0.0     0.04    0.0     0.0     0.0     0.02    0.0     0.0     0.01    0.02     0.01
'c'  :   0.07    0.22    0.02    0.37    0.06    0.04    0.02    0.0     0.0      0.0     0.02    0.02    0.01    0.01    0.01    0.03    0.0     0.0     0.02     0.0
'z'  :   0.09    0.32    0.15    0.01    0.16    0.01    0.0     0.0     0.04     0.01    0.14    0.0     0.0     0.0     0.0     0.0     0.02    0.0     0.0      0.02
'f'  :   0.19    0.42    0.15    0.0     0.05    0.0     0.03    0.0     0.0      0.0     0.02    0.0     0.0     0.0     0.02    0.0     0.01    0.05    0.02     0.0
'o'  :   0.14    0.01    0.0     0.02    0.0     0.08    0.18    0.03    0.17     0.12    0.04    0.02    0.01    0.02    0.01    0.02    0.01    0.0     0.0      0.04
'#'  :   0.05    0.33    0.07    0.14    0.14    0.01    0.01    0.0     0.0      0.01    0.07    0.0     0.0     0.0     0.0     0.0     0.0     0.0     0.03     0.06

Normalized Transition table: vertical = first character, horizontal = following character ThomasHardy.txt

          ' '     'e'     'i'     's'     'n'     'r'     'a'     't'     'o'      'l'     'd'     'c'     'g'     'u'     'h'     'p'     'm'     'y'     'f'      '#'
' '  :   0.0     0.03    0.03    0.11    0.02    0.04    0.05    0.05    0.02     0.03    0.05    0.08    0.02    0.01    0.04    0.06    0.04    0.0     0.04     0.16
'e'  :   0.15    0.03    0.0     0.1     0.09    0.14    0.06    0.02    0.0      0.04    0.14    0.03    0.0     0.0     0.0     0.01    0.02    0.0     0.0      0.06
'i'  :   0.0     0.04    0.0     0.08    0.31    0.04    0.02    0.08    0.07     0.05    0.04    0.05    0.03    0.0     0.0     0.01    0.03    0.0     0.01     0.06
's'  :   0.31    0.09    0.07    0.07    0.0     0.0     0.02    0.14    0.03     0.01    0.0     0.02    0.0     0.03    0.06    0.03    0.01    0.0     0.0      0.03
'n'  :   0.13    0.11    0.05    0.05    0.01    0.0     0.03    0.1     0.03     0.0     0.08    0.06    0.21    0.01    0.0     0.0     0.0     0.0     0.0      0.05
'r'  :   0.11    0.23    0.1     0.04    0.02    0.02    0.09    0.05    0.07     0.01    0.03    0.01    0.01    0.02    0.0     0.0     0.02    0.03    0.0      0.04
'a'  :   0.0     0.0     0.05    0.07    0.14    0.14    0.0     0.13    0.0      0.1     0.03    0.05    0.02    0.01    0.0     0.03    0.02    0.01    0.0      0.09
't'  :   0.17    0.17    0.17    0.03    0.0     0.06    0.07    0.03    0.03     0.02    0.0     0.01    0.0     0.03    0.09    0.0     0.0     0.03    0.0      0.02
'o'  :   0.01    0.0     0.01    0.04    0.21    0.12    0.01    0.04    0.04     0.04    0.02    0.02    0.01    0.11    0.0     0.03    0.06    0.0     0.01     0.12
'l'  :   0.1     0.19    0.11    0.01    0.0     0.0     0.1     0.02    0.08     0.1     0.03    0.0     0.0     0.02    0.0     0.0     0.0     0.14    0.0      0.03
'd'  :   0.45    0.15    0.12    0.03    0.0     0.03    0.02    0.0     0.02     0.02    0.02    0.0     0.0     0.02    0.0     0.0     0.0     0.01    0.0      0.02
'c'  :   0.02    0.16    0.05    0.0     0.0     0.04    0.1     0.09    0.18     0.04    0.0     0.02    0.0     0.05    0.11    0.0     0.0     0.01    0.0      0.07
'g'  :   0.42    0.11    0.05    0.02    0.01    0.06    0.05    0.0     0.03     0.04    0.0     0.0     0.02    0.02    0.08    0.0     0.0     0.0     0.0      0.0
'u'  :   0.0     0.03    0.03    0.14    0.14    0.17    0.03    0.07    0.0      0.1     0.02    0.03    0.04    0.0     0.0     0.03    0.04    0.0     0.0      0.03
'h'  :   0.11    0.28    0.13    0.01    0.0     0.02    0.14    0.06    0.13     0.0     0.0     0.0     0.0     0.03    0.0     0.0     0.0     0.01    0.0      0.01
'p'  :   0.04    0.21    0.08    0.01    0.0     0.14    0.13    0.03    0.1      0.08    0.0     0.0     0.0     0.02    0.04    0.06    0.0     0.0     0.0      0.0
'm'  :   0.06    0.23    0.13    0.02    0.0     0.0     0.17    0.0     0.11     0.0     0.0     0.0     0.0     0.02    0.0     0.1     0.03    0.01    0.0      0.05
'y'  :   0.75    0.05    0.03    0.03    0.0     0.0     0.01    0.0     0.02     0.0     0.0     0.0     0.0     0.0     0.0     0.0     0.01    0.0     0.0      0.04
'f'  :   0.04    0.13    0.13    0.0     0.0     0.06    0.14    0.03    0.14     0.08    0.0     0.0     0.0     0.1     0.0     0.0     0.0     0.0     0.07     0.0
'#'  :   0.15    0.2     0.11    0.06    0.02    0.02    0.09    0.03    0.07     0.05    0.0     0.0     0.0     0.03    0.02    0.0     0.0     0.0     0.0      0.03

Normalized Transition table: vertical = first character, horizontal = following character Augustinus_all.txt

          ' '     'e'     'i'     'a'     't'     'r'     's'     'u'     'n'      'm'     'o'     'c'     'l'     'd'     'p'     'b'     'v'     'f'     'g'      '#'
' '  :   0.0     0.04    0.09    0.09    0.04    0.03    0.09    0.01    0.03     0.05    0.02    0.08    0.03    0.07    0.09    0.0     0.05    0.05    0.01     0.03
'e'  :   0.13    0.0     0.0     0.01    0.06    0.2     0.08    0.0     0.14     0.07    0.0     0.05    0.02    0.03    0.01    0.05    0.0     0.0     0.01     0.03
'i'  :   0.08    0.04    0.0     0.05    0.12    0.02    0.14    0.04    0.12     0.04    0.05    0.05    0.03    0.03    0.01    0.04    0.01    0.0     0.02     0.01
'a'  :   0.1     0.07    0.0     0.0     0.15    0.09    0.05    0.02    0.11     0.12    0.0     0.04    0.04    0.02    0.01    0.06    0.01    0.0     0.01     0.0
't'  :   0.16    0.16    0.22    0.15    0.0     0.04    0.0     0.17    0.0      0.0     0.04    0.0     0.0     0.0     0.0     0.0     0.0     0.0     0.0      0.01
'r'  :   0.12    0.24    0.16    0.15    0.04    0.02    0.01    0.07    0.01     0.01    0.04    0.0     0.0     0.01    0.01    0.01    0.01    0.0     0.0      0.0
's'  :   0.43    0.09    0.08    0.05    0.09    0.0     0.05    0.05    0.0      0.0     0.02    0.05    0.0     0.0     0.02    0.0     0.0     0.0     0.0      0.02
'u'  :   0.01    0.05    0.05    0.04    0.03    0.12    0.18    0.0     0.06     0.19    0.01    0.02    0.06    0.04    0.02    0.01    0.0     0.0     0.01     0.0
'n'  :   0.0     0.1     0.11    0.06    0.27    0.0     0.08    0.05    0.0      0.0     0.05    0.02    0.0     0.11    0.01    0.0     0.01    0.01    0.02     0.0
'm'  :   0.46    0.08    0.1     0.09    0.0     0.0     0.0     0.06    0.01     0.01    0.06    0.0     0.0     0.0     0.03    0.01    0.0     0.0     0.0      0.02
'o'  :   0.14    0.01    0.0     0.0     0.03    0.23    0.08    0.0     0.18     0.04    0.0     0.05    0.06    0.02    0.02    0.02    0.01    0.01    0.01     0.02
'c'  :   0.01    0.18    0.16    0.12    0.1     0.05    0.0     0.1     0.0      0.0     0.18    0.03    0.01    0.0     0.0     0.0     0.0     0.0     0.0      0.01
'l'  :   0.0     0.16    0.28    0.18    0.03    0.0     0.01    0.09    0.0      0.0     0.07    0.01    0.09    0.0     0.0     0.0     0.01    0.0     0.0      0.0
'd'  :   0.0     0.28    0.36    0.1     0.0     0.0     0.0     0.09    0.0      0.0     0.1     0.0     0.0     0.0     0.0     0.0     0.0     0.0     0.0      0.0
'p'  :   0.0     0.25    0.12    0.09    0.04    0.16    0.02    0.06    0.0      0.0     0.12    0.0     0.07    0.0     0.01    0.0     0.0     0.0     0.0      0.01
'b'  :   0.0     0.11    0.14    0.38    0.0     0.04    0.02    0.2     0.0      0.0     0.03    0.0     0.01    0.0     0.0     0.0     0.0     0.0     0.0      0.0
'v'  :   0.0     0.33    0.38    0.11    0.0     0.0     0.0     0.03    0.0      0.0     0.12    0.0     0.0     0.0     0.0     0.0     0.0     0.0     0.0      0.0
'f'  :   0.0     0.17    0.21    0.19    0.0     0.04    0.0     0.13    0.0      0.0     0.08    0.0     0.07    0.0     0.0     0.0     0.0     0.07    0.0      0.0
'g'  :   0.0     0.25    0.24    0.13    0.0     0.12    0.0     0.07    0.1      0.0     0.03    0.0     0.02    0.0     0.0     0.0     0.0     0.0     0.0      0.0
'#'  :   0.22    0.05    0.05    0.09    0.01    0.01    0.0     0.34    0.0      0.0     0.05    0.02    0.0     0.0     0.01    0.0     0.0     0.0     0.0      0.07
Categories: 1, 4, Labels, Latin, o, Recipes Folios, Stars Tags: , , , , , , ,

Genetic Algorithm based Phrase Analysis

February 26, 2010 1 comment

Hypothesis

The following hypothesis occurred to me while I was investigating a cipher theory proposed by Rich Santa Coloma. (This is not a new idea amongst Voynich researchers, but it was new to me!)

The VMs “words” are codes for plaintext character groups, probably trigraphs, digraphs and single characters.

How does  one use this system?

1) Take each word in the plaintext
2) Break it up into a sequence of one or more trigraphs, digraphs and single characters by referring to a code table
3) Write the code for each, separated by a space, and terminate the last  tri/di-graph/character code by a VMs “9”.

The labels are probably treated differently: there may well be a separate set of codes just for the labels.

As an example, take the following “sentence” of 33 “words” from the Herbal folios:

h1cok 2oe 1c9 4ohom 2oy 4ok1coe 1oyoy 2o82c9 4okd9 4okcc9 8am 4okC9 Kay o1c9 1oe 1oe 4ok1c9 8am 1okd9 8ae s19 k1c9 8am 8C9 ko8 8an 4okds 3o h1cc9 sam 1oh1oe 1oy Hos

Breaking the VMs “words” at each terminal “9”, this is deciphered to be a sentence of 13 words:

h1cok 2oe 1c
4ohom 2oy 4ok1coe 1oyoy 2o82c
4okd
4okcc
8am 4okC
Kay o1c
1oe 1oe 4ok1c
8am 1okd
8ae s1
k1c
8am 8C
ko8 8an 4okds 3o h1cc
sam 1oh1oe 1oy Hos

Each of these words is built of one or more codes. E.g. the first word in the list above is “h1cok 2oe 1c” and may be deciphered as

h1cok = “qui”,
2oe = “de”
1c = “m”

to make the Latin word “quidem”.

An interesting feature of this cipher/code is that you may have several choices of how to split each plaintext word into tri/di/mono-graphs, but without ambiguity for the decipherer. This may be an explanation for the different frequency distributions between the VMs folios and Currier hands: they were written by different scribes who tended to split the plaintext words differently.

Does the Theory fit the Data, for Latin?

We first take a substantial body of text from the VMs, e.g. the Recipes folios, and feed it through an application code that extracts all the VMs words, and groups them according to the procedure described above, using one or more arbitrary characters as word ending marks. Typically we use VMs “9”. Each sentence so derived is analysed: each of the tokens is analysed for n-gram content and frequencies are tallied.

At the end of the processing, the n-grams are sorted into frequency order: the most frequent n-grams appear first in the list.

At this point the application moves to its second stage. It ingests a large list of Latin phrases, generated by Knox (thanks, Knox!) and processes each word in each unique phrase for n-gram content, so extracting the n-gram frequencies for Latin. The phrases are placed in a sorted list: shortest first. The n-grams are sorted by frequency, most frequent first.

Here are the Latin phrase sizes used:

A total of 53834 different phrases of size >= 2
2 4405
3 28152
4 8524
5 3866
6 2227
7 1507
8 1085
9 813
10 633
11 513
12 424
13 356
14 300
15 252
16 209
17 177
18 150
19 130

The third stage of the application is to generate a set of Genetic Algorithm chromosomes. Each chromosome takes the Top N n-grams from the Voynich n-gram list and pairs them with a random selection of the n-grams from the Latin list.

For example, for a Chromosome of length 15 (in fact the GA uses much longer lengths, typically 200) the following table might be used:

V: am ay ae 1c8 4ohC oe 1c 4oham 8am 4ohan oham okam  oy 1c7  e
L: ed gi  n  de   et ae  p     s  du    tu   nd    d tio rum te

The chromosomes are “scored” by having them translate/decipher a training set of sentences from the input VMs folios. To calculate the score of each chromosome for each sentence, the sentence word tokens are converted to Latin n-grams using the chromosome’s table. Then the tokens are joined together to form the plaintext words. The plaintext words are looked up in the Latin dictionary: the chromosome’s score is increased for valid words, and decreased for invalid words. Once all the words in the sentence have been deciphered in this way, it is compared with each of the Latin phrases: if a Latin phrase appears in the sentence, the score of the chromosome is increased substantially.

The best chromosome found by a Monte Carlo method (basically generating random chromosomes, and retaining the best scoring chromosome) is placed at the top of a list, and then the remaining chromosomes needed for the Genetic Algorithm are generated.

The GA phase now begins: the chromosomes are genetically altered, mated and selected to optimise the best chromosome’s score on the training sentences. This phase is compute intensive.

Periodically, the GA will report on its progress:

Epoch 311 Cost/Ave 62.845588235294116/61.22993872549012 same 1 Mutated 21.608040201005025% New 1 MS 15
62.845588235294116 GAPhrases$Chromosome@41ec5a Good=128 / 408 = 31.37255% 40 phrases in 25 sentences
S: am ay ae 1c8 4ohC oe 1c 4oham 8am 4ohan oham okam  oy 1c7  e
R: ed gi  n  de   et ae  p     s  du    tu   nd    d tio rum te
Sentence 189
S: 2o ok1c - 1coe hc1 - 1Kc - ohan ae e hC - 4ohan 1cH - 1c7ay ap e2c - 2c7ae ohcay e hc8 - 1coehC - ehc - ohC - 4ohC - 4ohc - 4ohan ap -
T: endve la' binteua tunti nis te' pi et' in'* tunis

In this report, the GA has been running for 311 “epochs” (each epoch is a new generation of chromosomes). The cost (score) of the best chromosome is 62.8, whereas the average score of all the chromosomes in the population is 61.2. In this Epoch, there has been no change to the best chromosome since the last Epoch (“same 1”), 21% of the chromosomes have been mutated, a fresh chromosome (“New 1”) was inserted at this Epoch (to ensure diversity – this is not usually done in GA, but I find it produces more reliable training). “MS 15” means that the maximum number of no-change Epochs seen so far has been 15 … the larger this number is, the more stagnant the chromosome pool is, and the nearer to a solution we are.

The following line shows in detail how the best chromosome has scored: its table produces 128 valid Latin words, from a total of 408 translations i.e. about 31%. In the 25 sentences being used in training, 40 common Latin phrases have been found.

The next two lines show the first 15 n-grams in the mapping that the chromosome is using.

Then the status report shows how the chromosome fared on translating a sentence picked at random from the VMs folios. Since the GA is being trained only on the first few sentences, the remainder are essentially “unseen”, and so a valid, sensible translation in a non-trained sentence is significant.

The sentence picked is number 129 (the training set is the first 25 sentences in this run, so number 129 is well outside that). The VMs source sentence is shown with hyphens “-” separating the tokens that make up words. E.g. “2o ok1c” is the first word. Beneath is the Latin translation. A Latin word followed by a single quote means that that word appears in the Latin dictionary, and is thus valid. A star appearing after a set of valid Latin words indicates that the Latin phrase made up by the words is common, or at least appears in Knox’s list.