Are the Glyphs placed in specific folio locations?

June 6, 2016 12 comments

Based on a lot of circumstantial evidence related to the weirdness of the Voynich text (such as the odd repeating words, the curious faintness and boldness of some glyphs, and the sometimes curious positioning of text words and lines), it appears that the folios were perhaps not written Left to Right (or Right to Left) and Top to Bottom.

Instead, suppose the scribe started each folio with a prescription: for example “put an h-Gallows at the top left, then put a c in the middle of the folio, then a 9 at the end of the last line”, and so on. This would be sort of like filling out the answers to a bizarre crossword puzzle.

If there was such a prescription, might it explain some of the Voynich text features?

In the following selected charts I’m showing a virtual folio from the Recipes section. Each chart has lines and columns. Line 1 position 1 is the top left of the folio. Let’s look at the chart folio for Glyph “o”:

Recipes_o

Each disc indicates that the “o” appears at least twice in that location in the Recipes. The size of the disc indicates how many times it appears there: the bigger the disc, the more times it appeared. The random appearance of the chart suggests that “o” is not placed on the page in any particular pattern.

Let’s now look at the “s” glyph:

Recipes_s

Here it is clear that this glyph vastly prefers the first column, but not the first line. It is infrequently found elsewhere on the folio. In contrast, take a look at the rare glyphs (I just call them “?”):

Recipes_?

These abhor the early columns, and love the ends of the lines. They also seem to prefer the ends of the first lines (notice a little cluster there). Perhaps they hate the “s” glyphs…

The “4” glyph:

Recipes_4

The gap after the first column is explained by how “4” only appears at the start of a word.

Here are some more glyphs:

Recipes_y

Recipes_1Recipes_2Recipes_8Recipes_9

No conclusions here, as usual!

Addendum: the distribution for “c”:

Recipes_c

 

 

Entropy of the Voynich text

May 26, 2015 23 comments

The Shannon Entropy of a string of text measures the information content of the text. For text that is completely random i.e. where the appearance of any character is as likely as the appearance of any other, the entropy (or “disorder”) is high. For a text which is a long string of identical characters, for example, the entropy is low.

Mathematically, the Shannon Entropy is defined as:

Entropy = –ΣiN probi * Log( probi)

where probi is the frequency of the i’th character in the text, and the sum is over all the characters.


If the Voynich text is randomly created (by whatever means), we’d expect it to have high entropy (i.e. be very disordered). What we in fact find is that the text is ordered, with low Entropy, and is rather more ordered than English, for example. The result of comparing the Voynich text with several other texts in different languages is shown in the table below.

Language Source Entropy
Voynich GC’s Transcription 3.73
French Text from 1367 3.97
Latin Cantus Planus 4.05
Spanish Medina 1543 4.09
German Kochbuch 1553 4.15
English Thomas Hardy 4.21
Early Italian Divine Comedy 1300 4.23
None Random characters 6.01

The last entry in the table shows the Entropy for a random text – and is getting on for double the Entropy of the Voynich.

Common Words in Language A that are Rare in Language B

March 15, 2013 40 comments

The question was posed: which words are common in Language A but rare in Language B? And vice versa.

For this study I used the Herbal/Balneo folios that are Language A and B respectively (folios 1-25 and 75-84).

There are around 2900 unique words in total, with around 1600 being used in Language A, and 1630 in Language B.

Here are the results. The tables show the words in order of decreasing value of the frequency in A (B) divided by the frequency in B (A), and show the number of occurrences of each word in both Languages.

Common in A, rare in B

Common in A, rare in B

 

Common in B, rare in A

Common in B, rare in A

Conclusion? I have no idea … for now.

Categories: Features, Folios, Languages

Language A and B Again

March 13, 2013 12 comments

A tentative conclusion from comparing Language A and Language B  is that the non-gallows glyphs are used in the same way in both Languages.

That is to say, they appear to mean the same thing. So the “o” in A means the same as the “o” in B.
There is some persistent “mixing” between the e/y glyphs, which is illustrated by the example result below:
ABMixing
There is also some doubt about the “8” glyph, which sometimes seems to mix with the gallows glyphs (e.g. in some cases, the “8” appears in A to function in the same way as a gallows glyph in B and vice versa). This may simply be an error in the comparison method, or it may be that the “8” is a null, or it may be due to some other effect.
The gallows glyphs are different – they don’t appear to mean the same in A and B. I’m focussing on those glyphs now.

Language “A” and “B” Conversions

March 5, 2013 12 comments

This is an update to my previous two posts on this topic.

I have been concentrating on searching for the correspondence between glyphs used in Language A, and glyphs used in Language B. As a reminder, the method is to take all words in, say, Language A, and “convert” them to words in Language B by changing the glyphs according to a candidate mapping table. The frequency of the converted Language B words is then compared with the original Language A words: the closer the frequencies, the better the mapping match.

Method Check using only Language A words

As a check of the method, I took the Herbal folios 1-25 (all in Language A) and split them into two groups: 1-12 and 13-25, and I then artificially labelled the latter group as Language B. Then I ran the matching procedure, which produced the following result:

Epoch 62 Best chromosome 0 Value= 5.62272615159e-05
Chromosome ['o', '9', '1', 'i', '8', 'a', 'e', 'c', 'k', 'y', 'h', 'N', '2', '4', 's', 'g', 'p', '?', 'K', 'H']
ngramsA    ['o', '9', '1', 'i', '8', 'a', 'e', 'c', 'h', 'y', 'k', 'N', '2', '4', 's', 'g', 'p', '?', 'K', 'H']

This is good and reassuring, since it shows that the words in folios 13-25 have essentially the same frequency distribution when their glyphs are mapped to the same glyphs in folios 1-12.

Removal of Glyph Variants in Voyn_101

As the tests progressed, it became clear that some of the glyphs GC defined in Voyn_101 were in fact variants of more common glyphs. The most obvious were the “m”, “n”, “N” glyphs mentioned before – with these included, the conversions between Language B and Language A were of much poorer quality than if they were expanded to “iiN”, “iN” and “iiiN” respectively. After some time weeding out these variants, the following table was arrived at:

seek =  ["3", "5", "+", "%", "#", "6", "7", "A", "X", 
         "I", "C", "z", "Z", "j", "u", "d", "U", "P", 
         "Y", "$", "S", "t", "q",
         "m", "M", "n", "Y", "!", ")", "*", "b", "J", "E", "x", "B", "D", "T", "Q", "W", "w", "V", "(", "&"]
repl =  ["2", "2", "2", "2", "2", "8", "8", "a", "y", 
         "ii", "cc", "iy", "iiy", "g", "f", "ccc", "F", "ip",
         "y", "s", "cs", "s", "iip",
         "iiN", "iiiN", "iN", "y", "2", "9", "p", "y", "G", "c", "y", "cccN", "ccN", "s", "p", "h", "h", "K", "9", "8"]

I am very confident that the glyphs remaining after using the above conversion table are the base set.  The base set of glyphs is thus:

Language A frequency order: 'o', 'c', '9', '1', 'a', '8', 'e', 'i', 'h', 'y', 'k', 's', '2', 'N', '4', 'g', 'p', '?', 'K', 'H', 'f', 'G', 'F', 'L', 'l', 'v', 'r', 'R'
Language B frequency order: 'c', 'o', '9', 'a', '8', 'e', '1', 'h', 'i', 'y', 'k', '2', 'N', 's', '4', 'g', 'p', 'f', '?', 'H', 'K', 'G', 'F', 'l', 'L', 'R', 'r', 'v'

where “?” represents all very rare glyphs (such as the “picnic table” glyph). There are thus 27 glyphs (15 gallows and 12 regular) excluding the rare special glyphs like the picnic table.

Glyph Mixing Between A and B

I ran many trials using the base set of glyphs, comparing various sections of the VMs written in the different hands. In particular, the following folio collections were defined:

Special = {'HerbalRecipeAB': range(107,117) + range(1,26),
           'HerbalAB': range(1,57),
           'HerbalBalneoAB': range(1,26) + range(75,85),
           'HerbalAstroAB': range(1,13) + range(67,75),
           'PharmaRecipeAB': [88,89,99,100,101,102] + range(103,117),
           'AllAB': range(1,117)
 }

The collection I used the most was the one called “HerbalBalneoAB”, which contains Herbal folios written in Language A, and Balneo folios written in Language B. The nice feature of this collection is that the number of words is around the same for both Languages, which makes comparing counts very easy:

Total words =  2846  Total Language A =  1581  Total Language B =  1584

As an example, here is a trial result for HerbalBalneoAB:

Language B ['o', '9', '1', 'a', 'i', 'f', 'c', 'y', 'h', 'e', 'K', 'N', '2', 's', '4', 'g', 'p', '8', 'k', 'H']
Language A ['o', '9', '1', 'a', 'i', '8', 'c', 'e', 'h', 'y', 'k', 'N', '2', 's', '4', 'g', 'p', 'K', '?', 'H']

In all the tests I ran, there were some common features in the results:

  • Mixing between “e” and “y” – when writing Language A, the use of “e” appears to be equivalent to the use of  “y” in Language B, and vice versa
  • Mixing between  8,f,F,k,K,g,G,r,R,?  and so on – the Gallows glyphs swap amongst themselves, and “8”

Just about all trials showed the “e”/”y” mixing. Tony Gaffney pointed out that these two glyphs are quite similar in stroke construction. The appearance of “8” amongst the swapping Gallows glyphs is curious.

Single glyphs in Language A and Language B

March 2, 2013 4 comments
As a sanity check, I looked at single glyphs (rather than nGrams > 1), searching for the mapping that takes all the Language B glyphs and maps them to Language A glyphs, so that the Language B words converted with the mapping most closely match the frequency of Language A words. I found the following:
Chromosome  ['o', '9', '1', 'a', 'H', 'c', 'e', 'h', 'y', 'k', '2', 's', 'm', '4', 'i', '(', '8', 'p', 'g', 'n']
ngramsA     ['o', '9', '1', 'a', '8', 'c', 'e', 'h', 'y', 'k', '2', 's', 'm', '4', 'g', 'i', 'K', 'p', '?', 'n']

This shows that most Language B glyphs map to the same glyph in Language A. However, there is some mixing going on here between “H”, “8”, “i”, “g”, “(“, “K” and “?”

It occurred to me that this may be due to GC’s choice of ascribing single glyphs where there should perhaps be several. In particular, he has:
“m” which looks like “iiN”
“n” which looks like “iN”
“M” which looks like “iiiN”
(I think EVA does a better job of recognizing these.) So I adjusted the GC transcription accordingly, replacing n,m,M with the i,N combinations above.
This resulted in a new mapping for B to A:
Chromosome  ['o', '9', '1', 'a', 'i', 'g', 'c', 'y', 'k', 'e', 'h', 'N', '2', 's', '4', '(', '8', 'p', 'f', 'H']
ngramsA     ['o', '9', '1', 'a', 'i', '8', 'c', 'e', 'h', 'y', 'k', 'N', '2', 's', '4', 'g', 'K', 'p', '?', 'H']
(There may be better mappings, but this is the best so far.) This has some interesting features:
  • e and y swap between languages
  • h and k gallows swap between languages
  • some mixing of g,8,(,K,f,? – some of these are relatively rare, so the statistics are poor, which may explain the mixing.
 Note that the simplification table I’m using for Voyn_101 is currently:
    seek = ["3",   "5",    "+",  "%",   "#", "6", "7",    "A", "X",  
            "I",   "C",    "z",  "Z",   "j", "u", "d",    "U", "P", 
            "Y",   "$",    "S",  "t",   "q",
            "m",   "M",    "n",  "Y",   "!"]
    repl = ["2",   "2",    "2",  "2",   "2", "8",  "8",   "a", "y",  
            "ii",  "cc",   "iy", "iiy", "g", "f",  "ccc", "F", "ip",
            "y",   "s",    "cs", "s",   "iip",
            "iiN", "iiiN", "iN",  "y",   "2"]
(Thanks to Tony Gaffney for spotting an error in the conversion for C in a previous version.)
Categories: Algorithms, Languages

The Relationship Between Currier Languages “A” and “B”

March 1, 2013 24 comments

Captain Prescott Currier, a cryptographer, looked at the Voynich many moons ago, and made some very perceptive comments about it, which can be seen here on Rene Zandbergen’s site.

In particular, he noticed that the handwriting was different between some folios and others, and he also noticed (based on glyph/character counts) that there were two “languages” being used.

When I first looked at the manuscript, I was principally considering the initial (roughly) fifty folios, constituting the herbal section. The first twenty-five folios in the herbal section are obviously in one hand and one ‘‘language,’’ which I called ‘‘A.’’ (It could have been called anything at all; it was just the first one I came to.) The second twenty-five or so folios are in two hands, very obviously the work of at least two different men. In addition to this fact, the text of this second portion of the herbal section (that is, the next twenty-five of thirty folios) is in two ‘‘languages,’’ and each ‘‘language’’ is in its own hand. This means that, there being two authors of the second part of the herbal section, each one wrote in his own ‘‘language.’’ Now, I’m stretching a point a bit, I’m aware; my use of the word language is convenient, but it does not have the same connotations as it would have in normal use. Still, it is a convenient word, and I see no reason not to continue using it.

We can look at some statistics to see what he was referring to. Let’s compare the most common words in Folios 1 to 25 (in the Herbal section, Language A, written in Hand 1) and in Folios 107 to 116 (in the Recipes section, Language B, written in a different Hand):

Comparison between word frequencies in Languages A and B

Comparison between word frequencies in Languages A and B

So, for example, in Language A the most common word is “8am” and it occurs 192 times in the folios, whereas in Language B the most common word is “am”, occuring 137 times.

We might expect that these are the same word, enciphered differently. The question then is, how does one convert between words in Language A and words in Language B, and vice versa? In the case of the “8am” to “am” it’s just a question of dropping the “8”, as if “8” is a null character in Language A. In the case of the next most popular words, “1oe”(A) and “1c89″(B) it looks like “oe”(A) converts to “c89″(B). And so on.

If we look at the most popular nGrams (substrings) in both Languages, perhaps there is a mapping that translates between the two. Perhaps the cipher machinery that was used to generate the text had different settings, that produced Language A in one configuration, and Language B in another. Perhaps, if we look at the nGram correspondence that results in the best match between the two Languages, a clue will be revealed as to how that machinery worked.

This involves some software (I’m using Python now, which is fun). The software first calculates the word frequencies for Language A and B in a set of folios (the table above is an output from this stage). It then calculates the nGram frequencies for each Language. Here are the top 10:

nGramFrequencies

The software then runs a Genetic Algorithm to find the best mapping between the two sets of nGrams, so that when the mapping is applied to all words in Language B, it produces a set of words in Language A the frequencies of which most closely match the frequencies of words observed in Language A (i.e.  the frequencies shown in the first table above).

Here is an initial result. With the following mapping, you can take most common words in Language B, and convert them to Language A.

Table for converting between a Language B word and a Language A word

Table for converting between a Language B word and a Language A word

A couple of remarks. This is an early result and probably not the best match. There are some interesting correspondences :

  • “9” and “c” are immutable, and have the same function
  • Another interesting feature is that “4o” in Language B maps to “o” in Language A, and vice versa!
  • in Language B, “ha” maps to “h” in Language A, as if “a” is a null

In the Comments, Dave suggested looking at word pair frequencies between the Languages. Here is a table of the most common pairs in each Language.

Common word pairs in Languages A and B

Common word pairs in Languages A and B

For clarity, I am using what I call the “HerbalRecipesAB” folios for this study i.e.

Using folios for HerbalRecipeAB : [107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25]

More results coming …