Archive
Language A and B Again
A tentative conclusion from comparing Language A and Language B is that the non-gallows glyphs are used in the same way in both Languages.
Language “A” and “B” Conversions
This is an update to my previous two posts on this topic.
I have been concentrating on searching for the correspondence between glyphs used in Language A, and glyphs used in Language B. As a reminder, the method is to take all words in, say, Language A, and “convert” them to words in Language B by changing the glyphs according to a candidate mapping table. The frequency of the converted Language B words is then compared with the original Language A words: the closer the frequencies, the better the mapping match.
Method Check using only Language A words
As a check of the method, I took the Herbal folios 1-25 (all in Language A) and split them into two groups: 1-12 and 13-25, and I then artificially labelled the latter group as Language B. Then I ran the matching procedure, which produced the following result:
Epoch 62 Best chromosome 0 Value= 5.62272615159e-05 Chromosome ['o', '9', '1', 'i', '8', 'a', 'e', 'c', 'k', 'y', 'h', 'N', '2', '4', 's', 'g', 'p', '?', 'K', 'H'] ngramsA ['o', '9', '1', 'i', '8', 'a', 'e', 'c', 'h', 'y', 'k', 'N', '2', '4', 's', 'g', 'p', '?', 'K', 'H']
This is good and reassuring, since it shows that the words in folios 13-25 have essentially the same frequency distribution when their glyphs are mapped to the same glyphs in folios 1-12.
Removal of Glyph Variants in Voyn_101
As the tests progressed, it became clear that some of the glyphs GC defined in Voyn_101 were in fact variants of more common glyphs. The most obvious were the “m”, “n”, “N” glyphs mentioned before – with these included, the conversions between Language B and Language A were of much poorer quality than if they were expanded to “iiN”, “iN” and “iiiN” respectively. After some time weeding out these variants, the following table was arrived at:
seek = ["3", "5", "+", "%", "#", "6", "7", "A", "X",
"I", "C", "z", "Z", "j", "u", "d", "U", "P",
"Y", "$", "S", "t", "q",
"m", "M", "n", "Y", "!", ")", "*", "b", "J", "E", "x", "B", "D", "T", "Q", "W", "w", "V", "(", "&"]
repl = ["2", "2", "2", "2", "2", "8", "8", "a", "y",
"ii", "cc", "iy", "iiy", "g", "f", "ccc", "F", "ip",
"y", "s", "cs", "s", "iip",
"iiN", "iiiN", "iN", "y", "2", "9", "p", "y", "G", "c", "y", "cccN", "ccN", "s", "p", "h", "h", "K", "9", "8"]
I am very confident that the glyphs remaining after using the above conversion table are the base set. The base set of glyphs is thus:
Language A frequency order: 'o', 'c', '9', '1', 'a', '8', 'e', 'i', 'h', 'y', 'k', 's', '2', 'N', '4', 'g', 'p', '?', 'K', 'H', 'f', 'G', 'F', 'L', 'l', 'v', 'r', 'R' Language B frequency order: 'c', 'o', '9', 'a', '8', 'e', '1', 'h', 'i', 'y', 'k', '2', 'N', 's', '4', 'g', 'p', 'f', '?', 'H', 'K', 'G', 'F', 'l', 'L', 'R', 'r', 'v'
where “?” represents all very rare glyphs (such as the “picnic table” glyph). There are thus 27 glyphs (15 gallows and 12 regular) excluding the rare special glyphs like the picnic table.
Glyph Mixing Between A and B
I ran many trials using the base set of glyphs, comparing various sections of the VMs written in the different hands. In particular, the following folio collections were defined:
Special = {'HerbalRecipeAB': range(107,117) + range(1,26),
'HerbalAB': range(1,57),
'HerbalBalneoAB': range(1,26) + range(75,85),
'HerbalAstroAB': range(1,13) + range(67,75),
'PharmaRecipeAB': [88,89,99,100,101,102] + range(103,117),
'AllAB': range(1,117)
}
The collection I used the most was the one called “HerbalBalneoAB”, which contains Herbal folios written in Language A, and Balneo folios written in Language B. The nice feature of this collection is that the number of words is around the same for both Languages, which makes comparing counts very easy:
Total words = 2846 Total Language A = 1581 Total Language B = 1584
As an example, here is a trial result for HerbalBalneoAB:
Language B ['o', '9', '1', 'a', 'i', 'f', 'c', 'y', 'h', 'e', 'K', 'N', '2', 's', '4', 'g', 'p', '8', 'k', 'H'] Language A ['o', '9', '1', 'a', 'i', '8', 'c', 'e', 'h', 'y', 'k', 'N', '2', 's', '4', 'g', 'p', 'K', '?', 'H']
In all the tests I ran, there were some common features in the results:
- Mixing between “e” and “y” – when writing Language A, the use of “e” appears to be equivalent to the use of “y” in Language B, and vice versa
- Mixing between 8,f,F,k,K,g,G,r,R,? and so on – the Gallows glyphs swap amongst themselves, and “8”
Just about all trials showed the “e”/”y” mixing. Tony Gaffney pointed out that these two glyphs are quite similar in stroke construction. The appearance of “8” amongst the swapping Gallows glyphs is curious.
The Relationship Between Currier Languages “A” and “B”
Captain Prescott Currier, a cryptographer, looked at the Voynich many moons ago, and made some very perceptive comments about it, which can be seen here on Rene Zandbergen’s site.
In particular, he noticed that the handwriting was different between some folios and others, and he also noticed (based on glyph/character counts) that there were two “languages” being used.
When I first looked at the manuscript, I was principally considering the initial (roughly) fifty folios, constituting the herbal section. The first twenty-five folios in the herbal section are obviously in one hand and one ‘‘language,’’ which I called ‘‘A.’’ (It could have been called anything at all; it was just the first one I came to.) The second twenty-five or so folios are in two hands, very obviously the work of at least two different men. In addition to this fact, the text of this second portion of the herbal section (that is, the next twenty-five of thirty folios) is in two ‘‘languages,’’ and each ‘‘language’’ is in its own hand. This means that, there being two authors of the second part of the herbal section, each one wrote in his own ‘‘language.’’ Now, I’m stretching a point a bit, I’m aware; my use of the word language is convenient, but it does not have the same connotations as it would have in normal use. Still, it is a convenient word, and I see no reason not to continue using it.
We can look at some statistics to see what he was referring to. Let’s compare the most common words in Folios 1 to 25 (in the Herbal section, Language A, written in Hand 1) and in Folios 107 to 116 (in the Recipes section, Language B, written in a different Hand):
So, for example, in Language A the most common word is “8am” and it occurs 192 times in the folios, whereas in Language B the most common word is “am”, occuring 137 times.
We might expect that these are the same word, enciphered differently. The question then is, how does one convert between words in Language A and words in Language B, and vice versa? In the case of the “8am” to “am” it’s just a question of dropping the “8”, as if “8” is a null character in Language A. In the case of the next most popular words, “1oe”(A) and “1c89″(B) it looks like “oe”(A) converts to “c89″(B). And so on.
If we look at the most popular nGrams (substrings) in both Languages, perhaps there is a mapping that translates between the two. Perhaps the cipher machinery that was used to generate the text had different settings, that produced Language A in one configuration, and Language B in another. Perhaps, if we look at the nGram correspondence that results in the best match between the two Languages, a clue will be revealed as to how that machinery worked.
This involves some software (I’m using Python now, which is fun). The software first calculates the word frequencies for Language A and B in a set of folios (the table above is an output from this stage). It then calculates the nGram frequencies for each Language. Here are the top 10:
The software then runs a Genetic Algorithm to find the best mapping between the two sets of nGrams, so that when the mapping is applied to all words in Language B, it produces a set of words in Language A the frequencies of which most closely match the frequencies of words observed in Language A (i.e. the frequencies shown in the first table above).
Here is an initial result. With the following mapping, you can take most common words in Language B, and convert them to Language A.
A couple of remarks. This is an early result and probably not the best match. There are some interesting correspondences :
- “9” and “c” are immutable, and have the same function
- Another interesting feature is that “4o” in Language B maps to “o” in Language A, and vice versa!
- in Language B, “ha” maps to “h” in Language A, as if “a” is a null
In the Comments, Dave suggested looking at word pair frequencies between the Languages. Here is a table of the most common pairs in each Language.
For clarity, I am using what I call the “HerbalRecipesAB” folios for this study i.e.
Using folios for HerbalRecipeAB : [107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25]
More results coming …
f75r cures, pregnancy, life and death – Latin Plainchant
Here is a result obtained using a Genetic Algorithm to match the text on f75r to Latin. The training corpus I used was a large file of Latin plainchant (the idea being that repeated “words” in the VMs show similarities to chant).
First, here is the folio with the translated words overlain in red:
The genetic algorithm searched for a set of glyphs that each matched to a pair of Latin letters.
Most of the decrypted words are valid Latin and match words in the plaintext I used to train the GA. Some are Latin but do not appear in the plaintext. The other, invalid, words could be caused by errors in the pair matching.
Or the whole thing could well be nonsense! This is likely – I asked Joel Stevens to translate some of the Latin, and here is what he said:
On first inspection, it seems to be random non-sense. For example:
recita lugete vena dans veta ia debent lustrata liteWould mean: Recite! Mourn! Blood-vessel giving. Forbid! Oh, they owe things that were purified by the lawsuit.
I’m not really sure how to make sense of it. I don’t see anything that stands out as an obvious sentence. Maybe some words are filler and need to be dropped… or maybe there is a hidden order that needs to be found (assuming these are the correct words).
Here is the Latin:
piraextita recita' lugete' idpirata lucrte vena' dans' veta' ia' debent' lustrata' vanagete vamirata lite' lugens' esnt levata' nuta' gens' veanta rochum' nogete le dato' uascie excita' curi gent na veta' le veta' luedicta veexti vata' arta' te' chum no' no' amicta' luedet luga' mori' edente' noga date' reri' lugens' feta' luedicta luga' morata' luaena uechum vana' lugete' ad' nt vana' luga' pate vata' lugertta audita' lugens' vita' curata' resona' lupina' feta' lumina' lugeum veon lugete' no' vana' lant aule lucrte ista' veta' lugens' vita' na ruri' mena' strata' luedicta lugens' nota' luedicta na lugent' reti' vena' date' vageas dedita' lugent' nota' veta' te' no' iret' veta' na no' vena' luga' morata' edicta' lumina' lumina' lumina' lumina' lugens' novena' lace' si' educta' lugens' na novena' lugens' vita' lugens' ruti' sita' lans' lugens' luuacinota lacium vana' pant' le vena' reista luga' aurata' lugete' verata nt ha' urgens' ad' revena lugens' vana' morati' vemota curatita dant' bunt' id' ncnt vena' lugeas pant' vesata lunt vena' lunt ruchum este' late' nota' luaerata lunt veta' luga' veta' ulta' lugens' ti iu' veta' lute vata' stri sschum veta' lumicium clrechum vata' poma' le sebete no' te' ut' sati' veta' lugens' vana' le acta' gete vana' tute' mori' aeri' luedet lugent' deri lunt vana' lugeas lugens' no' edet' luuatita lans' vato te' no' lans' orta' luedicta id' na luedicta strata' tuas' dans' no' veno dans' no' luia' date' muta' gnsiurte duno' luca' alti' vena' resina' date' ruti' rurata na sunt' errata' morata' luedet pant' no' dace' veto' lunt nt amicta' luedicta strata' dans' fisi' na uachns no' recina lunt dans' novata' luedicta vana' lucrum' lu' dans' irquta lumita lugens' vaga' lugent' dans' vana' resino vena' reti' nodo' aurata' lumita vita' lugete' vena' vata' lumini' lumina' dant' na locuta' dant' vena' lugens' date' pena' vena' lugete' vena' usta' luedet nt vena' lunt vana' lugens' ferata rorata' dalias pr' nomina' resi date' ruta reti' ruti' no' gens' nomina' lugens' ambita' lugens' date' date' vena' lugens' nona' pant' mirata luedicta luncta' reti' ruti' date' date' na lumina' na ambita' lumina' luedicta lumiista nt lumirata luedicta lumina' lumina' lumirata usta' serata' luedicta lumirata noedicta ruri' ncusta date' vena' lugens' vena' dant' edicta' te' vena' paedicta luedicta rucina luga' na edicta' ma aurata' lumina' luedicta luiget luigicta date' serata' vagete nona' lugens' fiti ruti' nona' sprata lans' rerata lumina' rurata renona ruti' ruut nochns ha' date' na derata orta' lumita rerata lalint ruta rurata lumena rurata rechns lumita lumina' veno lumina' dans' vena' eg plnt vana' noti'
Landini’s Challenge
An excerpt from Landini’s challenge text (text he generated using an undisclosed method, supposed to replicate the features of the VMs text):
qopchdy chckhy daiin ¬ ½shxam chor otechar okcharain ryly sheodykeyl
sheodykeyl daiin shd okaiin qokain qokal yteoldy otedy qokydy opchedy
otal oldar chor lkeedol eer ol dair chedy daiin ockhdar cpheol chedy
xar qokaiin y chedy kshdy ololdy aiin char y okeey oldar qokaiin lsho
daiin olsheam qoeey chedy dchos pshedaiin shedy d qol key sheol or
cpheeedol qokedy qokaiin daiin cthosy chedy ar aiir chedy teeol aiin
cheey y cheam oky qokaiin daldaiin loiii¯ ar shtchy chedy aldaiin
ydchedy daiin shd okaiin qokain daiin qotcho chedy daiin lchy olorol
otedy qockhor shol daiin paichy chedy ar shdair chedal chedy kchdaldy
chckhy otakar qokedy s qooko chor daiin otcholchy chedy daiin koroiin
qokain qokedy kosholdy ol kchedy kshdy qokaiin ar shaikhy olaldy seees
ar oteodar chedy oteeol shedy daiin key dain daiin keeokechy chedy
lchey ail lchedy sches ol dsheeo otol odaiin qokain daiin sheeod chshy
chedy qoekedy tair sain qocheey aiin cheey chaiin ols shedy sheolol
daiin lcheol chedy daiin pchoraiin oshaiin chedy lchey lor sal aiin
cheey y dsheom shedy todydy cheor saiin shdaldy daiin ofchtar daiin
Here are some thought-provoking results from analysing the text, as suggested by Knox, the VM text, and comparisons with English, Latin, German, French and Spanish. These use a new form of the Genetic Algorithm, described below.
Summary
It looks to me like that Landini either generated his text from a transcription of the VM itself, or his algorithm for generating that text is a good emulation of the encoding process used in the VM. In other words the Landini “language” is a good candidate as a plaintext language for the VM, as opposed to the European languages tested.
Results
Here is a table which shows the GA’s efficiency at converting/translating between Voynich, Landini, and the other languages.

(In the table, the best possible score is 1.0 – see below for an explanation)
Asking the GA to translate English to English, or Latin to Latin, etc. results in a high efficiency score, as expected. Note that the Landini to Landini efficiency is 0.97 – almost perfect.
The GA performs moderately at converting between the languages and the Landini text. But what is most striking (to me) is the good efficiency for converting Voynich to Landini (0.74) and Landini to Voynich (0.89)
Some Notes on the table
To look at this I revised my GA code so that it was more flexible, and I jettisoned the use of separate dictionaries. Here is how the GA now functions. It can convert/translate between any language text samples.
1) Two text files are read in: the “source” text, and the “target” text. This could be, for example, a source file containing Landini’s text, and a target file containing Spanish text, if we want to convert from Landini to Spanish.
2) The text in each file is processed separately, producing two word lists, and two sets of n-Gram frequency tables.
3) The chromosomes are generated with random mappings between the source n-Grams and the target n-Grams
4) The GA evolves the chromosomes by trying to maximise their cost. The difference now is that when a target word is generated from the source text using the mappings, it is looked up in the target word list created in 2) above, rather than in a separate dictionary.
5) After training, the best chromosome can have a maximum cost value of 1.0, which would correspond to a perfect conversion between the source text and the target text (i.e. every word produced from the source text is found in the target text dictionary)
6) So we can feed the GA with two identical texts, and after training the score of the best chromosome should be 1.0, and indeed it approaches that (it doesn’t quite get there because only the top 100 n-Grams are translated, and so some characters in the source text cannot be translated).
7) The word and n-Gram frequency lists are made from the entirety of each text, but (for this exploratory study) the training takes place on only the first 50 “words” in the source text, and uses only the first 100 n-Grams for mapping. Thus if the 50 words of Voynich chosen contain several rare characters, then for those the mapping will fail because those rare characters do not appear in the n-Gram list, and this will result in a lower score.
8) In all cases the “X->X” score in the table (i.e. the diagonal) represents the best score possible for that language, and is a normalisation for the other numbers in the table. I should really revise the table and divide out the off-diagonal scores by the diagonal normalisations.
9) An improvement would be to configure the n-Gram list to be, say, 200 long, and use more source (Voynich) words for the training. The downside of this is mainly execution speed.
10) These runs were with n-Grams up to 3: it would be better to go to 4 at least.
11) I think Landini gets good scores because the character set he uses is very small. Knox comments ” A factor must be that the Landini Challenge has built-in frequency matches to any transcription of the VMs. Also, there is no meaningful correspondence in the letter sequence of one word to another in Landini. The difficulty fits what I said the VMs may be.”
Genetic Algorithm
Basic Idea
In the plaintext, convert each group of 1, 2, 3 or 4 characters into a Voynich group of 1,2,3 or 4 characters. We call this a “mapping”. For example, when creating Voynich from Latin, a cipher mapping might be:
e => o, i => 9, …
er => 4o, is => ok, ti => 8a, …
ent => 9k, ant => A, …
… and so on. This can be encoded into an algorithm thus which maps strings in “repl” to strings in “seek”. For example:
String seek[] = {"4ok1",
"4oh","8am","1oe","4ok","ok1","o89","1oy","oh1","o8a","oha","ohc","c89","1co","k1o","1c9",
"c79","h1o","1o8","oko","oho","coe","8ae","co8","k19","h19","8ay","ham","hcc","koe","oka",
"hco",
"1o", "oe", "oh", "4o", "ok", "8a", "89", "am", "1c", "oy", "o8", "co", "ay",
"k1", "h1", "19", "hc", "c9", "ha", "ae", "79", "2o", "cc", "ko", "ho", "c8", "9h",
"9k", "c7", "2c", "ka", "kc", "1a", "an", "h9", "o,", "e8", "k9", "ap", "8o", "e,",
",1", "7a", "81",
"o", "9", "1", "a", "8", "c", "h", "e", "k", "y", "4", "m", ",",
"2", "7", "s", "K", "C", "p", "g", "n", "H", "j", "A"};
//Latin
String repl[] = {"un",
"ri", "on", "f", "es", "g", "em", "de", "se", "co", "ne", "ur", "si", "ic", "ui", "me",
"ere","eb", "la", "ma", "le", "id", "bu", "nti","no", "cu", "eba","qui","ie", "al", "ul",
"ns",
"c", "d", "l", "er", "is", "ti", "nt", "en", "re", "in", "um", "am", "us",
"te", "it", "v", "tu", "ta", "ra", "di", "an", "ni", "li", "et", "ba", "ae", "mi",
"ent","st", "h", "nd", "ci", "pe", "im", "ua", "io", "tur","il", "ve", "iu", "as",
"vi", "ita","ca",
"e", "i", "a", "t", "u", "s", "r", "n", "m", "o", "p", "b", "q",
"qu", "at", "or", "ia", "ar", "ce", "ib", "ec", "ab", "ru", "ant"};
Such an algorithm is used inside a Chromosome of the Genetic Algorithm. The Chromosome decodes Voynich into Latin by matching character groups in the Voynich word against each of the strings in the “seek” list in turn. If a match occurs, then the Voynich group is translated into the Latin group in the “repl” list at the same position. Thus “4ok1” in Voynich is translated into “un” in Latin.
Once the Voynich word has been translated into Latin, the Latin word is looked up in a Latin dictionary. If the word is found, then the “cost” (or “quality”) of the Chromosome is increased … if the word is not found, then the cost is decreased. After all words in the Voynich text have been converted to Latin, and the aggregate cost of the Chromosome evaluated, it can be judged whether the mapping “seek” to “repl” is a good one or not.
Generating the Chromosome Population
We generate a large number of Chromosomes, each of which has a different, randomised, “seek” to “repl” mapping. We do this by simply shuffling the order of the “repl” strings in each Chromosome.
Thus, one Chromosome may map “4ok1” to “s” and another may map it to “qui”.
This population of Chromosomes is then evaluated: each Chromosome converts the Voynich words to Latin, and each then gets a cost. The higher the cost, the better. The highest possible cost would be a Chromosome that had a seek-repl mapping that produced a valid Latin word for each Voynich word.
Training the Chromosomes
The Chromosomes are ordered in decreasing cost, and then the best of them (i.e. at the top of the list) are “mated” together to produce offspring Chromosomes. The mating process essentially involves taking sequences of the “repl” strings from both parents and combining them to form a new “repl” string.
Some of the offspring Chromosomes are then “mutated”. This involves replacing one of the “repl” strings with some randomly selected letters from the Latin character set.
The process repeats (ordering the Chromosomes, mating the best ones, mutating the offspring) until a predefined cost value is reached, or the population of Chromosomes refuses to improve itself.
In the end, the best, trained Chromosome will contain the optimal arrangement of “seek” to “repl” mappings for conversion of Voynich to Latin.
The same procedure can be used for a Voynich to English, to German, French or any other language, provided that a dictionary and substantial texts are available to process.
First Results – Voynich to Latin
This is a limited attack on the first five “sentences” of f1r, using 200 chromosomes and a Latin dictionary of around 15,000 words. The best chromosome scores 9.4 after 500 training epochs (cf a score of 20 for a one-to-one translation of Latin into Latin).
Here are the deciphered sentences:
1) Voynich: fa19s 9hae ay Akam 2oe !oy9 ²scs 9 hoy 2oe89 soy9 Hay oy9 hacy 1kam 2ay Ais Kay Kay 8aN s9aIy 2ch9 oy 9ham +o8 Koay9 Kcs 8ayam s9 8om okcc9 okcoy yoeok9 ?Aay 8am oham oy ohaN saz9 1cay Kam Jay Fam 98ayai29
Latin: ?ereieas vias is asasita meas ?ereis ?astuas is quinti mensis asereis vis ereis sttunti viasita viis as?as alis alis qui? asisere? nti quere ere viis ?ita alamisis altuas ereis asis quiita quantis querenti ntiviquis ?asis qui amita ere am? asere?is viis alis ?is ?is isereere?viis
Voynich Herbs
Edith Sherwood has a web site where she details compelling possible identifications for the plants depicted in the “herbal” pages of the VM.
Dana Scott’s page also has plausible identifications for the plants.
As has often been pointed out, if we look at the first Voynich “word” that appears on each page of the herbal part of the VM, we find that those words are unique, or appear elsewhere very rarely. It thus seems reasonable that the words may be the names of the plants depicted.
The GA was set up to find a set of n-Gram mappings that would convert a list of 111 Voynich first herbal words into Latin/English or Spanish. For this, dictionaries of Latin, English and Spanish herb/plant names were used.
The GA sought a mapping that would convert all the Voynich words for herbs/plants into as many valid plaintext (Spanish, English, Latin) words as possible. The best result was for a mixed English/Latin dictionary (see table): 31 of the 111 Voynich words were converted, about 30% success rate.
(One should never expect 100% success, due to missing names in the dictionary, transcription errors, missing n-Grams, incomplete n-Grams etc..)
The results are shown below in tabular form, together with Dana Scott’s and Edith Sherwood’s identification. The first column shows the folio in the VM, the second shows the first Voynich word on that folio. For the GA identification columns (3 and 4) the Voynich mapped word is shown, in quotation marks if not found in the associated dictionary, and in bold if found in the dictionary.
Note that, probably unsurprisingly, nowhere do the IDs from the GA in Spanish, English/Latin and Scott/Sherwood, agree! NOT YET, anyway 🙂
(What amuses me about about this mapping technique is that it tends to produce words that sound plausible in the target language. E.g. for f4r the Latin/English word “paptise” sounds like a valid word.)
Genetic Algorithm – f27v

The text, in the Voyn_101 transcription, reads:
fo1ou 1of 1o3o soe9 2oe 9k1ay og1oy9 h1oX1oy 819 1hay ok19 29 29 &19 829 h19 1co 8ai89 819 h1c9 h19 &1oh19 82o 8▀y 1o829 ck1co89 2e8 oh1o 19 h1cc8 1e 1oe ho8 o oh2o 8o1ccsp 4oh9 2hccoy1o8ay 2hoe 1ok19 Ko8oe 82o h1sss ohC89 81s19 sok189 2o 2o9h1o 289 818 1s1s9 ok189 oh2cs Ah1oh29
There are a total of 50 “words” (groups of characters) in this text. The GA tries to maximise the score of a conversion of all 50 words into valid Latin words by checking each converted word appears in a Latin Word Dictionary
1) Allowing 1-1 mapping i.e. 1 Voynich character maps to 1 Latin character
0 0.2278094451386928 GA$Chromosome@1a07791 prob=1.5168475878361884 Good=16 / 50 = 32.0%
S: o 1 9 8 h c 2 s y k a e & f X p A g u i K C 3 4 ▀
R: i e o r n c u d l g p s b a f q h x y µ m t j v z
aieiy eia' eiji diso uis ogepl ixeilo neifeil reo enpl igeo uo uo beo' ruo' neo' eci rpµro reo neco' neo' beineo rui' rzl
eiruo cgeciro usr inei eo' neccr es' eis' nir i' inui rieccdq vino' unccileirpl unis' eigeo miris' rui' neddd intro' redeo' digero'
ui uionei uro rer ededo igero inucd hneinuo
2) 1-2 mapping i.e. each Voynich character can map into 1 or 2 Latin characters
0 0.1504950444038535 GA$Chromosome@8a2f6b prob=1.4969215520914312 Good=15 / 50 = 30.0%
S: o 1 9 8 h c 2 s y k a e & f X p A g u i K C 3 4 ▀
R: i e is c p n s l a o r t m us er ti b in re u d tu nt it en
usieire eius' einti litis' sit' isoera iineiais peiereia ceis epra ioeis sis' sis' meis' csis peis eni crucis' ceis penis'
peis meipeis csi cena' eicsis noenicis stc ipei eis' pennc et' eit pic i' ipsi' ciennlti itipis spnniaeicra spit eioeis dicit' csi
pelll iptucis celeis lioecis si' siispei scis' cec elelis ioecis ipsnl bpeipsis
3) 2-1 mapping i.e. each Voynich character or pair of characters maps to 1 Latin character
0 0.1600852807737927 GA$Chromosome@2ccccf prob=1.410155608070086 Good=15 / 50 = 30.0%
S: 1o h1 oh 19 k1 o8 89 oy 1c cc co ay ok 1s 2o oe 29 8a o 1 9 8 h c 2 s y k a e
R: i d e l r m b f t g q x p y v c s j o z a u µ w h k £ + = n
|oi| i| i|o kca vn arx' o|i£a do|i£ ul zµx pl' s' s' |l us' da' to' j|b ul dwa da' |ida uv u|£ ius' wrqb hnu ei' l' dgu zn
in' µm o' ev uotwk| |ea hµgfij£ hµc' ira' |mc uv dkkk e|b uyl kpzb v' vado' hb uzu yya pzb ehwk |des
4) 2-2 mapping i.e. one or two Voynich characters maps to one or two Latin characters
0 0.6279610718846604 GA$Chromosome@47a prob=1.4749867245934525 Good=14 / 50 = 28.0%
S: 1o h1 oh 19 k1 o8 89 oy 1c cc co ay ok 1s 2o oe 29 8a o 1 9 8 h c 2 s y k a e
R: i e l is at in s d n a tu re er u ti p um b o en m t r c g f us nt it v
|oi| i| i|o fpm tiv matre' o|iusm eo|ius tis enrre eris' um um |is tum' em no' b|s tis ecm em |iem tti t|us itum' cattus'
gvt li is' eat' env iv' rin o' lti toncf| |lm gradibus' grp iatm |inp tti efff l|s tuis' ferens' ti timeo' gs tent uum erens lgcf |e
lum
5) 3-2 mapping i.e. one, two or three Voynich characters maps to one or two Latin characters
0 0.40567669613807406 GA$Chromosome@1a9fa4c prob=1.467899375005982 Good=18 / 50 = 36.0%
S: h1o ok1 1o h1 oh 19 k1 o8 89 oy 1c cc co ay ok 1s 2o oe o 1 9 8 h c 2 s y k a e
R: i er e a s it l r us m o p at c in en is um b tu t v n re u f d nt g ti
|be| e| e|b fumt isti' tlc b|edt i|ed vit tunc' ert ut' ut' |it vut at' ob' vg|us vit aret at' |eat vis' v|d evut relatus'
utiv se' it' apv tuti' eti nr b' sis' vboref| |st unpmevc unum' elt |rum vis' afff s|us venit' ferus' is' isti' uus vtuv enent erus' suref |inut
Explanation:
A ‘ sign following a word indicates that the word is valid Latin. A | (vertical bar) in place of a character indicates that the Voynich character(s) have no mapping defined into Latin – the Latin character could be anything.
The S: and R: lines show the Voynich characters (S) and their replacements (R) respectively.
Genetic Algorithm based Phrase Analysis
Hypothesis
The following hypothesis occurred to me while I was investigating a cipher theory proposed by Rich Santa Coloma. (This is not a new idea amongst Voynich researchers, but it was new to me!)
The VMs “words” are codes for plaintext character groups, probably trigraphs, digraphs and single characters.
How does one use this system?
1) Take each word in the plaintext
2) Break it up into a sequence of one or more trigraphs, digraphs and single characters by referring to a code table
3) Write the code for each, separated by a space, and terminate the last tri/di-graph/character code by a VMs “9”.
The labels are probably treated differently: there may well be a separate set of codes just for the labels.
As an example, take the following “sentence” of 33 “words” from the Herbal folios:
h1cok 2oe 1c9 4ohom 2oy 4ok1coe 1oyoy 2o82c9 4okd9 4okcc9 8am 4okC9 Kay o1c9 1oe 1oe 4ok1c9 8am 1okd9 8ae s19 k1c9 8am 8C9 ko8 8an 4okds 3o h1cc9 sam 1oh1oe 1oy Hos
Breaking the VMs “words” at each terminal “9”, this is deciphered to be a sentence of 13 words:
h1cok 2oe 1c
4ohom 2oy 4ok1coe 1oyoy 2o82c
4okd
4okcc
8am 4okC
Kay o1c
1oe 1oe 4ok1c
8am 1okd
8ae s1
k1c
8am 8C
ko8 8an 4okds 3o h1cc
sam 1oh1oe 1oy Hos
Each of these words is built of one or more codes. E.g. the first word in the list above is “h1cok 2oe 1c” and may be deciphered as
h1cok = “qui”,
2oe = “de”
1c = “m”
to make the Latin word “quidem”.
An interesting feature of this cipher/code is that you may have several choices of how to split each plaintext word into tri/di/mono-graphs, but without ambiguity for the decipherer. This may be an explanation for the different frequency distributions between the VMs folios and Currier hands: they were written by different scribes who tended to split the plaintext words differently.
Does the Theory fit the Data, for Latin?
We first take a substantial body of text from the VMs, e.g. the Recipes folios, and feed it through an application code that extracts all the VMs words, and groups them according to the procedure described above, using one or more arbitrary characters as word ending marks. Typically we use VMs “9”. Each sentence so derived is analysed: each of the tokens is analysed for n-gram content and frequencies are tallied.
At the end of the processing, the n-grams are sorted into frequency order: the most frequent n-grams appear first in the list.
At this point the application moves to its second stage. It ingests a large list of Latin phrases, generated by Knox (thanks, Knox!) and processes each word in each unique phrase for n-gram content, so extracting the n-gram frequencies for Latin. The phrases are placed in a sorted list: shortest first. The n-grams are sorted by frequency, most frequent first.
Here are the Latin phrase sizes used:
A total of 53834 different phrases of size >= 2 2 4405 3 28152 4 8524 5 3866 6 2227 7 1507 8 1085 9 813 10 633 11 513 12 424 13 356 14 300 15 252 16 209 17 177 18 150 19 130
The third stage of the application is to generate a set of Genetic Algorithm chromosomes. Each chromosome takes the Top N n-grams from the Voynich n-gram list and pairs them with a random selection of the n-grams from the Latin list.
For example, for a Chromosome of length 15 (in fact the GA uses much longer lengths, typically 200) the following table might be used:
V: am ay ae 1c8 4ohC oe 1c 4oham 8am 4ohan oham okam oy 1c7 e L: ed gi n de et ae p s du tu nd d tio rum te
The chromosomes are “scored” by having them translate/decipher a training set of sentences from the input VMs folios. To calculate the score of each chromosome for each sentence, the sentence word tokens are converted to Latin n-grams using the chromosome’s table. Then the tokens are joined together to form the plaintext words. The plaintext words are looked up in the Latin dictionary: the chromosome’s score is increased for valid words, and decreased for invalid words. Once all the words in the sentence have been deciphered in this way, it is compared with each of the Latin phrases: if a Latin phrase appears in the sentence, the score of the chromosome is increased substantially.
The best chromosome found by a Monte Carlo method (basically generating random chromosomes, and retaining the best scoring chromosome) is placed at the top of a list, and then the remaining chromosomes needed for the Genetic Algorithm are generated.
The GA phase now begins: the chromosomes are genetically altered, mated and selected to optimise the best chromosome’s score on the training sentences. This phase is compute intensive.
Periodically, the GA will report on its progress:
Epoch 311 Cost/Ave 62.845588235294116/61.22993872549012 same 1 Mutated 21.608040201005025% New 1 MS 15 62.845588235294116 GAPhrases$Chromosome@41ec5a Good=128 / 408 = 31.37255% 40 phrases in 25 sentences S: am ay ae 1c8 4ohC oe 1c 4oham 8am 4ohan oham okam oy 1c7 e R: ed gi n de et ae p s du tu nd d tio rum te
Sentence 189 S: 2o ok1c - 1coe hc1 - 1Kc - ohan ae e hC - 4ohan 1cH - 1c7ay ap e2c - 2c7ae ohcay e hc8 - 1coehC - ehc - ohC - 4ohC - 4ohc - 4ohan ap - T: endve la' binteua tunti nis te' pi et' in'* tunis
In this report, the GA has been running for 311 “epochs” (each epoch is a new generation of chromosomes). The cost (score) of the best chromosome is 62.8, whereas the average score of all the chromosomes in the population is 61.2. In this Epoch, there has been no change to the best chromosome since the last Epoch (“same 1”), 21% of the chromosomes have been mutated, a fresh chromosome (“New 1”) was inserted at this Epoch (to ensure diversity – this is not usually done in GA, but I find it produces more reliable training). “MS 15” means that the maximum number of no-change Epochs seen so far has been 15 … the larger this number is, the more stagnant the chromosome pool is, and the nearer to a solution we are.
The following line shows in detail how the best chromosome has scored: its table produces 128 valid Latin words, from a total of 408 translations i.e. about 31%. In the 25 sentences being used in training, 40 common Latin phrases have been found.
The next two lines show the first 15 n-grams in the mapping that the chromosome is using.
Then the status report shows how the chromosome fared on translating a sentence picked at random from the VMs folios. Since the GA is being trained only on the first few sentences, the remainder are essentially “unseen”, and so a valid, sensible translation in a non-trained sentence is significant.
The sentence picked is number 129 (the training set is the first 25 sentences in this run, so number 129 is well outside that). The VMs source sentence is shown with hyphens “-” separating the tokens that make up words. E.g. “2o ok1c” is the first word. Beneath is the Latin translation. A Latin word followed by a single quote means that that word appears in the Latin dictionary, and is thus valid. A star appearing after a set of valid Latin words indicates that the Latin phrase made up by the words is common, or at least appears in Knox’s list.





