Archive

Posts Tagged ‘Genetic Algorithm’

Language A and B Again

March 13, 2013 12 comments

A tentative conclusion from comparing Language A and Language B  is that the non-gallows glyphs are used in the same way in both Languages.

That is to say, they appear to mean the same thing. So the “o” in A means the same as the “o” in B.
There is some persistent “mixing” between the e/y glyphs, which is illustrated by the example result below:
ABMixing
There is also some doubt about the “8” glyph, which sometimes seems to mix with the gallows glyphs (e.g. in some cases, the “8” appears in A to function in the same way as a gallows glyph in B and vice versa). This may simply be an error in the comparison method, or it may be that the “8” is a null, or it may be due to some other effect.
The gallows glyphs are different – they don’t appear to mean the same in A and B. I’m focussing on those glyphs now.

Language “A” and “B” Conversions

March 5, 2013 12 comments

This is an update to my previous two posts on this topic.

I have been concentrating on searching for the correspondence between glyphs used in Language A, and glyphs used in Language B. As a reminder, the method is to take all words in, say, Language A, and “convert” them to words in Language B by changing the glyphs according to a candidate mapping table. The frequency of the converted Language B words is then compared with the original Language A words: the closer the frequencies, the better the mapping match.

Method Check using only Language A words

As a check of the method, I took the Herbal folios 1-25 (all in Language A) and split them into two groups: 1-12 and 13-25, and I then artificially labelled the latter group as Language B. Then I ran the matching procedure, which produced the following result:

Epoch 62 Best chromosome 0 Value= 5.62272615159e-05
Chromosome ['o', '9', '1', 'i', '8', 'a', 'e', 'c', 'k', 'y', 'h', 'N', '2', '4', 's', 'g', 'p', '?', 'K', 'H']
ngramsA    ['o', '9', '1', 'i', '8', 'a', 'e', 'c', 'h', 'y', 'k', 'N', '2', '4', 's', 'g', 'p', '?', 'K', 'H']

This is good and reassuring, since it shows that the words in folios 13-25 have essentially the same frequency distribution when their glyphs are mapped to the same glyphs in folios 1-12.

Removal of Glyph Variants in Voyn_101

As the tests progressed, it became clear that some of the glyphs GC defined in Voyn_101 were in fact variants of more common glyphs. The most obvious were the “m”, “n”, “N” glyphs mentioned before – with these included, the conversions between Language B and Language A were of much poorer quality than if they were expanded to “iiN”, “iN” and “iiiN” respectively. After some time weeding out these variants, the following table was arrived at:

seek =  ["3", "5", "+", "%", "#", "6", "7", "A", "X", 
         "I", "C", "z", "Z", "j", "u", "d", "U", "P", 
         "Y", "$", "S", "t", "q",
         "m", "M", "n", "Y", "!", ")", "*", "b", "J", "E", "x", "B", "D", "T", "Q", "W", "w", "V", "(", "&"]
repl =  ["2", "2", "2", "2", "2", "8", "8", "a", "y", 
         "ii", "cc", "iy", "iiy", "g", "f", "ccc", "F", "ip",
         "y", "s", "cs", "s", "iip",
         "iiN", "iiiN", "iN", "y", "2", "9", "p", "y", "G", "c", "y", "cccN", "ccN", "s", "p", "h", "h", "K", "9", "8"]

I am very confident that the glyphs remaining after using the above conversion table are the base set.  The base set of glyphs is thus:

Language A frequency order: 'o', 'c', '9', '1', 'a', '8', 'e', 'i', 'h', 'y', 'k', 's', '2', 'N', '4', 'g', 'p', '?', 'K', 'H', 'f', 'G', 'F', 'L', 'l', 'v', 'r', 'R'
Language B frequency order: 'c', 'o', '9', 'a', '8', 'e', '1', 'h', 'i', 'y', 'k', '2', 'N', 's', '4', 'g', 'p', 'f', '?', 'H', 'K', 'G', 'F', 'l', 'L', 'R', 'r', 'v'

where “?” represents all very rare glyphs (such as the “picnic table” glyph). There are thus 27 glyphs (15 gallows and 12 regular) excluding the rare special glyphs like the picnic table.

Glyph Mixing Between A and B

I ran many trials using the base set of glyphs, comparing various sections of the VMs written in the different hands. In particular, the following folio collections were defined:

Special = {'HerbalRecipeAB': range(107,117) + range(1,26),
           'HerbalAB': range(1,57),
           'HerbalBalneoAB': range(1,26) + range(75,85),
           'HerbalAstroAB': range(1,13) + range(67,75),
           'PharmaRecipeAB': [88,89,99,100,101,102] + range(103,117),
           'AllAB': range(1,117)
 }

The collection I used the most was the one called “HerbalBalneoAB”, which contains Herbal folios written in Language A, and Balneo folios written in Language B. The nice feature of this collection is that the number of words is around the same for both Languages, which makes comparing counts very easy:

Total words =  2846  Total Language A =  1581  Total Language B =  1584

As an example, here is a trial result for HerbalBalneoAB:

Language B ['o', '9', '1', 'a', 'i', 'f', 'c', 'y', 'h', 'e', 'K', 'N', '2', 's', '4', 'g', 'p', '8', 'k', 'H']
Language A ['o', '9', '1', 'a', 'i', '8', 'c', 'e', 'h', 'y', 'k', 'N', '2', 's', '4', 'g', 'p', 'K', '?', 'H']

In all the tests I ran, there were some common features in the results:

  • Mixing between “e” and “y” – when writing Language A, the use of “e” appears to be equivalent to the use of  “y” in Language B, and vice versa
  • Mixing between  8,f,F,k,K,g,G,r,R,?  and so on – the Gallows glyphs swap amongst themselves, and “8”

Just about all trials showed the “e”/”y” mixing. Tony Gaffney pointed out that these two glyphs are quite similar in stroke construction. The appearance of “8” amongst the swapping Gallows glyphs is curious.

The Relationship Between Currier Languages “A” and “B”

March 1, 2013 24 comments

Captain Prescott Currier, a cryptographer, looked at the Voynich many moons ago, and made some very perceptive comments about it, which can be seen here on Rene Zandbergen’s site.

In particular, he noticed that the handwriting was different between some folios and others, and he also noticed (based on glyph/character counts) that there were two “languages” being used.

When I first looked at the manuscript, I was principally considering the initial (roughly) fifty folios, constituting the herbal section. The first twenty-five folios in the herbal section are obviously in one hand and one ‘‘language,’’ which I called ‘‘A.’’ (It could have been called anything at all; it was just the first one I came to.) The second twenty-five or so folios are in two hands, very obviously the work of at least two different men. In addition to this fact, the text of this second portion of the herbal section (that is, the next twenty-five of thirty folios) is in two ‘‘languages,’’ and each ‘‘language’’ is in its own hand. This means that, there being two authors of the second part of the herbal section, each one wrote in his own ‘‘language.’’ Now, I’m stretching a point a bit, I’m aware; my use of the word language is convenient, but it does not have the same connotations as it would have in normal use. Still, it is a convenient word, and I see no reason not to continue using it.

We can look at some statistics to see what he was referring to. Let’s compare the most common words in Folios 1 to 25 (in the Herbal section, Language A, written in Hand 1) and in Folios 107 to 116 (in the Recipes section, Language B, written in a different Hand):

Comparison between word frequencies in Languages A and B

Comparison between word frequencies in Languages A and B

So, for example, in Language A the most common word is “8am” and it occurs 192 times in the folios, whereas in Language B the most common word is “am”, occuring 137 times.

We might expect that these are the same word, enciphered differently. The question then is, how does one convert between words in Language A and words in Language B, and vice versa? In the case of the “8am” to “am” it’s just a question of dropping the “8”, as if “8” is a null character in Language A. In the case of the next most popular words, “1oe”(A) and “1c89″(B) it looks like “oe”(A) converts to “c89″(B). And so on.

If we look at the most popular nGrams (substrings) in both Languages, perhaps there is a mapping that translates between the two. Perhaps the cipher machinery that was used to generate the text had different settings, that produced Language A in one configuration, and Language B in another. Perhaps, if we look at the nGram correspondence that results in the best match between the two Languages, a clue will be revealed as to how that machinery worked.

This involves some software (I’m using Python now, which is fun). The software first calculates the word frequencies for Language A and B in a set of folios (the table above is an output from this stage). It then calculates the nGram frequencies for each Language. Here are the top 10:

nGramFrequencies

The software then runs a Genetic Algorithm to find the best mapping between the two sets of nGrams, so that when the mapping is applied to all words in Language B, it produces a set of words in Language A the frequencies of which most closely match the frequencies of words observed in Language A (i.e.  the frequencies shown in the first table above).

Here is an initial result. With the following mapping, you can take most common words in Language B, and convert them to Language A.

Table for converting between a Language B word and a Language A word

Table for converting between a Language B word and a Language A word

A couple of remarks. This is an early result and probably not the best match. There are some interesting correspondences :

  • “9” and “c” are immutable, and have the same function
  • Another interesting feature is that “4o” in Language B maps to “o” in Language A, and vice versa!
  • in Language B, “ha” maps to “h” in Language A, as if “a” is a null

In the Comments, Dave suggested looking at word pair frequencies between the Languages. Here is a table of the most common pairs in each Language.

Common word pairs in Languages A and B

Common word pairs in Languages A and B

For clarity, I am using what I call the “HerbalRecipesAB” folios for this study i.e.

Using folios for HerbalRecipeAB : [107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25]

More results coming …

f75r cures, pregnancy, life and death – Latin Plainchant

May 15, 2012 3 comments

Here is a result obtained using a Genetic Algorithm to match the text on f75r to Latin. The training corpus I used was a large file of Latin plainchant (the idea being that repeated “words” in the VMs show similarities to chant).

First, here is the folio with the translated words overlain in red:

Folio f75r decrypted as Latin plainchant

The genetic algorithm searched for a set of glyphs that each matched to a pair of Latin letters.

Most of the decrypted words are valid Latin and match words in the plaintext I used to train the GA. Some are Latin but do not appear in the plaintext. The other, invalid, words could be caused by errors in the pair matching.

Or the whole thing could well be nonsense! This is likely – I asked Joel Stevens to translate some of the Latin, and here is what he said:

On first inspection, it seems to be random non-sense. For example:
recita lugete vena dans veta ia debent lustrata lite

Would mean: Recite! Mourn! Blood-vessel giving. Forbid! Oh, they owe things that were purified by the lawsuit.

I’m not really sure how to make sense of it. I don’t see anything that stands out as an obvious sentence. Maybe some words are filler and need to be dropped… or maybe there is a hidden order that needs to be found (assuming these are the correct words).

Here is the Latin:

piraextita recita' lugete' idpirata lucrte vena'
dans' veta' ia' debent' lustrata' vanagete vamirata lite'
lugens' esnt levata' nuta' gens' veanta rochum' nogete le 
dato' uascie excita' curi gent na veta' le veta' luedicta 
veexti vata' arta' te' chum no' no' amicta' luedet luga' 
mori' edente' noga date' reri' lugens' feta' luedicta luga' 
morata' luaena uechum vana' lugete' ad' nt vana' luga'
pate vata' lugertta audita' lugens' vita' curata' resona'
lupina' feta' lumina' lugeum veon lugete' no' vana' lant 
aule lucrte ista' veta' lugens' vita' na ruri' mena' strata' 
luedicta lugens' nota' luedicta na lugent' reti' vena' date' vageas 
dedita' lugent' nota' veta' te' no' iret' veta' na no' vena' luga' 
morata' edicta' lumina' lumina' lumina' lumina' lugens' novena' 
lace' si' educta' lugens' na novena' lugens' vita' lugens' ruti' sita' 
lans' lugens' luuacinota lacium vana' pant' le vena' reista luga' 
aurata' lugete' verata nt ha' urgens' ad' revena lugens' vana'
morati' vemota curatita dant' bunt' id' ncnt vena' lugeas 
pant' vesata lunt vena' lunt ruchum este' late' nota' 
luaerata lunt veta' luga' veta' ulta' lugens' ti 
iu' veta' lute vata' stri sschum veta' lumicium 
clrechum vata' poma' le sebete no' te' ut' 
sati' veta' lugens' vana' le acta' gete vana' tute' 
mori' aeri' luedet lugent' deri lunt vana' lugeas 
lugens' no' edet' luuatita lans' vato te' no' 
lans' orta' luedicta id' na luedicta strata' tuas' 
dans' no' veno dans' no' luia' date' muta' 
gnsiurte duno' luca' alti' vena' resina' date' ruti' rurata na sunt' 
errata' morata' luedet pant' no' dace' veto' lunt nt amicta' luedicta strata' 
dans' fisi' na uachns no' recina lunt dans' novata' luedicta vana' lucrum' 
lu' dans' irquta lumita lugens' vaga' lugent' dans' vana' resino vena' reti' nodo' 
aurata' lumita vita' lugete' vena' vata' lumini' lumina' dant' na 
locuta' dant' vena' lugens' date' pena' vena' lugete' vena' usta' 
luedet nt vena' lunt vana' lugens' ferata rorata' dalias 
pr' nomina' resi date' ruta reti' ruti' no' gens' nomina' 
lugens' ambita' lugens' date' date' vena' lugens' nona' 
pant' mirata luedicta luncta' reti' ruti' date' date' na 
lumina' na ambita' lumina' luedicta lumiista nt 
lumirata luedicta lumina' lumina' lumirata usta' 
serata' luedicta lumirata noedicta ruri' ncusta 
date' vena' lugens' vena' dant' edicta' te' vena' 
paedicta luedicta rucina luga' na edicta' ma 
aurata' lumina' luedicta luiget luigicta date' 
serata' vagete nona' lugens' fiti ruti' nona' 
sprata lans' rerata lumina' rurata renona ruti' ruut nochns ha' date' na 
derata orta' lumita rerata lalint ruta rurata lumena rurata rechns 
lumita lumina' veno lumina' dans' vena' eg plnt vana' noti' 

Landini’s Challenge

February 26, 2010 Leave a comment

An excerpt from Landini’s challenge text (text he generated using an undisclosed method, supposed to replicate the features of the VMs text):

qopchdy chckhy daiin ¬ ½shxam chor otechar okcharain ryly sheodykeyl
sheodykeyl daiin shd okaiin qokain qokal yteoldy otedy qokydy opchedy
otal oldar chor lkeedol eer ol dair chedy daiin ockhdar cpheol chedy
xar qokaiin y chedy kshdy ololdy aiin char y okeey oldar qokaiin lsho
daiin olsheam qoeey chedy dchos pshedaiin shedy d qol key sheol or
cpheeedol qokedy qokaiin daiin cthosy chedy ar aiir chedy teeol aiin
cheey y cheam oky qokaiin daldaiin loiii¯ ar shtchy chedy aldaiin
ydchedy daiin shd okaiin qokain daiin qotcho chedy daiin lchy olorol
otedy qockhor shol daiin paichy chedy ar shdair chedal chedy kchdaldy
chckhy otakar qokedy s qooko chor daiin otcholchy chedy daiin koroiin
qokain qokedy kosholdy ol kchedy kshdy qokaiin ar shaikhy olaldy seees
ar oteodar chedy oteeol shedy daiin key dain daiin keeokechy chedy
lchey ail lchedy sches ol dsheeo otol odaiin qokain daiin sheeod chshy
chedy qoekedy tair sain qocheey aiin cheey chaiin ols shedy sheolol
daiin lcheol chedy daiin pchoraiin oshaiin chedy lchey lor sal aiin
cheey y dsheom shedy todydy cheor saiin shdaldy daiin ofchtar daiin

Here are some thought-provoking results from analysing the text, as suggested by Knox, the VM text, and comparisons with English, Latin, German, French and Spanish. These use a new form of the Genetic Algorithm, described below.

Summary

It looks to me like that Landini either generated his text from a transcription of the VM itself, or his algorithm for generating that text is a good emulation of the encoding process used in the VM. In other words the Landini “language” is a good candidate as a plaintext language for the VM, as opposed to the European languages tested.

Results

Here is a table which shows the GA’s efficiency at converting/translating between Voynich, Landini, and the other languages.

(In the table, the best possible score is 1.0 – see below for an explanation)

Asking the GA to translate English to English, or Latin to Latin, etc. results in a high efficiency score, as expected. Note that the Landini to Landini  efficiency is 0.97 – almost perfect.

The GA performs moderately at converting between the languages and the Landini text. But what is most striking (to me) is the good efficiency for converting Voynich to Landini (0.74) and Landini to Voynich (0.89)

Some Notes on the table

To look at this I revised my GA code so that it was more flexible, and I jettisoned the use of separate dictionaries. Here is how the GA now functions. It can convert/translate between any language text samples.

1) Two text files are read in: the “source” text, and the “target” text. This could be, for example, a source file containing Landini’s text, and a target file containing Spanish text, if we want to convert from Landini to Spanish.

2) The text in each file is processed separately, producing two word lists, and two sets of n-Gram frequency tables.

3) The chromosomes are generated with random mappings between the source n-Grams and the target n-Grams

4) The GA evolves the chromosomes by trying to maximise their cost. The difference now is that when a target word is generated from the source text using the mappings, it is looked up in the target word list created in 2) above, rather than in a separate dictionary.

5) After training, the best chromosome can have a maximum cost value of 1.0, which would correspond to a perfect conversion between the source text and the target text (i.e. every word produced from the source text is found in the target text dictionary)

6) So we can feed the GA with two identical texts, and after training the score of the best chromosome should be 1.0, and indeed it approaches that (it doesn’t quite get there because only the top 100 n-Grams are translated, and so some characters in the source text cannot be translated).

7) The word and n-Gram frequency lists are made from the entirety of each text, but (for this exploratory study) the training takes place on only the first 50 “words” in the source text, and uses only the first 100 n-Grams for mapping.  Thus if the 50 words of Voynich chosen contain several rare characters, then for those the mapping will fail because those rare characters do not appear in the n-Gram list, and this will result in a lower score.

8) In all cases the “X->X” score in the table (i.e. the diagonal)  represents the best score possible for that language, and is a normalisation for the other numbers in the table. I should really revise the table and divide out the off-diagonal scores by the diagonal normalisations.

9) An improvement would be to configure the n-Gram list to be, say, 200 long, and use more source (Voynich) words for the training. The downside of this is mainly execution speed.

10) These runs were with n-Grams up to 3: it would be better to go to 4 at least.

11) I think Landini gets good scores because the character set he uses is very small. Knox comments ” A factor must be that the Landini Challenge has built-in frequency matches to any transcription of the VMs. Also, there is no meaningful correspondence in the letter sequence of one word to another in Landini. The difficulty fits what I said the VMs may be.”

Genetic Algorithm

February 26, 2010 Leave a comment

Basic Idea

In the plaintext, convert each group of 1, 2, 3 or 4 characters into a Voynich group of 1,2,3 or 4 characters. We call this a “mapping”. For example, when creating Voynich from Latin, a cipher mapping might be:

e => o, i => 9, …

er => 4o, is => ok, ti => 8a, …

ent => 9k, ant => A, …

… and so on. This can be encoded into an algorithm thus which maps strings in “repl” to strings in “seek”. For example:

	String seek[] = {"4ok1",
			 "4oh","8am","1oe","4ok","ok1","o89","1oy","oh1","o8a","oha","ohc","c89","1co","k1o","1c9",
			 "c79","h1o","1o8","oko","oho","coe","8ae","co8","k19","h19","8ay","ham","hcc","koe","oka",
			 "hco",
			 "1o", "oe", "oh", "4o", "ok", "8a", "89", "am", "1c", "oy", "o8", "co", "ay",
			 "k1", "h1", "19", "hc", "c9", "ha", "ae", "79", "2o", "cc", "ko", "ho", "c8", "9h",
			 "9k", "c7", "2c", "ka", "kc", "1a", "an", "h9", "o,", "e8", "k9", "ap", "8o", "e,",
			 ",1", "7a", "81",
			 "o",  "9",  "1",  "a",  "8",  "c",  "h",  "e",  "k",  "y",  "4",  "m",  ",",
			 "2",  "7",  "s",  "K",  "C",  "p",  "g",  "n",  "H",  "j",  "A"};

	//Latin
	String repl[] = {"un",
			 "ri", "on", "f",  "es", "g",  "em", "de", "se", "co", "ne", "ur", "si", "ic", "ui", "me",
			 "ere","eb", "la", "ma", "le", "id", "bu", "nti","no", "cu", "eba","qui","ie", "al", "ul",
			 "ns",
			 "c",  "d",  "l",  "er", "is", "ti", "nt", "en", "re", "in", "um", "am", "us",
			 "te", "it", "v",  "tu", "ta", "ra", "di", "an", "ni", "li", "et", "ba", "ae", "mi",
			 "ent","st", "h",  "nd", "ci", "pe", "im", "ua", "io", "tur","il", "ve", "iu", "as",
			 "vi", "ita","ca",
			 "e",  "i",  "a",  "t",  "u",  "s",  "r",  "n",  "m",  "o",  "p",  "b", "q",
			 "qu", "at", "or", "ia", "ar", "ce", "ib", "ec", "ab", "ru", "ant"};

Such an algorithm is used inside a Chromosome of the Genetic Algorithm. The Chromosome decodes Voynich into Latin by  matching character groups in the Voynich word against each of the strings in the “seek” list in turn. If a match occurs, then the  Voynich group is translated into the Latin group in the “repl” list at the same position. Thus “4ok1” in Voynich is translated into “un” in Latin.

Once the Voynich word has been translated into Latin, the Latin word is looked up in a Latin dictionary. If the word is found, then the “cost” (or “quality”) of the Chromosome is increased … if the word is not found, then the cost is decreased. After all words in the Voynich text have been converted to Latin, and the aggregate cost of the Chromosome evaluated, it can be judged whether the mapping “seek” to “repl” is a good one or not.

Generating the Chromosome Population

We generate a large number of Chromosomes, each of which has a different, randomised, “seek” to “repl” mapping. We do this by simply shuffling the order of the “repl” strings in each Chromosome.

Thus, one Chromosome may map “4ok1” to “s” and another may map it to “qui”.

This population of Chromosomes is then evaluated: each Chromosome converts the Voynich words to Latin, and each then gets a cost. The higher the cost, the better. The highest possible cost would be a Chromosome that had a seek-repl mapping that produced a valid Latin word for each Voynich word.

Training the Chromosomes

The Chromosomes are ordered in decreasing cost, and then the best of them (i.e. at the top of the list) are “mated” together to produce offspring Chromosomes. The mating process essentially involves taking sequences of the “repl” strings from both parents and combining them to form a new “repl” string.

Some of the offspring Chromosomes are then “mutated”. This involves replacing one of the “repl” strings with some randomly selected letters from the Latin character set.

The process repeats (ordering the Chromosomes, mating the best ones, mutating the offspring) until a predefined cost value is reached, or the population of Chromosomes refuses to improve itself.

In the end, the best, trained Chromosome will contain the optimal arrangement of “seek” to “repl” mappings for conversion of Voynich to Latin.

The same procedure can be used for a Voynich to English, to German, French or any other language, provided that a dictionary and substantial texts are available to process.

First Results – Voynich to Latin

This is a limited attack on the first five “sentences” of f1r, using 200 chromosomes and a Latin dictionary of around 15,000 words. The best chromosome scores 9.4 after 500 training epochs (cf a score of 20 for a one-to-one translation of Latin into Latin).

Here are the deciphered sentences:

1) Voynich: fa19s 9hae ay Akam 2oe !oy9 ²scs 9 hoy 2oe89 soy9 Hay oy9 hacy 1kam 2ay Ais Kay Kay 8aN s9aIy 2ch9 oy 9ham +o8 Koay9 Kcs 8ayam s9 8om okcc9 okcoy yoeok9 ?Aay 8am oham oy ohaN saz9 1cay Kam Jay Fam 98ayai29

Latin: ?ereieas vias is asasita meas ?ereis ?astuas is quinti mensis asereis vis ereis sttunti viasita viis as?as alis alis qui? asisere? nti quere ere viis ?ita alamisis altuas ereis asis quiita quantis querenti ntiviquis ?asis qui amita ere am? asere?is viis alis ?is ?is isereere?viis

Voynich Herbs

February 26, 2010 1 comment

Edith Sherwood has a web site where she details compelling possible identifications for the plants depicted in the “herbal” pages of the VM.

Dana Scott’s page also has plausible identifications for the plants.

As has often been pointed out, if we look at the first Voynich “word” that appears on each page of the herbal part of the VM, we find that those words are unique, or appear elsewhere very rarely. It thus seems reasonable that the words may be the names of the plants depicted.

The GA was set up to find a set of n-Gram mappings that would convert a list of 111 Voynich first herbal words into Latin/English or Spanish. For this, dictionaries of Latin, English and Spanish herb/plant names were used.

The GA sought a mapping that would convert all the Voynich words for herbs/plants into as many valid plaintext (Spanish, English, Latin) words as possible. The best result was for a mixed English/Latin dictionary (see table): 31 of the 111 Voynich words were converted, about 30% success rate.

(One should never expect 100% success, due to missing names in the dictionary, transcription errors, missing n-Grams, incomplete n-Grams etc..)

The results are shown below in tabular form, together with Dana Scott’s and Edith Sherwood’s identification. The first column shows the folio in the VM, the second shows the first Voynich word on that folio. For the GA identification columns (3 and 4) the Voynich mapped word is shown, in quotation marks if not found in the associated dictionary, and in bold if found in the dictionary.

Note that, probably unsurprisingly, nowhere do the IDs from the GA in Spanish, English/Latin and Scott/Sherwood, agree! NOT YET, anyway 🙂

(What amuses me about about this mapping technique is that it tends to produce words that sound plausible in the target language. E.g. for f4r the Latin/English word “paptise” sounds like a valid word.)

Folio Voynich 1st Word Candidate GA ID, Spanish Candidate GA ID, Latin/Engish Dana Scott ID, English Dana Scott ID, Latin Sherwood ID, Latin Sherwood ID, English
f1r fa19s costa “greica”
f1v h1s9 rabo geum Deadly Nightshade Atropa belladonna Hyoscyamus niger Solanum nigrum Solanum dulcamara Atropa belladonna Deadly Nightshade
f2r h98an9 “jzba” “ariapha” Cornflower Centaurea cyanus Centaurea diffusa Diffuse Knapweed
f2v hoom “meic” “padi” Water Lily Nymphaea candida Nymphoides Nymphoides
f3r k2cos chinita (Impatiens) arnica Celosia argentea Feathery amaranth
f3v hoam menta (mint) paris Helleborus foetidus Dungwort
f4r ho8ae19 “mezirn” “paptise” Saxifraga cespitosa Alpine Saxifrage
f4v j1oom pastora (Poinsettia) “oigle” Campanula rapunculus Rampion
f5r h2o89 “piyn” “hicse” Arnica montana Wolfs Bane
f5v hA1coy malanga (Malanga) cirsium Tennis Racket Plant Agrimonia eupatoria Malva sylvestris Mallow
f6r foay “oote” “erk” Acanthus mollis Bear Breeches
f6v hoay9say1Chay “meotendoteisedh” “pakpikrtsst” Eryngium maritimum Sea Holly
f7r f1o8am “saynta” acris Trientalis europea Starflower
f7v joe29 “rden” anise Myrica gale Bog Myrtle
f8r g2oe “dno” “miv” Pisum sativum Green Pea
f8v Ko8 “anop” “amot” Symphytum officinale Comfrey
f9r k98eo “uardna” “cernur” Ricinus communis Casteroil
f9v fo1oy “oveh” “erut” Heartsease, Wild Pansy Viola tricolor Violaceae Viola
f10r g1oK9 “pohon” “apryse” Cichorium pumilum Chicory Endive
f10v gam tora (Tora Tree) gale Linnaea borealis Twinflower
f11r k2oe chino (Chinese Hat Plant) “arv” Rosmarinus officinalis Rosemary
f11v goe81o89 “albaveaca” “maadud” Curcuma longa Turmeric
f13r koy3oy “lenga” “mdoium” Banana Banana
f13v hoaiy “memh” “paft” Lonicera periclymenum Honeysuckles Woodbines
f14r g1o8am “poynta” “apcris” Scorzonera Black Salsify Vipers Grass
f14v g891om “uomic” “gesdi” Stachys monnieri Wood Betony Heal-all Sel-heal Woundwort
f15r k2oy “chiga” “arium” Sonchus oleraceus Sow Thistles
f15v gayoy “t8h” “gabt” Paris quadrifolia Herb Paris
f16r go1co89 “alblanyn” “marscse” Cannabis Cannabis
f16v g1yAm “potoora” “aptule” Chrysanthemum Chrysanthemum
f17r f2o89 “hayn” “ulcse” Catananche caerulea Cupids Dart
f17v g1o8oe “poyno” “apcv” Dioscorea Yams
f18r g8yaz89 “ullngn” “gmeagse” Aster alpinus Aster
f18v koe8 la (?) mad Telfairia Fluted pumpkin
f19r g1oy “poga” apium Polemonium coeruleum Greek Valerian
f19v go1am “albbora” mantle Draba nivalis Nailwort
f20r h81o89 “caveaca” woud Astragalus hypoglottis Milk vetch
f20v faIsay “crrote” greek Cynara cardunculus Cardoon
f21r g1oy “poga” apium Anagallis arvensis Pimpernel
f21v koe829 “laol” “madpe” Dictamnus albus Burning bush False Dittany White Dittany Gas Plant
f22r goe “albv” “maus” Verbena officinalis Common Vervain Holy Herb
f22v g9samoy “..dah” “hnshot” Tulip Tulip
f23r g9818op “.fhilo” “hsthlo” Pulsatilla vulgaris Pasque flower
f23v go8azoe “albzucv” “mapacus” Borago officinalis Borage Star Flower
f24r goyoy9 “alb..” “maby” Cucumis sativus Cucumber
f24v k1o8ay coyote (wild) rock Ficus religiosa Sacred Fig Bo Tree
f25r f1oe89 “sanoaca” “avd” Wild Thyme
f25v goCam “albcuora” “malile” Isatis tinctoria Woad
f26r g%coh9 “spnij” lunaria Prunella vulgaris Self heal
f26v g1c8ay pochote (Pochote) “apgok” Lens culinaris Lentil
f27r hsoy manga (Mango) “veium” Spinacia oleracea Spinach
f27v fo1ou oveja (?) eruca French Marigold Tagetes patula Dianthus superbus Dianthus
f28r g1o8ay “poyote” “apck” Aristolochia Smearwort Birthwort Pipevine
f28v h2oe pino (Pine) “hiv” Dahlia Dahlia imperialis Rhododendrons Rhododendrons
f29r gosam “alb.ora” “mansle” Lactuva sativa longifolia Romaine Cos Lettuce
f29v hoom “meic” “padi” Nigella sativa Roman coriander
f30r oh1cs9 “elanbo” “inrsum” Prunella vulgaris Healall
f30v Ks1an rubia (Madder) montana Cuscuta europaea Dodder
f31r hcc8c9 lichi (Lychee) “rgoio” Erigeron acris Fleabane
f31v go8az “albzon” “mapnn” Fernleaf yarrow Achillea filipendulina Valerian Valerian
f32r f1am santa (?) “aris” Veronica triphyllos Speedwell
f32v h1co8am “ranizora” “genple” Campanula rotundifolia Harebell
f33r k28ay “chizh” “arpt” Silene vulgaris Bladder Campion
f33v kayay “qllh” “opmet” Masterwort Astrantia major Tanacetum parthenium Feverfew
f34r g1cocj19 “ponianos” “apnbie” Anemone hortensis
f34v hs189 “mansn” “vewse” Lunaria annua Honesty Money Plant
f35r Koo anona (Custard Apple) amur Cichorium intybus Radicchio
f35v gay1oy “trtga” galium Ribes nigrum Blackcurrant
f36r j1af8aN “pa.nzti” “onupfl” Delphinium staphisagria Delphinium
f36v g1ayos9 “pooteesn” “apksise” Lamium amplexicaule Henbit
f37r koGoe “luiv” malus Mentha longifolia Mint
f37v h2o89 “piyn” “hicse” fedtschenkoi englerii Emilia fosbergii Tassel flower
f38r koeoy “lilh” “mmut”
f38v oh1oj “eveet” inula Euphorbia myrsinites Myrtle Spurge
f39r kc7o128 “goguadp” “gienmpot”
f39v g7aiy “inmh” “naft”
f40r g1c9 “poi” apio Erodium malacoides Storks bill
f40v j1c7an “pagmo” “oospo” Epiphyllum oxypetalum Crocus vernus Crocus
f41r j2c9hc8aecc9 “roilizrii” “ediorpcuio” Origanum vulgare Wild Marjoram
f41v hcSo8ae “lirbzv” “riupus” Coriandrum sativum Coriander Cilantro
f42r 2o “ah” st
f42v k1o˛ cola (?) rosa Aquilegia vulgaris Columbine Culverwort
f43r kayo8am “q.zora” “opbple” Stellaria media Chickweed
f43v g8saiy9 “u.lbn” “gnsicse” Elytrigia repens Couch grass
f44r k2o8g9 “chiy.” arch Mandragora officinarum Mandrake
f44v k2o china (Impatiens) “arur” Apium graveolens Celery
f45r g9h98ae “.jzv” “hariapus” Atriplex hortensis Orach Saltbush
f45v hosay9 “me..” pansy Lavandula angustifolia Lavender
f46r g1coJ9 “ponitr” “apnta” Leucanthemum vulgare Oxeye Daisy
f46v jo79e3c7 “rimvig” “andretos” Tanacetum parthenium, Chrysanthemum parthenium Inula conyza Ploughmans Spikenard Great Fleabane
f47r g1aiy “pomh” “apft” Lady’s Mantle, Lion’s Foot Alchemilla vulgaris Rosaceae Sempervivum tectorum Houseleek
f47v g2cok “dnier” minor Arnica montana Pulmonaria officinalis Lungwort
f48r g28am “dzora” “miple” Adonis Vernalis False Hellebore
f48v g1co819 “ponifn” “apnsse” Ruta graveolens Rue Herb of Grace
f49r gA2oe “ceahv” costus Nymphaea caerulea Blue Nile Lotus
f49v g he wort
f50r g2coy “dnih” mint Astrantia major Masterwort
f50v k19 con (?) rose Telopea speciosissima Gentiana frigida Stiff Gentain
f51r k2oe819 “chinofn” “arvsse” Cakile maritima Searocket
f51v go2o89 albahaca (Basil) “mastd” Salva officinalis Sage
f52r k8oh1F9 “queacn” “toinnise” Anemone coronaria Poppy Anemone
f52v g1oy “poga” apium Polystichum setiferum Fern
f53r hA8ap “mazlo” “ciplo” Achillea Ptarmica Sneezewort
f53v k2oy3c9 “chigamin” “ariumocse” Hieracium aurantiacum Hawkweed
f54r go8am “albzora” maple Cirsium oleraceum Cabbage thistle
f54v g1co8ay “ponizh” “apnpt” Bittersweet Nightshade Solanum dulcamara Perovskia atriplicifolia Russian Sage
f55r go8am “albzora” maple Fumaria officinalis Fumitory
f55v h1C8189 “raecsn” “geriwse” Forest lily Veltheima bracteata Broccoli Broccoli
f56r ok1ae “tebv” “trntus” Drosera Sundews
f56v h1cok “ranier” “genor” Cycas revoluta Sago Palm
f57r joccoHc9 “riopei” “anomiaio” Sherardia arvemsis Blue Field Madder
f65r Alchemilla vulgaris Ladies Mantle
f65v Centaurea cyanus Cornflower
f66v Satureja montana Winter Savory
f87r Satureja hortensis Summer Savory
f87v Senecio Primula vulgaris Primrose
f87v Kleinia Pedicularis flammea Lousewort Wood Bettony
f89v Actaea spicata Baneberry
f90r Conyza bonariensis Fleabane
f90v Eruca vesicaria Arugula Rocket
f93r Cynara cardunculus Artichoke
f93v Lupinus Lupin
f94r Botrychium lunaria Botrychium lunaria Moonwort Moonfern
f94v Agrostemma Githago Corncockle Red Campion
f94v Glycyrrhiza glabra Liquorice
f94v Plantago lanceolata Ribwort Plantain Kemps
f95r Berberis Sambucus nigra Elderberry
f95v Althaea Rosea Hollyhock
f96r Angelica archangelica Garden Angelica
f96v Tamus communis Black Bryony

Genetic Algorithm – f27v

February 26, 2010 1 comment

The text, in the Voyn_101 transcription, reads:

fo1ou 1of 1o3o soe9 2oe 9k1ay og1oy9 h1oX1oy 819 1hay ok19 29 29 &19 829 h19 1co 8ai89 819 h1c9 h19 &1oh19 82o 8▀y 1o829 ck1co89 2e8 oh1o 19 h1cc8 1e 1oe ho8 o oh2o 8o1ccsp 4oh9 2hccoy1o8ay 2hoe 1ok19 Ko8oe 82o h1sss ohC89 81s19 sok189 2o 2o9h1o 289 818 1s1s9 ok189 oh2cs Ah1oh29

There are a total of 50 “words” (groups of characters) in this text. The GA tries to maximise the score of a conversion of all 50 words into valid Latin words by checking each converted word appears in a Latin Word Dictionary

1) Allowing 1-1 mapping i.e. 1 Voynich character maps to 1 Latin character

0 0.2278094451386928 GA$Chromosome@1a07791 prob=1.5168475878361884 Good=16 / 50 = 32.0%
S: o 1 9 8 h c 2 s y k a e & f X p A g u i K C 3 4 ▀
R: i e o r n c u d l g p s b a f q h x y µ m t j v z
aieiy eia' eiji diso uis ogepl ixeilo neifeil reo enpl igeo uo uo beo' ruo' neo' eci rpµro reo neco' neo' beineo rui' rzl
eiruo cgeciro usr inei eo' neccr es' eis' nir i' inui rieccdq vino' unccileirpl unis' eigeo miris' rui' neddd intro' redeo' digero'
ui uionei uro rer ededo igero inucd hneinuo

2) 1-2 mapping i.e. each Voynich character can map into 1 or 2 Latin characters

0 0.1504950444038535 GA$Chromosome@8a2f6b prob=1.4969215520914312 Good=15 / 50 = 30.0%
S: o 1  9 8 h c 2 s y k a e &  f  X  p A  g  u i K  C  3  4  ▀
R: i e is c p n s l a o r t m us er ti b in re u d tu nt it en
usieire eius' einti litis' sit' isoera iineiais peiereia ceis epra ioeis sis' sis' meis' csis peis eni crucis' ceis penis'
peis meipeis csi cena' eicsis noenicis stc ipei eis' pennc et' eit pic i' ipsi' ciennlti itipis spnniaeicra spit eioeis dicit' csi
pelll iptucis celeis lioecis si' siispei scis' cec elelis ioecis ipsnl bpeipsis

3) 2-1 mapping i.e. each Voynich character or pair of characters maps to 1 Latin character

0 0.1600852807737927 GA$Chromosome@2ccccf prob=1.410155608070086 Good=15 / 50 = 30.0%
S: 1o h1 oh 19 k1 o8 89 oy 1c cc co ay ok 1s 2o oe 29 8a o 1 9 8 h c 2 s y k a e
R:  i  d  e  l  r  m  b  f  t  g  q  x  p  y  v  c  s  j o z a u µ w h k £ + = n
|oi| i| i|o kca vn arx' o|i£a do|i£ ul zµx pl' s' s' |l us' da' to' j|b ul dwa da' |ida uv u|£ ius' wrqb hnu ei' l' dgu zn
in' µm o' ev uotwk| |ea hµgfij£ hµc' ira' |mc uv dkkk e|b uyl kpzb v' vado' hb uzu yya pzb ehwk |des

4) 2-2 mapping i.e. one or two Voynich characters maps to one or two Latin characters

0 0.6279610718846604 GA$Chromosome@47a prob=1.4749867245934525 Good=14 / 50 = 28.0%
S: 1o h1 oh 19 k1 o8 89 oy 1c cc co ay ok 1s 2o oe 29 8a o  1 9 8 h c 2 s  y k a e
R:  i  e  l is at in  s  d  n  a tu re er  u ti  p um  b o en m t r c g f us nt it v
|oi| i| i|o fpm tiv matre' o|iusm eo|ius tis enrre eris' um um |is tum' em no' b|s tis ecm em |iem tti t|us itum' cattus'
gvt li is' eat' env iv' rin o' lti toncf| |lm gradibus' grp iatm |inp tti efff l|s tuis' ferens' ti timeo' gs tent uum erens lgcf |e
lum

5) 3-2 mapping i.e. one, two or three Voynich characters maps to one or two Latin characters

0 0.40567669613807406 GA$Chromosome@1a9fa4c prob=1.467899375005982 Good=18 / 50 = 36.0%
S: h1o ok1 1o h1 oh 19 k1 o8 89 oy 1c cc co ay ok 1s 2o oe o  1 9 8 h  c 2 s y k a e
R:   i  er  e  a  s it  l  r us  m  o  p at  c in en is um b tu t v n re u f d nt g ti
|be| e| e|b fumt isti' tlc b|edt i|ed vit tunc' ert ut' ut' |it vut at' ob' vg|us vit aret at' |eat vis' v|d evut relatus'
utiv se' it' apv tuti' eti nr b' sis' vboref| |st unpmevc unum' elt |rum vis' afff s|us venit' ferus' is' isti' uus vtuv enent erus' suref |inut

Explanation:

A ‘ sign following a word indicates that the word is valid Latin. A | (vertical bar) in place of a character indicates that the Voynich character(s) have no mapping defined into Latin – the Latin character could be anything.

The S: and R: lines show the Voynich characters (S) and their replacements (R) respectively.

Genetic Algorithm based Phrase Analysis

February 26, 2010 1 comment

Hypothesis

The following hypothesis occurred to me while I was investigating a cipher theory proposed by Rich Santa Coloma. (This is not a new idea amongst Voynich researchers, but it was new to me!)

The VMs “words” are codes for plaintext character groups, probably trigraphs, digraphs and single characters.

How does  one use this system?

1) Take each word in the plaintext
2) Break it up into a sequence of one or more trigraphs, digraphs and single characters by referring to a code table
3) Write the code for each, separated by a space, and terminate the last  tri/di-graph/character code by a VMs “9”.

The labels are probably treated differently: there may well be a separate set of codes just for the labels.

As an example, take the following “sentence” of 33 “words” from the Herbal folios:

h1cok 2oe 1c9 4ohom 2oy 4ok1coe 1oyoy 2o82c9 4okd9 4okcc9 8am 4okC9 Kay o1c9 1oe 1oe 4ok1c9 8am 1okd9 8ae s19 k1c9 8am 8C9 ko8 8an 4okds 3o h1cc9 sam 1oh1oe 1oy Hos

Breaking the VMs “words” at each terminal “9”, this is deciphered to be a sentence of 13 words:

h1cok 2oe 1c
4ohom 2oy 4ok1coe 1oyoy 2o82c
4okd
4okcc
8am 4okC
Kay o1c
1oe 1oe 4ok1c
8am 1okd
8ae s1
k1c
8am 8C
ko8 8an 4okds 3o h1cc
sam 1oh1oe 1oy Hos

Each of these words is built of one or more codes. E.g. the first word in the list above is “h1cok 2oe 1c” and may be deciphered as

h1cok = “qui”,
2oe = “de”
1c = “m”

to make the Latin word “quidem”.

An interesting feature of this cipher/code is that you may have several choices of how to split each plaintext word into tri/di/mono-graphs, but without ambiguity for the decipherer. This may be an explanation for the different frequency distributions between the VMs folios and Currier hands: they were written by different scribes who tended to split the plaintext words differently.

Does the Theory fit the Data, for Latin?

We first take a substantial body of text from the VMs, e.g. the Recipes folios, and feed it through an application code that extracts all the VMs words, and groups them according to the procedure described above, using one or more arbitrary characters as word ending marks. Typically we use VMs “9”. Each sentence so derived is analysed: each of the tokens is analysed for n-gram content and frequencies are tallied.

At the end of the processing, the n-grams are sorted into frequency order: the most frequent n-grams appear first in the list.

At this point the application moves to its second stage. It ingests a large list of Latin phrases, generated by Knox (thanks, Knox!) and processes each word in each unique phrase for n-gram content, so extracting the n-gram frequencies for Latin. The phrases are placed in a sorted list: shortest first. The n-grams are sorted by frequency, most frequent first.

Here are the Latin phrase sizes used:

A total of 53834 different phrases of size >= 2
2 4405
3 28152
4 8524
5 3866
6 2227
7 1507
8 1085
9 813
10 633
11 513
12 424
13 356
14 300
15 252
16 209
17 177
18 150
19 130

The third stage of the application is to generate a set of Genetic Algorithm chromosomes. Each chromosome takes the Top N n-grams from the Voynich n-gram list and pairs them with a random selection of the n-grams from the Latin list.

For example, for a Chromosome of length 15 (in fact the GA uses much longer lengths, typically 200) the following table might be used:

V: am ay ae 1c8 4ohC oe 1c 4oham 8am 4ohan oham okam  oy 1c7  e
L: ed gi  n  de   et ae  p     s  du    tu   nd    d tio rum te

The chromosomes are “scored” by having them translate/decipher a training set of sentences from the input VMs folios. To calculate the score of each chromosome for each sentence, the sentence word tokens are converted to Latin n-grams using the chromosome’s table. Then the tokens are joined together to form the plaintext words. The plaintext words are looked up in the Latin dictionary: the chromosome’s score is increased for valid words, and decreased for invalid words. Once all the words in the sentence have been deciphered in this way, it is compared with each of the Latin phrases: if a Latin phrase appears in the sentence, the score of the chromosome is increased substantially.

The best chromosome found by a Monte Carlo method (basically generating random chromosomes, and retaining the best scoring chromosome) is placed at the top of a list, and then the remaining chromosomes needed for the Genetic Algorithm are generated.

The GA phase now begins: the chromosomes are genetically altered, mated and selected to optimise the best chromosome’s score on the training sentences. This phase is compute intensive.

Periodically, the GA will report on its progress:

Epoch 311 Cost/Ave 62.845588235294116/61.22993872549012 same 1 Mutated 21.608040201005025% New 1 MS 15
62.845588235294116 GAPhrases$Chromosome@41ec5a Good=128 / 408 = 31.37255% 40 phrases in 25 sentences
S: am ay ae 1c8 4ohC oe 1c 4oham 8am 4ohan oham okam  oy 1c7  e
R: ed gi  n  de   et ae  p     s  du    tu   nd    d tio rum te
Sentence 189
S: 2o ok1c - 1coe hc1 - 1Kc - ohan ae e hC - 4ohan 1cH - 1c7ay ap e2c - 2c7ae ohcay e hc8 - 1coehC - ehc - ohC - 4ohC - 4ohc - 4ohan ap -
T: endve la' binteua tunti nis te' pi et' in'* tunis

In this report, the GA has been running for 311 “epochs” (each epoch is a new generation of chromosomes). The cost (score) of the best chromosome is 62.8, whereas the average score of all the chromosomes in the population is 61.2. In this Epoch, there has been no change to the best chromosome since the last Epoch (“same 1”), 21% of the chromosomes have been mutated, a fresh chromosome (“New 1”) was inserted at this Epoch (to ensure diversity – this is not usually done in GA, but I find it produces more reliable training). “MS 15” means that the maximum number of no-change Epochs seen so far has been 15 … the larger this number is, the more stagnant the chromosome pool is, and the nearer to a solution we are.

The following line shows in detail how the best chromosome has scored: its table produces 128 valid Latin words, from a total of 408 translations i.e. about 31%. In the 25 sentences being used in training, 40 common Latin phrases have been found.

The next two lines show the first 15 n-grams in the mapping that the chromosome is using.

Then the status report shows how the chromosome fared on translating a sentence picked at random from the VMs folios. Since the GA is being trained only on the first few sentences, the remainder are essentially “unseen”, and so a valid, sensible translation in a non-trained sentence is significant.

The sentence picked is number 129 (the training set is the first 25 sentences in this run, so number 129 is well outside that). The VMs source sentence is shown with hyphens “-” separating the tokens that make up words. E.g. “2o ok1c” is the first word. Beneath is the Latin translation. A Latin word followed by a single quote means that that word appears in the Latin dictionary, and is thus valid. A star appearing after a set of valid Latin words indicates that the Latin phrase made up by the words is common, or at least appears in Knox’s list.