Archive

Archive for the ‘Algorithms’ Category

Edit Distance for Word Positions

October 17, 2016 9 comments

The edit distance between two words is the number of edits needed to convert between the words. For example, the edit distance between “banana” and “bahama” is 2.

I looked at the average edit distance (the Levenshtein measure) between words on each line of each folio in the Herbal A and Herbal B sections. Here are the results:

herbala_editdistance

herbalb_editdistance

How to interpret these plots

There is one square per word and line position: the top left square corresponds to the average edit distance between word 1 and word 2 on all the folios. The next square in the that row corresponds to the average edit distance between word 2 and word 3 on the folios.

Each square in the plot has a shade of gray: the darker the shade, the bigger the average edit distance.

One conclusion is that for both sets of folios, there is a big edit distance between the first and second words on the folios: the words are very dissimilar.

Another conclusion is that similar words (lighter shade of gray) tend not to occur in the first line, or as the first words.

 

Advertisements

Language A and B Again

March 13, 2013 12 comments

A tentative conclusion from comparing Language A and Language B  is that the non-gallows glyphs are used in the same way in both Languages.

That is to say, they appear to mean the same thing. So the “o” in A means the same as the “o” in B.
There is some persistent “mixing” between the e/y glyphs, which is illustrated by the example result below:
ABMixing
There is also some doubt about the “8” glyph, which sometimes seems to mix with the gallows glyphs (e.g. in some cases, the “8” appears in A to function in the same way as a gallows glyph in B and vice versa). This may simply be an error in the comparison method, or it may be that the “8” is a null, or it may be due to some other effect.
The gallows glyphs are different – they don’t appear to mean the same in A and B. I’m focussing on those glyphs now.

Language “A” and “B” Conversions

March 5, 2013 12 comments

This is an update to my previous two posts on this topic.

I have been concentrating on searching for the correspondence between glyphs used in Language A, and glyphs used in Language B. As a reminder, the method is to take all words in, say, Language A, and “convert” them to words in Language B by changing the glyphs according to a candidate mapping table. The frequency of the converted Language B words is then compared with the original Language A words: the closer the frequencies, the better the mapping match.

Method Check using only Language A words

As a check of the method, I took the Herbal folios 1-25 (all in Language A) and split them into two groups: 1-12 and 13-25, and I then artificially labelled the latter group as Language B. Then I ran the matching procedure, which produced the following result:

Epoch 62 Best chromosome 0 Value= 5.62272615159e-05
Chromosome ['o', '9', '1', 'i', '8', 'a', 'e', 'c', 'k', 'y', 'h', 'N', '2', '4', 's', 'g', 'p', '?', 'K', 'H']
ngramsA    ['o', '9', '1', 'i', '8', 'a', 'e', 'c', 'h', 'y', 'k', 'N', '2', '4', 's', 'g', 'p', '?', 'K', 'H']

This is good and reassuring, since it shows that the words in folios 13-25 have essentially the same frequency distribution when their glyphs are mapped to the same glyphs in folios 1-12.

Removal of Glyph Variants in Voyn_101

As the tests progressed, it became clear that some of the glyphs GC defined in Voyn_101 were in fact variants of more common glyphs. The most obvious were the “m”, “n”, “N” glyphs mentioned before – with these included, the conversions between Language B and Language A were of much poorer quality than if they were expanded to “iiN”, “iN” and “iiiN” respectively. After some time weeding out these variants, the following table was arrived at:

seek =  ["3", "5", "+", "%", "#", "6", "7", "A", "X", 
         "I", "C", "z", "Z", "j", "u", "d", "U", "P", 
         "Y", "$", "S", "t", "q",
         "m", "M", "n", "Y", "!", ")", "*", "b", "J", "E", "x", "B", "D", "T", "Q", "W", "w", "V", "(", "&"]
repl =  ["2", "2", "2", "2", "2", "8", "8", "a", "y", 
         "ii", "cc", "iy", "iiy", "g", "f", "ccc", "F", "ip",
         "y", "s", "cs", "s", "iip",
         "iiN", "iiiN", "iN", "y", "2", "9", "p", "y", "G", "c", "y", "cccN", "ccN", "s", "p", "h", "h", "K", "9", "8"]

I am very confident that the glyphs remaining after using the above conversion table are the base set.  The base set of glyphs is thus:

Language A frequency order: 'o', 'c', '9', '1', 'a', '8', 'e', 'i', 'h', 'y', 'k', 's', '2', 'N', '4', 'g', 'p', '?', 'K', 'H', 'f', 'G', 'F', 'L', 'l', 'v', 'r', 'R'
Language B frequency order: 'c', 'o', '9', 'a', '8', 'e', '1', 'h', 'i', 'y', 'k', '2', 'N', 's', '4', 'g', 'p', 'f', '?', 'H', 'K', 'G', 'F', 'l', 'L', 'R', 'r', 'v'

where “?” represents all very rare glyphs (such as the “picnic table” glyph). There are thus 27 glyphs (15 gallows and 12 regular) excluding the rare special glyphs like the picnic table.

Glyph Mixing Between A and B

I ran many trials using the base set of glyphs, comparing various sections of the VMs written in the different hands. In particular, the following folio collections were defined:

Special = {'HerbalRecipeAB': range(107,117) + range(1,26),
           'HerbalAB': range(1,57),
           'HerbalBalneoAB': range(1,26) + range(75,85),
           'HerbalAstroAB': range(1,13) + range(67,75),
           'PharmaRecipeAB': [88,89,99,100,101,102] + range(103,117),
           'AllAB': range(1,117)
 }

The collection I used the most was the one called “HerbalBalneoAB”, which contains Herbal folios written in Language A, and Balneo folios written in Language B. The nice feature of this collection is that the number of words is around the same for both Languages, which makes comparing counts very easy:

Total words =  2846  Total Language A =  1581  Total Language B =  1584

As an example, here is a trial result for HerbalBalneoAB:

Language B ['o', '9', '1', 'a', 'i', 'f', 'c', 'y', 'h', 'e', 'K', 'N', '2', 's', '4', 'g', 'p', '8', 'k', 'H']
Language A ['o', '9', '1', 'a', 'i', '8', 'c', 'e', 'h', 'y', 'k', 'N', '2', 's', '4', 'g', 'p', 'K', '?', 'H']

In all the tests I ran, there were some common features in the results:

  • Mixing between “e” and “y” – when writing Language A, the use of “e” appears to be equivalent to the use of  “y” in Language B, and vice versa
  • Mixing between  8,f,F,k,K,g,G,r,R,?  and so on – the Gallows glyphs swap amongst themselves, and “8”

Just about all trials showed the “e”/”y” mixing. Tony Gaffney pointed out that these two glyphs are quite similar in stroke construction. The appearance of “8” amongst the swapping Gallows glyphs is curious.

Single glyphs in Language A and Language B

March 2, 2013 4 comments
As a sanity check, I looked at single glyphs (rather than nGrams > 1), searching for the mapping that takes all the Language B glyphs and maps them to Language A glyphs, so that the Language B words converted with the mapping most closely match the frequency of Language A words. I found the following:
Chromosome  ['o', '9', '1', 'a', 'H', 'c', 'e', 'h', 'y', 'k', '2', 's', 'm', '4', 'i', '(', '8', 'p', 'g', 'n']
ngramsA     ['o', '9', '1', 'a', '8', 'c', 'e', 'h', 'y', 'k', '2', 's', 'm', '4', 'g', 'i', 'K', 'p', '?', 'n']

This shows that most Language B glyphs map to the same glyph in Language A. However, there is some mixing going on here between “H”, “8”, “i”, “g”, “(“, “K” and “?”

It occurred to me that this may be due to GC’s choice of ascribing single glyphs where there should perhaps be several. In particular, he has:
“m” which looks like “iiN”
“n” which looks like “iN”
“M” which looks like “iiiN”
(I think EVA does a better job of recognizing these.) So I adjusted the GC transcription accordingly, replacing n,m,M with the i,N combinations above.
This resulted in a new mapping for B to A:
Chromosome  ['o', '9', '1', 'a', 'i', 'g', 'c', 'y', 'k', 'e', 'h', 'N', '2', 's', '4', '(', '8', 'p', 'f', 'H']
ngramsA     ['o', '9', '1', 'a', 'i', '8', 'c', 'e', 'h', 'y', 'k', 'N', '2', 's', '4', 'g', 'K', 'p', '?', 'H']
(There may be better mappings, but this is the best so far.) This has some interesting features:
  • e and y swap between languages
  • h and k gallows swap between languages
  • some mixing of g,8,(,K,f,? – some of these are relatively rare, so the statistics are poor, which may explain the mixing.
 Note that the simplification table I’m using for Voyn_101 is currently:
    seek = ["3",   "5",    "+",  "%",   "#", "6", "7",    "A", "X",  
            "I",   "C",    "z",  "Z",   "j", "u", "d",    "U", "P", 
            "Y",   "$",    "S",  "t",   "q",
            "m",   "M",    "n",  "Y",   "!"]
    repl = ["2",   "2",    "2",  "2",   "2", "8",  "8",   "a", "y",  
            "ii",  "cc",   "iy", "iiy", "g", "f",  "ccc", "F", "ip",
            "y",   "s",    "cs", "s",   "iip",
            "iiN", "iiiN", "iN",  "y",   "2"]
(Thanks to Tony Gaffney for spotting an error in the conversion for C in a previous version.)
Categories: Algorithms, Languages

The Relationship Between Currier Languages “A” and “B”

March 1, 2013 24 comments

Captain Prescott Currier, a cryptographer, looked at the Voynich many moons ago, and made some very perceptive comments about it, which can be seen here on Rene Zandbergen’s site.

In particular, he noticed that the handwriting was different between some folios and others, and he also noticed (based on glyph/character counts) that there were two “languages” being used.

When I first looked at the manuscript, I was principally considering the initial (roughly) fifty folios, constituting the herbal section. The first twenty-five folios in the herbal section are obviously in one hand and one ‘‘language,’’ which I called ‘‘A.’’ (It could have been called anything at all; it was just the first one I came to.) The second twenty-five or so folios are in two hands, very obviously the work of at least two different men. In addition to this fact, the text of this second portion of the herbal section (that is, the next twenty-five of thirty folios) is in two ‘‘languages,’’ and each ‘‘language’’ is in its own hand. This means that, there being two authors of the second part of the herbal section, each one wrote in his own ‘‘language.’’ Now, I’m stretching a point a bit, I’m aware; my use of the word language is convenient, but it does not have the same connotations as it would have in normal use. Still, it is a convenient word, and I see no reason not to continue using it.

We can look at some statistics to see what he was referring to. Let’s compare the most common words in Folios 1 to 25 (in the Herbal section, Language A, written in Hand 1) and in Folios 107 to 116 (in the Recipes section, Language B, written in a different Hand):

Comparison between word frequencies in Languages A and B

Comparison between word frequencies in Languages A and B

So, for example, in Language A the most common word is “8am” and it occurs 192 times in the folios, whereas in Language B the most common word is “am”, occuring 137 times.

We might expect that these are the same word, enciphered differently. The question then is, how does one convert between words in Language A and words in Language B, and vice versa? In the case of the “8am” to “am” it’s just a question of dropping the “8”, as if “8” is a null character in Language A. In the case of the next most popular words, “1oe”(A) and “1c89″(B) it looks like “oe”(A) converts to “c89″(B). And so on.

If we look at the most popular nGrams (substrings) in both Languages, perhaps there is a mapping that translates between the two. Perhaps the cipher machinery that was used to generate the text had different settings, that produced Language A in one configuration, and Language B in another. Perhaps, if we look at the nGram correspondence that results in the best match between the two Languages, a clue will be revealed as to how that machinery worked.

This involves some software (I’m using Python now, which is fun). The software first calculates the word frequencies for Language A and B in a set of folios (the table above is an output from this stage). It then calculates the nGram frequencies for each Language. Here are the top 10:

nGramFrequencies

The software then runs a Genetic Algorithm to find the best mapping between the two sets of nGrams, so that when the mapping is applied to all words in Language B, it produces a set of words in Language A the frequencies of which most closely match the frequencies of words observed in Language A (i.e.  the frequencies shown in the first table above).

Here is an initial result. With the following mapping, you can take most common words in Language B, and convert them to Language A.

Table for converting between a Language B word and a Language A word

Table for converting between a Language B word and a Language A word

A couple of remarks. This is an early result and probably not the best match. There are some interesting correspondences :

  • “9” and “c” are immutable, and have the same function
  • Another interesting feature is that “4o” in Language B maps to “o” in Language A, and vice versa!
  • in Language B, “ha” maps to “h” in Language A, as if “a” is a null

In the Comments, Dave suggested looking at word pair frequencies between the Languages. Here is a table of the most common pairs in each Language.

Common word pairs in Languages A and B

Common word pairs in Languages A and B

For clarity, I am using what I call the “HerbalRecipesAB” folios for this study i.e.

Using folios for HerbalRecipeAB : [107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25]

More results coming …

Frequency Distributions for Phonetic Codes

June 12, 2012 1 comment

Knox took the time to plot the frequency distributions from this post, where I looked at the theory that the VMs words are phonetic codes. Here are his results:

Where not included in the title, comparisons are to the Herbal Sections. VMs is in blue-black.

Comparison of phonetic code frequencies between VMs sections and various known texts.

With only 40 words to translate, there cannot be a meaningful series but it would be interesting to see the actual words in position, anyway. If this only shows the power of Genetic Algorithms to match something regardless of significance, why does the old Latin Herbal make the best matches to the Herbal and Astrological sections?

An abjad result from the Genetic Algorithm

June 17, 2011 3 comments

Here is one of the GA results. This is an attempt at deciphering the text on f9v (the Viola plant). The VMs words on that folio are:

"fo1oy","ogoyo89","og9","2oy","4og19j1o","4ofoe","2oe",
"81oy","1oe","1oy","89","ok9","89",
"9hc9","1oy","oh9","occcs",
"9kc9","k19","okoe","ok9","koe89",
"g1oy","9j1cc9","4okoy","9j19","kc","ay","1k9",
"o8oe","1o9","h2co89","1o89","ok19","9ha",
"4o","1oe","1oe","okae","8oy",
"4oh1o","yoh98","8ae9",
"19","kay","19k9","8ay9","9koe89",
"ok9","h1oe","1oe","19","h9k9",
"91oy","12ok9","1oy"

These are not all the words on the folio: I have removed those that contain unusual or problematic glyphs (e.g. the “m”).

The GA comes up with the following VMs->Latin character mapping:

Voynich: o    9    1    k    y    8    e    c    h    a    4    g    2    j    f    s

Plain:   r    s    d    p    m    b    t    n    f    l        q    c    x    v    g

And here are the deciphered words. On each line you have the VMs word, the Latin consonants, then the possible Latin or English words in the dictionary that match the abjad.

fo1oy = vrdrm =  virdiarium viridarium viridiarium
ogoyo89 = rqrmrbs =  ?
og9 = rqs =  requies arquus
2oy = crm =  carum coram curam corium cremo cyrum curiam acerum acorum acroama acrum aecoreum careum cereum cerium ceroma coarmi coarmo crami cremii cremi croma cromae curium cream
4og19j1o = rqdsxdr =  ?
4ofoe = rvrt =  reverti reverto iuraverat
2oe = crt =  certa certe certo creta curatio curto creat coarto create cartae caret acerata careota careotae cariota cariotae carota carotae carta carti caryitae caryota caryotae ceratia ceratiae ceratii cerati cerata ceroti certi coertio coryti cratio creatio creati creata cretae cretea cretio crita critae croto curate curata curiatia curiata curito curta ocreata court courte curt cart
81oy = bdrm =  obdormio
1oe = drt =  audierat deerat oderit odorati aderat auderet durat diruat daret deaurata adaeratio adoratio deartuo deorata deratio diratio dirutio duratio duritia duritiae duritiei odoratio odorata darte dirty
1oy = drm =  audieram darem dierum dormio oderam odorem iudeorum deorum darium adoreum adorium dearmo diarium dirimo diremi dirum dormeo drama dromo durum edormio edurum odorum dram
89 = bs =  abs bis bos iubeas iubes basio uobis abusi ibis abies absi abusio baes bas basi bes bios bus ibos obesa obsuo obsui base abuse bees boys busy bays
ok9 = rps =  repsi rapis aeripes euripus reapse reposui rupes rupis ropes
89 = bs =  abs bis bos iubeas iubes basio uobis abusi ibis abies absi abusio baes bas basi bes bios bus ibos obesa obsuo obsui base abuse bees boys busy bays
9hc9 = sfns =  sifonis
1oy = drm =  audieram darem dierum dormio oderam odorem iudeorum deorum darium adoreum adorium dearmo diarium dirimo diremi dirum dormeo drama dromo durum edormio edurum odorum dram
oh9 = rfs =  rufus refuse
occcs = rnnng =  running runninge
9kc9 = spns =  sapiens spinas sponsi sponsa supinis spensa spinis yspanos sapineus sapinus saponis siponis sopionis spensae spineus spinosa spinus spons sponsae sponsio sponso supinus
k19 = pds =  pedes pedis apodis pods
okoe = rprt =  reperiet reparat eriperet reperta reperit reparatio reperti reporto reporte report
ok9 = rps =  repsi rapis aeripes euripus reapse reposui rupes rupis ropes
koe89 = prtbs =  partibus portabis parietibus
g1oy = qdrm =  quadrum quadrima
9j1cc9 = sxdnns =  ?
4okoy = rprm =  reprimi reprimo
9j19 = sxds =  ?
kc = pn =  opinio opino paene pene poena pono punio puny upon pane pena pone apiana apianae apina apinae paean paeon paeonia penae peni pinea pini poenae poenio open pen paine pain payne pyany pin pine pan peny peony
ay = lm =  aliam alium lama lamia lima limo olim almi oleum alme alma aulam alum aulaeum elimo ilum lamae lamiae lema limae limi ulmea ulmi elm
1k9 = dps =  dapes daps adeps adipis adipeus adips adipsi adposui dapis deposui depso depsui diapasi
o8oe = rbrt =  arboreti robert
1o9 = drs =  aderas derisui dorso durus odores duros dirus edurus odorus edrus durius diris duris derisio dares adoris adoreus adoriosa adrasi adrisi adrisio adrosi adursi derasi derisi derisa derosi derosa diarius dirasi dorsi odoris deirous dooers doores dryes dries drousie dyers
h2co89 = fcnrbs =  facinoribus
1o89 = drbs =  derbiosa
ok19 = rpds =  rapidus
9ha = sfl =  useful safly safely
4o = r =  aer ara aro aurae aure aurea auro ero eruo ira irae ire iuro or ore ori oro re rea rei rui ruo aera aerio ora iura aura era r uero uaria area auri iure iuri ere aeer aerae aerea aerei aeria aero arae areae areo arui aria ariae ari aureae aurei eiero eare erae erui eri euro euroa euri iro orae reae uro uri rai are oure yeare your our youre ear rue year yeer air rye ar
1oe = drt =  audierat deerat oderit odorati aderat auderet durat diruat daret deaurata adaeratio adoratio deartuo deorata deratio diratio dirutio duratio duritia duritiae duritiei odoratio odorata darte dirty
1oe = drt =  audierat deerat oderit odorati aderat auderet durat diruat daret deaurata adaeratio adoratio deartuo deorata deratio diratio dirutio duratio duritia duritiae duritiei odoratio odorata darte dirty
okae = rplt =  repleuit repleta repleat
8oy = brm =  baioarium barim baioariam brume bireme boarium boreum borium bromi bruma brumae eboreum ebrium ebureum obarmo broom
4oh1o = rfdr =  ?
yoh98 = mrfsb =  ?
8ae9 = blts =  oblitus balatus balteus ablatis ablutus abolitus ablatus belatus beluatus bliteus boletus bolites oblatus blites
19 = ds =  ades audias audis das deos deus dies duos odiosa dis adso iudeis ydus adesa adsuo adsui aedes aedis aedus dasea daseae dasia dasiae des desuo desui diis dius dos duis edius edus idos odiose udus dayes daies odyous dose ads daisie
kay = plm =  palam palma pluma pulmo puleium epulum pilum palmo apuliam palium apalum palmae palmea palmi palum paulum pileum plumae plumea plum polium polum palm
19k9 = dsps =  dasypus deseps disposui despise
8ay9 = blms =  bulimos bulimosa bulimus balms
9koe89 = sprtbs =  spiritibus
ok9 = rps =  repsi rapis aeripes euripus reapse reposui rupes rupis ropes
h1oe = fdrt =  foederata foederati
1oe = drt =  audierat deerat oderit odorati aderat auderet durat diruat daret deaurata adaeratio adoratio deartuo deorata deratio diratio dirutio duratio duritia duritiae duritiei odoratio odorata darte dirty
19 = ds =  ades audias audis das deos deus dies duos odiosa dis adso iudeis ydus adesa adsuo adsui aedes aedis aedus dasea daseae dasia dasiae des desuo desui diis dius dos duis edius edus idos odiose udus dayes daies odyous dose ads daisie
h9k9 = fsps =  ?
91oy = sdrm =  siderum sidereum sudarium
12ok9 = dcrps =  decerpsi decarpsi
1oy = drm =  audieram darem dierum dormio oderam odorem iudeorum deorum darium adoreum adorium dearmo diarium dirimo diremi dirum dormeo drama dromo durum edormio edurum odorum dram