Archive

Archive for the ‘Herbal Folios’ Category

Edit Distance for Word Positions

October 17, 2016 9 comments

The edit distance between two words is the number of edits needed to convert between the words. For example, the edit distance between “banana” and “bahama” is 2.

I looked at the average edit distance (the Levenshtein measure) between words on each line of each folio in the Herbal A and Herbal B sections. Here are the results:

herbala_editdistance

herbalb_editdistance

How to interpret these plots

There is one square per word and line position: the top left square corresponds to the average edit distance between word 1 and word 2 on all the folios. The next square in the that row corresponds to the average edit distance between word 2 and word 3 on the folios.

Each square in the plot has a shade of gray: the darker the shade, the bigger the average edit distance.

One conclusion is that for both sets of folios, there is a big edit distance between the first and second words on the folios: the words are very dissimilar.

Another conclusion is that similar words (lighter shade of gray) tend not to occur in the first line, or as the first words.

 

Advertisements

The Relationship Between Currier Languages “A” and “B”

March 1, 2013 24 comments

Captain Prescott Currier, a cryptographer, looked at the Voynich many moons ago, and made some very perceptive comments about it, which can be seen here on Rene Zandbergen’s site.

In particular, he noticed that the handwriting was different between some folios and others, and he also noticed (based on glyph/character counts) that there were two “languages” being used.

When I first looked at the manuscript, I was principally considering the initial (roughly) fifty folios, constituting the herbal section. The first twenty-five folios in the herbal section are obviously in one hand and one ‘‘language,’’ which I called ‘‘A.’’ (It could have been called anything at all; it was just the first one I came to.) The second twenty-five or so folios are in two hands, very obviously the work of at least two different men. In addition to this fact, the text of this second portion of the herbal section (that is, the next twenty-five of thirty folios) is in two ‘‘languages,’’ and each ‘‘language’’ is in its own hand. This means that, there being two authors of the second part of the herbal section, each one wrote in his own ‘‘language.’’ Now, I’m stretching a point a bit, I’m aware; my use of the word language is convenient, but it does not have the same connotations as it would have in normal use. Still, it is a convenient word, and I see no reason not to continue using it.

We can look at some statistics to see what he was referring to. Let’s compare the most common words in Folios 1 to 25 (in the Herbal section, Language A, written in Hand 1) and in Folios 107 to 116 (in the Recipes section, Language B, written in a different Hand):

Comparison between word frequencies in Languages A and B

Comparison between word frequencies in Languages A and B

So, for example, in Language A the most common word is “8am” and it occurs 192 times in the folios, whereas in Language B the most common word is “am”, occuring 137 times.

We might expect that these are the same word, enciphered differently. The question then is, how does one convert between words in Language A and words in Language B, and vice versa? In the case of the “8am” to “am” it’s just a question of dropping the “8”, as if “8” is a null character in Language A. In the case of the next most popular words, “1oe”(A) and “1c89″(B) it looks like “oe”(A) converts to “c89″(B). And so on.

If we look at the most popular nGrams (substrings) in both Languages, perhaps there is a mapping that translates between the two. Perhaps the cipher machinery that was used to generate the text had different settings, that produced Language A in one configuration, and Language B in another. Perhaps, if we look at the nGram correspondence that results in the best match between the two Languages, a clue will be revealed as to how that machinery worked.

This involves some software (I’m using Python now, which is fun). The software first calculates the word frequencies for Language A and B in a set of folios (the table above is an output from this stage). It then calculates the nGram frequencies for each Language. Here are the top 10:

nGramFrequencies

The software then runs a Genetic Algorithm to find the best mapping between the two sets of nGrams, so that when the mapping is applied to all words in Language B, it produces a set of words in Language A the frequencies of which most closely match the frequencies of words observed in Language A (i.e.  the frequencies shown in the first table above).

Here is an initial result. With the following mapping, you can take most common words in Language B, and convert them to Language A.

Table for converting between a Language B word and a Language A word

Table for converting between a Language B word and a Language A word

A couple of remarks. This is an early result and probably not the best match. There are some interesting correspondences :

  • “9” and “c” are immutable, and have the same function
  • Another interesting feature is that “4o” in Language B maps to “o” in Language A, and vice versa!
  • in Language B, “ha” maps to “h” in Language A, as if “a” is a null

In the Comments, Dave suggested looking at word pair frequencies between the Languages. Here is a table of the most common pairs in each Language.

Common word pairs in Languages A and B

Common word pairs in Languages A and B

For clarity, I am using what I call the “HerbalRecipesAB” folios for this study i.e.

Using folios for HerbalRecipeAB : [107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25]

More results coming …

Herbal Match

May 23, 2012 11 comments

These comparisons are between an old Italian Herbal kept at the University of Vermont dating from 1475, and folios in the Voynich Manuscript.

Categories: Features, Herbal Folios

Voynich Herbs

February 26, 2010 1 comment

Edith Sherwood has a web site where she details compelling possible identifications for the plants depicted in the “herbal” pages of the VM.

Dana Scott’s page also has plausible identifications for the plants.

As has often been pointed out, if we look at the first Voynich “word” that appears on each page of the herbal part of the VM, we find that those words are unique, or appear elsewhere very rarely. It thus seems reasonable that the words may be the names of the plants depicted.

The GA was set up to find a set of n-Gram mappings that would convert a list of 111 Voynich first herbal words into Latin/English or Spanish. For this, dictionaries of Latin, English and Spanish herb/plant names were used.

The GA sought a mapping that would convert all the Voynich words for herbs/plants into as many valid plaintext (Spanish, English, Latin) words as possible. The best result was for a mixed English/Latin dictionary (see table): 31 of the 111 Voynich words were converted, about 30% success rate.

(One should never expect 100% success, due to missing names in the dictionary, transcription errors, missing n-Grams, incomplete n-Grams etc..)

The results are shown below in tabular form, together with Dana Scott’s and Edith Sherwood’s identification. The first column shows the folio in the VM, the second shows the first Voynich word on that folio. For the GA identification columns (3 and 4) the Voynich mapped word is shown, in quotation marks if not found in the associated dictionary, and in bold if found in the dictionary.

Note that, probably unsurprisingly, nowhere do the IDs from the GA in Spanish, English/Latin and Scott/Sherwood, agree! NOT YET, anyway 🙂

(What amuses me about about this mapping technique is that it tends to produce words that sound plausible in the target language. E.g. for f4r the Latin/English word “paptise” sounds like a valid word.)

Folio Voynich 1st Word Candidate GA ID, Spanish Candidate GA ID, Latin/Engish Dana Scott ID, English Dana Scott ID, Latin Sherwood ID, Latin Sherwood ID, English
f1r fa19s costa “greica”
f1v h1s9 rabo geum Deadly Nightshade Atropa belladonna Hyoscyamus niger Solanum nigrum Solanum dulcamara Atropa belladonna Deadly Nightshade
f2r h98an9 “jzba” “ariapha” Cornflower Centaurea cyanus Centaurea diffusa Diffuse Knapweed
f2v hoom “meic” “padi” Water Lily Nymphaea candida Nymphoides Nymphoides
f3r k2cos chinita (Impatiens) arnica Celosia argentea Feathery amaranth
f3v hoam menta (mint) paris Helleborus foetidus Dungwort
f4r ho8ae19 “mezirn” “paptise” Saxifraga cespitosa Alpine Saxifrage
f4v j1oom pastora (Poinsettia) “oigle” Campanula rapunculus Rampion
f5r h2o89 “piyn” “hicse” Arnica montana Wolfs Bane
f5v hA1coy malanga (Malanga) cirsium Tennis Racket Plant Agrimonia eupatoria Malva sylvestris Mallow
f6r foay “oote” “erk” Acanthus mollis Bear Breeches
f6v hoay9say1Chay “meotendoteisedh” “pakpikrtsst” Eryngium maritimum Sea Holly
f7r f1o8am “saynta” acris Trientalis europea Starflower
f7v joe29 “rden” anise Myrica gale Bog Myrtle
f8r g2oe “dno” “miv” Pisum sativum Green Pea
f8v Ko8 “anop” “amot” Symphytum officinale Comfrey
f9r k98eo “uardna” “cernur” Ricinus communis Casteroil
f9v fo1oy “oveh” “erut” Heartsease, Wild Pansy Viola tricolor Violaceae Viola
f10r g1oK9 “pohon” “apryse” Cichorium pumilum Chicory Endive
f10v gam tora (Tora Tree) gale Linnaea borealis Twinflower
f11r k2oe chino (Chinese Hat Plant) “arv” Rosmarinus officinalis Rosemary
f11v goe81o89 “albaveaca” “maadud” Curcuma longa Turmeric
f13r koy3oy “lenga” “mdoium” Banana Banana
f13v hoaiy “memh” “paft” Lonicera periclymenum Honeysuckles Woodbines
f14r g1o8am “poynta” “apcris” Scorzonera Black Salsify Vipers Grass
f14v g891om “uomic” “gesdi” Stachys monnieri Wood Betony Heal-all Sel-heal Woundwort
f15r k2oy “chiga” “arium” Sonchus oleraceus Sow Thistles
f15v gayoy “t8h” “gabt” Paris quadrifolia Herb Paris
f16r go1co89 “alblanyn” “marscse” Cannabis Cannabis
f16v g1yAm “potoora” “aptule” Chrysanthemum Chrysanthemum
f17r f2o89 “hayn” “ulcse” Catananche caerulea Cupids Dart
f17v g1o8oe “poyno” “apcv” Dioscorea Yams
f18r g8yaz89 “ullngn” “gmeagse” Aster alpinus Aster
f18v koe8 la (?) mad Telfairia Fluted pumpkin
f19r g1oy “poga” apium Polemonium coeruleum Greek Valerian
f19v go1am “albbora” mantle Draba nivalis Nailwort
f20r h81o89 “caveaca” woud Astragalus hypoglottis Milk vetch
f20v faIsay “crrote” greek Cynara cardunculus Cardoon
f21r g1oy “poga” apium Anagallis arvensis Pimpernel
f21v koe829 “laol” “madpe” Dictamnus albus Burning bush False Dittany White Dittany Gas Plant
f22r goe “albv” “maus” Verbena officinalis Common Vervain Holy Herb
f22v g9samoy “..dah” “hnshot” Tulip Tulip
f23r g9818op “.fhilo” “hsthlo” Pulsatilla vulgaris Pasque flower
f23v go8azoe “albzucv” “mapacus” Borago officinalis Borage Star Flower
f24r goyoy9 “alb..” “maby” Cucumis sativus Cucumber
f24v k1o8ay coyote (wild) rock Ficus religiosa Sacred Fig Bo Tree
f25r f1oe89 “sanoaca” “avd” Wild Thyme
f25v goCam “albcuora” “malile” Isatis tinctoria Woad
f26r g%coh9 “spnij” lunaria Prunella vulgaris Self heal
f26v g1c8ay pochote (Pochote) “apgok” Lens culinaris Lentil
f27r hsoy manga (Mango) “veium” Spinacia oleracea Spinach
f27v fo1ou oveja (?) eruca French Marigold Tagetes patula Dianthus superbus Dianthus
f28r g1o8ay “poyote” “apck” Aristolochia Smearwort Birthwort Pipevine
f28v h2oe pino (Pine) “hiv” Dahlia Dahlia imperialis Rhododendrons Rhododendrons
f29r gosam “alb.ora” “mansle” Lactuva sativa longifolia Romaine Cos Lettuce
f29v hoom “meic” “padi” Nigella sativa Roman coriander
f30r oh1cs9 “elanbo” “inrsum” Prunella vulgaris Healall
f30v Ks1an rubia (Madder) montana Cuscuta europaea Dodder
f31r hcc8c9 lichi (Lychee) “rgoio” Erigeron acris Fleabane
f31v go8az “albzon” “mapnn” Fernleaf yarrow Achillea filipendulina Valerian Valerian
f32r f1am santa (?) “aris” Veronica triphyllos Speedwell
f32v h1co8am “ranizora” “genple” Campanula rotundifolia Harebell
f33r k28ay “chizh” “arpt” Silene vulgaris Bladder Campion
f33v kayay “qllh” “opmet” Masterwort Astrantia major Tanacetum parthenium Feverfew
f34r g1cocj19 “ponianos” “apnbie” Anemone hortensis
f34v hs189 “mansn” “vewse” Lunaria annua Honesty Money Plant
f35r Koo anona (Custard Apple) amur Cichorium intybus Radicchio
f35v gay1oy “trtga” galium Ribes nigrum Blackcurrant
f36r j1af8aN “pa.nzti” “onupfl” Delphinium staphisagria Delphinium
f36v g1ayos9 “pooteesn” “apksise” Lamium amplexicaule Henbit
f37r koGoe “luiv” malus Mentha longifolia Mint
f37v h2o89 “piyn” “hicse” fedtschenkoi englerii Emilia fosbergii Tassel flower
f38r koeoy “lilh” “mmut”
f38v oh1oj “eveet” inula Euphorbia myrsinites Myrtle Spurge
f39r kc7o128 “goguadp” “gienmpot”
f39v g7aiy “inmh” “naft”
f40r g1c9 “poi” apio Erodium malacoides Storks bill
f40v j1c7an “pagmo” “oospo” Epiphyllum oxypetalum Crocus vernus Crocus
f41r j2c9hc8aecc9 “roilizrii” “ediorpcuio” Origanum vulgare Wild Marjoram
f41v hcSo8ae “lirbzv” “riupus” Coriandrum sativum Coriander Cilantro
f42r 2o “ah” st
f42v k1o˛ cola (?) rosa Aquilegia vulgaris Columbine Culverwort
f43r kayo8am “q.zora” “opbple” Stellaria media Chickweed
f43v g8saiy9 “u.lbn” “gnsicse” Elytrigia repens Couch grass
f44r k2o8g9 “chiy.” arch Mandragora officinarum Mandrake
f44v k2o china (Impatiens) “arur” Apium graveolens Celery
f45r g9h98ae “.jzv” “hariapus” Atriplex hortensis Orach Saltbush
f45v hosay9 “me..” pansy Lavandula angustifolia Lavender
f46r g1coJ9 “ponitr” “apnta” Leucanthemum vulgare Oxeye Daisy
f46v jo79e3c7 “rimvig” “andretos” Tanacetum parthenium, Chrysanthemum parthenium Inula conyza Ploughmans Spikenard Great Fleabane
f47r g1aiy “pomh” “apft” Lady’s Mantle, Lion’s Foot Alchemilla vulgaris Rosaceae Sempervivum tectorum Houseleek
f47v g2cok “dnier” minor Arnica montana Pulmonaria officinalis Lungwort
f48r g28am “dzora” “miple” Adonis Vernalis False Hellebore
f48v g1co819 “ponifn” “apnsse” Ruta graveolens Rue Herb of Grace
f49r gA2oe “ceahv” costus Nymphaea caerulea Blue Nile Lotus
f49v g he wort
f50r g2coy “dnih” mint Astrantia major Masterwort
f50v k19 con (?) rose Telopea speciosissima Gentiana frigida Stiff Gentain
f51r k2oe819 “chinofn” “arvsse” Cakile maritima Searocket
f51v go2o89 albahaca (Basil) “mastd” Salva officinalis Sage
f52r k8oh1F9 “queacn” “toinnise” Anemone coronaria Poppy Anemone
f52v g1oy “poga” apium Polystichum setiferum Fern
f53r hA8ap “mazlo” “ciplo” Achillea Ptarmica Sneezewort
f53v k2oy3c9 “chigamin” “ariumocse” Hieracium aurantiacum Hawkweed
f54r go8am “albzora” maple Cirsium oleraceum Cabbage thistle
f54v g1co8ay “ponizh” “apnpt” Bittersweet Nightshade Solanum dulcamara Perovskia atriplicifolia Russian Sage
f55r go8am “albzora” maple Fumaria officinalis Fumitory
f55v h1C8189 “raecsn” “geriwse” Forest lily Veltheima bracteata Broccoli Broccoli
f56r ok1ae “tebv” “trntus” Drosera Sundews
f56v h1cok “ranier” “genor” Cycas revoluta Sago Palm
f57r joccoHc9 “riopei” “anomiaio” Sherardia arvemsis Blue Field Madder
f65r Alchemilla vulgaris Ladies Mantle
f65v Centaurea cyanus Cornflower
f66v Satureja montana Winter Savory
f87r Satureja hortensis Summer Savory
f87v Senecio Primula vulgaris Primrose
f87v Kleinia Pedicularis flammea Lousewort Wood Bettony
f89v Actaea spicata Baneberry
f90r Conyza bonariensis Fleabane
f90v Eruca vesicaria Arugula Rocket
f93r Cynara cardunculus Artichoke
f93v Lupinus Lupin
f94r Botrychium lunaria Botrychium lunaria Moonwort Moonfern
f94v Agrostemma Githago Corncockle Red Campion
f94v Glycyrrhiza glabra Liquorice
f94v Plantago lanceolata Ribwort Plantain Kemps
f95r Berberis Sambucus nigra Elderberry
f95v Althaea Rosea Hollyhock
f96r Angelica archangelica Garden Angelica
f96v Tamus communis Black Bryony

Genetic Algorithm – f27v

February 26, 2010 1 comment

The text, in the Voyn_101 transcription, reads:

fo1ou 1of 1o3o soe9 2oe 9k1ay og1oy9 h1oX1oy 819 1hay ok19 29 29 &19 829 h19 1co 8ai89 819 h1c9 h19 &1oh19 82o 8▀y 1o829 ck1co89 2e8 oh1o 19 h1cc8 1e 1oe ho8 o oh2o 8o1ccsp 4oh9 2hccoy1o8ay 2hoe 1ok19 Ko8oe 82o h1sss ohC89 81s19 sok189 2o 2o9h1o 289 818 1s1s9 ok189 oh2cs Ah1oh29

There are a total of 50 “words” (groups of characters) in this text. The GA tries to maximise the score of a conversion of all 50 words into valid Latin words by checking each converted word appears in a Latin Word Dictionary

1) Allowing 1-1 mapping i.e. 1 Voynich character maps to 1 Latin character

0 0.2278094451386928 GA$Chromosome@1a07791 prob=1.5168475878361884 Good=16 / 50 = 32.0%
S: o 1 9 8 h c 2 s y k a e & f X p A g u i K C 3 4 ▀
R: i e o r n c u d l g p s b a f q h x y µ m t j v z
aieiy eia' eiji diso uis ogepl ixeilo neifeil reo enpl igeo uo uo beo' ruo' neo' eci rpµro reo neco' neo' beineo rui' rzl
eiruo cgeciro usr inei eo' neccr es' eis' nir i' inui rieccdq vino' unccileirpl unis' eigeo miris' rui' neddd intro' redeo' digero'
ui uionei uro rer ededo igero inucd hneinuo

2) 1-2 mapping i.e. each Voynich character can map into 1 or 2 Latin characters

0 0.1504950444038535 GA$Chromosome@8a2f6b prob=1.4969215520914312 Good=15 / 50 = 30.0%
S: o 1  9 8 h c 2 s y k a e &  f  X  p A  g  u i K  C  3  4  ▀
R: i e is c p n s l a o r t m us er ti b in re u d tu nt it en
usieire eius' einti litis' sit' isoera iineiais peiereia ceis epra ioeis sis' sis' meis' csis peis eni crucis' ceis penis'
peis meipeis csi cena' eicsis noenicis stc ipei eis' pennc et' eit pic i' ipsi' ciennlti itipis spnniaeicra spit eioeis dicit' csi
pelll iptucis celeis lioecis si' siispei scis' cec elelis ioecis ipsnl bpeipsis

3) 2-1 mapping i.e. each Voynich character or pair of characters maps to 1 Latin character

0 0.1600852807737927 GA$Chromosome@2ccccf prob=1.410155608070086 Good=15 / 50 = 30.0%
S: 1o h1 oh 19 k1 o8 89 oy 1c cc co ay ok 1s 2o oe 29 8a o 1 9 8 h c 2 s y k a e
R:  i  d  e  l  r  m  b  f  t  g  q  x  p  y  v  c  s  j o z a u µ w h k £ + = n
|oi| i| i|o kca vn arx' o|i£a do|i£ ul zµx pl' s' s' |l us' da' to' j|b ul dwa da' |ida uv u|£ ius' wrqb hnu ei' l' dgu zn
in' µm o' ev uotwk| |ea hµgfij£ hµc' ira' |mc uv dkkk e|b uyl kpzb v' vado' hb uzu yya pzb ehwk |des

4) 2-2 mapping i.e. one or two Voynich characters maps to one or two Latin characters

0 0.6279610718846604 GA$Chromosome@47a prob=1.4749867245934525 Good=14 / 50 = 28.0%
S: 1o h1 oh 19 k1 o8 89 oy 1c cc co ay ok 1s 2o oe 29 8a o  1 9 8 h c 2 s  y k a e
R:  i  e  l is at in  s  d  n  a tu re er  u ti  p um  b o en m t r c g f us nt it v
|oi| i| i|o fpm tiv matre' o|iusm eo|ius tis enrre eris' um um |is tum' em no' b|s tis ecm em |iem tti t|us itum' cattus'
gvt li is' eat' env iv' rin o' lti toncf| |lm gradibus' grp iatm |inp tti efff l|s tuis' ferens' ti timeo' gs tent uum erens lgcf |e
lum

5) 3-2 mapping i.e. one, two or three Voynich characters maps to one or two Latin characters

0 0.40567669613807406 GA$Chromosome@1a9fa4c prob=1.467899375005982 Good=18 / 50 = 36.0%
S: h1o ok1 1o h1 oh 19 k1 o8 89 oy 1c cc co ay ok 1s 2o oe o  1 9 8 h  c 2 s y k a e
R:   i  er  e  a  s it  l  r us  m  o  p at  c in en is um b tu t v n re u f d nt g ti
|be| e| e|b fumt isti' tlc b|edt i|ed vit tunc' ert ut' ut' |it vut at' ob' vg|us vit aret at' |eat vis' v|d evut relatus'
utiv se' it' apv tuti' eti nr b' sis' vboref| |st unpmevc unum' elt |rum vis' afff s|us venit' ferus' is' isti' uus vtuv enent erus' suref |inut

Explanation:

A ‘ sign following a word indicates that the word is valid Latin. A | (vertical bar) in place of a character indicates that the Voynich character(s) have no mapping defined into Latin – the Latin character could be anything.

The S: and R: lines show the Voynich characters (S) and their replacements (R) respectively.

Prefix Stem and Suffix Analysis

February 26, 2010 2 comments

I grouped all the folios from f1v to f20v inclusive, and labeled the group as “Herbal folios”, and folios f103r to f116r inclusive labeled as “Recipe folios”. I ran each group through a program that extracts all the prefixes,suffixes and stems, validates each, and orders them in frequency. (The method used was described in an earlier email to the list.) My first question was: are the word frequencies and prefix/stem/suffix(PSS) frequencies similar between the Herbal and Recipe collections?

Here are the results. I’ll show only the suffix frequencies, because they are the most interesting.

Herbal: 1331 different words, top 10 words: "8am 1oe 1oy K9 89 19 s 8ay 2oe oy" 
Recipe: 1443 different words, top 10 words: "am ay 1c89 oe 4ohC9 8am oy 4oham 1c9 2c89" 

Top 10 Herbal Suffixes  (Frequency) 

9       0.105580695 
89      0.065862246 
y       0.06435395 
e       0.058320764 
am      0.04524887 
m       0.03167421 
s       0.027652087 
19      0.025641026 
8       0.023629965 
oy      0.02212167 

Top 10 Recipe Suffixes 

9       0.11764706 
e       0.05882353 
89      0.05020284 
y       0.04817444 
am      0.036511157 
8       0.029411765 
ay      0.028904665 
ae      0.024340771 
oy      0.023326572 
oe      0.021805273 

Note: Similar sets (7 of 10), with suffix “9” being approximately a factor two more common than the next most common suffix. I’m not sure what conclusions can be drawn, if any, from this. For fun, I applied the same analysis to a similar number of words from Augustinus Latin. Here are the results, together with the VMs data:


(Augustinus: 1257 different words, top 10 words: "et te in non me mihi est domine ut enim") 

Top 10 Latin Suffixes 

m       0.118421055 
s       0.10526316 
e       0.047368422 
que     0.039473683 
i       0.034210525 
o       0.028947368 
t       0.028947368 
us      0.02631579 
rum     0.021052632 
a       0.021052632 

So, Latin does not have the same frequency pattern at all. Is there a language which does have a similar patterm? I looked at Frenchfrom 1367, Spanish from 1527, German from 1553, and old English (Courtier):

Top 10 French Suffixes 

s    0.1199 
t    0.0736 
z    0.0708 
e    0.0654 
nt    0.0463 
es    0.0436 
l    0.0245 
r    0.0218 
re    0.0191 
er    0.0191 
tre    0.0163 

Top 10 Spanish Suffixes 

s    0.1874 
n    0.0519 
o    0.0464 
a    0.0445 
r    0.0297 
do    0.0297 
es    0.0241 
l    0.0223 
e    0.0223 
va    0.0204 
to    0.0148 

Top 10 German Suffixes 

en    0.1171 
t    0.1171 
s    0.1122 
n    0.0537 
er    0.0390 
ten    0.0341 
d    0.0341 
e    0.0293 
m    0.0293 
ts    0.0244 
r    0.0244 

Top 10 English Suffixes 

e    0.1404 
n    0.0449 
s    0.0421 
t    0.0393 
re    0.0337 
y    0.0281 
ne    0.0253 
l    0.0253 
r    0.0253 
ll    0.0253 
ed    0.0225 

The Spanish suffix “s” is three times more frequent than the next suffix: not a good match to the VMs. Similarly for the English “e”. The German suffix pattern is completely different to the VMs. The French pattern looks similar to the VMs. Let’s look at the French Stems, and compare with the VMs:


Top 10 Herbal Stems 

o       0.15171504 
9       0.058377307 
8       0.045184698 
k       0.04287599 
1o      0.040567283 
oe      0.036609497 
o8      0.028364116 
oy      0.026385223 
y       0.02176781 
2       0.02176781 

Top 10 French Stems 

a       0.0704 
d       0.0544 
es      0.0528 
en      0.0448 
le      0.0432 
se      0.032 
ent     0.0304 
de      0.0272 
ce      0.0272 
ne      0.0256 

A poor match.

Conclusion: the “9” suffix in the VMs appears too frequently for it to come from Latin, German, English or Spanish. Although French has a similarly frequent suffix “s”, the stem frequencies of French don’t match the VMs.

Hypothesis: the “9” suffix in the VMs is not a word suffix, but punctuation or some other annotation. Perhaps a key mark for deciphering purposes. Next step: re-analyse the PSS frequencies in the VMs after removing suffix “9” from words where it appears.

Using the Biological and Astrological Folios

Astrological: folios 66v to 73v inclusive

Biological: folios 75r to 85r inclusive

Herbal: 1331 different words,       top 10 words: "8am 1oe 1oy K9 89 19 s 8ay 2oe oy"
Recipe: 1443 different words,       top 10 words: "am ay 1c89 oe 4ohC9 8am oy 4oham 1c9 2c89" 
Astrological: 1771 different words, top 10 words: "ay am ae 8am s 8ay 8ae 89 okcos ohC9" 
Biological: 2135 different words,   top 10 words: "oe 4ohan 1c89 2c89 4ohc89 4oe 4ohae 1c9 4oham" 

Top 10 Herbal Suffixes  (Frequency) 

9       0.105580695 
89      0.065862246 
y       0.06435395 
e       0.058320764 
am      0.04524887 
m       0.03167421 
s       0.027652087 
19      0.025641026 
8       0.023629965 
oy      0.02212167 

Top 10 Recipe Suffixes 

9       0.11764706 
e       0.05882353 
89      0.05020284 
y       0.04817444 
am      0.036511157 
8       0.029411765 
ay      0.028904665 
ae      0.024340771 
oy      0.023326572 
oe      0.021805273 

Top 10 Astrological Suffixes 

9       0.120173536 
89      0.055531453 
am      0.046420824 
ay      0.04381779 
s       0.04295011 
ae      0.04251627 
e       0.040347073 
79      0.026898047 
y       0.022993492 
oe      0.022125814 

Top 10 Biological Suffixes 

9       0.11961975 
89      0.049643517 
e       0.038288884 
oe      0.031687353 
y       0.030102983 
c89     0.029838923 
ae      0.0293108 
c9      0.0293108 
oy      0.02719831 
ay      0.02508582 


The suffix frequency results for the different folio groups look reassuringly similar to me: the differences are what you would see if you compared two modestly sized tests in, say, English. Indeed, one can tentatively conclude that the language is the same in all four of the VMs sections. On the other hand, the top 10 word lists are quite different. Curious.

Regarding word stems: the definition of a word stem for this study is “any group of characters that spells a valid word by itself, and is also found following one or more other characters (a prefix) and/or followed

by one or more other characters (a suffix).” So, single VMs characters can be stems. After all, it may be that a single VMs character equates to multiple plaintext characters, so we have to have the flexibility to assign single characters as stems.

To clarify, take for example the VMs word “8am”. The candidate stems are “8am”, “8a”, “am”, “8”, “a” and “m”. Those candidates that appear as single words in the VMS dictionary are classed as valid stems (in this case, I believe all six are valid stems).

Once we have a list of all the valid stems in the text, we can count how often each appears, and then order that list. This is what is done toobtain the lists above.

Because this method is fully general, we avoid any assumptions about how many characters a single VMs character maps to.

Refinement

I changed the algorithm so that it only accumulated prefix/stem/suffixes for unique words in the VMs (as opposed to accumulating them for all words). I think this is more sensible, otherwise a very popular word ended up skewing the statistics. After doing this, the results for suffixes look similar between Latin and VMs (Recipes) – using 3800 words:

Top 20 Latin Suffixes (from a Latin dictionary)

s 0.08350305
o 0.042769857
t 0.03971487
m 0.034623217
is 0.029531568
e 0.02749491
us 0.026476579
a 0.022403259
es 0.020366598
rum 0.01934827
um 0.018329939
tum 0.017311608
mus 0.017311608
to 0.017311608
i 0.01629328
tus 0.01629328
tis 0.015274949
c 0.014256619
em 0.013238289
am 0.013238289

Top 20 Herbal Suffixes

9 0.094210714
89 0.045487236
e 0.040273283
ay 0.036857247
y 0.036857247
am 0.03613808
ae 0.029126214
an 0.028047465
oe 0.024631428
79 0.023552679
oy 0.023013305
8 0.023013305
o 0.020316433
ap 0.019417476
c89 0.018878102
c9 0.017979145
s 0.017799353
m 0.015462064
o89 0.014383315
19 0.01366415

This suggests the following (partial) cipher :

VMs Latin
=== =====
9 s
8 i
7 u
e m
a r
o a
y um
m is

1 t 
4 qu
c e
g f
k c
2 d
s p
h n
3 h


Top 20 VMs words translated

am -> ris
ay -> rum
ae -> rm
1c89 -> teis
4ohC9 -> quan?s
1c9 -> tes
oe -> am
4oham -> quanris
8am -> iris
4ohan -> quanr?
oham -> anris
okam -> acris
oy -> aum
an -> r?
ohan -> anr?
e -> m
2c89 -> dkis
1c79 -> tkus
ohC9 -> an?s
okay -> acrum

Looking for longer repeating character sequences

In this analysis, the software looks in the text for all nGrams that appear at least twice as a) a prefix, or b) as a suffix or at least once as a stem, and calculates their (normalised) frequencies. I’m not sure what to make of the results!


For N=3, looking at the Herbal folios f1v-f20v inclusive, 1331 different words. 

Confirmed valid prefix/stem/suffix counts 99 252 111 
Prefix/Stem/Suffix frequency, normalised 
4ok     0.1010101               o89     0.05952381              o89     0.09009009 
4oh     0.07070707              1oe     0.055555556             8am     0.09009009 
1oe     0.060606062             4ok     0.055555556             1c9     0.054054055 
1oh     0.04040404              8am     0.04761905              1oy     0.054054055 
ok1     0.04040404              4oh     0.04761905              1oe     0.045045044 
8oe     0.030303031             1oy     0.03968254              coe     0.036036037 
1oy     0.030303031             1c9     0.031746034             cc9     0.027027028 
1co     0.030303031             1co     0.023809524             e89     0.027027028 
1ok     0.030303031             8oe     0.023809524             ham     0.027027028 
4oj     0.030303031             coe     0.01984127              2c9     0.027027028 

For N=3, processing the same number of different words from Thomas Hardy (English) 

Confirmed valid prefix/stem/suffix counts 87 160 67 
Prefix/Stem/Suffix frequency, normalised 
com     0.04597701              ely     0.025           ing     0.07462686 
par     0.022988506             ted     0.025           led     0.04477612 
rea     0.022988506             led     0.025           sed     0.04477612 
mot     0.022988506             sed     0.025           ely     0.04477612 
pla     0.022988506             ght     0.025           ted     0.029850746 
see     0.022988506             ing     0.01875         ter     0.029850746 
pas     0.022988506             ked     0.01875         son     0.029850746 
wai     0.022988506             per     0.01875         ned     0.029850746 
can     0.022988506             com     0.01875         ner     0.029850746 
smi     0.022988506             par     0.01875         mon     0.029850746 

For N=3, same number of words from Augustinus (Latin) 

Confirmed valid prefix/stem/suffix counts 102 197 83 
Prefix/Stem/Suffix frequency, normalised 
qua     0.039215688             ere     0.05076142              ere     0.04819277 
fac     0.029411765             qua     0.035532996             iat     0.04819277 
qui     0.029411765             fac     0.02538071              que     0.036144577 
dic     0.029411765             ita     0.02538071              ius     0.036144577 
pot     0.029411765             ius     0.02538071              ita     0.036144577 
ter     0.019607844             que     0.020304568             rum     0.024096385 
ali     0.019607844             dic     0.020304568             ent     0.024096385 
aud     0.019607844             ini     0.020304568             ram     0.024096385 
par     0.019607844             ans     0.015228426             unt     0.024096385 
cor     0.019607844             ent     0.015228426             ris     0.024096385 


For N=4 Voynich (statistics become poorer as N increases, of course) 

Confirmed valid prefix/stem/suffix counts 6 14 6 
Prefix/Stem/Suffix frequency, normalised 
4oko    0.16666667              o8ae    0.14285715              co89    0.16666667 
okam    0.16666667              okam    0.14285715              e8am    0.16666667 
oh2o    0.16666667              4ok1    0.071428575             o8an    0.16666667 
4okc    0.16666667              4oh1    0.071428575             e2oe    0.16666667 
k2co    0.16666667              co89    0.071428575             9koy    0.16666667 
4ohC    0.16666667              4oko    0.071428575             oKoy    0.16666667 
4ok1    0.0                     e8am    0.071428575             1o89    0.0 
4oh1    0.0                     oh2o    0.071428575             oe89    0.0 
ok1c    0.0                     o8an    0.071428575             o8ae    0.0 
ohoe    0.0                     4okc    0.071428575             ho89    0.0 

For N=4 English 

Confirmed valid prefix/stem/suffix counts 36 66 26 
Prefix/Stem/Suffix frequency, normalised 
pres    0.055555556             ined    0.045454547             sing    0.115384616 
dist    0.055555556             ring    0.045454547             ined    0.115384616 
weak    0.055555556             test    0.045454547             ally    0.07692308 
occa    0.055555556             ment    0.030303031             ring    0.03846154 
outl    0.027777778             pres    0.030303031             ence    0.03846154 
prob    0.027777778             sing    0.030303031             nded    0.03846154 
ment    0.027777778             weak    0.030303031             ding    0.03846154 
cons    0.027777778             prob    0.030303031             ning    0.03846154 
atte    0.027777778             hern    0.030303031             ness    0.03846154 
stan    0.027777778             sion    0.030303031             wing    0.03846154 

For N=4 Latin 

Confirmed valid prefix/stem/suffix counts 63 126 57 
Prefix/Stem/Suffix frequency, normalised 
faci    0.06349207              bant    0.03968254              ntes    0.0877193 
pecc    0.04761905              ntes    0.03968254              quam    0.05263158 
invo    0.031746034             faci    0.031746034             endo    0.05263158 
cred    0.031746034             pecc    0.031746034             ebam    0.03508772 
infa    0.031746034             endo    0.023809524             erem    0.03508772 
puer    0.031746034             ndis    0.023809524             iens    0.03508772 
habe    0.031746034             quam    0.023809524             ones    0.03508772 
form    0.031746034             quid    0.023809524             bant    0.01754386 
pare    0.031746034             rati    0.023809524             abam    0.01754386 
nesc    0.031746034             ibus    0.015873017             ndis    0.01754386 

For N=5 Voynich (no data satisfies selection) 

For N=5 English 

Confirmed valid prefix/stem/suffix counts 15 29 13 
Prefix/Stem/Suffix frequency, normalised 
consi   0.13333334              ation   0.06896552              ation   0.15384616 
ornam   0.13333334              consi   0.06896552              sting   0.15384616 
appea   0.06666667              ornam   0.06896552              dered   0.07692308 
dimen   0.06666667              sting   0.06896552              ality   0.07692308 
occup   0.06666667              still   0.06896552              ingly   0.07692308 
stand   0.06666667              dered   0.03448276              ental   0.07692308 
conce   0.06666667              ingly   0.03448276              rning   0.07692308 
sugge   0.06666667              dimen   0.03448276              ented   0.07692308 
diffe   0.06666667              occup   0.03448276              rence   0.07692308 
speci   0.06666667              ality   0.03448276              sions   0.07692308 

For N=5 Latin 

Confirmed valid prefix/stem/suffix counts 21 44 23 
Prefix/Stem/Suffix frequency, normalised 
volun   0.0952381               entes   0.06818182              entes   0.13043478 
pecca   0.0952381               batur   0.045454547             batur   0.08695652 
lauda   0.0952381               tibus   0.045454547             antur   0.08695652 
quaer   0.0952381               invoc   0.045454547             tibus   0.08695652 
metue   0.0952381               pecca   0.045454547             bamus   0.08695652 
invoc   0.04761905              lauda   0.045454547             torum   0.08695652 
infan   0.04761905              quaer   0.045454547             tatis   0.04347826 
inven   0.04761905              volun   0.045454547             itate   0.04347826 
nesci   0.04761905              metue   0.045454547             antes   0.04347826 
paren   0.04761905              bamus   0.045454547             bilis   0.04347826 
 
Here are the N=3 counts/frequency for the 1331 unique words in f1v-f20v of the Herbal: 

Confirmed valid prefix/stem/suffix counts 99 252 111 
Prefix/Stem/Suffix frequency, normalised 
4ok     10      0.1010101               o89     15      0.05952381              o89     10      0.09009009 
4oh     7       0.07070707              1oe     14      0.055555556             8am     10      0.09009009 
1oe     6       0.060606062             4ok     14      0.055555556             1c9     6       0.054054055 
1oh     4       0.04040404              8am     12      0.04761905              1oy     6       0.054054055 
ok1     4       0.04040404              4oh     12      0.04761905              1oe     5       0.045045044 
8oe     3       0.030303031             1oy     10      0.03968254              coe     4       0.036036037 
1oy     3       0.030303031             1c9     8       0.031746034             cc9     3       0.027027028 
1co     3       0.030303031             1co     6       0.023809524             e89     3       0.027027028 
1ok     3       0.030303031             8oe     6       0.023809524             ham     3       0.027027028 
4oj     3       0.030303031             coe     5       0.01984127              2c9     3       0.027027028 


(e.g. the sequence "4ok" appears 10 times at the start of a longer word (prefix)) 

N=3 for 1331 unique words in the Astrological Section 

Confirmed valid prefix/stem/suffix counts 154 346 153 
Prefix/Stem/Suffix frequency, normalised 
okc     11      0.071428575             o89     16      0.046242774             o89     13      0.08496732 
ohc     8       0.051948052             okc     11      0.031791907             cos     6       0.039215688 
4oh     7       0.045454547             8ae     11      0.031791907             8am     6       0.039215688 
9hc     7       0.045454547             1co     10      0.028901733             8ae     6       0.039215688 
oko     6       0.038961038             oko     10      0.028901733             cc9     4       0.026143791 
oka     6       0.038961038             oho     9       0.02601156              coe     4       0.026143791 
oho     5       0.032467533             ohc     8       0.023121387             o79     4       0.026143791 
1ok     5       0.032467533             oka     8       0.023121387             oh9     4       0.026143791 
oh1     5       0.032467533             4oh     8       0.023121387             c79     4       0.026143791 
1co     4       0.025974026             9hc     7       0.020231213             c89     3       0.019607844 


N=3 for 1331 unique words in the Biological Section 

Confirmed valid prefix/stem/suffix counts 124 275 124 
Prefix/Stem/Suffix frequency, normalised 
4oh     13      0.10483871              c89     26      0.094545454             c89     17      0.13709678 
4ok     10      0.08064516              4oh     20      0.07272727              c79     13      0.10483871 
4oe     8       0.06451613              c79     13      0.047272727             1c9     9       0.07258064 
oeh     6       0.048387095             4ok     12      0.043636363             C89     7       0.05645161 
oe1     5       0.04032258              1c9     11      0.04                    2c9     7       0.05645161 
ohc     4       0.032258064             2c9     9       0.03272727              189     4       0.032258064 
soe     4       0.032258064             4oe     8       0.02909091              eoy     3       0.024193548 
oe2     3       0.024193548             oeh     7       0.025454545             cc9     3       0.024193548 
91c     3       0.024193548             8ae     7       0.025454545             hC9     3       0.024193548 
8ay     3       0.024193548             8ay     7       0.025454545             ae9     3       0.024193548 


N=3 for 1331 unique words in the Recipes Section 

Confirmed valid prefix/stem/suffix counts 135 303 143 
Prefix/Stem/Suffix frequency, normalised 
4oh     17      0.12592593              4oh     18      0.05940594              c89     13      0.09090909 
4ok     14      0.1037037               4ok     17      0.05610561              o89     13      0.09090909 
ohc     9       0.06666667              o89     16      0.052805282             189     8       0.055944055 
okc     8       0.05925926              c89     15      0.04950495              c79     7       0.04895105 
oeh     7       0.05185185              oeh     10      0.0330033               8am     7       0.04895105 
1co     5       0.037037037             1co     10      0.0330033               8ay     6       0.04195804 
g1c     4       0.02962963              ohc     9       0.02970297              coe     5       0.034965035 
4oj     4       0.02962963              c79     9       0.02970297              8ae     5       0.034965035 
ohC     4       0.02962963              8ae     9       0.02970297              1c9     4       0.027972028 
1oe     3       0.022222223             189     9       0.02970297              cc9     4       0.027972028 

Philip Neal’s Anagram Encryption

Notice how words tend to start with “4”, “o” and “1” and tend to end with “9”, “m” and “e”. This sort of feature has me excited about Philip Neal’s anagram encryption idea explained here: http://voynichcentral.com/users/philipneal/language.html which is summarised thus (quoting from that page):

  "1. Divide a plaintext into lines 
   2. Sort the words of each line into alphabetical order 
   3. Sort the letters of each word into alphabetical order 

   1. one thing led to another thing last night 
   2. another last led night one to thing thing 
   3. aehnort alst del ghint eno ot ghint ghint" 


Right now I am repurposing my Genetic Algorithm to attach some lines of the VMs assuming such an encryption – I am killed by the permutations (which go as factorial the length of the word).

Vowel/Consonant Group Encryption Scheme

February 26, 2010 1 comment

This idea is embryonic at the moment, so I would welcome comments – and also pointers to prior art if it exists.

In this method, to encrypt a word, we first separate out the vowels from the consonants, and order each group of letters. (This is similar to Philip Neal’s alphabetic anagrams idea.)

For example:

aldebaran => aaae bdlnr
pleiades  => aeei dlps
algol     => ao gll

Then we label each group of vowels with a shorthand. For example:


aaae => 4o
aeei => 1o
ao   => 8a

For the consonants, we perhaps use a simple monoalphabetic cipher:

bdlnr => cemos
dlps  => emqt
gll   => hmm

The encrypted word examples are then:


aldebaran => 4ocemos
pleiades  => 1oemqt
algol     => 8ahmm

The appeal (at first sight) of this scheme is that it might explain the strange prefix/stem/suffix or crust/mantle features of the VMs words, and also the repeated words.

To lend some support to it, we can look at the frequency distribution in the VMs of the first pair of glyphs in each word, and compare the distribution with the ordered vowel lists from words in a plaintext language.

For the Herbal, the distribution looks like this (top 10 only shown):

(order, glyph pair in Voyn101 encoding, number of occurrences, normalised frequency)


1: 4o 303 0.08426029
2: 1o 218 0.060622916
3: oh 159 0.044215795
4: 1c 153 0.042547274
5: ok 143 0.03976641
6: oe 103 0.028642936
7: 8a 94 0.026140155
8: 9h 93 0.02586207
9: 2o 84 0.023359288
10: 9k 75 0.020856507


Words in an English dictionary:


1: ae 779 0.08168187
2: ei 528 0.055363324
3: eo 512 0.053685647
4: ee 475 0.049806017
5: e 343 0.03596519
6: a 320 0.03355353
7: aei 309 0.032400128
8: aee 299 0.031351577
9: i 286 0.029988466
10: o 279 0.029254483

The normalised frequency distributions look comparable.

Frankly, I’m not sure how to proceed further with this, so would welcome ideas 🙂