Posts Tagged ‘English’

Entropy of the Voynich text

May 26, 2015 23 comments

The Shannon Entropy of a string of text measures the information content of the text. For text that is completely random i.e. where the appearance of any character is as likely as the appearance of any other, the entropy (or “disorder”) is high. For a text which is a long string of identical characters, for example, the entropy is low.

Mathematically, the Shannon Entropy is defined as:

Entropy = –ΣiN probi * Log( probi)

where probi is the frequency of the i’th character in the text, and the sum is over all the characters.

If the Voynich text is randomly created (by whatever means), we’d expect it to have high entropy (i.e. be very disordered). What we in fact find is that the text is ordered, with low Entropy, and is rather more ordered than English, for example. The result of comparing the Voynich text with several other texts in different languages is shown in the table below.

Language Source Entropy
Voynich GC’s Transcription 3.73
French Text from 1367 3.97
Latin Cantus Planus 4.05
Spanish Medina 1543 4.09
German Kochbuch 1553 4.15
English Thomas Hardy 4.21
Early Italian Divine Comedy 1300 4.23
None Random characters 6.01

The last entry in the table shows the Entropy for a random text – and is getting on for double the Entropy of the Voynich.


Landini’s Challenge

February 26, 2010 Leave a comment

An excerpt from Landini’s challenge text (text he generated using an undisclosed method, supposed to replicate the features of the VMs text):

qopchdy chckhy daiin ¬ ½shxam chor otechar okcharain ryly sheodykeyl
sheodykeyl daiin shd okaiin qokain qokal yteoldy otedy qokydy opchedy
otal oldar chor lkeedol eer ol dair chedy daiin ockhdar cpheol chedy
xar qokaiin y chedy kshdy ololdy aiin char y okeey oldar qokaiin lsho
daiin olsheam qoeey chedy dchos pshedaiin shedy d qol key sheol or
cpheeedol qokedy qokaiin daiin cthosy chedy ar aiir chedy teeol aiin
cheey y cheam oky qokaiin daldaiin loiii¯ ar shtchy chedy aldaiin
ydchedy daiin shd okaiin qokain daiin qotcho chedy daiin lchy olorol
otedy qockhor shol daiin paichy chedy ar shdair chedal chedy kchdaldy
chckhy otakar qokedy s qooko chor daiin otcholchy chedy daiin koroiin
qokain qokedy kosholdy ol kchedy kshdy qokaiin ar shaikhy olaldy seees
ar oteodar chedy oteeol shedy daiin key dain daiin keeokechy chedy
lchey ail lchedy sches ol dsheeo otol odaiin qokain daiin sheeod chshy
chedy qoekedy tair sain qocheey aiin cheey chaiin ols shedy sheolol
daiin lcheol chedy daiin pchoraiin oshaiin chedy lchey lor sal aiin
cheey y dsheom shedy todydy cheor saiin shdaldy daiin ofchtar daiin

Here are some thought-provoking results from analysing the text, as suggested by Knox, the VM text, and comparisons with English, Latin, German, French and Spanish. These use a new form of the Genetic Algorithm, described below.


It looks to me like that Landini either generated his text from a transcription of the VM itself, or his algorithm for generating that text is a good emulation of the encoding process used in the VM. In other words the Landini “language” is a good candidate as a plaintext language for the VM, as opposed to the European languages tested.


Here is a table which shows the GA’s efficiency at converting/translating between Voynich, Landini, and the other languages.

(In the table, the best possible score is 1.0 – see below for an explanation)

Asking the GA to translate English to English, or Latin to Latin, etc. results in a high efficiency score, as expected. Note that the Landini to Landini  efficiency is 0.97 – almost perfect.

The GA performs moderately at converting between the languages and the Landini text. But what is most striking (to me) is the good efficiency for converting Voynich to Landini (0.74) and Landini to Voynich (0.89)

Some Notes on the table

To look at this I revised my GA code so that it was more flexible, and I jettisoned the use of separate dictionaries. Here is how the GA now functions. It can convert/translate between any language text samples.

1) Two text files are read in: the “source” text, and the “target” text. This could be, for example, a source file containing Landini’s text, and a target file containing Spanish text, if we want to convert from Landini to Spanish.

2) The text in each file is processed separately, producing two word lists, and two sets of n-Gram frequency tables.

3) The chromosomes are generated with random mappings between the source n-Grams and the target n-Grams

4) The GA evolves the chromosomes by trying to maximise their cost. The difference now is that when a target word is generated from the source text using the mappings, it is looked up in the target word list created in 2) above, rather than in a separate dictionary.

5) After training, the best chromosome can have a maximum cost value of 1.0, which would correspond to a perfect conversion between the source text and the target text (i.e. every word produced from the source text is found in the target text dictionary)

6) So we can feed the GA with two identical texts, and after training the score of the best chromosome should be 1.0, and indeed it approaches that (it doesn’t quite get there because only the top 100 n-Grams are translated, and so some characters in the source text cannot be translated).

7) The word and n-Gram frequency lists are made from the entirety of each text, but (for this exploratory study) the training takes place on only the first 50 “words” in the source text, and uses only the first 100 n-Grams for mapping.  Thus if the 50 words of Voynich chosen contain several rare characters, then for those the mapping will fail because those rare characters do not appear in the n-Gram list, and this will result in a lower score.

8) In all cases the “X->X” score in the table (i.e. the diagonal)  represents the best score possible for that language, and is a normalisation for the other numbers in the table. I should really revise the table and divide out the off-diagonal scores by the diagonal normalisations.

9) An improvement would be to configure the n-Gram list to be, say, 200 long, and use more source (Voynich) words for the training. The downside of this is mainly execution speed.

10) These runs were with n-Grams up to 3: it would be better to go to 4 at least.

11) I think Landini gets good scores because the character set he uses is very small. Knox comments ” A factor must be that the Landini Challenge has built-in frequency matches to any transcription of the VMs. Also, there is no meaningful correspondence in the letter sequence of one word to another in Landini. The difficulty fits what I said the VMs may be.”

Voynich Herbs

February 26, 2010 1 comment

Edith Sherwood has a web site where she details compelling possible identifications for the plants depicted in the “herbal” pages of the VM.

Dana Scott’s page also has plausible identifications for the plants.

As has often been pointed out, if we look at the first Voynich “word” that appears on each page of the herbal part of the VM, we find that those words are unique, or appear elsewhere very rarely. It thus seems reasonable that the words may be the names of the plants depicted.

The GA was set up to find a set of n-Gram mappings that would convert a list of 111 Voynich first herbal words into Latin/English or Spanish. For this, dictionaries of Latin, English and Spanish herb/plant names were used.

The GA sought a mapping that would convert all the Voynich words for herbs/plants into as many valid plaintext (Spanish, English, Latin) words as possible. The best result was for a mixed English/Latin dictionary (see table): 31 of the 111 Voynich words were converted, about 30% success rate.

(One should never expect 100% success, due to missing names in the dictionary, transcription errors, missing n-Grams, incomplete n-Grams etc..)

The results are shown below in tabular form, together with Dana Scott’s and Edith Sherwood’s identification. The first column shows the folio in the VM, the second shows the first Voynich word on that folio. For the GA identification columns (3 and 4) the Voynich mapped word is shown, in quotation marks if not found in the associated dictionary, and in bold if found in the dictionary.

Note that, probably unsurprisingly, nowhere do the IDs from the GA in Spanish, English/Latin and Scott/Sherwood, agree! NOT YET, anyway 🙂

(What amuses me about about this mapping technique is that it tends to produce words that sound plausible in the target language. E.g. for f4r the Latin/English word “paptise” sounds like a valid word.)

Folio Voynich 1st Word Candidate GA ID, Spanish Candidate GA ID, Latin/Engish Dana Scott ID, English Dana Scott ID, Latin Sherwood ID, Latin Sherwood ID, English
f1r fa19s costa “greica”
f1v h1s9 rabo geum Deadly Nightshade Atropa belladonna Hyoscyamus niger Solanum nigrum Solanum dulcamara Atropa belladonna Deadly Nightshade
f2r h98an9 “jzba” “ariapha” Cornflower Centaurea cyanus Centaurea diffusa Diffuse Knapweed
f2v hoom “meic” “padi” Water Lily Nymphaea candida Nymphoides Nymphoides
f3r k2cos chinita (Impatiens) arnica Celosia argentea Feathery amaranth
f3v hoam menta (mint) paris Helleborus foetidus Dungwort
f4r ho8ae19 “mezirn” “paptise” Saxifraga cespitosa Alpine Saxifrage
f4v j1oom pastora (Poinsettia) “oigle” Campanula rapunculus Rampion
f5r h2o89 “piyn” “hicse” Arnica montana Wolfs Bane
f5v hA1coy malanga (Malanga) cirsium Tennis Racket Plant Agrimonia eupatoria Malva sylvestris Mallow
f6r foay “oote” “erk” Acanthus mollis Bear Breeches
f6v hoay9say1Chay “meotendoteisedh” “pakpikrtsst” Eryngium maritimum Sea Holly
f7r f1o8am “saynta” acris Trientalis europea Starflower
f7v joe29 “rden” anise Myrica gale Bog Myrtle
f8r g2oe “dno” “miv” Pisum sativum Green Pea
f8v Ko8 “anop” “amot” Symphytum officinale Comfrey
f9r k98eo “uardna” “cernur” Ricinus communis Casteroil
f9v fo1oy “oveh” “erut” Heartsease, Wild Pansy Viola tricolor Violaceae Viola
f10r g1oK9 “pohon” “apryse” Cichorium pumilum Chicory Endive
f10v gam tora (Tora Tree) gale Linnaea borealis Twinflower
f11r k2oe chino (Chinese Hat Plant) “arv” Rosmarinus officinalis Rosemary
f11v goe81o89 “albaveaca” “maadud” Curcuma longa Turmeric
f13r koy3oy “lenga” “mdoium” Banana Banana
f13v hoaiy “memh” “paft” Lonicera periclymenum Honeysuckles Woodbines
f14r g1o8am “poynta” “apcris” Scorzonera Black Salsify Vipers Grass
f14v g891om “uomic” “gesdi” Stachys monnieri Wood Betony Heal-all Sel-heal Woundwort
f15r k2oy “chiga” “arium” Sonchus oleraceus Sow Thistles
f15v gayoy “t8h” “gabt” Paris quadrifolia Herb Paris
f16r go1co89 “alblanyn” “marscse” Cannabis Cannabis
f16v g1yAm “potoora” “aptule” Chrysanthemum Chrysanthemum
f17r f2o89 “hayn” “ulcse” Catananche caerulea Cupids Dart
f17v g1o8oe “poyno” “apcv” Dioscorea Yams
f18r g8yaz89 “ullngn” “gmeagse” Aster alpinus Aster
f18v koe8 la (?) mad Telfairia Fluted pumpkin
f19r g1oy “poga” apium Polemonium coeruleum Greek Valerian
f19v go1am “albbora” mantle Draba nivalis Nailwort
f20r h81o89 “caveaca” woud Astragalus hypoglottis Milk vetch
f20v faIsay “crrote” greek Cynara cardunculus Cardoon
f21r g1oy “poga” apium Anagallis arvensis Pimpernel
f21v koe829 “laol” “madpe” Dictamnus albus Burning bush False Dittany White Dittany Gas Plant
f22r goe “albv” “maus” Verbena officinalis Common Vervain Holy Herb
f22v g9samoy “..dah” “hnshot” Tulip Tulip
f23r g9818op “.fhilo” “hsthlo” Pulsatilla vulgaris Pasque flower
f23v go8azoe “albzucv” “mapacus” Borago officinalis Borage Star Flower
f24r goyoy9 “alb..” “maby” Cucumis sativus Cucumber
f24v k1o8ay coyote (wild) rock Ficus religiosa Sacred Fig Bo Tree
f25r f1oe89 “sanoaca” “avd” Wild Thyme
f25v goCam “albcuora” “malile” Isatis tinctoria Woad
f26r g%coh9 “spnij” lunaria Prunella vulgaris Self heal
f26v g1c8ay pochote (Pochote) “apgok” Lens culinaris Lentil
f27r hsoy manga (Mango) “veium” Spinacia oleracea Spinach
f27v fo1ou oveja (?) eruca French Marigold Tagetes patula Dianthus superbus Dianthus
f28r g1o8ay “poyote” “apck” Aristolochia Smearwort Birthwort Pipevine
f28v h2oe pino (Pine) “hiv” Dahlia Dahlia imperialis Rhododendrons Rhododendrons
f29r gosam “alb.ora” “mansle” Lactuva sativa longifolia Romaine Cos Lettuce
f29v hoom “meic” “padi” Nigella sativa Roman coriander
f30r oh1cs9 “elanbo” “inrsum” Prunella vulgaris Healall
f30v Ks1an rubia (Madder) montana Cuscuta europaea Dodder
f31r hcc8c9 lichi (Lychee) “rgoio” Erigeron acris Fleabane
f31v go8az “albzon” “mapnn” Fernleaf yarrow Achillea filipendulina Valerian Valerian
f32r f1am santa (?) “aris” Veronica triphyllos Speedwell
f32v h1co8am “ranizora” “genple” Campanula rotundifolia Harebell
f33r k28ay “chizh” “arpt” Silene vulgaris Bladder Campion
f33v kayay “qllh” “opmet” Masterwort Astrantia major Tanacetum parthenium Feverfew
f34r g1cocj19 “ponianos” “apnbie” Anemone hortensis
f34v hs189 “mansn” “vewse” Lunaria annua Honesty Money Plant
f35r Koo anona (Custard Apple) amur Cichorium intybus Radicchio
f35v gay1oy “trtga” galium Ribes nigrum Blackcurrant
f36r j1af8aN “pa.nzti” “onupfl” Delphinium staphisagria Delphinium
f36v g1ayos9 “pooteesn” “apksise” Lamium amplexicaule Henbit
f37r koGoe “luiv” malus Mentha longifolia Mint
f37v h2o89 “piyn” “hicse” fedtschenkoi englerii Emilia fosbergii Tassel flower
f38r koeoy “lilh” “mmut”
f38v oh1oj “eveet” inula Euphorbia myrsinites Myrtle Spurge
f39r kc7o128 “goguadp” “gienmpot”
f39v g7aiy “inmh” “naft”
f40r g1c9 “poi” apio Erodium malacoides Storks bill
f40v j1c7an “pagmo” “oospo” Epiphyllum oxypetalum Crocus vernus Crocus
f41r j2c9hc8aecc9 “roilizrii” “ediorpcuio” Origanum vulgare Wild Marjoram
f41v hcSo8ae “lirbzv” “riupus” Coriandrum sativum Coriander Cilantro
f42r 2o “ah” st
f42v k1o˛ cola (?) rosa Aquilegia vulgaris Columbine Culverwort
f43r kayo8am “q.zora” “opbple” Stellaria media Chickweed
f43v g8saiy9 “u.lbn” “gnsicse” Elytrigia repens Couch grass
f44r k2o8g9 “chiy.” arch Mandragora officinarum Mandrake
f44v k2o china (Impatiens) “arur” Apium graveolens Celery
f45r g9h98ae “.jzv” “hariapus” Atriplex hortensis Orach Saltbush
f45v hosay9 “me..” pansy Lavandula angustifolia Lavender
f46r g1coJ9 “ponitr” “apnta” Leucanthemum vulgare Oxeye Daisy
f46v jo79e3c7 “rimvig” “andretos” Tanacetum parthenium, Chrysanthemum parthenium Inula conyza Ploughmans Spikenard Great Fleabane
f47r g1aiy “pomh” “apft” Lady’s Mantle, Lion’s Foot Alchemilla vulgaris Rosaceae Sempervivum tectorum Houseleek
f47v g2cok “dnier” minor Arnica montana Pulmonaria officinalis Lungwort
f48r g28am “dzora” “miple” Adonis Vernalis False Hellebore
f48v g1co819 “ponifn” “apnsse” Ruta graveolens Rue Herb of Grace
f49r gA2oe “ceahv” costus Nymphaea caerulea Blue Nile Lotus
f49v g he wort
f50r g2coy “dnih” mint Astrantia major Masterwort
f50v k19 con (?) rose Telopea speciosissima Gentiana frigida Stiff Gentain
f51r k2oe819 “chinofn” “arvsse” Cakile maritima Searocket
f51v go2o89 albahaca (Basil) “mastd” Salva officinalis Sage
f52r k8oh1F9 “queacn” “toinnise” Anemone coronaria Poppy Anemone
f52v g1oy “poga” apium Polystichum setiferum Fern
f53r hA8ap “mazlo” “ciplo” Achillea Ptarmica Sneezewort
f53v k2oy3c9 “chigamin” “ariumocse” Hieracium aurantiacum Hawkweed
f54r go8am “albzora” maple Cirsium oleraceum Cabbage thistle
f54v g1co8ay “ponizh” “apnpt” Bittersweet Nightshade Solanum dulcamara Perovskia atriplicifolia Russian Sage
f55r go8am “albzora” maple Fumaria officinalis Fumitory
f55v h1C8189 “raecsn” “geriwse” Forest lily Veltheima bracteata Broccoli Broccoli
f56r ok1ae “tebv” “trntus” Drosera Sundews
f56v h1cok “ranier” “genor” Cycas revoluta Sago Palm
f57r joccoHc9 “riopei” “anomiaio” Sherardia arvemsis Blue Field Madder
f65r Alchemilla vulgaris Ladies Mantle
f65v Centaurea cyanus Cornflower
f66v Satureja montana Winter Savory
f87r Satureja hortensis Summer Savory
f87v Senecio Primula vulgaris Primrose
f87v Kleinia Pedicularis flammea Lousewort Wood Bettony
f89v Actaea spicata Baneberry
f90r Conyza bonariensis Fleabane
f90v Eruca vesicaria Arugula Rocket
f93r Cynara cardunculus Artichoke
f93v Lupinus Lupin
f94r Botrychium lunaria Botrychium lunaria Moonwort Moonfern
f94v Agrostemma Githago Corncockle Red Campion
f94v Glycyrrhiza glabra Liquorice
f94v Plantago lanceolata Ribwort Plantain Kemps
f95r Berberis Sambucus nigra Elderberry
f95v Althaea Rosea Hollyhock
f96r Angelica archangelica Garden Angelica
f96v Tamus communis Black Bryony