Archive for the ‘n-grams’ Category

Vowel-less plaintext

June 13, 2011 3 comments

Suppose the VMs words have no vowels, and that a simple alphabetic substitution has been used to create the text from vowel-less plaintext.

I used a Genetic Algorithm to test this hypothesis on some of the naked lady labels in the Balneological section. Using a large Latin dictionary, I stripped out all vowels “aeiou” from the Latin words, giving me a set of vowel-less Latin words. This was then used by the GA to try to find the best 1-1 mapping between VMs glyph and Latin.

Here is a table of the starting statistics. The “Source” is the VMs (in the Voyn_101 encoding), the Target is Latin. The second and fifth columns show the total number of occurrences of each glyph and each Latin letter, respectively, and the following columns show that number as a fraction of the total. The rows are in order of glyph/letter frequency.

There are 16 VMs glyphs, and 22 Latin letters.

16 Voynich nGrams 21 plaintext nGrams
Top 16 1-grams in Voynich and 1-grams in plaintext
Source            Target
------            ------
o    52    0.21311475    s    7666    0.14250925
e    35    0.14344262    r    7450    0.13849385
9    30    0.12295082    t    7053    0.13111371
8    27    0.11065574    n    5706    0.10607328
a    25    0.10245901    c    4386    0.08153477
h    25    0.10245901    m    4340    0.08067964
y    17    0.06967213    l    3707    0.06891231
2    6    0.024590164    p    3079    0.05723793
k    5    0.020491803    d    2790    0.051865485
c    5    0.020491803    b    1725    0.03206737
i    5    0.020491803    v    1424    0.026471846
1    4    0.016393442    f    1372    0.025505178
s    3    0.012295082    g    1347    0.025040433
N    2    0.008196721    q    600    0.0111538675
4    2    0.008196721    h    509    0.009462197
g    1    0.0040983604    x    499    0.0092763

To run the GA, I used a simple weighting function that added the square of the length of every label that was decoded into a valid plaintext word.

Here are the results of one run, where about 50% of the labels (25/53) were converted. First the derived mapping between VMs glyph and Latin consonant:

Voynich: c    1    k    2    y    i    h    s    o    a    4    8    e    N    9    g    
Plain:   l    g    c    p    f    x    v    y        t    q    n    r    d    s    b

Note that the GA has assigned VMs “o” to a null …

Now here are the deciphered labels, with the possible voweled Latin words each may correspond to:

Source  : oeae9
Decipher: rtrs' : oratorius
Source  : oe189
Decipher: rgns' : origines
Source  : oha89
Decipher: vtns
Source  : ohoeo
Decipher: vr' : varia varie ver vera vere veri vero vir viro voro avara
Source  : ohoy9
Decipher: vfs
Source  : ogoy
Decipher: bf
Source  : oeh9
Decipher: rvs' : rivos
Source  : ohaN
Decipher: vtd
Source  : ohay
Decipher: vtf
Source  : oh29
Decipher: vps
Source  : sayae
Decipher: ytftr
Source  : 8ohae
Decipher: nvtr' : invetero
Source  : 8ayoe
Decipher: ntfr
Source  : 8ae89
Decipher: ntrns' : nutriens internus
Source  : 8ae28
Decipher: ntrpn' : interpono
Source  : 8aehay
Decipher: ntrvtf
Source  : 4ohae
Decipher: qvtr
Source  : 8e9
Decipher: nrs' : inrisuo iners
Source  : oy9
Decipher: fs' : fas
Source  : ok9
Decipher: cs' : acies acsi causa causae cuius iaces iocus ocius casa casia cos
Source  : e19
Decipher: rgs' : erigis reges regius rgis rugas regis
Source  : 8ay9
Decipher: ntfs
Source  : 8ae
Decipher: ntr' : antra inter interea intereo intra intro intueor natura naturae nitor nutrio nitori enitor enutrio ianitor notoare
Source  : 8ae89
Decipher: ntrns' : nutriens internus
Source  : 4oko8
Decipher: qcn
Source  : yhae
Decipher: fvtr
Source  : 9hc89
Decipher: svlns
Source  : oeh19
Decipher: rvgs
Source  : oko89
Decipher: cns' : canis canos cinis consui consuo censeo cuneus
Source  : ohay
Decipher: vtf
Source  : ohae
Decipher: vtr' : vetera viatori vitrea veter viator
Source  : ohoe89
Decipher: vrns
Source  : ohaiya89
Decipher: vtxftns
Source  : oh1oy
Decipher: vgf
Source  : oeaiiN
Decipher: rtxxd
Source  : 8oeoe
Decipher: nrr' : narro
Source  : sohoe9
Decipher: yvrs
Source  : oeha
Decipher: rvt
Source  : h9
Decipher: vs' : evasi ovis vasa vias viis vis visa visu vos avus vas visio
Source  : soyoye
Decipher: yffr
Source  : oeoeae
Decipher: rrtr
Source  : oy
Decipher: f' : fio fui f of
Source  : 2chay
Decipher: plvtf
Source  : 989
Decipher: sns' : sanes sanies sanus senis sensa sensi sensu sonas sinus
Source  : ohc89
Decipher: vlns' : valens volans volens vulnus
Source  : eoe9
Decipher: rrs' : rarus ruris rarius
Source  : 8oiiy
Decipher: nxxf
Source  : oe29
Decipher: rps' : repsi
Source  : okc89
Decipher: clns' : colonus
Source  : ehoe
Decipher: rvr' : revera
Source  : ohoe29
Decipher: vrps
Source  : oko89
Decipher: cns' : canis canos cinis consui consuo censeo cuneus
Source  : 82c89
Decipher: nplns

Does the language of Dante fit the VMs?

October 4, 2010 Leave a comment

Having spent many pleasurable hours checking various exotic cipher and code ideas, none of them remotely fits when using a GA, except one. My faith in the GA technique is that it very quickly gives an idea of how well a code/cipher theory fits the VMs text.

The one cipher idea and plaintext language that does notably better than all others is an nGram mapping with the language of Dante as the plaintext. This is a form of early Italian, and it produces results significantly better than all other languages tried with nGrams, including Latin, German, English, Spanish, Dutch, Chinese etc. .

I’ll post some results from this nGram/Dante GA later.

There is a significant obstacle with applying computational techniques to the VMs, and that is the machine transcriptions of the VMs text. Basically they differ substantially, to the extent that statistics obtained with, say, EVA do not match well with statistics obtained with, say, Voyn_101. A particular problem is glyph bloat … my opinion is that GC’s Voyn_101 transcription contains many more glyphs than the scribes were actually using. Little differences between the ways of writing “9″ for example, are classified as different glyphs. This plays havoc with statistical analysis. Thus I have a procedure that filters the Voyn_101 and remaps e.g. those multiple “9″ glyphs to the same glyph. This allows a smaller, more realistic, search space. But it still doesn’t address the question of what strokes make up a single glyph, which is often open to interpretation. Thus any nGram mapping procedure has to allow for at least 1-3 Grams in the Voynich to be reasonably sure of covering the glyph correspondences properly.

Here is an extract of the Dante Alighieri text that matches decently using nGrams to the VMs:

Cjant Prin

A metàt strada dal nustri lambicà
mi soj cjatàt ta un bosc cussì scur
chel troj just i no podevi pì cjatà.

A contàlu di nòuf a è propit dur:
stu post salvàdi al sgrifàva par dut
che al pensàighi al fa di nòuf timour!

Che colp amàr! Murì a lera puc pi brut!
Ma par tratà dal ben chiai cjatàt
i parlarài dal altri chiai jodùt.

I no saj propit coma chi soj entràt:
cun chel gran sùn che in chel moment i vèvi,
la strada justa i vèvi bandonàt.

Necuàrt che in riva in su i zèvi
propit la ca finiva la valàda
se tremaròla tal còu chi sintèvi

in alt jodùt iai la so spalàda
vistìda belzà dai rajs dal pianèta
cal mena i àltris dres pa la so strada.

(This is modified from a reply to Knox who commented on an earlier post.)

Current Status

March 3, 2010 6 comments

Current Status

This is my personal summary of where I am at the moment, in particular which theories I’ve rejected (for better or worse!)

  • Theory: VMs words are anagrams of a plaintext that has been enciphered into the VMs glyphs
    • Attempts to find solutions with many mappings (1- 2- 3-grams) and various languages/dictionaries fail to find even mediocre matches
    • Unusual prevalence of e.g. “8am 8am 8am” not explained by this theory
  • Theory: VMs words are in fact pieces of plaintext words, that need to be a) combined b) deciphered
    • Trials with delimiters like VMs “o” and “9” and with many mappings and languages/dictionaries fail to find good matches
    • But this would explain “8am 8am 8am” at a stretch
  • Theory: VMs words contain numeric codes, that use a Selenus type code table, with e.g. gallows characters used as multipliers
    • There are too many VMs characters: for this to work – only, say, 4 gallows characters and ten digits are needed for a minimal implementation – what are all the rest for?
    • Doesn’t explain “8am 8am 8am”
  • Theory: VMs words are phonetic codes for a reading of the manuscript
    • Mapping the words to Soundex or Double Metaphone and comparing with plaintexts produces a poor frequency match (but is this a good test – see e.g. Robert Firth’s notes)
    • This could explain “8am 8am 8am”
  • Theory: The text is produced by a polyalphabetic cipher with rotating/repeating sequences (a la Strong)
    • Multiple attempt to fit this theory using various alphabet lengths and sequence lengths fails to find a convincing match, although plausible results can be generated
    • Would explain “8am 8am 8am”
  • Procedure: since the cipher/code/whatever it is changes at least between sections, and possibly between folios (and maybe even within a folio), examining large quantities of VMs text for statistical properties is very misleading. Only text within a single side of a folio should be tackled for decryption.

Genetic Algorithm based Phrase Analysis

February 26, 2010 1 comment


The following hypothesis occurred to me while I was investigating a cipher theory proposed by Rich Santa Coloma. (This is not a new idea amongst Voynich researchers, but it was new to me!)

The VMs “words” are codes for plaintext character groups, probably trigraphs, digraphs and single characters.

How does  one use this system?

1) Take each word in the plaintext
2) Break it up into a sequence of one or more trigraphs, digraphs and single characters by referring to a code table
3) Write the code for each, separated by a space, and terminate the last  tri/di-graph/character code by a VMs “9”.

The labels are probably treated differently: there may well be a separate set of codes just for the labels.

As an example, take the following “sentence” of 33 “words” from the Herbal folios:

h1cok 2oe 1c9 4ohom 2oy 4ok1coe 1oyoy 2o82c9 4okd9 4okcc9 8am 4okC9 Kay o1c9 1oe 1oe 4ok1c9 8am 1okd9 8ae s19 k1c9 8am 8C9 ko8 8an 4okds 3o h1cc9 sam 1oh1oe 1oy Hos

Breaking the VMs “words” at each terminal “9”, this is deciphered to be a sentence of 13 words:

h1cok 2oe 1c
4ohom 2oy 4ok1coe 1oyoy 2o82c
8am 4okC
Kay o1c
1oe 1oe 4ok1c
8am 1okd
8ae s1
8am 8C
ko8 8an 4okds 3o h1cc
sam 1oh1oe 1oy Hos

Each of these words is built of one or more codes. E.g. the first word in the list above is “h1cok 2oe 1c” and may be deciphered as

h1cok = “qui”,
2oe = “de”
1c = “m”

to make the Latin word “quidem”.

An interesting feature of this cipher/code is that you may have several choices of how to split each plaintext word into tri/di/mono-graphs, but without ambiguity for the decipherer. This may be an explanation for the different frequency distributions between the VMs folios and Currier hands: they were written by different scribes who tended to split the plaintext words differently.

Does the Theory fit the Data, for Latin?

We first take a substantial body of text from the VMs, e.g. the Recipes folios, and feed it through an application code that extracts all the VMs words, and groups them according to the procedure described above, using one or more arbitrary characters as word ending marks. Typically we use VMs “9”. Each sentence so derived is analysed: each of the tokens is analysed for n-gram content and frequencies are tallied.

At the end of the processing, the n-grams are sorted into frequency order: the most frequent n-grams appear first in the list.

At this point the application moves to its second stage. It ingests a large list of Latin phrases, generated by Knox (thanks, Knox!) and processes each word in each unique phrase for n-gram content, so extracting the n-gram frequencies for Latin. The phrases are placed in a sorted list: shortest first. The n-grams are sorted by frequency, most frequent first.

Here are the Latin phrase sizes used:

A total of 53834 different phrases of size >= 2
2 4405
3 28152
4 8524
5 3866
6 2227
7 1507
8 1085
9 813
10 633
11 513
12 424
13 356
14 300
15 252
16 209
17 177
18 150
19 130

The third stage of the application is to generate a set of Genetic Algorithm chromosomes. Each chromosome takes the Top N n-grams from the Voynich n-gram list and pairs them with a random selection of the n-grams from the Latin list.

For example, for a Chromosome of length 15 (in fact the GA uses much longer lengths, typically 200) the following table might be used:

V: am ay ae 1c8 4ohC oe 1c 4oham 8am 4ohan oham okam  oy 1c7  e
L: ed gi  n  de   et ae  p     s  du    tu   nd    d tio rum te

The chromosomes are “scored” by having them translate/decipher a training set of sentences from the input VMs folios. To calculate the score of each chromosome for each sentence, the sentence word tokens are converted to Latin n-grams using the chromosome’s table. Then the tokens are joined together to form the plaintext words. The plaintext words are looked up in the Latin dictionary: the chromosome’s score is increased for valid words, and decreased for invalid words. Once all the words in the sentence have been deciphered in this way, it is compared with each of the Latin phrases: if a Latin phrase appears in the sentence, the score of the chromosome is increased substantially.

The best chromosome found by a Monte Carlo method (basically generating random chromosomes, and retaining the best scoring chromosome) is placed at the top of a list, and then the remaining chromosomes needed for the Genetic Algorithm are generated.

The GA phase now begins: the chromosomes are genetically altered, mated and selected to optimise the best chromosome’s score on the training sentences. This phase is compute intensive.

Periodically, the GA will report on its progress:

Epoch 311 Cost/Ave 62.845588235294116/61.22993872549012 same 1 Mutated 21.608040201005025% New 1 MS 15
62.845588235294116 GAPhrases$Chromosome@41ec5a Good=128 / 408 = 31.37255% 40 phrases in 25 sentences
S: am ay ae 1c8 4ohC oe 1c 4oham 8am 4ohan oham okam  oy 1c7  e
R: ed gi  n  de   et ae  p     s  du    tu   nd    d tio rum te
Sentence 189
S: 2o ok1c - 1coe hc1 - 1Kc - ohan ae e hC - 4ohan 1cH - 1c7ay ap e2c - 2c7ae ohcay e hc8 - 1coehC - ehc - ohC - 4ohC - 4ohc - 4ohan ap -
T: endve la' binteua tunti nis te' pi et' in'* tunis

In this report, the GA has been running for 311 “epochs” (each epoch is a new generation of chromosomes). The cost (score) of the best chromosome is 62.8, whereas the average score of all the chromosomes in the population is 61.2. In this Epoch, there has been no change to the best chromosome since the last Epoch (“same 1”), 21% of the chromosomes have been mutated, a fresh chromosome (“New 1”) was inserted at this Epoch (to ensure diversity – this is not usually done in GA, but I find it produces more reliable training). “MS 15” means that the maximum number of no-change Epochs seen so far has been 15 … the larger this number is, the more stagnant the chromosome pool is, and the nearer to a solution we are.

The following line shows in detail how the best chromosome has scored: its table produces 128 valid Latin words, from a total of 408 translations i.e. about 31%. In the 25 sentences being used in training, 40 common Latin phrases have been found.

The next two lines show the first 15 n-grams in the mapping that the chromosome is using.

Then the status report shows how the chromosome fared on translating a sentence picked at random from the VMs folios. Since the GA is being trained only on the first few sentences, the remainder are essentially “unseen”, and so a valid, sensible translation in a non-trained sentence is significant.

The sentence picked is number 129 (the training set is the first 25 sentences in this run, so number 129 is well outside that). The VMs source sentence is shown with hyphens “-” separating the tokens that make up words. E.g. “2o ok1c” is the first word. Beneath is the Latin translation. A Latin word followed by a single quote means that that word appears in the Latin dictionary, and is thus valid. A star appearing after a set of valid Latin words indicates that the Latin phrase made up by the words is common, or at least appears in Knox’s list.