Archive

Archive for the ‘German’ Category

Frequency Distributions for Phonetic Codes

June 12, 2012 1 comment

Knox took the time to plot the frequency distributions from this post, where I looked at the theory that the VMs words are phonetic codes. Here are his results:

Where not included in the title, comparisons are to the Herbal Sections. VMs is in blue-black.

Comparison of phonetic code frequencies between VMs sections and various known texts.

With only 40 words to translate, there cannot be a meaningful series but it would be interesting to see the actual words in position, anyway. If this only shows the power of Genetic Algorithms to match something regardless of significance, why does the old Latin Herbal make the best matches to the Herbal and Astrological sections?

Does the language of Dante fit the VMs?

October 4, 2010 Leave a comment

Having spent many pleasurable hours checking various exotic cipher and code ideas, none of them remotely fits when using a GA, except one. My faith in the GA technique is that it very quickly gives an idea of how well a code/cipher theory fits the VMs text.

The one cipher idea and plaintext language that does notably better than all others is an nGram mapping with the language of Dante as the plaintext. This is a form of early Italian, and it produces results significantly better than all other languages tried with nGrams, including Latin, German, English, Spanish, Dutch, Chinese etc. .

I’ll post some results from this nGram/Dante GA later.

There is a significant obstacle with applying computational techniques to the VMs, and that is the machine transcriptions of the VMs text. Basically they differ substantially, to the extent that statistics obtained with, say, EVA do not match well with statistics obtained with, say, Voyn_101. A particular problem is glyph bloat … my opinion is that GC’s Voyn_101 transcription contains many more glyphs than the scribes were actually using. Little differences between the ways of writing “9″ for example, are classified as different glyphs. This plays havoc with statistical analysis. Thus I have a procedure that filters the Voyn_101 and remaps e.g. those multiple “9″ glyphs to the same glyph. This allows a smaller, more realistic, search space. But it still doesn’t address the question of what strokes make up a single glyph, which is often open to interpretation. Thus any nGram mapping procedure has to allow for at least 1-3 Grams in the Voynich to be reasonably sure of covering the glyph correspondences properly.

Here is an extract of the Dante Alighieri text that matches decently using nGrams to the VMs:


Cjant Prin

A metàt strada dal nustri lambicà
mi soj cjatàt ta un bosc cussì scur
chel troj just i no podevi pì cjatà.

A contàlu di nòuf a è propit dur:
stu post salvàdi al sgrifàva par dut
che al pensàighi al fa di nòuf timour!

Che colp amàr! Murì a lera puc pi brut!
Ma par tratà dal ben chiai cjatàt
i parlarài dal altri chiai jodùt.

I no saj propit coma chi soj entràt:
cun chel gran sùn che in chel moment i vèvi,
la strada justa i vèvi bandonàt.

Necuàrt che in riva in su i zèvi
propit la ca finiva la valàda
se tremaròla tal còu chi sintèvi

in alt jodùt iai la so spalàda
vistìda belzà dai rajs dal pianèta
cal mena i àltris dres pa la so strada.

(This is modified from a reply to Knox who commented on an earlier post.)

Landini’s Challenge

February 26, 2010 Leave a comment

An excerpt from Landini’s challenge text (text he generated using an undisclosed method, supposed to replicate the features of the VMs text):

qopchdy chckhy daiin ¬ ½shxam chor otechar okcharain ryly sheodykeyl
sheodykeyl daiin shd okaiin qokain qokal yteoldy otedy qokydy opchedy
otal oldar chor lkeedol eer ol dair chedy daiin ockhdar cpheol chedy
xar qokaiin y chedy kshdy ololdy aiin char y okeey oldar qokaiin lsho
daiin olsheam qoeey chedy dchos pshedaiin shedy d qol key sheol or
cpheeedol qokedy qokaiin daiin cthosy chedy ar aiir chedy teeol aiin
cheey y cheam oky qokaiin daldaiin loiii¯ ar shtchy chedy aldaiin
ydchedy daiin shd okaiin qokain daiin qotcho chedy daiin lchy olorol
otedy qockhor shol daiin paichy chedy ar shdair chedal chedy kchdaldy
chckhy otakar qokedy s qooko chor daiin otcholchy chedy daiin koroiin
qokain qokedy kosholdy ol kchedy kshdy qokaiin ar shaikhy olaldy seees
ar oteodar chedy oteeol shedy daiin key dain daiin keeokechy chedy
lchey ail lchedy sches ol dsheeo otol odaiin qokain daiin sheeod chshy
chedy qoekedy tair sain qocheey aiin cheey chaiin ols shedy sheolol
daiin lcheol chedy daiin pchoraiin oshaiin chedy lchey lor sal aiin
cheey y dsheom shedy todydy cheor saiin shdaldy daiin ofchtar daiin

Here are some thought-provoking results from analysing the text, as suggested by Knox, the VM text, and comparisons with English, Latin, German, French and Spanish. These use a new form of the Genetic Algorithm, described below.

Summary

It looks to me like that Landini either generated his text from a transcription of the VM itself, or his algorithm for generating that text is a good emulation of the encoding process used in the VM. In other words the Landini “language” is a good candidate as a plaintext language for the VM, as opposed to the European languages tested.

Results

Here is a table which shows the GA’s efficiency at converting/translating between Voynich, Landini, and the other languages.

(In the table, the best possible score is 1.0 – see below for an explanation)

Asking the GA to translate English to English, or Latin to Latin, etc. results in a high efficiency score, as expected. Note that the Landini to Landini  efficiency is 0.97 – almost perfect.

The GA performs moderately at converting between the languages and the Landini text. But what is most striking (to me) is the good efficiency for converting Voynich to Landini (0.74) and Landini to Voynich (0.89)

Some Notes on the table

To look at this I revised my GA code so that it was more flexible, and I jettisoned the use of separate dictionaries. Here is how the GA now functions. It can convert/translate between any language text samples.

1) Two text files are read in: the “source” text, and the “target” text. This could be, for example, a source file containing Landini’s text, and a target file containing Spanish text, if we want to convert from Landini to Spanish.

2) The text in each file is processed separately, producing two word lists, and two sets of n-Gram frequency tables.

3) The chromosomes are generated with random mappings between the source n-Grams and the target n-Grams

4) The GA evolves the chromosomes by trying to maximise their cost. The difference now is that when a target word is generated from the source text using the mappings, it is looked up in the target word list created in 2) above, rather than in a separate dictionary.

5) After training, the best chromosome can have a maximum cost value of 1.0, which would correspond to a perfect conversion between the source text and the target text (i.e. every word produced from the source text is found in the target text dictionary)

6) So we can feed the GA with two identical texts, and after training the score of the best chromosome should be 1.0, and indeed it approaches that (it doesn’t quite get there because only the top 100 n-Grams are translated, and so some characters in the source text cannot be translated).

7) The word and n-Gram frequency lists are made from the entirety of each text, but (for this exploratory study) the training takes place on only the first 50 “words” in the source text, and uses only the first 100 n-Grams for mapping.  Thus if the 50 words of Voynich chosen contain several rare characters, then for those the mapping will fail because those rare characters do not appear in the n-Gram list, and this will result in a lower score.

8) In all cases the “X->X” score in the table (i.e. the diagonal)  represents the best score possible for that language, and is a normalisation for the other numbers in the table. I should really revise the table and divide out the off-diagonal scores by the diagonal normalisations.

9) An improvement would be to configure the n-Gram list to be, say, 200 long, and use more source (Voynich) words for the training. The downside of this is mainly execution speed.

10) These runs were with n-Grams up to 3: it would be better to go to 4 at least.

11) I think Landini gets good scores because the character set he uses is very small. Knox comments ” A factor must be that the Landini Challenge has built-in frequency matches to any transcription of the VMs. Also, there is no meaningful correspondence in the letter sequence of one word to another in Landini. The difficulty fits what I said the VMs may be.”

Prefix Stem and Suffix Analysis

February 26, 2010 2 comments

I grouped all the folios from f1v to f20v inclusive, and labeled the group as “Herbal folios”, and folios f103r to f116r inclusive labeled as “Recipe folios”. I ran each group through a program that extracts all the prefixes,suffixes and stems, validates each, and orders them in frequency. (The method used was described in an earlier email to the list.) My first question was: are the word frequencies and prefix/stem/suffix(PSS) frequencies similar between the Herbal and Recipe collections?

Here are the results. I’ll show only the suffix frequencies, because they are the most interesting.

Herbal: 1331 different words, top 10 words: "8am 1oe 1oy K9 89 19 s 8ay 2oe oy" 
Recipe: 1443 different words, top 10 words: "am ay 1c89 oe 4ohC9 8am oy 4oham 1c9 2c89" 

Top 10 Herbal Suffixes  (Frequency) 

9       0.105580695 
89      0.065862246 
y       0.06435395 
e       0.058320764 
am      0.04524887 
m       0.03167421 
s       0.027652087 
19      0.025641026 
8       0.023629965 
oy      0.02212167 

Top 10 Recipe Suffixes 

9       0.11764706 
e       0.05882353 
89      0.05020284 
y       0.04817444 
am      0.036511157 
8       0.029411765 
ay      0.028904665 
ae      0.024340771 
oy      0.023326572 
oe      0.021805273 

Note: Similar sets (7 of 10), with suffix “9” being approximately a factor two more common than the next most common suffix. I’m not sure what conclusions can be drawn, if any, from this. For fun, I applied the same analysis to a similar number of words from Augustinus Latin. Here are the results, together with the VMs data:


(Augustinus: 1257 different words, top 10 words: "et te in non me mihi est domine ut enim") 

Top 10 Latin Suffixes 

m       0.118421055 
s       0.10526316 
e       0.047368422 
que     0.039473683 
i       0.034210525 
o       0.028947368 
t       0.028947368 
us      0.02631579 
rum     0.021052632 
a       0.021052632 

So, Latin does not have the same frequency pattern at all. Is there a language which does have a similar patterm? I looked at Frenchfrom 1367, Spanish from 1527, German from 1553, and old English (Courtier):

Top 10 French Suffixes 

s    0.1199 
t    0.0736 
z    0.0708 
e    0.0654 
nt    0.0463 
es    0.0436 
l    0.0245 
r    0.0218 
re    0.0191 
er    0.0191 
tre    0.0163 

Top 10 Spanish Suffixes 

s    0.1874 
n    0.0519 
o    0.0464 
a    0.0445 
r    0.0297 
do    0.0297 
es    0.0241 
l    0.0223 
e    0.0223 
va    0.0204 
to    0.0148 

Top 10 German Suffixes 

en    0.1171 
t    0.1171 
s    0.1122 
n    0.0537 
er    0.0390 
ten    0.0341 
d    0.0341 
e    0.0293 
m    0.0293 
ts    0.0244 
r    0.0244 

Top 10 English Suffixes 

e    0.1404 
n    0.0449 
s    0.0421 
t    0.0393 
re    0.0337 
y    0.0281 
ne    0.0253 
l    0.0253 
r    0.0253 
ll    0.0253 
ed    0.0225 

The Spanish suffix “s” is three times more frequent than the next suffix: not a good match to the VMs. Similarly for the English “e”. The German suffix pattern is completely different to the VMs. The French pattern looks similar to the VMs. Let’s look at the French Stems, and compare with the VMs:


Top 10 Herbal Stems 

o       0.15171504 
9       0.058377307 
8       0.045184698 
k       0.04287599 
1o      0.040567283 
oe      0.036609497 
o8      0.028364116 
oy      0.026385223 
y       0.02176781 
2       0.02176781 

Top 10 French Stems 

a       0.0704 
d       0.0544 
es      0.0528 
en      0.0448 
le      0.0432 
se      0.032 
ent     0.0304 
de      0.0272 
ce      0.0272 
ne      0.0256 

A poor match.

Conclusion: the “9” suffix in the VMs appears too frequently for it to come from Latin, German, English or Spanish. Although French has a similarly frequent suffix “s”, the stem frequencies of French don’t match the VMs.

Hypothesis: the “9” suffix in the VMs is not a word suffix, but punctuation or some other annotation. Perhaps a key mark for deciphering purposes. Next step: re-analyse the PSS frequencies in the VMs after removing suffix “9” from words where it appears.

Using the Biological and Astrological Folios

Astrological: folios 66v to 73v inclusive

Biological: folios 75r to 85r inclusive

Herbal: 1331 different words,       top 10 words: "8am 1oe 1oy K9 89 19 s 8ay 2oe oy"
Recipe: 1443 different words,       top 10 words: "am ay 1c89 oe 4ohC9 8am oy 4oham 1c9 2c89" 
Astrological: 1771 different words, top 10 words: "ay am ae 8am s 8ay 8ae 89 okcos ohC9" 
Biological: 2135 different words,   top 10 words: "oe 4ohan 1c89 2c89 4ohc89 4oe 4ohae 1c9 4oham" 

Top 10 Herbal Suffixes  (Frequency) 

9       0.105580695 
89      0.065862246 
y       0.06435395 
e       0.058320764 
am      0.04524887 
m       0.03167421 
s       0.027652087 
19      0.025641026 
8       0.023629965 
oy      0.02212167 

Top 10 Recipe Suffixes 

9       0.11764706 
e       0.05882353 
89      0.05020284 
y       0.04817444 
am      0.036511157 
8       0.029411765 
ay      0.028904665 
ae      0.024340771 
oy      0.023326572 
oe      0.021805273 

Top 10 Astrological Suffixes 

9       0.120173536 
89      0.055531453 
am      0.046420824 
ay      0.04381779 
s       0.04295011 
ae      0.04251627 
e       0.040347073 
79      0.026898047 
y       0.022993492 
oe      0.022125814 

Top 10 Biological Suffixes 

9       0.11961975 
89      0.049643517 
e       0.038288884 
oe      0.031687353 
y       0.030102983 
c89     0.029838923 
ae      0.0293108 
c9      0.0293108 
oy      0.02719831 
ay      0.02508582 


The suffix frequency results for the different folio groups look reassuringly similar to me: the differences are what you would see if you compared two modestly sized tests in, say, English. Indeed, one can tentatively conclude that the language is the same in all four of the VMs sections. On the other hand, the top 10 word lists are quite different. Curious.

Regarding word stems: the definition of a word stem for this study is “any group of characters that spells a valid word by itself, and is also found following one or more other characters (a prefix) and/or followed

by one or more other characters (a suffix).” So, single VMs characters can be stems. After all, it may be that a single VMs character equates to multiple plaintext characters, so we have to have the flexibility to assign single characters as stems.

To clarify, take for example the VMs word “8am”. The candidate stems are “8am”, “8a”, “am”, “8”, “a” and “m”. Those candidates that appear as single words in the VMS dictionary are classed as valid stems (in this case, I believe all six are valid stems).

Once we have a list of all the valid stems in the text, we can count how often each appears, and then order that list. This is what is done toobtain the lists above.

Because this method is fully general, we avoid any assumptions about how many characters a single VMs character maps to.

Refinement

I changed the algorithm so that it only accumulated prefix/stem/suffixes for unique words in the VMs (as opposed to accumulating them for all words). I think this is more sensible, otherwise a very popular word ended up skewing the statistics. After doing this, the results for suffixes look similar between Latin and VMs (Recipes) – using 3800 words:

Top 20 Latin Suffixes (from a Latin dictionary)

s 0.08350305
o 0.042769857
t 0.03971487
m 0.034623217
is 0.029531568
e 0.02749491
us 0.026476579
a 0.022403259
es 0.020366598
rum 0.01934827
um 0.018329939
tum 0.017311608
mus 0.017311608
to 0.017311608
i 0.01629328
tus 0.01629328
tis 0.015274949
c 0.014256619
em 0.013238289
am 0.013238289

Top 20 Herbal Suffixes

9 0.094210714
89 0.045487236
e 0.040273283
ay 0.036857247
y 0.036857247
am 0.03613808
ae 0.029126214
an 0.028047465
oe 0.024631428
79 0.023552679
oy 0.023013305
8 0.023013305
o 0.020316433
ap 0.019417476
c89 0.018878102
c9 0.017979145
s 0.017799353
m 0.015462064
o89 0.014383315
19 0.01366415

This suggests the following (partial) cipher :

VMs Latin
=== =====
9 s
8 i
7 u
e m
a r
o a
y um
m is

1 t 
4 qu
c e
g f
k c
2 d
s p
h n
3 h


Top 20 VMs words translated

am -> ris
ay -> rum
ae -> rm
1c89 -> teis
4ohC9 -> quan?s
1c9 -> tes
oe -> am
4oham -> quanris
8am -> iris
4ohan -> quanr?
oham -> anris
okam -> acris
oy -> aum
an -> r?
ohan -> anr?
e -> m
2c89 -> dkis
1c79 -> tkus
ohC9 -> an?s
okay -> acrum

Looking for longer repeating character sequences

In this analysis, the software looks in the text for all nGrams that appear at least twice as a) a prefix, or b) as a suffix or at least once as a stem, and calculates their (normalised) frequencies. I’m not sure what to make of the results!


For N=3, looking at the Herbal folios f1v-f20v inclusive, 1331 different words. 

Confirmed valid prefix/stem/suffix counts 99 252 111 
Prefix/Stem/Suffix frequency, normalised 
4ok     0.1010101               o89     0.05952381              o89     0.09009009 
4oh     0.07070707              1oe     0.055555556             8am     0.09009009 
1oe     0.060606062             4ok     0.055555556             1c9     0.054054055 
1oh     0.04040404              8am     0.04761905              1oy     0.054054055 
ok1     0.04040404              4oh     0.04761905              1oe     0.045045044 
8oe     0.030303031             1oy     0.03968254              coe     0.036036037 
1oy     0.030303031             1c9     0.031746034             cc9     0.027027028 
1co     0.030303031             1co     0.023809524             e89     0.027027028 
1ok     0.030303031             8oe     0.023809524             ham     0.027027028 
4oj     0.030303031             coe     0.01984127              2c9     0.027027028 

For N=3, processing the same number of different words from Thomas Hardy (English) 

Confirmed valid prefix/stem/suffix counts 87 160 67 
Prefix/Stem/Suffix frequency, normalised 
com     0.04597701              ely     0.025           ing     0.07462686 
par     0.022988506             ted     0.025           led     0.04477612 
rea     0.022988506             led     0.025           sed     0.04477612 
mot     0.022988506             sed     0.025           ely     0.04477612 
pla     0.022988506             ght     0.025           ted     0.029850746 
see     0.022988506             ing     0.01875         ter     0.029850746 
pas     0.022988506             ked     0.01875         son     0.029850746 
wai     0.022988506             per     0.01875         ned     0.029850746 
can     0.022988506             com     0.01875         ner     0.029850746 
smi     0.022988506             par     0.01875         mon     0.029850746 

For N=3, same number of words from Augustinus (Latin) 

Confirmed valid prefix/stem/suffix counts 102 197 83 
Prefix/Stem/Suffix frequency, normalised 
qua     0.039215688             ere     0.05076142              ere     0.04819277 
fac     0.029411765             qua     0.035532996             iat     0.04819277 
qui     0.029411765             fac     0.02538071              que     0.036144577 
dic     0.029411765             ita     0.02538071              ius     0.036144577 
pot     0.029411765             ius     0.02538071              ita     0.036144577 
ter     0.019607844             que     0.020304568             rum     0.024096385 
ali     0.019607844             dic     0.020304568             ent     0.024096385 
aud     0.019607844             ini     0.020304568             ram     0.024096385 
par     0.019607844             ans     0.015228426             unt     0.024096385 
cor     0.019607844             ent     0.015228426             ris     0.024096385 


For N=4 Voynich (statistics become poorer as N increases, of course) 

Confirmed valid prefix/stem/suffix counts 6 14 6 
Prefix/Stem/Suffix frequency, normalised 
4oko    0.16666667              o8ae    0.14285715              co89    0.16666667 
okam    0.16666667              okam    0.14285715              e8am    0.16666667 
oh2o    0.16666667              4ok1    0.071428575             o8an    0.16666667 
4okc    0.16666667              4oh1    0.071428575             e2oe    0.16666667 
k2co    0.16666667              co89    0.071428575             9koy    0.16666667 
4ohC    0.16666667              4oko    0.071428575             oKoy    0.16666667 
4ok1    0.0                     e8am    0.071428575             1o89    0.0 
4oh1    0.0                     oh2o    0.071428575             oe89    0.0 
ok1c    0.0                     o8an    0.071428575             o8ae    0.0 
ohoe    0.0                     4okc    0.071428575             ho89    0.0 

For N=4 English 

Confirmed valid prefix/stem/suffix counts 36 66 26 
Prefix/Stem/Suffix frequency, normalised 
pres    0.055555556             ined    0.045454547             sing    0.115384616 
dist    0.055555556             ring    0.045454547             ined    0.115384616 
weak    0.055555556             test    0.045454547             ally    0.07692308 
occa    0.055555556             ment    0.030303031             ring    0.03846154 
outl    0.027777778             pres    0.030303031             ence    0.03846154 
prob    0.027777778             sing    0.030303031             nded    0.03846154 
ment    0.027777778             weak    0.030303031             ding    0.03846154 
cons    0.027777778             prob    0.030303031             ning    0.03846154 
atte    0.027777778             hern    0.030303031             ness    0.03846154 
stan    0.027777778             sion    0.030303031             wing    0.03846154 

For N=4 Latin 

Confirmed valid prefix/stem/suffix counts 63 126 57 
Prefix/Stem/Suffix frequency, normalised 
faci    0.06349207              bant    0.03968254              ntes    0.0877193 
pecc    0.04761905              ntes    0.03968254              quam    0.05263158 
invo    0.031746034             faci    0.031746034             endo    0.05263158 
cred    0.031746034             pecc    0.031746034             ebam    0.03508772 
infa    0.031746034             endo    0.023809524             erem    0.03508772 
puer    0.031746034             ndis    0.023809524             iens    0.03508772 
habe    0.031746034             quam    0.023809524             ones    0.03508772 
form    0.031746034             quid    0.023809524             bant    0.01754386 
pare    0.031746034             rati    0.023809524             abam    0.01754386 
nesc    0.031746034             ibus    0.015873017             ndis    0.01754386 

For N=5 Voynich (no data satisfies selection) 

For N=5 English 

Confirmed valid prefix/stem/suffix counts 15 29 13 
Prefix/Stem/Suffix frequency, normalised 
consi   0.13333334              ation   0.06896552              ation   0.15384616 
ornam   0.13333334              consi   0.06896552              sting   0.15384616 
appea   0.06666667              ornam   0.06896552              dered   0.07692308 
dimen   0.06666667              sting   0.06896552              ality   0.07692308 
occup   0.06666667              still   0.06896552              ingly   0.07692308 
stand   0.06666667              dered   0.03448276              ental   0.07692308 
conce   0.06666667              ingly   0.03448276              rning   0.07692308 
sugge   0.06666667              dimen   0.03448276              ented   0.07692308 
diffe   0.06666667              occup   0.03448276              rence   0.07692308 
speci   0.06666667              ality   0.03448276              sions   0.07692308 

For N=5 Latin 

Confirmed valid prefix/stem/suffix counts 21 44 23 
Prefix/Stem/Suffix frequency, normalised 
volun   0.0952381               entes   0.06818182              entes   0.13043478 
pecca   0.0952381               batur   0.045454547             batur   0.08695652 
lauda   0.0952381               tibus   0.045454547             antur   0.08695652 
quaer   0.0952381               invoc   0.045454547             tibus   0.08695652 
metue   0.0952381               pecca   0.045454547             bamus   0.08695652 
invoc   0.04761905              lauda   0.045454547             torum   0.08695652 
infan   0.04761905              quaer   0.045454547             tatis   0.04347826 
inven   0.04761905              volun   0.045454547             itate   0.04347826 
nesci   0.04761905              metue   0.045454547             antes   0.04347826 
paren   0.04761905              bamus   0.045454547             bilis   0.04347826 
 
Here are the N=3 counts/frequency for the 1331 unique words in f1v-f20v of the Herbal: 

Confirmed valid prefix/stem/suffix counts 99 252 111 
Prefix/Stem/Suffix frequency, normalised 
4ok     10      0.1010101               o89     15      0.05952381              o89     10      0.09009009 
4oh     7       0.07070707              1oe     14      0.055555556             8am     10      0.09009009 
1oe     6       0.060606062             4ok     14      0.055555556             1c9     6       0.054054055 
1oh     4       0.04040404              8am     12      0.04761905              1oy     6       0.054054055 
ok1     4       0.04040404              4oh     12      0.04761905              1oe     5       0.045045044 
8oe     3       0.030303031             1oy     10      0.03968254              coe     4       0.036036037 
1oy     3       0.030303031             1c9     8       0.031746034             cc9     3       0.027027028 
1co     3       0.030303031             1co     6       0.023809524             e89     3       0.027027028 
1ok     3       0.030303031             8oe     6       0.023809524             ham     3       0.027027028 
4oj     3       0.030303031             coe     5       0.01984127              2c9     3       0.027027028 


(e.g. the sequence "4ok" appears 10 times at the start of a longer word (prefix)) 

N=3 for 1331 unique words in the Astrological Section 

Confirmed valid prefix/stem/suffix counts 154 346 153 
Prefix/Stem/Suffix frequency, normalised 
okc     11      0.071428575             o89     16      0.046242774             o89     13      0.08496732 
ohc     8       0.051948052             okc     11      0.031791907             cos     6       0.039215688 
4oh     7       0.045454547             8ae     11      0.031791907             8am     6       0.039215688 
9hc     7       0.045454547             1co     10      0.028901733             8ae     6       0.039215688 
oko     6       0.038961038             oko     10      0.028901733             cc9     4       0.026143791 
oka     6       0.038961038             oho     9       0.02601156              coe     4       0.026143791 
oho     5       0.032467533             ohc     8       0.023121387             o79     4       0.026143791 
1ok     5       0.032467533             oka     8       0.023121387             oh9     4       0.026143791 
oh1     5       0.032467533             4oh     8       0.023121387             c79     4       0.026143791 
1co     4       0.025974026             9hc     7       0.020231213             c89     3       0.019607844 


N=3 for 1331 unique words in the Biological Section 

Confirmed valid prefix/stem/suffix counts 124 275 124 
Prefix/Stem/Suffix frequency, normalised 
4oh     13      0.10483871              c89     26      0.094545454             c89     17      0.13709678 
4ok     10      0.08064516              4oh     20      0.07272727              c79     13      0.10483871 
4oe     8       0.06451613              c79     13      0.047272727             1c9     9       0.07258064 
oeh     6       0.048387095             4ok     12      0.043636363             C89     7       0.05645161 
oe1     5       0.04032258              1c9     11      0.04                    2c9     7       0.05645161 
ohc     4       0.032258064             2c9     9       0.03272727              189     4       0.032258064 
soe     4       0.032258064             4oe     8       0.02909091              eoy     3       0.024193548 
oe2     3       0.024193548             oeh     7       0.025454545             cc9     3       0.024193548 
91c     3       0.024193548             8ae     7       0.025454545             hC9     3       0.024193548 
8ay     3       0.024193548             8ay     7       0.025454545             ae9     3       0.024193548 


N=3 for 1331 unique words in the Recipes Section 

Confirmed valid prefix/stem/suffix counts 135 303 143 
Prefix/Stem/Suffix frequency, normalised 
4oh     17      0.12592593              4oh     18      0.05940594              c89     13      0.09090909 
4ok     14      0.1037037               4ok     17      0.05610561              o89     13      0.09090909 
ohc     9       0.06666667              o89     16      0.052805282             189     8       0.055944055 
okc     8       0.05925926              c89     15      0.04950495              c79     7       0.04895105 
oeh     7       0.05185185              oeh     10      0.0330033               8am     7       0.04895105 
1co     5       0.037037037             1co     10      0.0330033               8ay     6       0.04195804 
g1c     4       0.02962963              ohc     9       0.02970297              coe     5       0.034965035 
4oj     4       0.02962963              c79     9       0.02970297              8ae     5       0.034965035 
ohC     4       0.02962963              8ae     9       0.02970297              1c9     4       0.027972028 
1oe     3       0.022222223             189     9       0.02970297              cc9     4       0.027972028 

Philip Neal’s Anagram Encryption

Notice how words tend to start with “4”, “o” and “1” and tend to end with “9”, “m” and “e”. This sort of feature has me excited about Philip Neal’s anagram encryption idea explained here: http://voynichcentral.com/users/philipneal/language.html which is summarised thus (quoting from that page):

  "1. Divide a plaintext into lines 
   2. Sort the words of each line into alphabetical order 
   3. Sort the letters of each word into alphabetical order 

   1. one thing led to another thing last night 
   2. another last led night one to thing thing 
   3. aehnort alst del ghint eno ot ghint ghint" 


Right now I am repurposing my Genetic Algorithm to attach some lines of the VMs assuming such an encryption – I am killed by the permutations (which go as factorial the length of the word).