Home > Algorithms, cipher, Elmar Vogt, Philip Neal, Recipes Folios, Repeating Sequences, Verbose Homophonic > How about a “Verbose Homophonic cipher”?

How about a “Verbose Homophonic cipher”?

September 24, 2010 Leave a comment Go to comments

I’ve had a bit of hiatus from the VMs, but it’s always popping up in my mind and niggling me, even when I haven’t got time to spend on it. The latest niggle was the idea that the VMs scribe used a set of simple tables that showed how to convert plaintext letters into codes. So, in an example table, letter “A” is written “4oh”, letter “B” is written “8am” and so on. Also, spaces in the plaintext have their own code. Veteran VMs researcher Philip Neal informed me that this is called a “verbose homophonic cipher”.

Elaborating on the idea:  the scribe uses one of the set of tables for each folio s/he is writing. To encipher the plaintext onto the folio, it’s simply a matter of writing down the VMs “word”  for each letter in the plaintext word. If there is more space on the line for the next plaintext word, the scribe writes down the code for space, and then the codes for the letters in the next word. Long spaces are written by writing the code for space more than once … The next line is used for the next word, and so on.

On the next folio, a different table may be used.

It’s hard to imagine the justification for such a scheme, but it does appear (at least initially) to fit some of the features of the VMs script (especially the repeating VMs words often seen).

I made a quick test that looks at VMs word frequencies on a single folio (in the Recipes section, which has the densest text). These showed a word frequency distribution that looks similar to the letter frequency distribution in Latin, apart from the most frequently occurring word (which is much more frequent) and which it is suggested would code for a space in the cipher.

However, on a typical folio, there are usually many more VMs words than there are plaintext letters. So the scheme has to be extended to allow the scribe a choice between several different VMs words to encode a single letter. Each table must have a set of words appearing in each plaintext letter column. Something like this:

Plaintext (space) a b
VMs words 8am ay okoe 4ohoe 2ay 1coe faiis 4ay oka

If this is indeed the scheme, one would expect to see patterns in the VMs word sequences that match patterns seen in the letter sequences of e.g. Latin words. Also, as Philip Neal pointed out, patterns like “word1 word2 word2 word1” would indicate a plaintext letter sequence of either “vowel consonant consonant vowel” or vice versa.

Looking through the whole of the VMs for sequence patterns (on the same line of text), I found the following:

  • There are no 4 word sequences that repeat at all
  • There are only four 3 word sequences that repeat, and each only twice
  • There are no sequences at all of the form “xyyx”

(all of which I find rather surprising, and thought provoking).

So it looks like this hypothesis is dead in the water, and can be ticked off that long list of “things it might have been but in fact don’t fit”!

(It turns out that Elmar Vogt has been working on a related, but more sophisticated, idea which he describes on his blog and is called a “Stroke Theory”.)

  1. September 25, 2010 at 5:20 am

    Hi Julian,

    First, some quick definitions to mull over:-

    * A “verbose cipher”: *at least some* of the single plaintext letters get mapped to (what appear to be) multiple letters in the target alphabet.

    * A “pure verbose cipher”: *all* of the individual plaintext letters get mapped to (what appear to be) multiple letters in the target alphabet.

    * A “homophonic cipher”: a cipher where the encipherer has a choice of possible ciphertext shapes for each plaintext letter.

    * A “verbose homophonic cipher”: a cipher where the encipherer has a choice of possible ciphertext sequences (some of which appear to be multiple shapes) for each plaintext letter.

    To my eyes, common Voynichese sequences (such as qo / ol / al / or / ar / ee / eee / am / an / ain / aiin / air / aiir, and even o + gallows and y + gallows) do give every indication of being in verbose cipher: and I think the stats bear this out. But the plaintext can’t then be ‘simple’ language, because the average enciphered word length would be substantially longer than we see.

    Hence I think that there is an element of 15th century scribal shorthand going on: specifically, that mid-word ‘d’ (8) enciphers a shorthand ‘contractio’, while word-terminal ‘y’ (9) enciphers a shorthand ‘truncatio’. “qokedy” would then be “qo” + “k” + “e” + (omitted internal syllable) + (omitted terminal syllable).

    In this way, positing Voynichese as “verbose enciphered shorthand” kind of balances the overall equation. So on the one hand, its shorthand aspect is shortening the text (but introducing a few extra tokens); while on the other hand, its verbose aspect is bulking it back out again.

    However, even though this helps explain a great deal of how Voynichese letters form the patterns they do, there is – as you point out – very probably a yet further layer of obfuscation going on that functions to prevent long sequences being repeated. Now, I really don’t think that this extra layer will turn out to be anything as complex as a full-on polyalpha (because we seem to have many universal features of the cipher that do remain constant). However, there seems to be something added in to the mix to some of the characters (probably gallows-related) that is ~just enough~ to disrupt our stats gathering.

    And that’s pretty much where the Voynichese verbose cipher reasoning chain currently halts. Just so you know there’s nothing new under the sun! 🙂

    Cheers, ….Nick Pelling….

    • JB
      September 27, 2010 at 8:00 am

      Thanks, Nick – useful comments, as usual. I am aware of your theories about nulls and the other letter features, as you have explained them before. However, using them with a GA I was not able to find a good match to the languages I tried, so there is probably an extra level of obfuscation in play, if indeed you are correct.

      (Did I ever send you those results?)

  2. October 3, 2010 at 11:13 am

    A certain text generated by substituting synthetic words for end-to-end character 2-grams has more than 2000 types and more than 20000 tokens. The ten most common 2-grams have more than one substitute. A link to the stats is in the “Website” box for comments to the blog.
    This is a simplified form of generated text in which:
    1) more than ten 2-grams have multiple substitutes
    2) there is transposition of fractionated strings
    Word series stats for a text generated in this manner will vary with different source languages.
    The length and number of repeated word series and/or the average number of times word series repeat can be reduced by controlling “1” and/or “2”, above.
    A simpler method would be to forget about “1” and “2” but use a simple transposition of 2-grams before testing with GA against the same source language from which the character 2-grams were obtained. I suggest beginning with no transposition.
    If this explanation is not clear and if you are interested, send me an E-mail.

    • JB
      October 4, 2010 at 9:01 am

      Nice! Looks like you have already been down a similar path, but using plaintext letter pairs.

      Did you use pairs simply because it gives a much larger number of possibilities (n^2)?

      Did you try to match any of the VMs text to this hypothesis?


  3. October 4, 2010 at 2:00 pm

    The code was an attempt to mimic language. It has more repeated word n-grams than the VMs. It does not accumulate as many unique words as the texts I have studied, including the VMs. The VMs is well within the ballpark of known writing. The unaltered code is not. Fractionation and/or a larger CT vocabulary can correct that. We can’t be sure the accumulation of uniques in the VMs isn’t distorted by alternate word forms, errors of writing, errors in reading, and illegible glyphs. Missing and out-of-order pages do not matter for this. If all that could be overcome, I believe the VMs would still be within the range of ordinary text.

    In one sense, code eliminates the problem of the peculiar word structure. However, the question of vocabluary construction remains. Whether we have an enciphered code or a cipher-only, there are not enough single letters to map to discrete (non-overlapping) n-grams in the VMs. Worse than that, we have not found a set of discrete n-grams. It’s obvious to me that the problem can only be overcome with the concept of unwritten glyphs. That strays from the topic here. Other problems, which have been discussed, in matching the VMs are more difficult. The best we can hope for with GA at this point is a significantly better than random match to a language. If that happens, we will be partially right about some of the VMs characteristics, if not about how they happened. This we can try without assuming a post Fifteenth Century mindset in the development of the VMs script.

    • JB
      October 4, 2010 at 2:28 pm

      This is good to hear – I think we are on the same page (folio)! Having spent many pleasurable hours checking various exotic cipher and code ideas, none of them remotely fits when using a GA (except, notably, an nGram mapping with the language of Dante as the plaintext, a form of early Italian, which produces results significantly better than all other languages tried, including Latin, German, English, Spanish, Dutch, Chinese etc. – see below).

      My faith in the GA technique is that it very quickly gives an idea of how well a code/cipher theory fits the VMs text.

      A significant problem is the machine transcriptions we have of the VMs. Basically (as you and I have found out before) they differ substantially, to the extent that statistics obtained with, say, EVA do not match well with statistics obtained with, say, Voyn_101. A particular problem is glyph bloat … my opinion is that GC’s Voyn_101 transcription contains many more glyphs than the scribes were actually using. Little differences between the ways of writing “9” for example, are classified as different glyphs. This plays havoc with statistical analysis.

      So I have a procedure that filters the Voyn_101 and remaps e.g. those multiple “9” glyphs to the same glyph.

      Anyway, your idea of plaintext letter doublets mapping to VMs glyphs is excellent. We need something like this to account for multiple recurring VMs “words” like “8am 8am 8am” and to allow a sufficiently large vocabulary for the cipher/code system. It perhaps couples with a set of code pages one of which is selected for use at the start of each folio.

      But how would this fit with the labels? Most of the labels are single Voynich “words”. These would decipher as plaintext letters or letter pairs, which is an odd way of labeling things if those labels don’t appear in the surrounding text. (E.g. one can imagine writing “The herb marked as “A” is deadly nightshade”, and placing an “A” next to the drawing … but we don’t see VMs labels appear like this in the VMs text.)

      Here is an extract of the Dante Alighieri text that matches decently using nGrams to the VMs:

      Cjant Prin

      A metàt strada dal nustri lambicà
      mi soj cjatàt ta un bosc cussì scur
      chel troj just i no podevi pì cjatà.

      A contàlu di nòuf a è propit dur:
      stu post salvàdi al sgrifàva par dut
      che al pensàighi al fa di nòuf timour!

      Che colp amàr! Murì a lera puc pi brut!
      Ma par tratà dal ben chiai cjatàt
      i parlarài dal altri chiai jodùt.

      I no saj propit coma chi soj entràt:
      cun chel gran sùn che in chel moment i vèvi,
      la strada justa i vèvi bandonàt.

      Necuàrt che in riva in su i zèvi
      propit la ca finiva la valàda
      se tremaròla tal còu chi sintèvi

      in alt jodùt iai la so spalàda
      vistìda belzà dai rajs dal pianèta
      cal mena i àltris dres pa la so strada.

  4. October 4, 2010 at 7:16 pm

    I should have paid attention to labels all along. Verbosity definitely is a problem. Another set of code pages might contain whole word substitutions for nouns and labels are, perhaps, all nouns. That’s the best I can do in trying to plug holes. In the running text, a word re-location system could cause repeated words of high frequency.

  1. No trackbacks yet.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: