Home > Characters > Entropy of the Voynich text

Entropy of the Voynich text

The Shannon Entropy of a string of text measures the information content of the text. For text that is completely random i.e. where the appearance of any character is as likely as the appearance of any other, the entropy (or “disorder”) is high. For a text which is a long string of identical characters, for example, the entropy is low.

Mathematically, the Shannon Entropy is defined as:

Entropy = –ΣiN probi * Log( probi)

where probi is the frequency of the i’th character in the text, and the sum is over all the characters.

If the Voynich text is randomly created (by whatever means), we’d expect it to have high entropy (i.e. be very disordered). What we in fact find is that the text is ordered, with low Entropy, and is rather more ordered than English, for example. The result of comparing the Voynich text with several other texts in different languages is shown in the table below.

Language Source Entropy
Voynich GC’s Transcription 3.73
French Text from 1367 3.97
Latin Cantus Planus 4.05
Spanish Medina 1543 4.09
German Kochbuch 1553 4.15
English Thomas Hardy 4.21
Early Italian Divine Comedy 1300 4.23
None Random characters 6.01

The last entry in the table shows the Entropy for a random text – and is getting on for double the Entropy of the Voynich.

  1. May 27, 2015 at 8:21 am

    This shows that some languages use words that are more similar to each other than do other languages. If there is a variety of words (however similar) and syntax, human-readable information can be transferred. In calculating entropy, the often repeated nGrams have more influence than the others. This is seen in Rank X Frequency charts, which are negatively correlated with entropy. Ignoring syntax, Voy is like phonetic Chinese in the first eleven most frequent 2Grams and again in ranks 32-40. The foregoing is about h1 entropy. The big difference is in h1 minus h2 entropy for which Voy scores high. That is because glyph beginning strokes are compatible with ending strokes of immediately previous glyphs. No such constraint applies in known text. If spaces are included in the calculation, there is even more difference in the h1-h2 scores because so many words have the same endings and beginnings. I think we need more text than is in the VMs sections to calculate word entropy.

    • JB
      May 28, 2015 at 5:25 pm

      I should also point out that this entropy calculation uses the *whole text*, as a long character string (spaces and all), for all the texts mentioned.

    • JB
      May 28, 2015 at 5:28 pm

      Another reply: my tests with Pinyin give a very high entropy score – much higher than Voynich and the highest of all the languages I tried. Is Pinyin what you were referring to as “phonetic Chinese”?

  2. JB
    May 27, 2015 at 8:26 am

    Yes, I deliberately made the calculation as simple as possible. The moment one starts assuming “words” are indeed words, or looking at n-grams, the result is open to interpretation.

  3. May 27, 2015 at 10:58 am

    I’d be interested to see what happens if the tests are run quire-by-quire. I wonder at our habit of treating the text as homogeneous-by-default when it is plainly diverse in both subject matter and style of imagery.

    Or did someone do that already?

    • JB
      May 27, 2015 at 12:56 pm

      The text is certainly not homogeneous: who told you that?! 🙂

      I’m sure there would be entropy differences between the Quires, and especially between Language A and Language B.

      But the point of this post was just to highlight the fact that the text contains information, and is not random gibberish.

  4. May 27, 2015 at 3:53 pm

    ‘But the point of this post was just to highlight the fact that the text contains information, and is not random gibberish.’

    The fact that it is not random gibberish does not imply that it must contain information – the only information you’ll find in it is how it was constructed.

    • JB
      May 27, 2015 at 3:59 pm

      Yes, perhaps you are right, Tony.

  5. May 28, 2015 at 10:45 am

    As a non-linguist and non-IT person, and non-cryptanalyst, I may be asking a foolish question – but has anyone tested the Vms’ written text against forms of language that are not ordinary forms of text – like Moby Dick and Shakespeare?

    It seems to me that it fits very easily into forms of technical expression. I’ve played with comparing its overall patterns to those of weaving, and its internal patterns to those of other “patterns” such as recipes for dyes, knitting pattern code, and I guess if you suppose those trivial because ‘female’ – alchemical recipes, or even portolan and ships’ logs records would belong in the same class – lots of short, seemingly meaningless strings which use abbreviation to convey what I suppose you could call esoteric information. Musical notation might be another.

    They might explain why the results seem to point to Chinese, or Jurchen, etc.

    Of course, the language might be Chinese, or Jurchen. 🙂

  6. May 28, 2015 at 11:03 am

    My pet theory is that the various bits were collated from sources which contained matter relevant to the chart-maker’s needs, the old cosmographers carefully correlating astronomical and geographical matter, and filling the labels to various places with snippets from various texts – not to mention trying to picture the plants appropriate to different regions. By 1375, it had certainly all come together in Cresques’ illuminated almanac, the so-called Atlas Catala. …. so the written text might have a whole lot of geographic and astronomical co-ordinates included, with book references for the snippets, and gd knows what else. Glad it’s not my problem.

  7. May 28, 2015 at 11:27 pm

    on the point of a last glyph in a string also being the first of the next string – I believe that this is a formal convention in some forms of poetry, which need not rhyme. It’s called “linked verse” in its English form, where the last-and-first unit is normally a whole word. In other traditions, it might be a word as a single character (I’m thinking of Chinese here), or it may be a single sound/syllable. I believe the Japanese form is of that sort, though I may be mistaken.

    • JB
      May 29, 2015 at 9:28 am

      I don’t think that’s what Knox meant – the last glyph of one word is very rarely the first glyph of the next. I think he meant that e.g. if the last glyph of one word is X then the probability of the next word beginning with Y is much higher than in known languages.

      • May 29, 2015 at 6:46 pm

        Yes. Sorry about the ambiguity.

  8. May 29, 2015 at 7:00 pm

    I hesitate to expand further but I think we are on the wrong track. It depends on whether I understand what I am about to write. In the ordinary sense, a substitution cipher (excepting simple substitution) can have no more “usable information” for our purposes than the plaintext from which it is derived but it has much more “Shannon Information”. In the cipher, we can’t predict any letter by a previous letter. So the VMs, by our transliterations, appears to be on or beyond one margin of written language according to h1 entropy. Even so, it has a sufficently large lexicon (or lexicons) for a language. But where is the syntax? The best hope, so I think, is to try to unravel a word or nGram transposition. At the same time, we have to figure out some character alterations that, apparently, are designed to format paragraphs.

  9. May 29, 2015 at 10:51 pm

    I should have added an obvious example of the type of condensed technical note: the ‘de gradibus’ type where compounds or mixtures are described as hot, cold etc. in degrees up to .. I think.. four.

  10. June 4, 2015 at 11:58 am

    Many older ways of writing didn’t break the text into words. For reasons to do with the ms’ codicology, it occurs to me that the word-breaks may be no more than a convenience intended for people setting type. Not that we have at present anything remotely like a book printed in Voynichese, unless I missed it.

    More reasons to work quire-by-quire, I suppose.

    oh – and Knox, thank you.

  11. Jon p
    June 30, 2015 at 6:07 pm

    I’m posting here in hopes of sharing some information I have that seems to be somewhat lesser studied. It’s rather lengthy but, quite valid. Fist off I’m not a VM researcher, so keeping it to myself serves little purpose. Secondly I don’t have the means to analyze it in a manner like what I see here. However, this type of analysis is what would be needed to make use of the things I’ve observed. Upon request I’d be happy to share them in detail. I’m primarily interested in anomalies of all types, and have of course heard of the VM. I don’t believe I have what it takes to unravel the text, but I do think I can help outline a method to cross reference the art in a manner to allow researchers to “infer” the likelihood of what some text may say. In short, the layout of the ms, it’s order, and drawing makes sense to me. You can only obscure or encrypt a picture so much until it’s pointless. The authors did a fine job but they left plenty of clues. The overall ms (text aside) is parallel to many highly structured sympathetic “magical” systems of which I’m very familiar. Despite the contents or topics of these systems, they all share a common element. That is to tie together or map the correspondence of certain things within the structure for a specialized result. That is the point of the VM, in a nutshell. I’ve departmentalized the art and identify some new categories of repetition across the overall ms and there is without a doubt, a pattern. I don’t intend to go much further with it, but it seems unfortunate not to hand off the information to someone who will. I hope somebody is willing to give it a shot.

  12. Garrett
    July 31, 2015 at 9:22 pm

    Has anyone tried applying deep learning algorithms to this? I’d be really curious to see if it’s possible to train an autoencoder on Voynich data and examine what we can do with it. I know there’s some work out there that uses autoencoders to do cross-language learning- Lauly and LaRochelle have a few decent papers about using bag-of-words autoencoders to translate between languages, which might be worth exploring.

    • JB
      July 31, 2015 at 9:44 pm

      Interesting question, Garrett. I’m not familiar with autoencoders, so am not sure how they could be applied to the Voynich text in practice.

  13. SteveofCaley
    January 28, 2016 at 8:01 pm

    Have the individual chapters been analyzed for their entropy score? How might that vary? I would expect that chapters requiring a broader vocabulary or concept set would appear more entropic.

  14. kevin
    May 23, 2016 at 6:31 pm

    I have read that the Voynich Manuscript text was written from right to left. However the curvature of the writing on a lot of lines of text appear to dip to the lower right, indicating the text was written by a right handed person. Looking closely page by page it becomes clear that a large percentage of the paragraphs, end with space on the right hand side. Also on most pages containing a larger amount of text with less graphic art work, start with the writing starting on left side of the page and the image on the right. Line by line the first letters on the left stay nicely vertical, while the last word on the end of each line ( on the right ) are varied showing the intelligence to start the writing in the top left corner of each paragraph and fit the text into the space left available, due to the images and decorations already being present on the pages before the writing began.

    • JB
      May 23, 2016 at 6:37 pm

      I agree with your observations, but I have not heard the theory that it was written right to left before. My opinion is that the text may have been written neither RtoL or LtoR but in a way where each glyph was placed on the page following an unknown prescription that defined its location.

  1. No trackbacks yet.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s

%d bloggers like this: