Home > clustering, Currier, f1r, Similarities > Using t-distributed Stochastic Neighbor Embedding (TSNE) to cluster folios

Using t-distributed Stochastic Neighbor Embedding (TSNE) to cluster folios

September 26, 2017 Leave a comment Go to comments

For this attack we’ll use the Takeshi EVA transcription to count the number of times each glyph appears on each folio. This gives us a vector of probabilities for each glyph, for each folio – the vectors are 24 long, as there are 24 EVA glyphs in the alphabet.

For example, here is the probability vector for f1r:

1r 28 lines {‘a’: 0.08917835671342686, ‘c’: 0.08216432865731463, ‘e’: 0.05110220440881764, ‘d’: 0.06212424849699399, ‘f’: 0.00501002004008016, ‘i’: 0.08617234468937876, ‘h’: 0.12324649298597194, ‘k’: 0.045090180360721446, ‘*’: 0.012024048096192385, ‘m’: 0.001002004008016032, ‘l’: 0.03507014028056112, ‘o’: 0.11923847695390781, ‘n’: 0.050100200400801605, ‘p’: 0.012024048096192385, ‘s’: 0.06412825651302605, ‘r’: 0.04408817635270541, ‘t’: 0.03907815631262525, ‘y’: 0.07915831663326653}

(This reads as glyph “a” appears 8.9% of the time on f1r, glyph “c” 8.2% of the time, and so on.)

The question is: how similar are these frequency distributions amongst all the folios? Using tSNE (implemented in Scikit learn here: http://scikit-learn.org/stable/modules/generated/sklearn.manifold.TSNE.html) we can try to find a 3D arrangement of all the folios that minimises the glyph frequency vector difference between nearby folios.

Here’s a typical result: each folio appears as a point in 3D space …

The colour coding is: red dots are folios that Currier identified as “Language A”, blue are “Language B”, and the remaining black dots do not have an assignment.

It’s clear that the red and blue are well separated, reinforcing Currier’s assignments. Thus this is independent support of Currier’s theory.

There are a couple of notable features:

  • f57r and f57v are labelled as Language A (red) – but it looks like they should be labelled as Language B (blue)
  • The unassigned folios (black dots) look like they are all Language B
  1. September 26, 2017 at 9:51 pm

    Thanks Julian,

    that’s quite a visible result. The separation is clear, but are the two clouds also ‘disconnected’? It seems not, but this representation may not be able to tell, due to the successive projections from 24-D to 3-D to 2-D….
    It is still the clearest result I’ve ever seen that’s just based on single-character distribution. (Or even sub-character given the nature of Eva).

    • JB
      September 27, 2017 at 8:56 am

      I wish I could insert the rotatable version of the graphic – which allows one to angle it in such a way as to see that the clouds are quite well separated. I was surprised that just glyph frequency was good enough to see separation by “Language”!

      • September 27, 2017 at 9:02 am

        I wonder – Do digraph vectors also show a similar separation?

      • JB
        September 27, 2017 at 10:28 am

        David, I wonder that too – it is not hard to compute but the vectors would be 24^2 long, so might take quite some more time. I’d expect the pairwise frequencies to show the same split into Language A and B clusters.

        I have also looked at the Quires using this method, and it shows clustering too, but nothing that would not be expected from the Language A/B folio assignments.

      • September 27, 2017 at 1:24 pm

        That’s a nice bit of Voynich candy, JB. Si if we were to look at the thing in 3D there would be a definite gap between both groups and not a gradual transition?

      • JB
        September 27, 2017 at 1:37 pm

        Yes, I will try to find a better separation image. There are a couple of folios that always appear on the boundary (I forget which) – it may be a foldout that Rene mentions on his page.

      • JB
        September 27, 2017 at 1:47 pm

        Here’s a better orientation (from a different run):


        The two folios that always appear on the boundary are f51r/v. As noted, f57r/v appears to be wrongly assigned.

  2. September 27, 2017 at 8:35 am

    That’s a fascinating test – thanks for posting it!

  3. September 27, 2017 at 10:52 am

    The digraph vectors is exactly what I looked at here:

    • JB
      September 27, 2017 at 11:16 am

      I *knew* I’d seen similar scatter plots somewhere, Rene! Thanks for the link and reminder!

    • JB
      September 27, 2017 at 12:43 pm

      Rene, your analysis showing the bridging of the two languages on the foldout pages is a great find I haven’t seen mentioned anywhere else, or discussed (but please let me know if it’s been discussed somewhere in depth).

      You say “The ‘bridging’ between the two languages is located exclusively on the foldout pages. This is an important feature that equally still lacks an explanation, but which almost certainly must be related to the order in which the MS has been created.” – what is your reasoning about the order?

      My running theory has always been that Language A was a) written earlier than Language B, and b) uses a different encoding/ciphering/transcription scheme from Language B, and that this was probably because the Languages were written by different scribes using different rules. But, if you are correct and the foldouts are a mixture of Language A and B, then it implies that that theory is all wrong, that in fact the Languages are more related to the subject matter. In other words, the Language used on a folio is determined by what is being described on that folio, and the foldout folios have subjects that are a mixture of those elsewhere in the manuscript.

  4. September 27, 2017 at 12:00 pm

    Julian B.
    The manuscript 408 is written in Czech language. Manuscript is not alchemy,astrology,herbal ,pharma etc. The manucript was written by Eliška of Rosenberg.
    MS 408 and encoded ( encrypted ) by a Jewish substitution.
    Alphabet Eva – bad.

    • JB
      September 27, 2017 at 12:11 pm

      I don’t think so, but I admire your confidence!

  5. September 28, 2017 at 12:15 pm

    In the “good old days” when I was of a more speculative mind, I imagined that the following could have happened.
    Two people agreed on some kind of a code or method to generate the text. This is the version “in the middle”. One concentrated on the plant pages and he gradually evolved into the A direction.
    The other concentrated on the bits with nymphs and stars and he evolved in the B direction. He wrote more cursive and much faster, so he was finished first, and at the end helped Mr. “A” to do the last herbal pages which we now know as herbal-B.

    What speaks against this wonderful scenario is, that I can’t see why they would have started on the foldout pages, and did the normal pages after…..

    • JB
      September 28, 2017 at 12:48 pm

      I love that scenario! Perhaps the foldout pages were the most important parts that needed to be got down first, and the rest was done later? In particular, the Rosettes are clearly special and significant …

  6. October 1, 2017 at 7:33 am

    Julian – just to be clear – does this idea re ‘bridging’ represent an extrapolation from the observations made by Currier, or is it not indicated at all by Currier’s observations and comments? I mean, is it an obvious implication of his ‘A’ and ‘B’ divisions or would it require some other information? I understood that Currier had already noted a distinction between some parts of a given fold-out and others. Would be glad if you could clarify.

    Related to this, but from my own field – I began pointing out about … I don’t know… as much as five years ago (though I can check).. that the imagery shows diagrams which are late-phase were set down on the back of fold-outs . The matter has become obscured by a specific, faulty, method of pagination which was not only employed on the first mailing list and then some web-sites but 5v and f.86r’ as if it were partly on one bifolium and partly on another…

    But the point, really is that the division between earlier and later phases of addition to the material in the Ms is reflected in these folios too, with the earlier being e.g. the map (folio 86v as was) and the later set down on what had been the fold-out’s blank reverse (formerly described as f.86r).

    There’s no difficulty explaining why the original ‘recto’ became the ‘verso’. As Pelling pointed out so clearly in 2006, citing the precedents, is that the folio had been so worn that it was re-bound into the volume as we now have it.

    So this division and coming together of the earlier and later phases – possibly first noted by Currier in connection with the script – is also reflected in the imagery as I’ve been explaining while explaining and dating the various diagrams – the map on one side (ci devant 86v) and the various others placed across the top/back of that sheet.

    In all this, I find it especially interesting that 57v would appear to be in the ‘wrong’ group.
    As I’ve tried to make clear, now, for almost as long, returning to the point over the past Four or five years – the image on that side of that folio is all ‘wrong’ for any of the earlier phases in the ms’ imagery … it doesn’t meet the habits or employ the same customs and standards as the rest, e.g. employing instruments of ruler and compass, which are both assiduously avoided through most of the ms, even when we’d expect it, as in ruling out the page before inscription, or drawing a circle. 57v is anomalous in a great many ways – so much so that apart from passing on to readers that it is evident that this is among the very last addition to these drawings and diagrams – I have refrained from positing any date for first enunciation OR for current presentation.

    But what this suggests for those working on the text’s written part is that the maker of that diagram, late though it is, still knew enough to use ONE of the two (or more) orthographic/cryptographic conventions affecting the written text.. which ought to add some ray of hope, surely.

  7. October 1, 2017 at 7:37 am

    Julian – some glitch dropped a bit of the previous. For

    “method of pagination which was not only employed on the first mailing list and then some web-sites but 5v and f.86r’ …

    read: method of pagination which was not only employed on the first mailing list and then some web-sites but later, for some bizarre reason adopted by the Beinecke library… which is why we see the description of a single side of a single sheet, formerly foliated ’86v’ now described as ’85v-and-f.86r’ – as if it were partly on one bifolium and partly on another…


  8. NICOLAS Georges
    October 21, 2017 at 3:08 am

    Le codex Voynich est tout simplement un Livre d’enseignements de la TORAH Hébraique
    Les Graphismes des plantes ne sont que les pierres de Rosette NUMÉRALES des Singularités numériques et de concepts des idées mères ,des 22 LETTRES de L’alphabet
    HÉBREU…. point final.
    Son texte n’est en RIEN un CODAGE! que de L’Hébreu de forme Numérique.
    Exemple de Lecture: La LETTRE ALEPH page 69 du codex…8à 2…9
    Le symbole 8…..doit se comprendre de Forme 8..il y a 22 Lettres Hébraïques de forme 8
    Comprendre :ALEPH. Première Lettre de l’alphabet Hébreu s’écrit pour former son NOM
    ALEPH …Lamed…Fé soit 1+30+80=111
    ALEPH a 3 Lettres dans son NOM
    le Lamed fait de 3 Lettres
    le Fé fait de 2 Lettres
    Donc la LETTRE ALEPH (8) ….a dans son NOM …deux Noms de Lettres de Trois Lettres
    Soit 8à. 2…9
    9=Lettre GIMEL =idée de 3
    Tout simple …..ici de l’enseignement pour des enfants sur la mémorisation des 22Lettres
    Bonne Lecture
    NICOLAS Georges 67 ans
    62127 FRANCE

    • JB
      October 21, 2017 at 11:21 am

      I don’t think so, Georges, but thanks for your comment.

      • Nicolas georges
        September 4, 2018 at 6:10 am

        Bonjour J B
        structure de l’écriture….Forme Hébraique …
        compréhension universelle
        car de Forme numérique
        étudier les graphismes et les coller au texte écrit
        chez les juifs…70 formes explicatives dans l’absolu !!
        TORAH HÉBRAIQUE ancien Testament ..ET…😊
        pas d’aide a attendre de savants Juifs pour la compréhension de ce codex…
        bonne chance aux Canadiens et Google 😊😊😊
        Nicolas georges
        68 ans France

  9. D.N. O'Donovan
    December 1, 2017 at 1:59 am

    Not sure who else to ask, and to a specialist it’s probably a stupid question.. but anyway

    Has anyone tried to match Voynich word-length frequencies to a range of languages. I realise that abbreviations and so on would probably wreck the stats but am curious.

    Suppose you take four folios’ worth of text from Currier A, and find the frequencies of four- five- six- and seven-glyph ‘words’ as a percentage or proportion of the text – would it be feasible to compare that with the patterns of other languages and dialects? Has anyone ever done that – and if so who and what came of it?


  10. JB
    December 1, 2017 at 9:37 am

    Hi Diane,

    I feel like this must have been done, as it is so easy to do! However, I don’t recall seeing anything about it. Perhaps Rene would know? If it turns out that nobody has done it, I would be happy to have a go!

  11. Indrė
    February 6, 2018 at 3:51 am

    Hi Julian,

    Great research!

    In J. Stolfi’s transcriptions summary f57r and f57v are designated as “Language: B (Currier), Hand: 2 (Currier)”.
    Why do you state it “labelled as Language A (red)”?

    • JB
      February 6, 2018 at 9:28 am

      Hi Indre, Yes, this seems to be an error on my part – my code certainly specifies f57 as Language B, so how it ended up being coloured red, I’m not sure! Will investigate.

  1. November 29, 2018 at 4:27 am

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: