Home > Edit Distance, Herbal Folios > Edit Distance for Word Positions

Edit Distance for Word Positions

October 17, 2016 Leave a comment Go to comments

The edit distance between two words is the number of edits needed to convert between the words. For example, the edit distance between “banana” and “bahama” is 2.

I looked at the average edit distance (the Levenshtein measure) between words on each line of each folio in the Herbal A and Herbal B sections. Here are the results:

herbala_editdistance

herbalb_editdistance

How to interpret these plots

There is one square per word and line position: the top left square corresponds to the average edit distance between word 1 and word 2 on all the folios. The next square in the that row corresponds to the average edit distance between word 2 and word 3 on the folios.

Each square in the plot has a shade of gray: the darker the shade, the bigger the average edit distance.

One conclusion is that for both sets of folios, there is a big edit distance between the first and second words on the folios: the words are very dissimilar.

Another conclusion is that similar words (lighter shade of gray) tend not to occur in the first line, or as the first words.

 

Advertisements
  1. October 18, 2016 at 1:55 am

    I think you can (sort of) see the Neal key pattern emerging from your stats – these are very often about 2/3rds of the way along the top line of a page, sometimes the top line of a paragraph as well. (Dark blocks 5/6/7 on the top line of your Herbal B plot).

    • JB
      October 18, 2016 at 9:01 am

      I need to try something like the square of the edit distance to help accentuate such features where they exist. Is Philip Neal still following Voynich stuff – haven’t seen anything from him for a while?

  2. October 18, 2016 at 2:58 am

    Interesting! I wonder how this compares to various real world languages.

    • JB
      October 18, 2016 at 8:58 am

      The expectation is a pretty uniform edit distance across a page, I suppose.

      • October 18, 2016 at 11:39 am

        I also wonder how this would look if you filter out noise. For example, shuffle the words so they appear in random order (but perhaps preserve the # of words on each line of each page). Then calculate the average edit distances again, this time computing the standard deviation of all the means of the repeated shuffles. Then when you compute the means for the actual manuscripts, you can see how many standard deviations they are from the mean of shuffles. Instead of plotting the means, use the number of standard deviations instead. This might give you a more normalized look at which measurements are truly anomalous.

      • JB
        October 18, 2016 at 1:35 pm

        Excellent suggestions! I’ll try to take a shot at that this evening.

  3. October 18, 2016 at 8:35 am

    It’s good to see something new.
    -The dark first lines are puzzling.
    -The greater edit distance between first and second words are, at least partially, caused by the average length difference due to wordwrap and to altered first letters of first words.
    -In texts from known languages, I expect to see a fairly uniform and darker distribution.
    -Another text to compare could be artificial with random letters (and a skewed token length distribution to match both the VMs and known texts).
    -Assigning different weights to insertions, deletions and substitutions might help distinguish between different known languages.

    • JB
      October 18, 2016 at 8:56 am

      Knox – thanks for the comment! Are on you on Voynich Ninja? Plenty of new stuff there, daily.

      I tried a variety of colour maps in an attempt to pull out more detail, but settled on grey. I need to run a known language text through and see how that looks, as you suggest.

      • October 19, 2016 at 10:38 am

        I am now registered on Voynich Ninja. It’s well organized.

  1. No trackbacks yet.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: