Playbook for the Cult of Isis, herbal health instructions, details of the benefits of therapeutic bathing, or a written history of speaking in tongues. Confused? Probably, but not as much as most are when they come face to face with the Voynich Manuscript.
That each of the above is a proposed theme for its indecipherable scribbles indicates the level of confusion. A brief meditation on the sentence “a written history of speaking in tongues” should also help you get in the confusion ballpark.
Written in the early parts of the 15th century, the manuscript, a 240-page compendium of seemingly illegible and likely codified text, has amassed a proud track record of confounding scholars and eminent code breakers, including Alan Turing, alike.
To this day, the Voynich Manuscript remains shrouded in mystery and controversy. The latter has risen in recent years with several researchers, some wielding the power of AI, claiming to have finally unlocked its cryptic contents.
The debate, which is by no means over, illuminates how some parts of human communication still seem beyond the understanding of AI systems.
Ending a 500-Year Mystery?
In the spring of 2019, the Voynich Manuscript was a mystery no more. Its pages were full of information about herbal remedies, therapeutic bathing, and astrological readings. The manuscript was written in a “Proto-Romance language” by a Dominican nun as a personal reference book for Maria of Castile, Queen of Aragon in 15th-century Spain. In broad strokes, those were the claims put forward in a research paper by Dr. Gerard Cheshire from Bristol University in the UK.
Voynich experts were quick to point to cut corners, as well as leaps of logic and faith, on Cheshire’s part, and Bristol University quickly distanced itself from the research.
One of the critics was Greg Kondrak, who a few years before had turned to algorithms and AI in his aspirations to solve the manuscript. Kondrak’s starting point was that languages tend to have certain identifiable qualities, like how often certain letter combinations appear. Algorithms can be employed to pick out such metrics and create ‘fingerprints’ for different languages. Turning the algorithms loose on the Voynich Manuscript revealed that it had been written in Hebrew and then encoded into its current form.
However, there turned out to be several unsolved issues with the approach that left experts unconvinced. One, from a long list, was that the algorithms had been trained on the modern-day version of languages, which vary substantially from what they would have looked like in the 15th-century. A reading of the original version of The Canterbury Tales illustrates what can happen to a language in that kind of time span.
A Reality Check on AI and Language
For now, the Voynich Manuscript’s contents seem set to be beyond making sense. Perhaps its codification is exceedingly clever or it’s written in an incredibly complex unknown language. It could also simply be that it’s gibberish from pillar to post.
On its own, AI’s seeming failure to crack the manuscript may not mean much. However, it can be a chance to go ‘are we there yet?’ with regard to AI systems decoding and understanding human language.
One part of the problem for AI is the translation between different codified sets of structures and associated content. Yes, I made that up as a fancy way of saying languages. I’m part Danish, so I know the meaning of a sentence like ‘de der er dyre dyr’ and that æ, ø, and å are real letters. The latter two are also words, meaning ‘island’ and ‘stream’ respectively.
As for the sentence, Google Translate, which has been thrown to the wolves like this before, would say it means ‘those who are expensive animals.’ That is an accurate translation but not what the sentence means. It means ‘those are expensive animals.’ Pretty good going really, seeing as ‘dyr’ can mean both ‘animal’ and ‘expensive.’
Apart from letting me connect with my Viking side, the above is hopefully an illustration of how languages are confusing things that require both codification and de-codification. The same applies to translations between computer languages (code) and human languages (‘fuzzy’ code).
Even when people speak the same language things like cultural references, irony, tone of voice, and mimicry play pivotal roles in making sense of often only half-formed sentences. All of which contributes to language recognition, and in particular language understanding, likely being amongst AI’s final frontiers. Especially if the AI is supposed to get that that’s a Star Trek reference—not to mention understanding that it’s a poor one.
AI as a Key to Languages
These issues are far from the same as AI not being of massive assistance when it comes to both living and dead languages.
For example, researchers from MIT have trained an AI to match words from unknown languages with related words in others that share the same root. They hope to use the AI to crack some of the still undeciphered ancient languages.
Companies, NGOs, and individuals are also turning to AI-powered systems for help in keeping minority languages alive. New Zealander Jason Lovell helped develop a chatbot that works in both Māori and English. In neighboring Australia, children living in remote communities have come face-to-face with Opie, a low-cost robot that helps teach indigenous language skills.
Some have suggested that the end goal could involve creating a ‘Global Language Archive’ where an ‘AI language recreation engine’ will play the roles of head librarian and researcher.
As highlighted in the yearly State of AI Report, AI systems are progressing by leaps and bounds when it comes to truly understanding languages, often thanks to advances in natural language processing (NLP).
So, to answer the ‘are we there yet?’ from earlier: No, but doesn’t mean we’re not moving forward at warp speed.