• Steve Katz

An experiment in why it's hard to get humans out of the loop of data interoperability

Last time, I started talking about data interoperability, and I focused on the challenges that operate on the level of semantics – what information the data represent. Interoperability is the ability for data from one source data system to seamlessly operate when placed in another. We confront this issue every time we combine data from different sources (aka data federation), or for example when we exchange data across interacting computer systems with an application programming interface (aka API). If we are all using the same conceptual model of the data, down to the most granular or atomic level, the thinking goes, then combining data should be a simple technical solution. Indeed, much energy is currently going into innovations to make data discovery and data interoperability entirely automated.

My focus last time was on data dictionaries and ontologies. Data dictionaries attempt to define the meaning of the elements in a database, and ontologies extend data dictionaries to include some of the relationships among the data elements and try to express an internal logic of the real-world system that the data are trying to represent. These tools, data dictionaries and ontologies, are familiar to a lot of data scientists, and they are often viewed as the principal tools at hand to facilitate and ultimately automate data interoperability.

What we found out however, was that even working on this most granular level, rigorously specifying the definitions of data, was not sufficient to uniquely determine the information that the data are trying to capture.

This time, I want to do a little experiment to extend the point and start to build a case for why its going to be very hard to get humans out of the business of getting data to talk to each other. The point of this experiment is that we can specify the definition of the data elements ad nauseum, and we can specify all kinds of rules for syntax and idiomatic aspects of speech, but we still fail to capture the concept behind the words.

The experiment is to get Google Translate to play a game of telephone with itself. I opened Google Translate in ten browser windows and daisy-chained them, so that I could put a piece of sample text in English in the first input window, and Google Translate then translated that text sample ten times, from one language to another, and finally back into English (i.e. an 11th step). The words are the data. Each data element as some meaning in the original language, and its placement within a sentence and paragraph create a context upon which an ontology can act to specify the meaning. Each step is a test of data interoperability; if the data translation is 100% effective, then any text put in at the beginning will have the same meaning all the way through the daisy chain. When we look at the final output text we are able to see the impact of all the accumulated problems along the way – i.e. the degree to which the system is less than 100% effective.

Google doesn’t share information on all aspects of the machinery behind Google Translate, but it undoubtedly has a data dictionary that allows a cross-walk of words from one language to another, and it uses an AI program to help recognize syntactical and idiomatic elements in the input text so that it can be effectively conveyed in the output language.

The starting input text I chose was the first couple of sentences of Ernest Hemingway’s A Farewell to Arms. The book is well known, and the opening is often cited as a classic piece of writing. The first couple of sentences are:

In the late summer of that year we lived in a house in a village that looked across the river and the plain to the mountains. In the bed of the river there were pebbles and boulders, dry and white in the sun, and the water was clear and swiftly moving and blue in the channels.

I love this for many reasons. Among them is how this is “a classic” and no elementary school teacher of mine would have ever tolerated for a second that many prepositions in a single sentence from me; “in a house”, “in a village”, “across the river”, “to the mountains”. The subversiveness of it is so satisfying.

OK, back to the experiment. I took that text and ran it through the daisy-chain of Google Translate. The sequence of languages was:

English→ Basque→ Japanese→ Russian→ Bengali→ Vietnamese→ Norwegian→ Somali→ Hebrew→ Esperanto→ Zulu→ English

English→ Basque→ Japanese→ Russian→ Bengali→ Vietnamese→ Norwegian→ Somali→ Hebrew→ Esperanto→ Zulu→ English

What came out at the end of the daisy chain was:

At the end of the summer of that year we lived in a beautiful house with a view of the river and nature. Along the river, there are rocky outcrops, sunsets, and clear, clear waters.

Some of the basic features of the input text are still very recognizable in the output text. Its still two sentences. The first is about where the house was and its view. The second was about the river. Given the challenge of translation through several vastly different languages, the two text blocks are actually pretty close. If you were traveling in a foreign country where you could not read the language, and needed to find a laundry to clean your clothes, this level of accuracy might be good enough.

For full disclosure, I did this same experiment about ten years ago with a different text block and it came out much worse. So, I was surprised that it came out this good. It is worth acknowledging that in the last ten years a lot of development has occurred and the translation tools should have been getting, and clearly have gotten better.

However, clearly there is a fair bit of meaning that didn’t survive to the output text block. For example, all those prepositions in the original first sentence, strung together, almost convey the rolling expanse of the plain and mountains. The translation lacks a similar cadence. In the original second sentence, an aesthetic property of the river bed substrate is conveyed by “dry and white in the sun”. In the translation, it is converted to a new property unrelated to the river at all: “sunsets”. Particularly in the second sentence, there is a lot of meaning in the original that did not survive to the final translation. If I was hired to develop a complex data system for a client, having the data system silently convert properties of their product to properties of the marketplace would not be an acceptable level of performance.

I should make clear that I am not meaning to pick on Google Translate. It’s actually very clever, useful, and frankly it works pretty well. I would not be able to do this experiment as easily as I did were it not for how useful Google Translate is. It is also important to point out that people doing the job of translating languages vary in their effectiveness – some translators are better than others, and in many cases it’s not because they “got it wrong”. So I am not saying that automated systems are uniquely bad at this. Rather, my point is that using translation as a model for moving data around helps illustrate some of the limitations to the automation of the process of making data interoperable. Humans with expertise, especially where the data applies to technical domains, continue to play a critical role in effective data translation and interoperability.

Hold on,” I hear you say, “you’ve already admitted that things have gotten better over time. Surely, this is just a matter of time before ever more clever software and AI’s close the gap and make humans unnecessary.” Things may get closer, but there are good reasons that the gap won’t close entirely. Next time, I will go into a little more detail on the indeterminacy of translation and why technological development won’t make humans entirely dispensable.