• Steve Katz

Humans in the loop

For a lot of folks, getting utility from their data is a technological issue. We do our analyses on computers, we store data on digital storage devices, and increasingly many of the computers that store our data are virtual machines we never actually touch out on the web. In parallel with the evolution of machines, our data itself evolves from notes in a notebook, to text files stored on a disk, to a diversity of spreadsheet formats, to an ecosystem of database schemes – and an associated industry of technologists evolving in parallel to support data interoperability and machine readability of data

Data interoperability is special interest of mine. Its often a harder and deeper problem than it looks. On its face, its about how to record information in an enduring format that can be accessed later; information like what was the temperature yesterday? Given that some people might just write the number down in a notebook, while others might have remote sensors logging the temperature with an automated data acquisition and storage protocol, different researchers might be from different countries and speak different languages, and are they all using the metric system, or recording temperature in Kelvins? There are lots of opportunities for data heterogeneity problems, that in turn limit the seamless sharing of data. One might reasonably think that this is just a simple problem of metadata; just tag the data with what system you’re using. And yet we still ended up crashing a 327-million-dollar Mars Climate Orbiter probe because of a failure to translate metric to English units of measure.

Yes, but that was a technical failure, not a technological one, one replies. The space agency admitted it was a failure of their error checking and process validation protocols, rather than a data problem per se.

Indeed, a first step to solving data interoperability problems is to develop the metadata for what all our data means in a data dictionary. All the data we would record must be defined; what the data elements mean and what all the possible values for that data might mean as well as their units, their bounds. If we can just specify everything that might otherwise be uncertain about that data record, surely we can guarantee that everyone will be able to use that data in the manner in which it was intended? For simple data structures, this can get close, but it has limits. Data dictionaries can often solve the technological problem of accessing data, and given that one can grab data from a remote archive and check deploy that data on a local system, it gives the impression of successful data interoperability. But even in these cases meaning can be elusive and the impression an illusion.

One reason is that the data dictionary is only a representation of a vocabulary, it is neither the syntax and so not the complete language, nor is it the real thing itself. One approach to capturing more of a complete language is the use of ontologies. Ontologies are more than a data dictionary, while they include identification and defining of data elements, they also include the properties of data elements and the conceptual relationships among the data elements. Sometimes ontologies are as simple as the specification of the topology of a database, but more often they are generated by specifying explicit relationships between data elements with properties such as “is-a” and “has-dimensions-of”. Ontologies try to capture an internal logic of the real system the data attempt to represent and provide sufficient vocabulary and syntax to ensure that data retains its veracity when being passed from system to system. Ontologies are seen as a mechanism to increase, and perhaps by some to guarantee data interoperability, and like everything else, there is a rapid evolution toward automating ontology development and reducing the human role in data interpretation i.e. getting humans “out of the loop”.

Recently some colleagues and I published a paper on comparative ontologies. We had a large data set from natural resource science in the US Pacific Northwest, and three independently derived ontologies for the same data. Independent development of the ontologies was driven by different needs within different communities looking at different aspects of the problem. One community were natural resource managers who were interested in asking if management actions, like habitat restoration, were working; these people needed to know what habitat action was taken and where. One community were academics interested in the impacts of management actions on the ecosystems they were affecting; these folks were interested in the ecological processes affected by the management action. The third community were political decisions makers who wanted to know where was public money going in managing natural resources; this group wanted to know what jurisdictions were being improved with habitat projects. All three groups wanted the same data on what happened where and when, but each had a different objective and a different internal logic to their world-view and a different ontology for their data. And when asked, all three expressed satisfaction with how their ontology performed. However, when we put the exact same data through all three ontologies, we got different and, in some cases, very different information. There were several examples, but one in particular stands out. In the decision maker ontology, water quality improvement projects were the most common and least expensive variety of habitat restoration action, but in the academic ontology, water quality improvement projects were the least common and most expensive actions taken. If one had an interest in saying one was really helping to improve water quality in a particular jurisdiction, one could in principle go shopping for the best ontology. The vocabulary was the same, the relationships and meaning were different, and as a result the inferences one draws from the data were different.

Which one was best? It turns out there is no universal answer. The signal to noise ratios were different when using different ontologies, and that is possibly one answer. But each of the communities where the ontologies were developed reported that they were OK with the results they got. So, what is the basis for deciding? It turns out that perception, perspective and culture play roles in how we relate to the data, and this is going to be a continuing issue standing in the way of the automation of data interoperability. Spoiler: humans will always be in the loop, and there are good reasons for this.

That’s enough for now, but next time I will walk through a little experiment in automated data translation that is a fun demonstration of this point (I think it was fun, anyway).