CLAVAS - OpenSKOS CLARIN Vocabulary Service

Documentation

Summary

For efficient production and curation of high quality metadata in CLARIN it is important to have access to vocabularies from a range of software tools. This document describes a solution that is based on a repository service platform, that is developed by the CATCHPlus project, OpenSKOS (formerly called Vocabulary and Alignment Service - VAS). This platform offers uniform and standardised ways to publish and retrieve vocabulary data in forms that can be used for many usage scenarions, e.g. during metadata creation (e.g. pick lists with autocompletion) or during metadata curation (e.g. search for preferred labels, given some alternative label). The OpenSKOS service is based on the W3C SKOS recommendation.

The CLARIN-NL CLAVAS project extends CATCHPlus' OpenSKOS with CLARIN specific components: first, it publishes three vocabularies required by the CLARIN community and it offers tools to keep those vocabularies synchronised with their sources. And second, it offers an interactive web application that can be used to curate existing vocabularies directly in OpenSKOS or to create new ones from scratch (the OpenSKOS Editor).

Of course CLAVAS meets general CLARIN requirements, like authentication using federated identities and use of persistent identifiers or stable resolvable http URIs for vocabularies and their elements. CLAVAS components are or will be tested in realistic usage scenarios concerning metadata editing, metadata curation and vocabulary maintenance.

This OpenSKOS instance currently publishes SKOS versions of three vocabularies:

1. ISO-639-3 language codes, as published by the registration authority SIL.

2. Closed and simple Data Categories from the ISOcat metadata profile.

3. A manually constructed and curated list of Organisations, based on the CLARIN VLO.

Project results

Because of time restrictions, external interdependencies (CATCHPlus and OpenSKOS development schedules) and the outcomes of pilot experiments the original CLAVAS proposal was mandated two times. The final set of deliverables is listed below, organised by data deliverables, software deliverables and reports. Differences between the original and final lists of deliverables were discussed with and approved by the CLARIN-NL Executive Board.

The initial CLAVAS project document can be found here

ISO-639-3

ISO-639-3 language codes can be downloaded/harvested by the CLAVAS harvesting web application. The download location of the source files can be modified, if it changes in the future. The source files were parsed and converted to a SKOS RDF/XML file, that is published by the CLAVAS OpenSKOS instance. The core of the conversion module is a perl script that can also be used by itself. The script is available as part of the CLAVAS open source distribution on GitHub.

ISOcat

ISO-DCR closed and simple data categories from ISOcat can be harvested directly from ISOcat (due to a contribution of Menzo Windhouwer, MPI/TLA). The CLAVAS harvesting web app does some necessary post-processing on the ISOcat SKOS RDF/XML data (combination of perl script and Java code, available on GitHub). The resulting SKOS is uploaded to and made available from the CLAVAS OpenSKOS site. In the next chapter there is a brief discussion of possible 'divisions of labour' between ISOcat and OpenSKOS.

Organisation Names

Our original proposal was to start with a short feasibilty study to find out if it was possible to adapt existing software to automatically extract organisation names from a dump of metadata fields from the Virtual Language Observatory (VLO). Because this software turned out not to be available anymore and no alternative was found, it was decided to spawn a manual curation project. This project was performed by the Data Curation Center, in close collaboration with CLAVAS. The necessary budget was transferred from CLAVAS to the DCS. The process was as follows: two manual passes bundled spelling variations of organisation names and identied the preferred spelling. A tabular text format was used for this. Also remarks and editorial notes were recorded. The intermediate results were automatically processed and converted to SKOS (as part of the CLAVAS project). The conversion process also extracted hierarchical structure were possible. Conversion errors were kept separate in the form of a set of SKOS Concepts that needed further curation.The converted result was evaluated using the search and browse facilities of the OpenSKOS Editor. In a final manual pass the OpenSKOS Editor was used to fix these Concepts. In the process, the OpenSKOS Editor was also evaluated with respect to usability for CLARIN vocabulary curation tasks. As the other two vocabularies, the Organisation Names vocabulary was published on the CLAVAS OpenSKOS site.

Updating vocabularies

A simple web application was built and integrated in the CLAVAS OpenSKOS site. This web application has three tabs, one for each of the CLAVAS vocabularies. The two that can be periodically updated have a facility to modify download paths for the source information and a download button. The RDF file that is created and downloaded is suitable to upload to an OpenSKOS instance. Sources for the web application as well as for all converters are available from the CLAVAS GitHub repository.

OpenSKOS Editor

The original CLAVAS proposal contained a 'simple vocabulary curation' tool as one of its deliverables. 'Simple' meant: with no more functionality than needed by the curators of simple, unstructured vocabularies in CLARIN. During CLAVAS execution the Netherlands Institute for Sound and Vision decided to base maintenance and publication of their thesaurus on OpenSKOS. They designed a full-blown thesaurus editing environment, including support for simple workflows. This Editor was subsequently built by Picturae as extension of the already existing CATCHPlus OpenSKOS software and was also made available under open source license via the CATCHPlus OpenSKOS GitHub repository. Hennie Brugman participated on behalf of CATCHPlus and CLAVAS as advisor at several occasions in the design and development process. It is clear that the current OpenSKOS Editor contains all required functionality for CLAVAS (and a lot more). However, it was designed for documentalists at NISV and not for curators in the CLARIN context. Therefore the Editor was evaluated by the CLARIN Data Curation Center in the context of the construction of the Organisation Names vocabulary.

The OpenSKOS platform

Some of the CLAVAS deliverables were directly built into the OpenSKOS software: support for login via federated identities using Shibboleth, and fuzzy search on concept labels. The OpenSKOS platform itself also had interesting developments that are presented in the next chapter.

Discussion and conclusions

Applications of the OpenSKOS API

An important use case and test case in the project proposal was the integration of OpenSKOS API usage in the Arbil metadata editor. It was discussed at several occasions and has been possible to implement for a substantial period. However, due to time limitations at MPI/TLA it was not possible to do this test within the time frame of CLAVAS. Therefore, we will do this test at a later stage. This usage scenario itself is daily practice in another collection description and management system (Memorix, by Picturae) so is proven to be working and efficient.

In how far did we meet our initial success criteria?

In the project document we formulated a number of success criteria for each of the intended user groups. Many criteria depend on application of CLAVAS. Therefore, for some criteria it is too early to make definite statements.

ISOcat and OpenSKOS

Although it turned out to be possible to provide OpenSKOS access to ISO-DCR datcats, this may not be the most useful subdivision of labour between the two services. An alternative is that ISOcat functions as a client of OpenSKOS: for open vocabularies it could refer to ConceptSchemes in OpenSKOS. Tools that need term lists (e.g. for autocompletion) can then be redirected to OpenSKOS and its API.

Emerging OpenSKOS community

During the CLAVAS project OpenSKOS itself attracted substantial attention. At the moment of writing this, approximately 10 instances of OpenSKOS are installed, either in a production setting (NISV, CLAVAS, some customers of Picturae), in an experimental setting (ICLTT in Vienna, Cologne Center for eHumanities) or for testing purposes (Europeana). There were several presentations with substantial positive feedback, in a national context (mainly in the Dutch Cultural Heritage domain) and internationally (presentations in Paris and at LREC 2012 in Istanbul). It was the main topic of a 'break out session' at the last DARIAH VCC meeting in Copenhagen. Consequence of this increasing interest could be that a number of interesting vocabularies from HLT, eHumanities and Cultural Heritage domains could become available for the CLARIN community as well (and vice versa).

Recommendations and next steps

OpenSKOS started off as an Open Source project that is potentially successful. It is actually used in production environments and there are many interested people, organisations and projects. Most outstanding tasks are small and relatively simple and straightforward. There are many good developers around that could contribute. There are organisations willing to host an instance for some time. Our hope is that modest contributions from current and future projects will bring OpenSKOS further as an open source community product.

Deliverables

Deliverable Description Type
D1.1 Base data for organisation names data
D1.2 Organisation names feasibility study report
D1.3 Name extraction and normalization tool software
D1.4 Organisation names vocabulary data set data
D2.1 Harvester for Isocat REST service software
D2.2 ISO-DCR SKOS conversion specification report
D2.3 ISO-DCR SKOS conversion module software
D3.1 Harvester for ISO-639-3 software
D3.2 Language codes SKOS module software
D3.3 (optional) lexvo.org mix-in module software
D4.1 Module for harvesting control, configuration and monitoring software
D4.2 Wireframe document for harvesting GUI report
D4.3 Harvesting GUI software
D5.1 Wireframe document for Vocabulary curation GUI report
D5.2 Vocubulary curation GUI software
D5.3 (optional) Minimal Vocabulary curation GUI software
D6.1 VAS API, authentication extension (Shibboleth) software
D6.2 VAS API, find lexically closest term software
D6.3 Persistent URL/identifier strategy report
D7.1 Source code online on GitHub software
D8.1 Hosting and exploitation plan report
D8.2 Operational version of CLAVAS OpenSKOS instance service
D8.3 Operational version of CLAVAS service
D9.1 User documentation (online and/or built-in) report
D9.2 Technical documentation report
D9.3 Final report report