CLAVAS - OpenSKOS CLARIN Vocabulary Service
For efficient production and curation of high quality metadata in CLARIN it is important to have access to vocabularies from a range of software tools. This document describes a solution that is based on a repository service platform, that is developed by the CATCHPlus project, OpenSKOS (formerly called Vocabulary and Alignment Service - VAS). This platform offers uniform and standardised ways to publish and retrieve vocabulary data in forms that can be used for many usage scenarions, e.g. during metadata creation (e.g. pick lists with autocompletion) or during metadata curation (e.g. search for preferred labels, given some alternative label). The OpenSKOS service is based on the W3C SKOS recommendation.
The CLARIN-NL CLAVAS project extends CATCHPlus' OpenSKOS with CLARIN specific components: first, it publishes three vocabularies required by the CLARIN community and it offers tools to keep those vocabularies synchronised with their sources. And second, it offers an interactive web application that can be used to curate existing vocabularies directly in OpenSKOS or to create new ones from scratch (the OpenSKOS Editor).
Of course CLAVAS meets general CLARIN requirements, like authentication using federated identities and use of persistent identifiers or stable resolvable http URIs for vocabularies and their elements. CLAVAS components are or will be tested in realistic usage scenarios concerning metadata editing, metadata curation and vocabulary maintenance.
This OpenSKOS instance currently publishes SKOS versions of three vocabularies:
1. ISO-639-3 language codes, as published by the registration authority SIL.
2. Closed and simple Data Categories from the ISOcat metadata profile.
3. A manually constructed and curated list of Organisations, based on the CLARIN VLO.
Because of time restrictions, external interdependencies (CATCHPlus and OpenSKOS development schedules) and the outcomes of pilot experiments the original CLAVAS proposal was mandated two times. The final set of deliverables is listed below, organised by data deliverables, software deliverables and reports. Differences between the original and final lists of deliverables were discussed with and approved by the CLARIN-NL Executive Board.
The initial CLAVAS project document can be found here
ISO-639-3 language codes can be downloaded/harvested by the CLAVAS harvesting web application. The download location of the source files can be modified, if it changes in the future. The source files were parsed and converted to a SKOS RDF/XML file, that is published by the CLAVAS OpenSKOS instance. The core of the conversion module is a perl script that can also be used by itself. The script is available as part of the CLAVAS open source distribution on GitHub.
ISO-DCR closed and simple data categories from ISOcat can be harvested directly from ISOcat (due to a contribution of Menzo Windhouwer, MPI/TLA). The CLAVAS harvesting web app does some necessary post-processing on the ISOcat SKOS RDF/XML data (combination of perl script and Java code, available on GitHub). The resulting SKOS is uploaded to and made available from the CLAVAS OpenSKOS site. In the next chapter there is a brief discussion of possible 'divisions of labour' between ISOcat and OpenSKOS.
Our original proposal was to start with a short feasibilty study to find out if it was possible to adapt existing software to automatically extract organisation names from a dump of metadata fields from the Virtual Language Observatory (VLO). Because this software turned out not to be available anymore and no alternative was found, it was decided to spawn a manual curation project. This project was performed by the Data Curation Center, in close collaboration with CLAVAS. The necessary budget was transferred from CLAVAS to the DCS. The process was as follows: two manual passes bundled spelling variations of organisation names and identied the preferred spelling. A tabular text format was used for this. Also remarks and editorial notes were recorded. The intermediate results were automatically processed and converted to SKOS (as part of the CLAVAS project). The conversion process also extracted hierarchical structure were possible. Conversion errors were kept separate in the form of a set of SKOS Concepts that needed further curation.The converted result was evaluated using the search and browse facilities of the OpenSKOS Editor. In a final manual pass the OpenSKOS Editor was used to fix these Concepts. In the process, the OpenSKOS Editor was also evaluated with respect to usability for CLARIN vocabulary curation tasks. As the other two vocabularies, the Organisation Names vocabulary was published on the CLAVAS OpenSKOS site.
A simple web application was built and integrated in the CLAVAS OpenSKOS site. This web application has three tabs, one for each of the CLAVAS vocabularies. The two that can be periodically updated have a facility to modify download paths for the source information and a download button. The RDF file that is created and downloaded is suitable to upload to an OpenSKOS instance. Sources for the web application as well as for all converters are available from the CLAVAS GitHub repository.
The original CLAVAS proposal contained a 'simple vocabulary curation' tool as one of its deliverables. 'Simple' meant: with no more functionality than needed by the curators of simple, unstructured vocabularies in CLARIN. During CLAVAS execution the Netherlands Institute for Sound and Vision decided to base maintenance and publication of their thesaurus on OpenSKOS. They designed a full-blown thesaurus editing environment, including support for simple workflows. This Editor was subsequently built by Picturae as extension of the already existing CATCHPlus OpenSKOS software and was also made available under open source license via the CATCHPlus OpenSKOS GitHub repository. Hennie Brugman participated on behalf of CATCHPlus and CLAVAS as advisor at several occasions in the design and development process. It is clear that the current OpenSKOS Editor contains all required functionality for CLAVAS (and a lot more). However, it was designed for documentalists at NISV and not for curators in the CLARIN context. Therefore the Editor was evaluated by the CLARIN Data Curation Center in the context of the construction of the Organisation Names vocabulary.
The OpenSKOS platform
Some of the CLAVAS deliverables were directly built into the OpenSKOS software: support for login via federated identities using Shibboleth, and fuzzy search on concept labels. The OpenSKOS platform itself also had interesting developments that are presented in the next chapter.
Discussion and conclusions
Applications of the OpenSKOS API
An important use case and test case in the project proposal was the integration of OpenSKOS API usage in the Arbil metadata editor. It was discussed at several occasions and has been possible to implement for a substantial period. However, due to time limitations at MPI/TLA it was not possible to do this test within the time frame of CLAVAS. Therefore, we will do this test at a later stage. This usage scenario itself is daily practice in another collection description and management system (Memorix, by Picturae) so is proven to be working and efficient.
In how far did we meet our initial success criteria?
In the project document we formulated a number of success criteria for each of the intended user groups. Many criteria depend on application of CLAVAS. Therefore, for some criteria it is too early to make definite statements.
- Metadata editors: it is too early to make any claims about improved metadata quality. A necessary step to take is integration of the use of CLAVAS/OpenSKOS in Arbil. In general, CLAVAS supports open vocabularies that are not suited for ISOcat, it offers autocompletion and easy, fine-grained search on basis of all SKOS attributes, and it contains the basic functionality to allow suggestion of missing concepts by metadata editors.
- Collection users and metadata managers: (way) too early to make any claims here.
- Vocabulary curators: criteria are met. Curation with OpenSKOS is tested by the CLARIN-NL Data Curation Service.
- CLAVAS content manager: periodic updates of ISO-639-3 and ISOcat vocabularies are possible and easy to perform. The Organisation Names vocabulary is bootstrapped from the VLO and can be curated manually using the OpenSKOS Editor.
- Tool builders: the RESTful API is freely available for builders of tools and services.
ISOcat and OpenSKOS
Although it turned out to be possible to provide OpenSKOS access to ISO-DCR datcats, this may not be the most useful subdivision of labour between the two services. An alternative is that ISOcat functions as a client of OpenSKOS: for open vocabularies it could refer to ConceptSchemes in OpenSKOS. Tools that need term lists (e.g. for autocompletion) can then be redirected to OpenSKOS and its API.
Emerging OpenSKOS community
During the CLAVAS project OpenSKOS itself attracted substantial attention. At the moment of writing this, approximately 10 instances of OpenSKOS are installed, either in a production setting (NISV, CLAVAS, some customers of Picturae), in an experimental setting (ICLTT in Vienna, Cologne Center for eHumanities) or for testing purposes (Europeana). There were several presentations with substantial positive feedback, in a national context (mainly in the Dutch Cultural Heritage domain) and internationally (presentations in Paris and at LREC 2012 in Istanbul). It was the main topic of a 'break out session' at the last DARIAH VCC meeting in Copenhagen. Consequence of this increasing interest could be that a number of interesting vocabularies from HLT, eHumanities and Cultural Heritage domains could become available for the CLARIN community as well (and vice versa).
Recommendations and next steps
- Integration of CLAVAS in Arbil is an important next step
- A number of issues concerning the distributed setup of OpenSKOS have to be addressed and agreed upon: discovery of each others vocabularies, data provenance and persistent identification of concepts
- A number of improvements for the OpenSKOS Editor user interface are identified, for example better support for multiple collections in one OpenSKOS instance.
- The OpenSKOS documentation should be extended and at some points updated.
- CLAVAS/OpenSKOS depends on useful content to become succesful. So collecting, converting and importing new vocabularies (under open licenses) is important.
OpenSKOS started off as an Open Source project that is potentially successful. It is actually used in production environments and there are many interested people, organisations and projects. Most outstanding tasks are small and relatively simple and straightforward. There are many good developers around that could contribute. There are organisations willing to host an instance for some time. Our hope is that modest contributions from current and future projects will bring OpenSKOS further as an open source community product.
|D1.1||Base data for organisation names||data|
|D1.2||Organisation names feasibility study||report|
|D1.3||Name extraction and normalization tool||software|
|D1.4||Organisation names vocabulary data set||data|
|D2.1||Harvester for Isocat REST service||software|
|D2.2||ISO-DCR SKOS conversion specification||report|
|D2.3||ISO-DCR SKOS conversion module||software|
|D3.1||Harvester for ISO-639-3||software|
|D3.2||Language codes SKOS module||software|
|D3.3||(optional) lexvo.org mix-in module||software|
|D4.1||Module for harvesting control, configuration and monitoring||software|
|D4.2||Wireframe document for harvesting GUI||report|
|D5.1||Wireframe document for Vocabulary curation GUI||report|
|D5.2||Vocubulary curation GUI||software|
|D5.3||(optional) Minimal Vocabulary curation GUI||software|
|D6.1||VAS API, authentication extension (Shibboleth)||software|
|D6.2||VAS API, find lexically closest term||software|
|D6.3||Persistent URL/identifier strategy||report|
|D7.1||Source code online on GitHub||software|
|D8.1||Hosting and exploitation plan||report|
|D8.2||Operational version of CLAVAS OpenSKOS instance||service|
|D8.3||Operational version of CLAVAS||service|
|D9.1||User documentation (online and/or built-in)||report|