Revisit: Reuse of Structured Data: Semantics, Linkage, and Realization (1)

Library & Information Science, 43.1 (2017): 7-46. / [[中文]] (PART II)

RESEARCH HIGHLIGHTS:

# An old record is not a data but now defined as a new semantic dataset. 
  i.e. its triples, graphs, links, file formats ...
  i.e. its revised, vocabulary encoded versions ...
  ex. data:d2148340 a dcat:dataset. #files:json-ld, ttl, XML

# A new method to curate, publish & visualize LOD graphs via CKAN portal. 
  i.e. two models for one dataset published in two views.
  ex. data:d2148340 a dcat:dataset.   # Dublin Core @schema1
  ex. data:d2148340 a data:Refined. # more semantics@schema2

# Validation & Reproducibility: Provenance and Contexts are in details.

Introduction

In order to enhance the reuse value of existing datasets, it is now becoming a general practice to add semantic links among the records in a dataset, and to link these records to external resources. The enriched datasets are published on the Web for both the human and the machine to consume and re-purpose.

Open Data Web (data.odw.tw)

In the paper, we make use of publicly available structured records from a digital archive catalogue, and we demonstrate a principled approach to converting the records into semantically rich and interlinked resources for all to reuse.

While exploring the various issues involved in the process of reusing and re-purposing existing datasets, we review the recent progress in the field of Linked Open Data (LOD), and examine twelve well-known knowledge bases built with a Linked Data approach. We also discuss the general issues of data quality, metadata vocabularies, and data provenance.

Different Contexts in Different Data Curation Phases

The concrete outcome of this research work is the following:

a website/repository (Open Data Web) that hosts more than 840,000 semantically enriched catalogue records across multiple subject areas,
a lightweight ontology voc4odw for describing data reuse and provenance, among others, and
a set of open source software tools available to all to perform the kind of data conversion and enrichment we did in this research. We have used and extended CKAN (The Comprehensive Knowledge Archive Network) as a platform to host and publish Linked Data.

Our extensions to CKAN is open sourced as well. As the records we have drawn from the originally catalogue are released under the Creative Commons licenses, the semantically enriched resources we now re-publish on the Web are free for all to reuse as well. Review of Twelve Knowledge Bases We begin by first examine twelve knowledge bases built with a Linked Data approach.

Five of them are built by domain knowledge experts (OpenCyc, Getty Art and Architecture Thesaurus (AAT), Getty Thesaurus of Geographic Names (TGN), and Ordnance Survey/ Open Names), six of them are collaborative databases (Freebase, YAGO, DBpedia, Wikidata, LinkedGeoData, GeoNames), and the last one is about ecological observations based on expert and community collaborations (Encyclopedia of Life/ EOL/ TraitBank). We further compare datasets about geospatial entities with controlled vocabularies: Getty TGN, Open Names (Ordnance Survey), DBpediaPlace*(instances of dbo:Place), LinkedGeoData, and GeoNames.

To make good reuse of structured data, ones need to first deal with the problem of data quality. Currently there exist different evaluation criteria, with various techniques for measuring the quality of information, data, metadata, and Linked Data.

LOD Knowledge Bases/Graphs (2016/11/06 sparql query results) /

LOD Knowledge Graph		since	organization	domain	resource	triples	update frequency	data source
Expert Lead (top down)	OpenCyc	2008	business	cross-domain	41,029	2,412,520	over one year	owner
	Getty AAT	2014	business	art &	45,327	13,259,890	3-5 times a year	owner
	Getty TGN	2014	business	place name	2,495,100	204,614,290	3-5 times a year	owner
	Ordnance Survey	2010	government	geography	2,938,707	58,377,209	depending	owner
	Open Names	2015	government	place name	925,157	21,360,688	twice a year	owner
Collective Collaboration (bottom up)	Freebase	2008	business	cross-domain	49,947,799	3,124,791,156	close din 2015	Wikipedia
	YAGO	2007	university	cross-domain	5,130,031	1,001,461,786	over one year	Wikipedia
	DBpedia	2007	university	cross-domain	5,109,890	402,086,316	about one year/ some in Live.	Wikipedia
	DBpediaPlace*	2007	university	place (name)	816,252	53,895,946	about one year/ some in Live.	Wikipedia
	Wikidata	2012	NGO	cross-domain	19,367,201	1,371,170,022	real time	Wikipedia
	LinkedGeoData	2010	university	geography	> 3 billion	1,384,887,500	about one year	OpenStreetMap
	GeoNames	2010	NGO	place name	>6.2 million	93,896,732	real Time	data collaboration/ partly integrated with others
Mix Mode	EOL (TraitBank)	2014	association	biodiversity	10,753,384	359,292,712	statistic data/ a week	research databases integration/ partly collaborated

We review four papers on data quality and systematically compare their evaluation criteria. Moreover, data provenance --- contextual metadata about the source and use of data --- has proven to be fundamental for assessing authenticity, enabling trust, and allowing reproducibility. Thus, we examine key mechanisms of data provenance before we move forward to discussing LOD applications.

Details or Fragments | 細碎再看

2018-09-05