Details or Fragments | 細碎再看: 2018

2018-12-25

為什麼開放連結資料(Linked Open Data/LOD)的資料溯源(Provenance)很重要?

一個基本原因，在於我們對於後設資料品質(metadata quality)的不滿意:

從單純的欄位值出現亂碼、空值、矛盾，資料重複、名稱模糊、欄位定義混淆、編碼不一致，或資料的語意描述不是太過薄弱(資訊不足、缺乏必要欄位)、不然就是語意超載(一個欄位包含太多語意)。

更進一步的觀察LOD現況，在數位化資料轉換更新與整合過程中，

往往無法保持原始資料的完整性: 如不同資料模型與資料庫間資料的轉換、異質資料來源的跨平台分散式處理。
錯誤使用國際語彙標準: 如語彙標準中類別(Class)與屬性/謂詞( Property )的誤用、違反資料模型中定義域(Domain/Type)與值域(Range/Value)的規範、以及上下階層語意的矛盾等。

也因此當我們不得不贊同Van Hooland 與 Verborgh（2014）說「沒有完全乾淨的後設資料」時，頓時難掩我們失落的沮喪。事實上資料清理工作可能發生在進行LOD前的前處理，也可能在完成LOD後的後處理。時間、經費、人力均會影響資料清理與品質。

關鍵是，愈早規劃後設資料品質，資料的價值才可能永續。

台灣的文化資源LOD 剛開始萌芽，例如Open Data Web、台中學資料庫、鏈結開放資料平台、以及近期國家級推動的前瞻基礎建設:文化部國家文化記憶庫等。然而相對國際LODLAM(Linked Open Data in Libraries, Archives, and Museums)或LAMLOD的發展則仍顯落後。幸運的是若能吸取過往錯誤經驗，揚棄沉滯的老套作法，在啟動計劃初期，即能取得後設資料品質管理的平衡，那麼「在後的將要在前」也不難期待。

令人意外的是，簡單並忠實的描述不同脈絡階段的人、時、地資訊，即可提供後設資料好的資料品質管理。而這也就是歷史文化學者Meroño-Peñuela 等人( 2014) 提出資料溯源（Provenance）是一個解決的方向。以下我們用「小飛的故事」: 一隻40年前聖誕節在台中公園飛舞的蝴蝶來說明，為何透過不同階段脈絡的人、時、地簡約的資訊架構，即可清晰簡單的描述W3C複雜的資料溯源知識本體推薦標準 (PROV-O)基本概念。

蝴蝶小飛的數位化歷史過程，導引我們同樣看待文化資產物件數位化的人、時、地資訊。今日我們都希望能利用LOD技術讓機器快速大量的語意化資料、整合分散式資料庫、連結全球語意網知識，同時又要邁向公眾協力文化記憶，因此提供機器每一個文化物件的後設資料溯源(Provenance)，就像是在藝術品拍賣會中，每一個珍貴的藝術品，它的拍品出處必需追溯物品來源以及上手物主，而保證欄、編製圖錄則需標明藝術家或創作人、製作年份、持有轉手人紀錄、參展紀錄、相關記述出版物等。換言之，後設資料溯源就是數位化資料的品質保證書。

參考資料:

Meroño-Peñuela, A., Ashkpour, A., Van Erp, M., Mandemakers, K., Breure, L., Scharnhorst, A., ... & Van Harmelen, F. (2014). Semantic technologies for historical research: A survey. Semantic Web, 6(6), 539-564.
Van Hooland, S., & Verborgh, R. (2014). Linked Data for Libraries, Archives and Museums: How to clean, link and publish your metadata. London: Facet Publishing.
黃韋菁、李承錱、莊庭瑞, (Andrea Wei-Ching Huang, Cheng-Jen Lee and Tyng-Ruey Chuang), 結構資料的再次使用：語意、連結與實作 (Reuse of Structured Data: Semantics, Linkage, and Realization), 圖書館學與資訊科學(Journal of Library and Information Science) 43 (1), 7-46, 2017, DOI: 10.6245/JLIS.2017.431/722

Citation Information: 黃韋菁 (2018) 為什麼開放連結資料(Linked Open Data/LOD)的資料溯源(Provenance)很重要? URL: http://andrea-index.blogspot.com/2018/12/provenance.html

2018-10-27

Revisit: Reuse of Structured Data: Semantics, Linkage, and Realization (2)

(continue from part I) / Library and Information Science, 43.1 (2017): 7-46. / [[中文]]

RESEARCH HIGHLIGHTS:

# An old record is not a data but now defined as a new semantic dataset. 
  i.e. its triples, graphs, links, file formats ...
  i.e. its revised, vocabulary encoded versions ...
  ex. data:d2148340 a dcat:dataset. #files:json-ld, ttl, XML

# A new method to curate, publish & visualize LOD graphs via CKAN portal. 
  i.e. two models for one dataset published in two views.
  ex. data:d2148340 a dcat:dataset.   # Dublin Core @schema1
  ex. data:d2148340 a data:Refined. # more semantics@schema2

# Validation & Reproducibility: Provenance and Contexts are in details.

Practices

Example: data:d2148340 (click to enlarge)

We then make use of structured records (XML files) from a digital archive catalogue, and convert the records into semantically rich and interlinked resources on the Web. This is realized as a unified Linked Data catalogue to several digital archive collections. Our work results in a LOD catalogue (data.odw.tw) available to the public at the website . The following five parts are involved in realizing this website.

A catalogue record, about a species of Pleione Formosana (data:d2148340), is used throughout in the paper as an example to demonstrate the way we model, convert, and represent the semantics of a structured record.

R4R Ontology (click to enlarge)

Part 1: Exploring data reuse relations in a shared context -- We review our previous research about the Relation for Reuse Ontology (R4R). In particular, we provide mechanisms for reusing article, data, and code with some flexibility of encoding provenance and license information.

Part 2: Comparing two different data conversion approaches to providing LOD for an archive catalogue -- We show two different scenarios: (1) The LOD catalogue is converted directly from a relational database, and (2) the LOD catalogue is generated from a series of format conversions --- from XML to CSV, and then to RDF.

KB links Example (click to enlarge)

Part 3: Data profiling, cleaning and mapping -- We demonstrate format conversion processes, and we discuss the pros and cons of various ways in handling broken links in source datasets. In addition, we mapped and linked catalogue records to three external knowledge bases: GeoNames, Wikidata, and Encyclopedia of Life.

Part 4: Using CKAN as a Linked Data platform -- We briefly introduce CKAN, an open source web-based data portal software package for curating and publishing datasets. CKAN provides data preview, search, and discovery, especially with regard to geospatial datasets. We built several extensions to CKAN in order to deposit, publish, browse, and search Linked Data. Various Linked Data representations of a catalogue record --- Turtle, RDF/XML, and JSON-LD --- can all be downloaded and reused.

Part 5: Designing an ontology for data representation and reuse -- We design an ontology voc4odw which includes the following 3 modules:

(1) The Core Model. It is comprise of a data model and a conceptual model.

The data model represents key data structure and relation. It is a framework to illustrate data source,derivation, and provenance.

The voc4odw Data Model (click)

The conceptual model incorporates Simple Knowledge Organization System (SKOS); it also connects to key event concepts. The conceptual model allows for data contextualization using common and domain knowledge vocabularies.

(2) The Curation Model. It is responsible for disclosing the identification, classification, and publication of structured records at a curation platform, such as the classification of themes, the assignment of data identifiers, and the publication of datasets.

(3) A vocabulary voaf:Vocabulary. It is defined as "A vocabulary used in the Linked Data cloud", from the Vocabulary of a Friend . This module is to relate the Core Model to external common vocabularies. Some hierarchy relations between different external vocabularies can be traced with this vocabulary.

voc4odw ontology
Common Knowledge
Prefix	Namespace	Description
cc	http://creativecommons.org/ns#	1. Creative Commons Rights Expression Language
csvw	http://www.w3.org/ns/csvw#	2. W3C CSVW Namespace Vocabulary Terms
dc	http://purl.org/dc/elements/1.1/	3. DC 15 (Dublin Core Metadata Element Set)
dcat	http://www.w3.org/ns/dcat#	4. W3C Data Catalog Vocabulary
dct	http://purl.org/dc/terms/	5. DCMI Metadata Terms
dctype	http://purl.org/dc/dcmitype/	6. DCMI Type Vocabulary
event	http://purl.org/NET/c4dm/event.owl#	7. Event Ontology
foaf	http://xmlns.com/foaf/0.1/	8. FOAF Vocabulary Specification
geo	http://www.w3.org/2003/01/geo/wgs84_pos#	9. W3C WGS84 Geo Positioning: an RDF vocabulary
gn	http://www.geonames.org/ontology#	10. GeoNames Ontology
gns	http://sws.geonames.org/	11. GeoNames Entity
lcsh	http://id.loc.gov/authorities/subjects	12. Library of Congress Subject Headings
org	http://www.w3.org/ns/org#	13. W3C Organization Ontology
prov	http://www.w3.org/ns/prov#	14. W3C Provenance Ontology (PROV)
r4r	http://guava.iis.sinica.edu.tw/r4r/	15. Relations for Reusing Ontology (r4r)
schema	http://schema.org/	16. Schema.org
skos	http://www.w3.org/2004/02/skos/core#	17. W3C Simple Knowledge Organization System (SKOS)
time	http://www.w3.org/2006/time#	18. W3C Time Ontology
voaf	http://purl.org/vocommons/voaf#	19. Vocabulary of a Friend (VOAF)
wde	http://www.wikidata.org/entity/	20. Wikidata Entity
Domain Knowledge
aat	http://vocab.getty.edu/aat/	1. Art and Architecture Thesaurus
dwc	http://rs.tdwg.org/dwc/terms/	2. Darwin Core Terms
dwciri	http://rs.tdwg.org/dwc/iri/	3. Darwin Core terms
eol	http://eol.org/pages/	4. The Encyclopaedia of Life (EOL)
txn	http://lod.taxonconcept.org/ontology/txn.owl#	5. Taxon Concept OWL Ontology
Local Namespace
voc	http://voc.odw.tw/ontology#	1. Ontology for ODWeb (voc4odw)
agent	http://data.odw.tw/agent/	2. Organization/Agent Entity in ODW
article	http://data.odw.tw/article/	3. Textual Description with rdf:type r4r:Article in ODW
code	http://data.odw.tw/code/	4. Code Description with rdf:type r4r:Code in ODW
data	http://data.odw.tw/record/	5. Linked Data for ODWeb
evt84	http://data.odw.tw/event/	6. Event Entity in ODW
project	http://data.odw.tw/project/	7. Project Entity in ODW
r1 (n)	http://data.odw.tw/r1/ (r2, r3…)	8. Refined Version(s) of ODW Entity
refined	http://data.odw.tw/refined/	9. Directory of the Refined Versions
catdat	http://catalog.digitalarchives.tw/	10. Union Catalog of Digital Archives Taiwan

2018-09-05

Revisit: Reuse of Structured Data: Semantics, Linkage, and Realization (1)

Library & Information Science, 43.1 (2017): 7-46. / [[中文]] (PART II)

RESEARCH HIGHLIGHTS:

# An old record is not a data but now defined as a new semantic dataset. 
  i.e. its triples, graphs, links, file formats ...
  i.e. its revised, vocabulary encoded versions ...
  ex. data:d2148340 a dcat:dataset. #files:json-ld, ttl, XML

# A new method to curate, publish & visualize LOD graphs via CKAN portal. 
  i.e. two models for one dataset published in two views.
  ex. data:d2148340 a dcat:dataset.   # Dublin Core @schema1
  ex. data:d2148340 a data:Refined. # more semantics@schema2

# Validation & Reproducibility: Provenance and Contexts are in details.

Introduction

In order to enhance the reuse value of existing datasets, it is now becoming a general practice to add semantic links among the records in a dataset, and to link these records to external resources. The enriched datasets are published on the Web for both the human and the machine to consume and re-purpose.

Open Data Web (data.odw.tw)

In the paper, we make use of publicly available structured records from a digital archive catalogue, and we demonstrate a principled approach to converting the records into semantically rich and interlinked resources for all to reuse.

While exploring the various issues involved in the process of reusing and re-purposing existing datasets, we review the recent progress in the field of Linked Open Data (LOD), and examine twelve well-known knowledge bases built with a Linked Data approach. We also discuss the general issues of data quality, metadata vocabularies, and data provenance.

Different Contexts in Different Data Curation Phases

The concrete outcome of this research work is the following:

a website/repository (Open Data Web) that hosts more than 840,000 semantically enriched catalogue records across multiple subject areas,
a lightweight ontology voc4odw for describing data reuse and provenance, among others, and
a set of open source software tools available to all to perform the kind of data conversion and enrichment we did in this research. We have used and extended CKAN (The Comprehensive Knowledge Archive Network) as a platform to host and publish Linked Data.

Our extensions to CKAN is open sourced as well. As the records we have drawn from the originally catalogue are released under the Creative Commons licenses, the semantically enriched resources we now re-publish on the Web are free for all to reuse as well. Review of Twelve Knowledge Bases We begin by first examine twelve knowledge bases built with a Linked Data approach.

Five of them are built by domain knowledge experts (OpenCyc, Getty Art and Architecture Thesaurus (AAT), Getty Thesaurus of Geographic Names (TGN), and Ordnance Survey/ Open Names), six of them are collaborative databases (Freebase, YAGO, DBpedia, Wikidata, LinkedGeoData, GeoNames), and the last one is about ecological observations based on expert and community collaborations (Encyclopedia of Life/ EOL/ TraitBank). We further compare datasets about geospatial entities with controlled vocabularies: Getty TGN, Open Names (Ordnance Survey), DBpediaPlace*(instances of dbo:Place), LinkedGeoData, and GeoNames.

To make good reuse of structured data, ones need to first deal with the problem of data quality. Currently there exist different evaluation criteria, with various techniques for measuring the quality of information, data, metadata, and Linked Data.

LOD Knowledge Bases/Graphs (2016/11/06 sparql query results) /

LOD Knowledge Graph		since	organization	domain	resource	triples	update frequency	data source
Expert Lead (top down)	OpenCyc	2008	business	cross-domain	41,029	2,412,520	over one year	owner
	Getty AAT	2014	business	art &	45,327	13,259,890	3-5 times a year	owner
	Getty TGN	2014	business	place name	2,495,100	204,614,290	3-5 times a year	owner
	Ordnance Survey	2010	government	geography	2,938,707	58,377,209	depending	owner
	Open Names	2015	government	place name	925,157	21,360,688	twice a year	owner
Collective Collaboration (bottom up)	Freebase	2008	business	cross-domain	49,947,799	3,124,791,156	close din 2015	Wikipedia
	YAGO	2007	university	cross-domain	5,130,031	1,001,461,786	over one year	Wikipedia
	DBpedia	2007	university	cross-domain	5,109,890	402,086,316	about one year/ some in Live.	Wikipedia
	DBpediaPlace*	2007	university	place (name)	816,252	53,895,946	about one year/ some in Live.	Wikipedia
	Wikidata	2012	NGO	cross-domain	19,367,201	1,371,170,022	real time	Wikipedia
	LinkedGeoData	2010	university	geography	> 3 billion	1,384,887,500	about one year	OpenStreetMap
	GeoNames	2010	NGO	place name	>6.2 million	93,896,732	real Time	data collaboration/ partly integrated with others
Mix Mode	EOL (TraitBank)	2014	association	biodiversity	10,753,384	359,292,712	statistic data/ a week	research databases integration/ partly collaborated

We review four papers on data quality and systematically compare their evaluation criteria. Moreover, data provenance --- contextual metadata about the source and use of data --- has proven to be fundamental for assessing authenticity, enabling trust, and allowing reproducibility. Thus, we examine key mechanisms of data provenance before we move forward to discussing LOD applications.

Details or Fragments | 細碎再看

2018-12-25

為什麼開放連結資料(Linked Open Data/LOD)的資料溯源(Provenance)很重要?

2018-10-27

Revisit: Reuse of Structured Data: Semantics, Linkage, and Realization (2)

2018-09-05

Revisit: Reuse of Structured Data: Semantics, Linkage, and Realization (1)

COVID-19

小花園

新冠說

Little Garden

About Me

Wikipedia|維基百科

2018-12-25

為什麼開放連結資料(Linked Open Data/LOD)的資料溯源(Provenance)很重要?

2018-10-27

Revisit: Reuse of Structured Data: Semantics, Linkage, and Realization (2)

2018-09-05

Revisit: Reuse of Structured Data: Semantics, Linkage, and Realization (1)

About Me

Wikipedia|維基百科

Subscribe To|訂閱