An Objective Assessment Framework & Tool for Linked Data Quality
Enriching Dataset Profiles with Quality Indicators
36 min read
In the last few years the Semantic Web gained a momentum supported by the introduction of many related initiatives like the Linked Open Data Cloud (LOD Cloud). From 12 datasets cataloged in 2007, the Linked Open Data cloud has grown to nearly 1000 datasets containing more than 82 billion triples. Data is being published by both public and private sectors and covers a diverse set of domains from life sciences to military. This success lies in the cooperation between data publishers and consumers where users are empowered to find, share and combine information in their applications easily.
We are entering an era where open is the new default. Governments, universities, organizations and even individuals are publicly publishing huge amounts of open data. This openness should be accompanied with a certain level of trust or guarantees about the quality of data. The Linked Open Data is a gold mine for those trying to leverage external data sources in order to produce more informed business decisions . However, the heterogeneous nature of sources reflects directly on the data quality as these sources often contain inconsistent as well as misinterpreted and incomplete information.
Traditional data quality is a thoroughly researched field with several benchmarks and frameworks to grasp its dimensions [1, 3, 4]. Data quality principles typically rely on many subjective indicators that are complex to measure automatically. The quality of data in indeed realized when it is used , thus directly relating to the ability of satisfying users' continuous needs.
Web documents that are by nature unstructured and interlinked require different quality metrics and assessment techniques than traditional datasets. For example, the importance and quality of Web documents can be subjectively calculated via algorithms like Page Rank . Ensuring data quality in Linked Open Data is a complex process as it consists of structured information supported by models, ontologies and vocabularies and contains queryable endpoints and links. This makes data quality assurance a challenge. Despite the fact that Linked Open Data quality is a trending and highly demanded topic, very few efforts are currently trying to standardize, track and formalize frameworks to issue scores or certificates that will help data consumers in their integration tasks.
Data quality assessment is the process of evaluating if a piece of data meets the consumers need in a specific use case . The dimensionality of data quality makes it dependent on the task and users requirements. For example, DBpedia  and YAGO  are knowledge bases containing data extracted from structured and semi-structured sources. They are used in a variety of applications e.g., annotation systems , exploratory search  and recommendation engines . However, their data is not integrated into critical systems e.g., life critical (medical applications) or safety critical (aviation applications) as its data quality is found to be insufficient. In this work, we first propose a comprehensive objective framework to evaluate the quality of Linked Data sources. Secondly, we present an extensible quality measurement tool that helps on one hand data owners to rate the quality of their dataset and get some hints on possible improvements, and on the other hand data consumers to choose their data sources from a ranked set. The aim of this work is to provide researchers and practitioners with a comprehensive understanding of the objective issues surrounding Linked Data quality.
The framework we propose is based on a refinement of the data quality principles described in  and surveyed in . Some attributes have been grouped for more detailed quality assessments while we have also extended them by adding for each attribute a set of objective indicators. These indicators are measures that provide users with quality metrics measurable by tools regardless of the use case. For example, when measuring the quality of DBpedia dataset, an objective metric would be the availability of human or machine readable license information rather than the trustworthiness of the publishers.
Furthermore, we surveyed the landscape of Linked Data quality tools to discover that they only cover a subset of the proposed objective quality indicators. As a result, we extend Roomba which is a framework to assess and build dataset profiles with an extensible quality measurement tool and evaluate it by measuring the quality of the LOD cloud group. The results demonstrate that the general quality of LOD cloud needs more attention as most of the datasets suffer from various quality issues.
In , the authors present a comprehensive systematic review of data quality assessment methodologies applied to LOD. They have extracted 26 quality dimensions and a total of 110 objective and subjective quality indicators. However, some of those objective indicators are dependent on the use case thus there is no clear separation on what can be automatically measured. For example, data completeness is generally a subjective dimension. However, the authors specified that the detection of the degree on which all the real-world objects are represented, detection of number of missing values for specific property and detection of the degree to which instances in the dataset are interlinked are considered as objective indicators given the presence of a gold standard or the original data source to compare with. Moreover, lots of the defined performance dimensions like low latency, high throughput or scalability of a data source were defined as objective but are still dependent on multiple subjective factors like network congestion. In addition, there were some missing objective indicators vital to the quality of LOD e.g., indication of the openness of the dataset.
The ODI certificate provides a description of the published data quality in plain English. It aspires to act as a mark of approval that helps publishers understand how to publish good open data and users how to use it. It gives publishers the ability to provide assurance and support on their data while encouraging further improvements through an ascending scale.
ODI comes as an online and free questionnaire for data publishers focusing on certain characteristics about their data. The questions are classified into the following categories: general information (about dataset, publisher and type of release), legal information (e.g., rights to publish), licensing, privacy (e.g., whether individuals can be identified), practical information (e.g., how to reach the data), quality, reliability, technical information (e.g., format and type of data) and social information (e.g., contacts, communities, etc.). Based on the information provided by the data publisher, a certificate is created with one of four different ratings.
Although ODI is a great initiative, the issued certificates are self-certified. ODI does not verify or review submissions but retains the right to revoke a certificate at any time. At the time of writing this post, there was only 10,555 ODI certificates issued. The dynamicity of Linked Data makes it also very difficult to update the certificates manually, especially when these changes are frequent and affect multiple categories. There is clearly a need for automatic certification which can be supplemented with some manual input for categories that cannot be processed by machines.
The emerging critical need for large, distributed, heterogeneous, and complex structured datasets identified the necessity to establish industry cooperation between vendors of RDF and Graph database technologies in developing, endorsing, and publishing reliable and insightful benchmark results. The Linked Data Benchmark Council (LDBC) aims to bridge the gap between the industry and the new trending stack of semantic technologies and their vendors. LDBC aims at promoting graph and RDF data management systems to be an accepted industrial solution. LDBC is not focused around measuring or assessing quality. However, it focuses on creating benchmarks to measure progress in scalability, storage, indexing and query optimization techniques to become the de facto standard for publishing performance results.
In , the authors propose a methodology for assessing Linked Data quality. It consists of three main steps: (1) requirement analysis, (2) quality assessment and (3) quality improvement. Considering the multidimensionality of data quality, the methodology requires users to provide the details of a use case or a scenario that describes the intended usage of the data. Moreover, quality issues identification is done with the help of a checklist. The user must have prior knowledge about the details of the data in order to fill this list. Tools implementing the proposed methodology should be able to generate comprehensive quality measures. However, they will require heavy manual intervention and deep knowledge on the data examined. These issues highly affect detecting quality issue on large scale.
Objective Linked Data Quality Classification
The basic idea behind Linked Data is that its usefulness increases when it is more interlinked with other datasets. Tim Berners-Lee defined four main principles for publishing data that can ensure a certain level of uniformity reflecting directly data's usability :
- Make the data available on the Web: assign URIs to identify things.
- Make the data machine readable: use HTTP URIs so that looking up these names is easy.
- Use publishing standards: when the lookup is done provide useful information using standards like RDF.
- Link your data: include links to other resources to enable users to discover more things.
Building on these principles, we group the quality attributes into four main categories:
- Quality of the entities : quality indicators that focus on the data at the instance level.
- Quality of the dataset: quality indicators at the dataset level.
- Quality of the semantic model: quality indicators that focus on the semantic models, vocabularies and ontologies.
- Quality of the linking process: quality indicators that focus on the inbound and outbound links between datasets.
In , the authors identified 24 different Linked Data quality attributes. These attributes are a mix of objective and subjective meaasures that may not be derived automatically. In this work, we refine these attributes into a condensed framework of 10 objective measures. Since these measures are rather abstract, we should rely on quality indicators that reflect data quality  and use them to automate calculating datasets quality.
The quality indicators are weighted. These weights give the flexibility to define multiple degrees of importance. For example, a dataset containing people can have more than one person with the same name thus it is not always true that two entities in a dataset should not have the same preferred label. As a result, the weight for that quality indicator will be set to zero and will not affect the overall quality score for the consistency measure.
Independent indicators for entity quality are mainly subjective e.g., the degree to which all the real-world objects are represented, the scope and level of details, etc. However, since entities are governed by the underlying model, we have grouped their indicators with those of the modeling quality.
Table 1 in our paper lists the refined measures alongside their objective quality indicators. Those indicators have been gathered by:
- Transforming the objective quality indicators presented as a set of questions in  into more concrete quality indicator metrics.
- Surveying the landscape of data quality tools and frameworks.
- Examining the properties of the most prominent linked data models from the survey done in .
Data completeness can be judged in the presence of a task where the ideal set of attributes and objects are known. It is generally a subjective measure depending highly on the scenario and use-case in hand, opposite to other measures like availability where i can measure if a dataset is available or not despite of the underlying use case. For example, an entity is considered to be complete if it contains all the attributes needed for a given task, has complete language coverage  and has documentation properties [12, 15]. Dataset completeness has some objective indicators which we include in our framework. A dataset is considered to be complete if it:
- Contains supporting structured metadata .
- Provides data in multiple serializations (N3, Turtle, etc.) 
- Contains different data access points. These can either be a queryable endpoint (i.e. SPARQL endpoint, REST API, etc.) or a data dump file.
- Uses datasets description vocabularies like DCAT or VOID.
- Provides descriptions about its size e.g.,
- Existence of descriptions about its format.
- Contains information about its organization and categorization e.g.,
- Contains information about the kind and number of used vocabularies .
Links are considered to be complete if the dataset and all its resources have defined links [10, 11, 14]. Models are considered to be complete if they do not contain disconnected graph clusters . Disconnected graphs are the result of incomplete data acquisition or accidental deletion of terms that leads to deprecated terms. In addition to that, models are considered to be complete if they have complete language coverage (each concept labeled in each of the languages that are also used on the other concepts) , do not contain omitted top concepts or unidirectional related concepts  and if they are not missing labels , equivalent properties, inverse relationships, domain or range values in properties .
A dataset is considered to be available if the publisher provides data dumps e.g., RDF dump, that can be downloaded by users [9, 11], its queryable endpoints e.g., SPARQL endpoint, are reachable and respond to direct queries and if all of its inbound and outbound links are dereferencable.
A dataset is considered to be correct if it includes the correct MIME-type and size for the content  and doesn't contain syntactic errors . Links are considered to be correct if they lack syntactic errors and use the HTTP URI scheme (avoid using URNs or DOIs) . Models are considered to be correct if the top concepts are marked and do not have broader concepts (for example having incoming
hasTopConcept or outgoing
topConceptOf relationships) . Moreover, if they don't contain incorrect data type for typed literals, no omitted or invalid languages tags [15, 22], does not contain "orphan terms" (orphan terms are terms without any associative or hierarchical relationships and if the labels are not empty, do not contain unprintable characters [1, 16] or extra white spaces .
Consistency implies lack of contradictions and conflicts. The objective indicators are mainly associated with the modeling quality. A model is considered to be consistent if it does not contain overlapping labels (two concepts having the same preferred lexical label in a given language when they belong to the same schema) [13, 17], consistent preferred labels per language tag [17, 24], atypical use of collections, containers and reification , wrong equivalent, symmetric or transitive relationships , consistent naming criteria in the model [15, 17], overlapping labels in a given language for concepts in the same scheme  and membership violations for disjoint classes [12, 15].
Freshness is a measure for the recency of data. The basic assumption is that old information is more likely to be outdated and unreliable . Dataset freshness can be identified if the dataset contains timestamps that can keep track of its modifications. Data freshness could be considered as a subjective measure. However, our concern is the existence of temporal information allowing dataset consumers to subjectively decide its freshness for their scenario.
Provenance can be achieved at the dataset level by including metadata that describes its authoritative information (author, maintainer, creation date, etc.), versioning information and verifying if the dataset uses a provenance vocabulary like PROV .
Licensing is a quality attribute that is measured on the dataset level. It includes the availability of machine readable license information , human readable license information in the documentation of the dataset or its source  and the indication of permissions, copyrights and attributions specified by the author .
Dataset comprehensibility is identified if the publisher provides general information about the dataset (e.g., title, description, URI). In addition, if he indicates at least one exemplary RDF file and SPARQL query and provides an active communication channel (mailing list, message board or e-mail) . A model is considered to be comprehensible if there is no misuse of ontology annotations and that all the concepts are documented and annotated [17, 20].
Coherence is the ability to interpret data as expected by the publisher or vocabulary maintainer . The objective coherence measures are mainly associated with the modeling quality. A model is considered to be coherent when it does not contain undefined classes and properties , blank nodes , deprecated classes or properties , relations and mappings clashes , invalid inverse-functional values , cyclic hierarchical relations [20, 26, 28], solely transitive related concepts , redefinitions of existing vocabularies  and valueless associative relations .
Security is a quality attribute that is measured on the dataset level. It is identified if the publishers use login credentials, SSL or SSH to provide access to their dataset, or if they only grant access to specific users .
Linked Data Quality Tools
In this section, we present the results of our survey on the Linked Data quality tools. There exists a number of data quality frameworks and tools that are either standalone or implemented as modules in data integration tools. These approaches can be classified into automatic, semi-automatic, manual or crowdsourced approaches.
RDF is the standard to model information in the Semantic Web. Linked Data publishers can pick from a plethora of tools that can automatically check their RDF files for quality problems. Syntactic RDF checkers are able to detect errors in RDF documents like the W3C RDF Validator, RDF:about validator and Converter and The Validating RDF Parser (VRP). The RDF Triple-Checker is an online tool that helps find typos and common errors in RDF data. Vapour  is a validation service to check whether semantic Web data is correctly published according to the current best practices .
ProLOD , ProLOD++ , Aether  and LODStats  are not purely quality assessment tools. They are Linked Data profiling tools providing clustering and labeling capabilities, schema discovery and statistics about data types and patterns. The statistics are about properties distribution, link-to-literal ratio, number of entities and RDF triples, average properties per entity and average error.
Reusing existing ontologies is a common practice that Linked Data publishers are always trying to adopt. However, ontologies and vocabularies development is often a long error-prone process especially when many contributors are working consecutively or collaboratively . This can introduce deficiencies such as redundant concepts or conflicting relationships . Getting to choose the right ontology or vocabulary is vital to ensure modeling correctness and consistency.
DL-Learner  uses supervised machine learning techniques to learn concepts from user-provided examples. CROCUS  applies a cluster-based approach for instance-level error detection. It validates identified errors by non-expert users and iterate to reach higher quality ontologies that can be safely used in industrial environments.
qSKOS  scans SKOS vocabularies to provide reports on vocabulary resources and relations that are problematic. PoolParty checker is an online service based on qSKOS. Skosify  supports OWL and RDFS ontologies by converting them into well-structured SKOS vocabularies. It includes automatic correction abilities for quality issues that have been observed by reviewing vocabularies on the Web. The OOPS! pitfall scanner  evaluates OWL ontologies against a rules catalog and provides the user with a set of guidelines to solve them. ASKOSI retrieves vocabularies from different sources, stores and displays the usage frequency of the different concepts used by different applications. It promotes reusing existing information systems by providing better management and presentation tools.
Some errors in RDF will only appear after reasoning (incorrect inferences). In [35, 40] the authors perform quality checking on OWL ontologies using integrity constraints involving the Unique Name Assumption (UNA) and the Closed World Assumption (CWA). Pellet provides reasoning services for OWL ontologies. It incorporates a number of heuristics to detect and repair quality issues among disjoint properties, negative property assertions and reflexive, irreflexive, symmetric, and anti-symmetric properties. Eyeball provides quality inspection for RDF models (including OWL). It provides checks for a variety of problems including the usage of unknown predicates, classes, poorly formed namespaces, literal syntax validation, type consistency and other heuristics. RDF:Alerts provides validation for many issues highlighted in  like misplaced, undefined or deprecated classes or properties.
Considering the large amount of available datasets in the Linked Open Data, users have a hard time trying to identify appropriate datasets that suit certain tasks. The most adopted approaches are based on link assessment. Provenance-based approaches and entity-based approaches are also used to compute not only dataset rankings, but also rankings on the entity level.
Manual Ranking Approaches
Sieve  is a framework for expressing quality assessment and fusion methods. It is implemented as a component of the Linked Data Integration Framework (LDIF). Sieve leverages the LDIF provenance metadata as quality indicators to produce quality assessment scores. However, despite its nice features, it is only targeted to perform data fusion based on user-configurable conflict resolution tasks. Moreover, since Sieve main input is provenance metadata, it is only limited to domains that can provide such metadata associated with their data.
SWIQA  is a framework providing policies or formulas controlling information quality assessment. It is composed of three layers: data acquisition, query and ontology layers. It uses query templates based on the SPARQL Inferencing Notation (SPIN) to express quality requirements. The queries are built to compute weighted and unweighted quality scores. At the end of the assessment, it uses vocabulary elements to annotate important values of properties and classes, assigning inferred quality scores to ontology elements and classifying the identified data quality problems.
There are several quality issues that can be difficult to spot and fix automatically. In  the authors highlight the fact that the RDFification process of some data can be more challenging than others, leading to errors in the Linked Data provisioning process that needs manual intervention. This can be more visible in datasets that have been semi-automatically translated to RDF from their primary source (the best example for this case is DBpedia ). The authors introduce a methodology to adjust crowdsourcing input from two types of audience:
- Linked Data experts, researchers and enthusiasts through a contest to find and classify erroneous RDF triples
- Crowdsourcing through the Amazon Mechanical Turk.
TripleCheckMate  is a crowdsourcing tool used by the authors to run out their assessment supported by a semi-automatic quality verification metrics. The tool allows users to select resources, identify and classify possible issues according to a pre-defined taxonomy of quality problems. It measures inter-rater agreements, meaning that the resources defined are checked multiple times. These features turn out to be extremely useful to analyze the performance of users and allow better identification of potential quality problems. TripleCheckMate is used to identify accuracy issues in the object extraction (completeness of the extraction value for object values and data types), relevancy of the extracted information, representational consistency and interlinking with other datasets.
Luzzu  is a generic Linked Data quality assessment framework. It can be easily extended through a declarative interface to integrate domain specific quality measures. The framework consists of three stages closely corresponding to the methodology in . They believe that data quality cannot be tackled in isolation. As a result, they require domain experts to identify quality assessment metrics in a schema layer. Luzzu is ontology driven. The core vocabulary for the schema layer is the Dataset Quality Ontology (daQ) . Any additional quality metrics added to the framework should extend it.
RDFUnit is a tool centered around the definition of data quality integrity constraints . The input is a defined set of test cases (which can be generated manually or automatically) presented in SPARQL query templates. One of the main advantages for this approach is the ability to discover quality problems beyond conventional quality heuristics by encoding domain specific semantics in the test cases.
LiQuate  is based on probabilistic models to analyze the quality of data and links. It consists of two main components: A Bayesian Network builder and an ambiguity detector. They rely on data experts to represent probabilistic rules. LiQuate identifies redundancies (redundant label names for a given resource), incompleteness (incomplete links among a given set of resources) and inconsistencies (inconsistent links).
Quality Assessment of Data Sources (Flemming's Data Quality Assessment Tool) calculates data quality scores based on manual user input. The user should assign weights to the predefined quality metrics and answer a series of questions regarding the dataset. These include, for example, the use of obsolete classes and properties by defining the number of described entities that are assigned disjoint classes, the usage of stable URIs and whether the publisher provides a mailing list for the dataset. The main disadvantage for using this tool is the manual intervention which requires deep knowledge in the dataset examined. Moreover, the tool lacks support for several quality concerns like completeness or consistency.
LODGRefine  is the Open Refine of Linked Data. It does not act as a quality assessment tool, but it is powerful in cleaning and refining raw instance data. LODGRefine can help detect duplicates, empty values, spot inconsistencies, extract Named Entities, discover patterns and more. LODGRefine helps in improving the quality of the dataset by improving the quality of the data at the instance level.
Automatic Ranking Approaches
The Project Open Data Dashboard tracks and measures how US government websites implement the Open Data principles to understand the progress and current status of their public data listings. A validator analyzes machine readable files e.g., JSON files for automated metrics like the resolved URLs, HTTP status and content-type. However, deep schema information about the metadata is missing like description, license information or tags.
Similarly on the LOD cloud, the Data Hub LOD Validator gives an overview of Linked Data sources cataloged on the Data Hub. It offers a step-by-step validator guidance to check a dataset completeness level for inclusion in the LOD cloud. The results are divided into four different compliance levels from basic to reviewed and included in the LOD cloud. Although it is an excellent tool to monitor LOD compliance, it still lacks the ability to give detailed insights about the completeness of the metadata and overview on the state of the whole LOD cloud group and is very specific to the LOD cloud group rules and regulations.
The basic idea behind link assessment tools is to provide rankings for datasets based on the cardinality and types of the relationships with other datasets. Traditional link analysis has proven to be an effective way to measure the quality of Web documents search. Algorithms like PageRank  and HITS  became successful based on the assumption that a certain Web document is considered to have higher importance or rank if it has more incoming links that other Web documents . However, the basic assumption that links are equivalent does not suit the heterogeneous nature of links in the Linked Open Data. Thus, the previous approaches fall short to provide reliable rankings as the types of the links can have a direct impact on the ranking computation .
The first adaption of PageRank for Semantic Web resources was the Ontology Rank algorithm implemented in the Swoogle search engine . They use a rational random surfing model that takes into account the different types of links between discovered sets and compute rankings based on three levels of granularity: documents, terms and RDF graphs. ReConRank  rankings are computed at query time based on two levels of granularity: resources and context graphs. DING  adapted the PageRank to rank datasets based on their interconnections. DING can also automatically assign weights to different link types based on the nature of the predicate involved in the link. Broken links are a major threat to Linked Data. They occur when resources are removed, moved or updated. DSNotify is a framework that informs data consumers about the various types of events that occur on data sources. Their approach is based on an indexing infrastructure that extracts feature vectors and stores them to an index. A monitoring module detects events on sources and write them to a central event log which pushes notifications to registered applications. LinkQA  is a fully automated approach which takes a set of RDF triples as an input and analyzes it to extract topological measures (links quality). However, the authors depend only on five metrics to determine the quality of data (degree, clustering coefficient, centrality, sameAs chains and descriptive richness through sameAs).
Provenance-based assessment methods are an important step towards transparency of data quality in the Semantic Web. In  the authors use a provenance model as an assessment method to evaluate the timeliness of Web data. Their model identifies types of "provenance elements" and the relationships between them. Provenance elements are classified into three types: actors, executions and artifacts. The assessment procedure is divided into three steps:
- Creating provenance graph based on the defined model
- Annotating the graph with impact values
- Calculating the information provenance-based assessment metrics to support quality assessment and repair in Linked Open Data. They rely on both data and metadata and use indicators like the source reputation, freshness and plausibility.
In  the authors introduce the notion of naming authority which connects an identifier with the source to establish a connection to its provenance. They construct a naming authority graph that acts as input to derive PageRank scores for the data sources.
Sindice  uses a set of techniques to rank Web data. They use a combination of query dependent and query independent rankings implemented in the Semantic Information Retrieval Engine (SIREn) to produce a final entity rank. Their query dependent approach rates individual entities by aggregating the the score of the matching terms with a term frequency - inverse subject frequency (tf-isf) algorithm. Their query independent ranking is done using hierarchical links analysis algorithms . The combination of these two approaches is used to generate a global weighted rank based on the dataset, entities and links ranks.
Queryable End-point Quality
The availability of Linked Data is highly dependent on the performance qualities of its queryable end-points. The standard query language for Semantic Web resources is SPARQL. As a result, we focus on tools measuring the quality of SPARQL endpoints. In  the authors present their findings to measure the discoverability of SPARQL endpoints by analyzing how they are located and the metadata used to describe them. In addition to that, they also analyze endpoints interoperability by identifying features of SPARQL 1.0 and SPARQL 1.1 that are supported. The authors tackled the endpoints efficiency by testing the time taken to answer generic, content-agnostic SPARQL queries over HTTP.
An Extensible Objective Quality Assessment Framework
Looking at the list of objective quality indicators, we found out that a large amount of those indicators can be examined automatically from attached datasets metadata found in data portals. As a result, we have chosen to extend Roomba, a scalable automatic approach for extracting, validating, correcting and generating descriptive linked dataset profiles . Roomba is built as a Command Line Interface (CLI) application using Node.js. Instructions on installing and running the framework are available on its public Github repository.
shows the main steps which are the following: (i) Data portal identification; (ii) metadata extraction; (iii) instance and resource extraction; (iv) profile validation (v) profile and report generation. Roomba's advantages lay in being easy to extend as it uses a modular pluggable approach and because it already performs several pre-processing steps needed to fetch, sample, cache and validate datasets metadata.
In our framework, we have presented 30 objective quality indicators related to dataset and links quality. The remainder 34 indicators are related to the entities and models quality and cannot be checked through the attached metadata. We have also excluded security related quality indicators as they require special protocols checks which are not in the scope of our extension. The Roomba quality extension is able to assess and score 23 of them (82%).
We have extended Roomba with 7 submodules that will check various dataset quality indicators shown in . Some indicators have to be examined against a finite set. For example, to measure the quality indicator no.3 (having different data access points), we need to have a defined set of access points in order to calculate a quality score. Since Roomba runs on CKAN-based data portals, we built our quality extension to calculate the scores against the CKAN standard model.
- [QI.1] Check if there is a valid metadata file by issuing a
package_showrequest to the CKAN API
- [QI.2] Check if the
formatfield for the dataset resources is defined and valid
- [QI.3] Check the
resource_typefield with the following possible values
file, file.upload, api, visualization, code, documentation
- [QI.4] Check the resources
- [QI.5] Check the resources
- [QI.6] Check the
mimetypefields for resources
- [QI.7] Check if the dataset has a
topictag and if it is part of a valid group in CKAN
- [QI.9] Check if the dataset and all its resources have has a valid URI
- [QI.18] Check if there is a dereferencable resource with a description containing string dump
- [QI.19] Check if there is a dereferencable resource with
- [QI.20] Check if all the links assigned to the dataset and its resources are dereferencable
- [QI.21] Check if the dataset contains valid
- [QI.22] Check if the
- [QI.24] Check if the dataset and its resources contain the following metadata fields
metadata_created, metadata_modified, revision_timestamp, cache_last_updated
- [QI.25] Check if the
content-typeextracted from the a valid HTTP request is equal to the corresponding
- [QI.26] Check if the
content-lengthextracted from the a valid HTTP request is equal to the corresponding
- [QI.28,29] Check that all the links are valid HTTP scheme URIs
- [QI.37] Check if there is at least one resource with a
formatvalue corresponding to one of
example/rdf+xml, example/turtle, example/ntriples, example/x-quads, example/rdfa, example/x-trig
- [QI.39] Check if the dataset and its tags and resources contain general metadata
id, name, type, title, description, URL, display_name, format
- [QI.40] Check if the dataset contain valid
- [QI.44] Check if the dataset and its resources contain provenance metadata
maintainer, owner_org, organization, author, maintainer_email, author_email
- [QI.46] Check if the dataset contain and its resources contain versioning information
Quality Score Calculation
A CKAN dataset model describes four main sections in addition to the core dataset's properties. These sections are:
- Resources: The distributable parts containing the actual raw data. They can come in various formats (JSON, XML, RDF, etc.) and can be downloaded or accessed directly (REST API, SPARQL endpoint).
- Tags: Provide descriptive knowledge on the dataset content and structure. They are used mainly to facilitate search and reuse.
- Groups: A dataset can belong to one or more group that share common semantics. A group can be seen as a cluster or a curation of datasets based on shared categories or themes.
- Organizations: A dataset can belong to one or more organization controlled by a set of users. Organizations are different from groups as they are not constructed by shared semantics or properties, but solely on their association to a specific administration party.
A CKAN portal contains a set of datasets . We denote the set of resources , groups and tags for by and respectively.
Our quality framework contains a set of measures . We denote the set of quality indicators for by . Each quality indicator has a weight, context and a score . Each of (for = 1,...) is applied to one or more of the resources, tags or groups. The indicator context is defined where .
The quality indicator score is based on a ratio between the number of violations and the total number of instances where the rule applies multiplied by the specified weight for that indicator. In some cases, the quality indicator score is a boolean value (0 or 1). For example, checking if there is a valid metadata file [QI.1] or checking if the
license_url is dereferencable [QI.22].
is an error ratio. A quality measure score should reflect the alignment of the dataset with respect to the quality indicators. The quality measure score M is calculated by dividing the weighted quality indicator scores sum by the total number of instances in its context, as the following formula shows:
Evaluation & Motivation
In our evaluation, similarly to Roomba we focused on two aspects: i) quality profiling correctness which manually assesses the validity of the errors generated in the report, and ii) quality profiling completeness which assesses if Roomba covers all the quality indicators above. The motivation behind these two metrics is to assess if Roomba's extension can generate accurate and reliable reports that reflect the objective quality of the examined dataset.
To measure profile correctness, we need to make sure that the issues reported by Roomba are valid. On the dataset level, we chose five datasets from the LOD Cloud.
After running Roomba and examining the results on the selected datasets and groups, we found out that our framework provides 100% correct results on the individual dataset level. Roomba's aggregation have been evaluated in , thus we can infer that the quality profiler at the group and portal level also produces correct profiles.
We analyzed the completeness of our framework by manually constructing a synthetic set of profiles. These profiles cover the indicators in table . After running our framework at each of these profiles, we measured the completeness and correctness of the results. We found out that our framework covers indeed all the quality problems discussed. The result is expected as we have specifically tailored Roomba to completely cover all the previously mentioned indicators.
Experiments and Analysis
In this section, we provide the experiments done using the proposed framework. Listing shows an excerpt of the generated quality report. All the experiments are reproducible by Roomba and their results are available on its Github repository. We have run the framework on the LOD cloud containing 259 datasets at the time of writing this post. We ran the instance and resource extractor in order to cache the metadata files for these datasets locally and ran the quality assessment process which took around two hours on a 2.6 Ghz Intel Core i7 processor with 16GB of DDR3 memory machine. In this experiment, we assumed that all the quality indicator weights are equal and set to 1.
We found out that licensing, availability and comprehensibility had the worst quality measures scores: 19.59%, 26.22% and 31.62% respectively. On the other hand, the LOD cloud datasets have good quality scores for freshness, correctness and provenance as most of the datasets have an average of 75% for each one of those measures.
The error percentage is the inverse quality. For example, 86.3% of the datasets resources do not have information about its size, which means that only 13.7% of the datasets are considered in good quality for this indicator. After examining the results, we notice that the worst quality indicators scores are for the comprehensibility measure where 99.61% of the datasets did not have valid exemplary RDF file [QI.37] and did not define valid point of contact [QI.40]. Moreover, we noticed that 96.41% of the datasets queryable endpoints (SPARQL endpoints) failed to respond to direct queries [QI.19]. After careful examination, we found that the cause was incorrect assignment for metadata fields. Data publishers specified the resource
format field as an
api instead of the specifying the
================================================================================= Dataset Quality Report ================================================================================= completeness quality Score : 50.22% availability quality Score : 26.22% licensing quality Score : 19.59% freshness quality Score : 79.49% correctness quality Score : 72.06% comprehensibility quality Score : 31.62% provenance quality Score : 74.07% Average total quality Score : 50.47% ================================================================================= Quality Indicators Average Error % ================================================================================= Quality Indicator : Supports multiple serializations: 11.35% Quality Indicator : Has different data access points: 19.31% Quality Indicator : Uses datasets description vocabularies: 88.80% Quality Indicator : Existence of descriptions about its size: 86.30% Quality Indicator : Existence of descriptions about its structure: 83.67%
To drill down more on the availability issues, we generated a metadata profile assessment report using Roomba's metadata profiler. We found out that 25% of the datasets access information (being the dataset URL and any URL defined in its groups) has issues related to them (missing or unreachable URLs). Three datasets (1.15%) did not have a URL defined while 45 datasets (17.3%) defined URLs were not accessible at the time writing this post. Out of the 1068 defined resources 31.27% were not reachable. All these issues resulted in a 26.22% average availability score. This can highly affect the usability of those datasets especially in an enterprise context.
We notice that there is a plethora of tools (syntactic checkers or statistical profilers) that automatically check the quality of information at the entities level. Moreover, various tools can automatically check the models against the objective quality indicators mentioned. OOPS! covers all of them with additional support for the other common modeling pitfalls in . PoolParty covers also a wide set of those indicators but it targets SKOS vocabularies only. However, we notice a lack in automatic tools to check the dataset quality especially in its completeness, licensing and provenance measures. Roomba covers most of the quality indicators with its focus on completeness, correctness provenance and licensing. Roomba is not able to check the existence of information about the kind and number of used vocabularies [QI.8], license permissions, copyrights and attributes [QI.23], exemplary SPARQL query [QI.38], usage of provenance vocabulary [QI.45] and is not able to check the dataset for syntactic errors [QI.27].
These shortcomings are mainly due to the limitations in the CKAN dataset model. However, due to the modualirty of Roomba, syntactic checkers and additional modules to examine vocabularies usage can be easily integrated in Roomba to fix [QI.27], [QI.8] and [QI.45]. Roomba's metadata quality profiler can fix [QI.23] as we have manually created a mapping file standardizing the set of possible license names and their information. We have also used the open source and knowledge license information to normalize license information and add extra metadata like the domain, maintainer and open data conformance.
Table of Contents
- Related Work
- Objective Linked Data Quality Classification
- Linked Data Quality Tools
- Information Quality
- Modeling Quality
- Semi-automatic Approaches
- Automatic Approaches
- Dataset Quality
- Manual Ranking Approaches
- Crowd-sourcing Approaches
- Semi-automatic Approaches
- Automatic Ranking Approaches
- Queryable End-point Quality
- An Extensible Objective Quality Assessment Framework
- Quality Indicators
- Quality Score Calculation
- Evaluation & Motivation
- Profiling Correctness
- Profiling Completeness
- Experiments and Analysis