Semantic web

Introduction

With so much data being created and shared on the internet, one of the oldest challenges in building digital infrastructure has been how to consistently establish meaning and context to this data. The semantic web is a set of technologies whose goal is to make all data on the web machine-readable. Its usage allows for a shared understanding around data that enables a variety of real-world applications and use cases.

The challenges to address with the semantic web include:

  • vastness -- the internet contains billions of pages, and existing technology has not yet been able to eliminate all semantically duplicated terms

  • vagueness -- imprecise concepts like 'young' or 'tall' make it challenging to combine different knowledge bases with overlapping but subtly different concepts

  • uncertainty -- precise concepts with uncertain values can be hard to reason about, this mirrors the ambiguity and probabilistic nature of everyday life

  • inconsistency -- logical contradictions create situations where reasoning breaks down

  • deceit -- intentionally misleading information spread by bad actors, can be mitigated with cryptography to establish information integrity

Linked data

Linked data is the theory behind much of the semantic web effort. It describes a general mechanism for publishing structured data on the internet using vocabularies like schema.org that can be connected together and interpreted by machines. Using linked data, statements encoded in triples (subject → predicate → object) can be spread across different websites in a standard way. These statements form the substrate of knowledge that spans across the entire internet. The reality is that the bulk of useful information on the internet today is unstructured data, or data that is not organised in a way which makes it useful to anyone beyond the creators of that data. This is fine for the cases where data remains in a single context throughout its lifecycle, but it becomes problematic when trying to share data across contexts while retaining its semantic meaning. The vision for linked data is for the internet to become a kind of global database where all data can be represented and understood in a similar way.

https://www.datocms-assets.com/38428/1620707068-linked-data-focus.svg

One of the biggest challenges to realising the vision of the internet as a global database is enabling a common set of underlying semantics that can be consumed by all this data. A proliferation of data becomes much less useful if the data is redundant, unorganised, or otherwise messy and complicated. Ultimately, we need to double down on the usage of common data vocabularies and common data schemas. Common data schemas combined with the security features of verifiable data will make fraud more difficult, making it easier to transmit and consume data so that trust-based decisions can be made. Moreover, the proliferation of common data vocabularies will help make data portability a reality, allowing data to be moved across contexts while retaining the semantics of its original context.

https://www.datocms-assets.com/38428/1620707105-linked-data-expand.svg

Semantic web technologies

The work around developing semantic web technology has been happening for a very long time. The vision for the semantic web has been remarkably consistent throughout its evolution, although the specifics around how to accomplish this and at what layer has developed over the years. W3C’s semantic web stack offers an overview of these foundational technologies and the function of each component in the stack.

The ultimate goal of the semantic web of data is to enable computers to do more useful work and to develop systems that can support trusted interactions over the network. The shared architecture as defined by the W3C supports the ability for the internet to become a global database based on linked data. Semantic Web technologies enable people to create data stores on the web, build vocabularies, and write rules for handling data. Linked data are empowered by technologies such as RDF, SPARQL, OWL, and SKOS.

  • RDF provides the foundation for publishing and linking your data. It’s a standard data model for representing information resources on the internet and describing the relationships between data and other pieces of information in a graph format.

  • OWL is a language which is used to build data vocabularies, or “ontologies”, that represent rich knowledge or logic.

  • SKOS is a standard way to represent knowledge organisation systems such as classification systems in RDF.

  • SPARQL is the query language for the Semantic Web; it is able to retrieve and manipulate data stored in an RDF graph. Query languages go hand-in-hand with databases. If the Semantic Web is viewed as a global database, then it is easy to understand why one would need a query language for that data.

By enriching data with additional context and meaning, more people (and machines) can understand and use that data to greater effect.

JSON-LD

JSON-LD is a serialisation format that extends JSON to support linked data, enabling the sharing and discovery of data in web-based environments. Its purpose is to be isomorphic to RDF, which has broad usability across the web and supports additional technologies for querying and language classification. RDF has been used to manage industry ontologies for the last couple decades, so creating a representation in JSON is incredibly useful in certain applications such as those found in the context of verifiable credentials (VCs).

The Linked Data Proofs representation of verifiable credentials makes use of a simple security protocol which is native to JSON-LD. The primary benefit of the JSON-LD format used by LD-Proofs is that it builds on a common set of semantics that allow for broader ecosystem interoperability of issued credentials. It provides a standard vocabulary that makes data in a credential more portable as well as easy to consume and understand across different contexts. In order to create a crawl-able web of verifiable data, it’s important that we prioritize strong reuse of data schemas as a key driver of interoperability efforts. Without it, we risk building a system where many different data schemas are used to represent the same exact information, creating the kinds of data silos that we see on the majority of the internet today. JSON-LD makes semantics a first-class principle and is therefore a solid basis for constructing VC implementations.

JSON-LD is also widely adopted on the web today, with W3C reporting it is used by 30% of the web and Google making it the de facto technology for search engine optimisation. When it comes to Verifiable Credentials, it's advantageous to extend and integrate the work around VCs with the existing burgeoning ecosystem of linked data.