An introduction to library linked data

We recently announced a comprehensive strategy to bring linked data into mainstream library cataloging workflows. It’s a long-term approach, recognizing that most libraries will move to linked data slowly and incrementally—and we’re committed to providing tools and resources to support the transition for everyone.

Working closely with libraries around the world, we know that staff at some libraries are already educating themselves on the topic, piloting linked data services, and taking part in ongoing research. But we also know that many others have a lot of questions. In addition to technical issues, librarians are also wondering how linked data will impact and affect the work they are currently doing. To help, OCLC is rolling out linked data infrastructure and services that meet libraries where they are today and provide meaningful improvement to challenges facing libraries.


What is linked data?

At its simplest, linked data is about connections. It’s a way to organize and connect data on the web so it can be easily, automatically, and programmatically shared and used by various systems and services.

For a brief, more technical introduction, jump down to the end of this post. But the super-short version is that linked data is simply lines of standardized HTML code that computers use to link different concepts by their relationships to each other.

If you look at the “knowledge panel” in a Google result, you’ll often see information about a subject from many sources. That “info card” is populated with linked data from many other sites (including direct links to library resources, using information from WorldCat). Other related linked data sources, including VIAF, Wikidata, and DBpedia, are already being used to connect services and create new applications.

As more related linked data comes online, we’ll see more opportunities for additional library-focused applications. By breaking up the valuable, library-focused data locked in MARC records and publishing it using URIs (Uniform Resource Identifiers), library staff will be able to provide greater context for information and build rich connections across library resources, their communities, and beyond.

How is linked data different? Is it better?

Traditional, fixed data formats—like MARC records—have two major limitations. It’s hard to get useful data from other, nonlibrary sources into library workflows and it’s hard for potential users of library information to get MARC data into their workflows.

The first is a challenge because, as we know, there are many sources of information to help improve the discovery and use of library materials. That could be across campus—in another department or system that is more heavily used by students and researchers—or from experts around the world. The second is a lost opportunity, because library metadata is created by cataloging workers (at libraries and OCLC) who are among the most talented data specialists in the world. Many other industries and areas could benefit from the work they do.

Linked data helps address both challenges. For example, OCLC works with organizations like Google to insert library linked data into their services. These efforts make library materials more visible in places where people search online. And there are opportunities for partners to help do the same in reverse, getting their information into systems and services where library workers and users can connect. For example, linked data makes connecting works across languages much easier, meaning that publishers can direct inquiries in one language to materials available in others.

In both cases, it helps connect library work to the wider web, promoting libraries while improving efficiency.

What about MARC?

If we look at the history of metadata, there’s a consistent record of libraries moving to systems and services that let more people interact with it in more ways.

  • Closed stacks were the ultimate data filter. When users had to ask library staff to fetch resources from a closed room, there was no chance for direct interaction.
  • Shelf browsing, using systems like Dewey and LCC (Library of Congress Classification), allowed users to interact with metadata themselves, making their own choices. Library workers moved from the position of data gatekeepers to being guides, educators, and advocates.
  • Centralized databases, such as WorldCat, connected library catalogs for cooperative record creation and improvement, as well as new discovery and resource sharing options within library-based services.
  • Online access to library databases, in places like WorldCat.org, meant that anyone with access to a web browser could find and use library metadata online. Early OCLC partnerships also meant that library data could—with some additional work—be shared in other online resources.

Linked data is the next step in this evolution. Until now, everything we’ve done was primarily to make library metadata more accessible to people. Now we’re putting library data out there in a way that’s more accessible to today’s online services, programs, machine learning systems, and artificial intelligence (AI) applications.

MARC will be with us for the foreseeable future. After all, it took nearly 50 years for many libraries to fully make the transition from printed cards to online cataloging. Our plan is to continue to support MARC-based functions while actively building powerful library linked data tools and resources.

Why should I care about linked data today?

As libraries continue to focus on new ways to facilitate the creation and sharing of knowledge, and as the volume and variety of information increases, metadata and metadata expertise are more important than ever. Evolving library data into linked data frees the knowledge in library collections and connects it to the knowledge streams that inform our everyday lives—on the web, through smart devices, and using technologies like AI.

Here are some of the reasons I think you should be excited about what’s happening with linked data today:

  • It allows us to harness the collective expertise of library workers at thousands of institutions. That’s exciting both in terms of partnerships and original research.
  • It synchronizes and enhances library data at scale. WorldCat Entities is a set of centralized data that establishes the context for bibliographic metadata curation. And we’re connecting it to existing systems like the DDC (Dewey Decimal Classification) and FAST (Faceted Application of Subject Technology) to integrate linked data into other library workflows.
  • It helps current systems and workflows through the transition to linked data by integrating data like WorldCat Entities URIs to WorldCat.
  • We’re creating new tools that will let cataloging workers add linked data to existing records. This will allow for enhanced cataloging applications, record output with identifiers, and soon, the launch of OCLC Meridian, a WorldCat Entities linked data management tool.
  • We’ll also launch a bibliographic editing tool that works seamlessly between BIBFRAME and MARC data, helping to meet the needs of librarians as they transition to non-MARC formats.

There’s a lot to be excited about. And this will be a marathon, not a sprint. But for today? Know that OCLC is working toward a linked data future that supports all libraries as they transition at their own pace and in ways that provide value without impacting current processes.

This is the first of three posts about linked data. Keep an eye on this space, check out the main page for OCLC linked data strategy and news, and sign up for updates on this important subject.


Technical background for linked data

When Tim Berners-Lee and the team at CERN invented the basic protocols for the web in 1989, they proposed three basic technologies to connect people to resources:

  • Unique resource identifiers (URIs) for anything that can be connected on the web; URLs (Uniform Resource Locators)—commonly known as “web page names”—are a type of URI
  • The Hypertext Markup Language (HTML) code used to format documents on the web
  • The Hypertext Transfer Protocol (HTTP), which is used to establish connections between web pages and related assets (pictures, sound, video, apps, and data)

When you—a human user of the web—click on a link that says, for example, “Boston Symphony Orchestra Archives,” you have an expectation that it will take you to another page with related information. The context for that journey is based on how people use documents and links to find and access related resources.

Later, Berners-Lee expanded this, outlining principles to link data between computers rather than people. He proposed that “conceptual things” should have a URI for an online name that returns data about that thing in a standard format, and that other related things should also be given a URI. In this way, similarly to how people use links, computer programs can move from page to page (URI to URI), using common technology to search for and utilize related information.

The URI for a “thing” (commonly called an “entity,” which could be any object, person, date, concept, place, etc.) is just a web page that has linked data code on it. That code contains information about the subject, and also links to other entities using something called “a triple,” which is just:

[Thing 1] <has this relationship> to [Thing 2]

So, for example:

[Octavia E. Butler] <is the author of> [Parable of the Sower]

That information would be found in a line of code on the page for both Butler and the novel. So, when a computer program finds either page, it will be able to “know” the relationship between those two entities. And when billions of pieces of linked data are published and connected all over the web, it becomes possible to build applications that utilize previously disconnected information in unique and powerful ways.

For example, another site might publish linked data about where famous people are born, and could have the following triple on the page for Pasadena, California, USA:

[Pasadena, California, USA] <is the birthplace of> [Octavia E. Butler]

And a third application might be pulling data from many sites in order to display interesting travel-related information for vacation planning. Its service could pull linked data from the birthplace site, and then search for related, interesting links. So that when you use its software to plan a trip to Pasadena, it would search that linked data, which would then connect to library data, and provide library links to works by authors from that city.

The main thing to keep in mind is that linked data is simply computer code on ordinary web pages that provides contextual information about things (“entities”). That data is then read by automated programs that put it together with linked data from other sources to create new applications and services.