Structured data and Rich Snippets
Structured data is an efficient way to embed information in web pages, that search engines can use to propose better search results to users: more precise results, which will better match the search of the user, and also more readable results, with an adapted presentation layer.
As we have recently launched our own structured data validator, we felt it was time to give some insight about schema.org structured data to our customers.
Some history - where does schema.org come from?
We will try to avoid the traditional introduction about schema.org structured data, which usually mixes thoughts about the Semantic Web, RDF and the concept of ontologies.
The World Wide Web, as we know it, was created in the early nineties. In a few words, one of the goals of its inventor, Tim Berners Lee, was to ease up the process of sharing knowledge among humans and machines. If the usual Web is unstructured by essence (the design, the words, the presentation layer plays a major role), Berners Lee soon pointed the need for a way to express information in a formal way, that would help process deeper searches and answer more complex questions than just "buy cheap books".
Today, when one person posts a notice on a Web site to sell, say, a yellow car, it is almost impossible for another person to find it. Searching for a "yellow car for sale in Massachusetts" results in a useless huge list of pages that happen to contain those words, when in fact the page I would want may be about a "Honda, good runner, any good offer" with a Boston phone number. The search engine doesn't understand the page, because it is written for a human reader with a knowledge of English and a lot of common sense.
This the concept at the roots of the so-called "Semantic Web", which Wikipedia presents as "an extension of the World Wide Web through standards set by the World Wide Web Consortium (W3C). The goal of the Semantic Web is to make Internet data machine-readable".
The W3C created several Web standards - called Recommendations - to set up a framework allowing to disambiguate knowledge concepts, and express complex information in the most precise way possible. During the last 20 years, the W3C worked on a ton of specifications in domains related to the Sematic Web - 301 as of today have reached the "Recommendation" level.
Some domains have adopted these standards to create professional tools (think of medical research, libraries, research laboratories, museums, etc.), but the huge part of the "public" Web does not use the concepts of the Semantic Web on a daily basis.
In 2010-2011, a consortium of leading Web players decided to come together to build a new vocabulary, with a very pragmatic objective, which would allow them to easily express concrete concepts from real life. This is how Google, Microsoft, Yahoo and Yandex created schema.org, a new initiative that started with a vocabulary containing ~300 concepts only, ranging from CreativeWorks (Books, Paintings, Webpages, etc.), Events (BusinessEvent, MusicEvent, TheaterEvent, etc.), Organizations (Corporations, LocalBuisinesses, etc.), Persons, Places (AdministrativeAreas, CivicStructures, etc.) or Products.
Since then, the schema.org vocabulary has grown a bit and now allows to express about 900 different concepts:
Expressing concepts with schema.org
Formal point of view
schema.org is a vocabulary, which means that it defines:
- data types (called
Class
) which allow to define objects, information of different nature. Classes are hierarchically organized, through multiple inheritance. For example, the classAudiobook
extends both the classesBook
andAudioObject
. - relations between these objects, through
Properties
. For example, anAudiobook
object can have aauthor
property, which can be of the typesOrganization
orPerson
.
The usual formal representation of schema.org structured data is done using oriented graphs, where the nodes are Objects (Classes instances) and the oriented links describe properties.
Syntaxes and formats
There are several ways to embed schema.org structured data in Web pages. The three most common serialization formats are:
- JSON-LD, which is the de-facto standard, recommended by Google. JSON-LD is a simple json serialization. It is easy to read, easy to produce, and can be included in any part of the webpage, without disturbing its design;
- Microdata is another popular way of integrating structured data within existing web pages. It uses HTML attributes to add properties and values to existing nodes of the DOM tree. Because the structured data are mixed up in HTML nodes, it is less easy to read (and to produce) than the JSON-LD serialization;
- RDFa is less used. It is a HTML5 extension, which adds HTML attributes that allow to express the schema.org metadata.
Some examples of real-world usage
At the time of writing, schema.org proposes almost 900 Classes and about 1400 Property names, and it evolves very frequently - the version 7.0 of schema.org, published on 2020/03/17, introduced for example the SpecialAnnouncement
class, as a contribution to the global response to the COVID-19 Coronavirus pandemic. In 2020, schema.org published 4 other new major versions.
Of course, most of the types described by schema.org are very specialized and should be used in very precise cases. You should, however, prefer specific types whenever possible instead of their generic equivalent (eg. if you Web page is about a train station, use the TrainStation
class instead of the generic Place
class.
Publishing structured data requires attention and effort, so you need to choose structured data that make sense for your business and for the main entities represented in the Web page. Adding dozens of structured data objects to a Web page is not necessarily a good idea, and in fact the real challenge, as it is often the case in SEO, is to remain relevant and measured in the choices you make.
In other words, just target your objectives to add useful structured data. If, for example, you want to facilitate inbound contacts, it is a good idea to add your company's contact information through an Organization
object, as this could allow a search engine to add action buttons to visit your Website, call you, get directions to your offices, etc.
CreativeWork
CreativeWork
is designed to expose all the intellectual or creative productions: books, movies, etc. Embedding such information in a Web page allows the consumers (a search engine, etc.) to nicely display this information in the Search Engine results which, of course, may encourage users to click on your link.
See for example here a search engine result to IMDB: the rich snippet could be generated thanks to the data extraction operated based on the structured data exposed in the Web page.
Events
schema.org includes an extensive support for many types of events: BroadcastEvent
, TheaterEvent
, SportsEvent
, MusicEvent
, Festival
, BusinessEvent
are some examples of the detailed classes that you could want to expose in your Web page, if it is about such an event.
In a schema.org Event
, you can configure its dates, its location, its maximal number of attendees, its status (scheduled, postponed, etc.), the type of performance associated to the event, tickets offers, etc. All these metadata allow structured data consumers to build attractive rich snippets, which could help increase the interest in taking part to the Event.
Products
Of course, exposing structured data for products is particularly interesting for online e-commerce stores, as it allows the product catalog to be shared with external websites and, potentially, generate more sales. Google Images embeds product image results, for example, as well as several other metadata extracted from structured data:
When including structured data in your Web pages, try to be comprehensive and accurate. Remember that you are doing it for your users, and they want to get the most useful information possible. For a product, indicate its availability, the price and its unit, the sku
(which will allow the user to compare with other offers for the same product), one or more photos of the product, etc.
Web page interactions
The schema.org Action
(and inherited) classes allow to define the availability of interactions with a subject.
For example, the following snippet can be safely and unambiguously interpreted by a consumer as "For the WebSite available at http://example.com/, it is possible to perform a Search: send a HTTP request to a URL of the form http://example.com/search?&q={query}
, where {query}
is a required parameter". With such an information, any consumer of this website can dynamically build links to internal search result pages for this website:
{
"@context": "https://schema.org",
"@type": "WebSite",
"url": "http://example.com/",
"potentialAction": {
"@type": "SearchAction",
"target": "http://example.com/search?&q={query}",
"query": "required"
}
}
schema.org describes several action types: ReviewAction
(the action of sending a review), BuyAction
(the action of buying an item, PlayAction
(the action of playing / inviting someone to play with), WatchAction
(the action of watching a movie, for example).
Structured data consumers have requirements
From a purely formal point of view, schema.org does not declare required properties. For example, it is possible to state that an Audiobook
exists, without declaring its title, its author, its encoding format or its duration. It is up to the structured data writer to give less or more details about the items that he wants to specify. Of course, the more detailed information you give about an object, the more complete, interesting or useful the uses of this information may be. This is the reason why schema.org structured data consumers (aka. Search Engines, but it could also be a Marketplace software, a Newsfeed aggregator, etc.) have started to publish a list of their requirements on the structured data objects that they understand and use.
Google maintains a Structured Data Search gallery which lists all the main datatype supported by Google, that may be displayed as rich snippets in search results. For each of the supported types, Google explains which properties are required or recommended.
These requirements change often (several times a year) and are not always consistent from one language to another in Google documentation.
While ignoring a property for a structured data object is not a formal error and will not penalize your SEO ranking, it could simply result in your annotations being ignored - which is obviously not what you want when adding structured data to your pages! Staying up to date, and verifying frequently that the structured data requirements are met for your target consumer(s) is the only way to be safe here. In other words: check frequently that your structured data is valid!
Please note that, while ignoring an attribute is not a serious mistake, adding wrong structured data on purpose is strongly discouraged by Google (forged Reviews, etc.). Google warns: "If your site violates one or more of these guidelines, then Google may take manual action against it."
Validate and test schema.org structured data
Testing and validating structured data is quite complicated, due to the changing specification and consumer requirements.
Several tools exist, that can extract and validate structured data from web pages. Here is a short list:
- Google Structured Data Testing Tool is a tool provided by Google, going to be deprecated in favor of the more simplistic tool "Rich Results Test";
- Bing Markup Validator is less advanced than the tools proposed by Google, but may help;
- redirection.io structured data validator: we provide a free and exhaustive Structured data validation tool, which tests structured data for their formal validity and against Google requirements. This validator is included in the website crawler that we propose to our paid customers.
There are other more academic tools, with less user-friendly interfaces, and they often do not indicate missing fields, wrong values, or miss some documentation.
Some subtleties and points of attention
Correctly writing valid schema.org structured data can somehow be complicated. Even more, there are some subtleties to know when you want to manipulate schema.org with an advanced level.
- one object can be described as a mix of several types (aka. types composition), on the fly. For example, you can mix the
Festival
and theTouristAttraction
classes to create an hybrid type, if you want for example to describe a very touristic music festival:{ "@context": "https://schema.org", "@type": ["Festival", "TouristAttraction"], "name": "Very touristic festival", "maximumAttendeeCapacity": 1000, // comes from Festival "touristType": [ // comes from TouristAttraction "Wine tourism", "Cultural tourism" ] }
- schema.org
Action
objects can accept custom properties, suffixed with-input
or-output
- read the schema.org documentation page about actions - every property value (even the ones documented as an object) can be a replaced with a URL
- every property can have multiple values. For example, a
CreativeWork
can have one or moreauthor
:{ "@context": "https://schema.org", "@type": "Book", "name": "Some book title", "author": [ { "@type": "Organization", "name": "Some company name" }, { "@type": "Person", "name": "John Doe" } ] }
- every property object can be replaced with
Role
objects that allow to type the relation. For example, if you want to explain that someone is a CEO of a company, you can replace themember
property, which should be an object of the classPerson
, with anOrganizationRole
object:{ "@context": "https://schema.org", "@type": "Organization", "name": "Tesla", "url": "https://www.tesla.com", "member": { "@type": "OrganizationRole", "member": { "@type": "Person", "name": "Elon Musk" }, "roleName": "CEO", "startDate": "2008" } }
- most of the examples displayed on schema.org are wrong When writing these lines, many examples provided by schema.org in the definition of the vocabulary are wrong (they do not respect the parameters case, their type, some extra parameters are misspelled, etc.). Or, what is not a real issue, these examples do not comply with Google requirements (and we forgive them, Google does not respect his own structured data guidelines either).
If you believe some major information misses in this page, please contact us!