Open Images videos enriched with Open Data

For Sound and Vision, in the context of the Dutch Open Data initiative “Nederland opent Data” (The Netherlands Opens Data), I created the basis for the demo that is described in this post. The demo shows how you can play a video in an enriched context, by linking open data sources to terms that are found in speech transcripts rendered from videos. For the Code Camping event, organized by Open Cultuur Data (Open Cultural Data) I extended the demo with newly linked data sets.

Basics
The starting point for this demo application was the reuse and linking of data sets to the Open Images collection, which contains more than 1,500 freely (re)usable videos containing mostly old news items from the ‘20 throughout the ‘80. All of these videos are published using Creative Commons licences.

The basis for the application lies in the use of the speech transcripts, which were generated by using automatic speech recognition (ASR) software (from X-MI) on these videos.

The main idea for the demonstration is to contextualise videos while they’re being watched, in order to provide the user with fun, interesting and unexpected background information about the things that are spoken in the video.

Demo-Open-Beelden-Open-Cultuur-Data

For example: when Philip Bloemendal (the presenter of the news items) – in a video titled: ‘Large parts of Holland completely snowed in’ – talks about: ‘(…) but on several places in Drenthe there (…)’, next to the video, several blocks of information about Drenthe (a province in The Netherlands) are shown. Each of these information blocks gets its data from a specific open data source. For the first prototype the data sources used were (amongst others): Google Maps and Wikipedia. To illustrate this some more: in the example where ‘Drenthe’ was recognized as a concept, the Wikipedia block shows an article about Drenthe; in the Google Maps block the map is zoomed in on the province of Drenthe in The Netherlands.

For the Code Camping event, organized by ‘Hack de Overheid’ (Hack the government), I added two new data sets to the demo: the collections from the Rijksmuseum and the Amsterdam Museum.

How it all works
As mentioned, the main building blocks for this demo are the Open Images videos and the corresponding speech transcripts that are used to link the words that are spoken (in the video) to an exact time code. (Note: Automatic speech recognition software is not perfect, which means that not every word in a speech transcript will exactly match the actual words that were spoken).

Step 1
Because not every word in a sentence is particularly interesting, the first step is to filter out stop words from the speech transcript, such as: articles, prepositions and verb modifiers.

Step 2
In the second step, a script is run on the remaining words to sort them by ‘importance’. Importance in this matter is calculated by combining a preset word score (coming from a special lexicon) with the frequency the word is spoken. In this way, words with a high score and a high frequency will end up high in the list.

Step 3
After sorting, the words are used, in order of importance, as query input for the GTAA thesaurus (used by Sound and Vision) and also for Freebase. The latter is a Google service and offers a big collection of interrelated concepts, containing descriptions from a large variety of domains. Freebase can be seen as an extensive thesaurus containing information from a large number of areas of expertise.

When, after querying, the GTAA or Freebase webservice yields a concept, it is put in a list of candidates. After processing all the words, this list is filtered using a very simple disambiguation algorithm (i.e. whenever the yielded concept is comprised of more than one word, it is taken out of the list).

Step 4
In step 4, each of the GTAA and Freebase concepts from the list of candidates is used for querying the open data webservices, which are:

  1. Google Maps (only queried for location type concepts)
  2. Wikipedia
  3. Amsterdam Museum
  4. Rijksmuseum

Each result returned, will be linked to the time code of the (spoken) word from the speech transcript that was used to find the eventual information.

(For those interested: the collection from the Amsterdam Museum has three different end-points: Adlib, OAI-PMH and SPARQL. For this demo, I used the latter, because, unlike OAI-PMH, it does not require to be harvested and indexed before it can be queried. In any case I thought it was a good idea to play around again with the Semantic Web and refresh my SPARQL skills. For the Rijksmuseum, I first harvested the collection from OAI-PMH and then indexed it with SOLR. This way the collection can be searched using Lucene queries.

Step 5
The last step was to send back the time-coded contextdata back to the browser. I do this by using a JSON object, which in turn I use as input for Popcorn.js to generate events. These events are linked to an HTML5 video player and make sure the right (context) information is shown in the different blocks/panels in the user interface.

Because the processing of these five steps takes around 15-20 seconds per video, I store all of the results in .json files. When opening the demo these files are loaded instead of fetching the data live from the web.

There is still a lot to do
The demo shows what can be done by using concept detection (a.k.a. Named Entity Recognition) in combination with open data sources. For several aspects however (significant) improvements can be made:

Better concept detection
The concept detection as described in this demo could be improved much more. For instance, concepts that comprise of more than one word are not recognized, e.g.: ‘Amsterdam Museum’ now yields two concepts, ‘Amsterdam’ and ‘Museum’, but the actual concept ‘Amsterdam Museum’ is not found.
Moreover, specific Named Entity Recognition (NER) services like DBpedia Spotlight should be investigated (having good results for English) in order to improve results. For Dutch however, it seems it’s an ongoing search for a decent (open source) solution.

Selection of relevant sources for the user
Concerning the relevance of the ‘context information’ that is currently shown to the user, there is still much to think about how to make the best selection of data sources. For instance: why somebody who is watching a video about ‘Holland’s oldest steam-powered pumping station’ would be interested in ‘Hens chalice from the Company of Nine’ (found on the basis of the word ‘Gorinchem’, which is a town in The Netherlands) is something to think about.

Optimizing Popcorn.js usage
The demo was made with an older version of Popcorn.js (v0.7) and therefore doesn’t make full usage of all of the latest features and plugins Popcorn.js has to offer. Future releases of the demo will incorporate the newest version (currently v1.1.1).

In any case the demo does show how speech transcripts of videos can be combined with open data sources and how this can enable (mutual) contextualisation of these sources. For the ‘Nederland opent Data project’ this demo will be further enhanced. Any progress of this will be reported here!

Jaap Blom | Software engineer | R&D department, Netherlands Institute for Sound and Vision

Reach of Open Images content increased by reuse on Wikipedia

Access to the audiovisual content on Open Images is provided under Creative Commons licences. These licenses facilitate the reuse of content in different ways. One of the possible ways media from Open Images can be reused is on Wikipedia. For this purpose the videos on Open Images are transferred to Wikimedia Commons, the online repository where freely licensed media files used for Wikimedia projects like Wikipedia are stored. In the beginning this was done manually, but in the meantime this process has been automated through the Open Images API. Currently, there are more than 1500 media items from Open Images available on Wikimedia Commons. This means that Open Images is responsible for about 15% of the total amount of videos, which makes Open Images the largest supplier of videos on Wikimedia Commons.

The Wikipedia community uses the videos from Open Images to enrich the entries on the Wikipedia. For instance, the English article on the ‘Elfstedentocht‘ has a video of the Elfstedentocht of 1954:

A video from Open Images on the Wikipedia lemma 'Elfstedentocht'

Besides the reuse of complete videos, derivative works (such as screenshots) are also used. These are then for example employed in articles on famous people, for instance in this article on Dutch politician Pieter Oud:

A screenshot used as photo on the lemma "Pieter Oud'

3 million views
The reach of Open Images content on Wikipedia turns out to be substantial. In May 2011 the Wikipedia articles with media items from Open Images were viewed more than 3 million times. This is almost three times as much as the number of views in December 2010. Noteworthy is that the majority of the views are not on the Dutch Wikipedia, even though most of the videos on Open Images have Dutch subjects and are in Dutch. Of the 3 million views a mere 880,000 were on the Dutch language Wikipedia. The remaining 2.2 million views were on Wikipedias in different languages. The five Wikipedias where articles with Open Images content got the most views in May 2011 were:

  1. the English Wikipedia
  2. the Dutch Wikipedia
  3. the French Wikipedia
  4. the Portuegese Wikipedia
  5. the Japanese Wikipedia

More than 850 articles on the different Wikipedias make use of content from Open Images.

The article with the most views in May 2011 was Mother’s Day on the English Wikipedia, which was viewed almost 1.5 million views. The video used in this article is used on several Wikipedia sites. Besides the English and the Dutch Wikipedia, it is also used on for example the Tibetan and Persian Wikipedia. The Wikipedia articles containing Open Images media with the most views in May 2011 were:

  1. Mother’s Day (EN) 1,445,756 views
  2. AFC Ajax (EN) 121,322 views
  3. AFC Ajax (NL) 111,190 views
  4. Billy Graham (EN) 94,485 views
  5. Giro d’Italia (EN) 73,055 views

Conclusion
These statistics demonstrate that offering their material under a free license certainly has an added value for cultural heritage institutions. For the cultural heritage field it is a sound strategyfor opening up their collections to a large audience. It also gives the (internet) community a chance to enrich their projects with historic images.  This reuse is of course not restricted to Wikipedia. By offering collections under a free license they turn into a rich source for (re)use fora large number of cultural, educational and creative purposes.

Open Images prize for best Wiki Loves Monuments video

At the moment Wikipedia articles don’t contain a lot of videos (less than 0,1% of all files on Wikimedia Commons are video files). Open Images would like to change this. Therefore, most videos from Open Images are already automatically mirrored to Wikimedia Commons. To stimulate users to use more video on Wikipedia, Open Images will be handing out a special video prize. The maker of the best video uploaded as part of Wiki Loves Monuments will be awarded a 2 year Premium subscription to Spotify, or alternatively an Amazon gift voucher.

Wiki Loves Monuments is a contest organised by Wikimedia, the movement behind Wikipedia. To be eligible for the video prize participants have to upload a video of one or more monuments to Wikipedia in September. The rules are:

  • Self made and self uploaded
  • Uploaded in September 2011
  • Freely licensed
  • Feature one or more monuments

So be creative and enter the contest! The people of Video on Wikipedia have a howto explaining how to post a video to Wikipedia. More information on Wiki Loves Monuments can be found on their website.

Open Images 2011: more content providers, more functionality and expanding reuse on Wikipedia

With this blog post we look back on the past year. How did Open Images contribute to an open collection of audiovisual material and stimulate the reuse of it?

Hundreds of Items Added to the Platform

In 2010 we have uploaded hundreds of interesting items to the platform from the historical newsreel collection of the Netherlands Institute for Sound and Vision, reaching the milestone of a thousand items available on the platform on the UNESCO World Day for Audio Visual Heritage in October. In our selection procedure some themes received special attention; sports, performing arts, winter, technology, and Indonesia.

This year the Sound and Vision was not the only contributor of content to the platform. Other wonderful additions to Open Images were done by the EYE Film Institute Netherlands, the Institute for Network Cultures and the Dutch National Committee May 4th and 5th.

API Launched

In September Open Images launched its open API. Items published on the platform and their descriptions (metadata) and are accessible through an Open Archives Initiative Protocol for Metadata Harvesting (OAI-PMH). This enables third parties to retrieve the stored metadata and media files in a structured way, making it easy to reuse material from the platform in their own applications (for example to create a mashup).

Video on Wikipedia

Since the start of the project, Open Images has contributed its audiovisual content to Wikimedia Commons to enable reuse of video on Wikipedia, for instance to ‘illustrate’ an article.

At first the ‘donation’ to Wikimedia Commons was a manual process, but in 2010 – in collaboration with Wikimedia Netherlands – we were able to fully automate this process, thanks to the Open Images API. As a result Open Images is now responsible for almost 12% of the video content available on Wikimedia Commons, hence being one of the biggest contributors of video that is reusable on Wikipedia.

We are getting more and more insight in the impact of the availability of Open Images material through Wikimedia Commons. We’ve learned that a large proportion is used to enrich over 550 entries on Wikipedia with related audiovisual content. In December 2010, these entries were viewed nearly 1.2 million times. This shows the great potential for the cultural heritage sector to collaborate with the Wikimedia Foundation to reach new and greater audiences within a meaningful context.

New Projects Reusing Open Images

When Open Images was launched in 2009 the material was almost immediately reused within several projects, including the OPEN CITY audiovisual archive of urban life from the Dutch public broadcaster VPRO and the ArtTube video platform about art and design from the Boijmans Van Beuningen Museum in Rotterdam.

In 2010, tens of projects, small and large, were added to the list. Among them Picture War Monuments, a location-aware iPhone app that enriches the on-site visit to war monuments with audiovisual heritage, including newsreel footage and oral history video material on the Second World War available through Open Images. Another notable initiative was Image on a Map (‘Beeld in kaart’), a Google Maps mashup for the educational sector in the Netherlands combining several (educational) video sources – including Open Images – within a map interface. With this interface users are able to filter results based on subject (geography or history), location and time period.

What’s Next?

In 2011 the Open Images platform will receive a major update, with both functional and visual improvements. Part of this update is the realisation of portal functionality, allowing third party content providers to build and manage their own entrance to the platform (think: http://partner.openimages.eu). This will, for example, allow them to highlight their own contributions to the platform and to design their presence on the platform according to their own wishes and branding.

The platform functionality is part of a larger campaign we are organising to attract more third party content keepers to contribute to an even larger and more diverse offering of open audiovisual content through Open Images. This campaign will focus on public broadcasters, regional and local archives and broadcasters, institutional archives and business archives.

Finally, we would proudly like to mention our nomination for the Museums and the Web – Best of the Web Award 2011 in the category ‘Innovative / Experimental’.

First EUscreen International Conference on Content Selection Policy and Contextualisation

logo

EUscreen started in October 2009 as a three-year project funded by the European Commission’s eContentplus programme. Over the project’s duration more than 30,000 items representing Europe’s television heritage (videos, photographs, articles) will be made available online through a freely accessible multilingual portal. As part of the project Open Images will function as a platform for European broadcasters to experiment with open content distribution of television heritage.

The portal will be launched in 2011 and will be directly connected to Europeana. The EUscreen consortium is co-ordinated by University of Utrecht and consists of 28 partners (comprising audiovisual archives, research institutions, technology providers and Europeana) from 19 different European countries. In October the project will organize its first international conference:

Date: 7-8 October 2010.
Location: Casa del Cinema. Largo Marcello Mastroianni 1, Rome, Italy.

EUscreen has organized a two-day conference on content selection policies and contextualisation in the audiovisual domain, to be held in Rome on October 7 and 8 2010. The conference will focus on contextualisation of audiovisual material, especially in the academic field. The conference programme is still under construction, but the first day includes a plenary session focussing on contextualisation of audiovisual material with keynotes and presentations of use cases. The second day comprises two workshops: one on European IPR legislations in the audiovisual sector and the impact on the exploitation of audiovisual and television archives, and one on best practices and guidelines for digitising audiovisual heritage. Attendance at the conference is free but online registration is required.

See www.euscreen.eu for more information on the final programme and for registration.

Confirmed speakers

• Prof. Andrew Hoskins, Professor of Cultural Studies at Nottingham University on media, digitization and memory.
• Dr. Lilian Landes, scientific co-ordinator of the recensio.net project at Bavaria State Library on creating a European Open Access infrastructure for historical reviews.
• Dr. Alec Badenoch, from Utrecht University on Making Europe, virtual exhibits on European cultural heritage.
• Johan Söderberg, lecturer and filmmaker from Sweden on using and reusing archival material in his works, like the series “Read my lips”.
• Dr. Tibor Hirsch, from Film Studies at ELTE University on using digitized material in a creative way to help students understanding the language of film and television.
• Dr. Andreas Fickers, from the Art and Social Sciences at Maastricht University on audiovisual source critique in the age of the web 2.0.
• Peter B. Kaufman, President and executive producer of Intelligent Television. He is also the author of “Marketing Culture in the Digital Age: A Report on New Business Collaborations between Libraries, Museums, Archives, and Commercial Companies”.
• Prof. John Ellis, Professor of Media at Royal Holloway – University of London.

Open Video Conference Report

The first Open Video Conference was held at NYU Law School on June 19-20. Eminent speakers and practitioners shared their thoughts on the emerging open video movement. The impressive line-up included: Matt Mason (author of The Pirate’s Dilemma), Yochai Benkler and Jonathan Zittrain (both Harvard Law School), Xeni Jardin (Boing Boing), Peter Kaufman (Intelligent Television), Mike Hudack (blip.tv) and Christopher Blizzard (Mozilla Corporation). The conference was put on by Kaltura, Yale Internet Society Project, Participatory Culture Foundation, iCommons and the Open Video Alliance, in partnership with Mozilla, Red Hat, Creative Commons, Level 3, Akamai and many more. Open Images was also actively involved, as Sound and Vision and Kennisland hosted a session “Audiovisual Archives” that investigated how memory institutions could provide access their holdings in a way that enables creative reuse.
Read the rest of this entry »

Nationaal Archief publishes photos on Flickr The Commons

The National Archive (Nationaal Archief), the largest Dutch archive, has put a selection of their collection on Flickr The Commons . It’s the first Dutch heritage institution to join Flickr The Commons, a project intitiated by the US Library of Congress and international photo-sharing website Flickr.

nb3-nationaalarchiefflickr.jpg

Click here for the pictures on Flickr The Commons

Parts of the special collection ‘Labour Inspectorate’, digitized in the Images for the Future framework, are placed onto the Flickr website. Users are invited to add tags and comments to the photos. As a result of the new collaboration between the National Archive and Spaarnestad Photo, photographs of this archive have been added to the Flickr collection as well.

On the 4th November, there is a seminar about the value of social tagging, with among others, delegates from Flickr and the National Maritime Museum. In the first two days, the photo’s have been viewed over 300.000 times and more than 400 comments have been added.

The Nationaal Archive is proud to be a member of The Commons on Flickr. Photographs of the Nationaal Archive that are part of the Commons on Flickr have “no known copyright restrictons”, this means that there are no copyright restrictions on the works designated, either because the Nationaal Archief owns the copyright of the photographs and authorizes others to use the work without restrictions, or because the copyright may have expired.