Open Images videos enriched with Open Data

For Sound and Vision, in the context of the Dutch Open Data initiative “Nederland opent Data” (The Netherlands Opens Data), I created the basis for the demo that is described in this post. The demo shows how you can play a video in an enriched context, by linking open data sources to terms that are found in speech transcripts rendered from videos. For the Code Camping event, organized by Open Cultuur Data (Open Cultural Data) I extended the demo with newly linked data sets.

The starting point for this demo application was the reuse and linking of data sets to the Open Images collection, which contains more than 1,500 freely (re)usable videos containing mostly old news items from the ‘20 throughout the ‘80. All of these videos are published using Creative Commons licences.

The basis for the application lies in the use of the speech transcripts, which were generated by using automatic speech recognition (ASR) software (from X-MI) on these videos.

The main idea for the demonstration is to contextualise videos while they’re being watched, in order to provide the user with fun, interesting and unexpected background information about the things that are spoken in the video.


For example: when Philip Bloemendal (the presenter of the news items) – in a video titled: ‘Large parts of Holland completely snowed in’ – talks about: ‘(…) but on several places in Drenthe there (…)’, next to the video, several blocks of information about Drenthe (a province in The Netherlands) are shown. Each of these information blocks gets its data from a specific open data source. For the first prototype the data sources used were (amongst others): Google Maps and Wikipedia. To illustrate this some more: in the example where ‘Drenthe’ was recognized as a concept, the Wikipedia block shows an article about Drenthe; in the Google Maps block the map is zoomed in on the province of Drenthe in The Netherlands.

For the Code Camping event, organized by ‘Hack de Overheid’ (Hack the government), I added two new data sets to the demo: the collections from the Rijksmuseum and the Amsterdam Museum.

How it all works
As mentioned, the main building blocks for this demo are the Open Images videos and the corresponding speech transcripts that are used to link the words that are spoken (in the video) to an exact time code. (Note: Automatic speech recognition software is not perfect, which means that not every word in a speech transcript will exactly match the actual words that were spoken).

Step 1
Because not every word in a sentence is particularly interesting, the first step is to filter out stop words from the speech transcript, such as: articles, prepositions and verb modifiers.

Step 2
In the second step, a script is run on the remaining words to sort them by ‘importance’. Importance in this matter is calculated by combining a preset word score (coming from a special lexicon) with the frequency the word is spoken. In this way, words with a high score and a high frequency will end up high in the list.

Step 3
After sorting, the words are used, in order of importance, as query input for the GTAA thesaurus (used by Sound and Vision) and also for Freebase. The latter is a Google service and offers a big collection of interrelated concepts, containing descriptions from a large variety of domains. Freebase can be seen as an extensive thesaurus containing information from a large number of areas of expertise.

When, after querying, the GTAA or Freebase webservice yields a concept, it is put in a list of candidates. After processing all the words, this list is filtered using a very simple disambiguation algorithm (i.e. whenever the yielded concept is comprised of more than one word, it is taken out of the list).

Step 4
In step 4, each of the GTAA and Freebase concepts from the list of candidates is used for querying the open data webservices, which are:

  1. Google Maps (only queried for location type concepts)
  2. Wikipedia
  3. Amsterdam Museum
  4. Rijksmuseum

Each result returned, will be linked to the time code of the (spoken) word from the speech transcript that was used to find the eventual information.

(For those interested: the collection from the Amsterdam Museum has three different end-points: Adlib, OAI-PMH and SPARQL. For this demo, I used the latter, because, unlike OAI-PMH, it does not require to be harvested and indexed before it can be queried. In any case I thought it was a good idea to play around again with the Semantic Web and refresh my SPARQL skills. For the Rijksmuseum, I first harvested the collection from OAI-PMH and then indexed it with SOLR. This way the collection can be searched using Lucene queries.

Step 5
The last step was to send back the time-coded contextdata back to the browser. I do this by using a JSON object, which in turn I use as input for Popcorn.js to generate events. These events are linked to an HTML5 video player and make sure the right (context) information is shown in the different blocks/panels in the user interface.

Because the processing of these five steps takes around 15-20 seconds per video, I store all of the results in .json files. When opening the demo these files are loaded instead of fetching the data live from the web.

There is still a lot to do
The demo shows what can be done by using concept detection (a.k.a. Named Entity Recognition) in combination with open data sources. For several aspects however (significant) improvements can be made:

Better concept detection
The concept detection as described in this demo could be improved much more. For instance, concepts that comprise of more than one word are not recognized, e.g.: ‘Amsterdam Museum’ now yields two concepts, ‘Amsterdam’ and ‘Museum’, but the actual concept ‘Amsterdam Museum’ is not found.
Moreover, specific Named Entity Recognition (NER) services like DBpedia Spotlight should be investigated (having good results for English) in order to improve results. For Dutch however, it seems it’s an ongoing search for a decent (open source) solution.

Selection of relevant sources for the user
Concerning the relevance of the ‘context information’ that is currently shown to the user, there is still much to think about how to make the best selection of data sources. For instance: why somebody who is watching a video about ‘Holland’s oldest steam-powered pumping station’ would be interested in ‘Hens chalice from the Company of Nine’ (found on the basis of the word ‘Gorinchem’, which is a town in The Netherlands) is something to think about.

Optimizing Popcorn.js usage
The demo was made with an older version of Popcorn.js (v0.7) and therefore doesn’t make full usage of all of the latest features and plugins Popcorn.js has to offer. Future releases of the demo will incorporate the newest version (currently v1.1.1).

In any case the demo does show how speech transcripts of videos can be combined with open data sources and how this can enable (mutual) contextualisation of these sources. For the ‘Nederland opent Data project’ this demo will be further enhanced. Any progress of this will be reported here!

Jaap Blom | Software engineer | R&D department, Netherlands Institute for Sound and Vision

6 thoughts on “Open Images videos enriched with Open Data

  1. Hi Evelien,
    great post. I find what you did with popcorn.js very innovative. Concerning the NER, can you explain why you had to use both the gtaa thesaurus and freebase? I dont know much about NER, yet what you did is giving me some ideas.

  2. Hi @paisible,

    that’s a good question. The initial idea was to be able to compare the results of both Freebase and GTAA (to see for instance how big a domain each service covers).
    Moreover I was curious to see how well Freebase works for Dutch.
    Also I thought of the possibly of using the result from the GTAA to validate the results from Freebase (or vice versa).

    In guess the main idea I had was to use multiple services in order to be able to disambiguate and better filter out faulty results.


  3. Hi, we (still) have some problem with the server hosting the demo, which makes it necessary to run the demo on port 4444 (rather than the default port 80).

    Some (company) firewalls prohibit viewing content over port 4444, so this is possibly the case for you.

    The best option is to try viewing the demo on different locations (e.g. at home, rather than at work).

    I hope this clarifies things.


Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>