Networked Video in 10 Years : Networked Video == Parseable Video

Recently, I had a chance to discuss what online video might look like in the next 10 years with a group of very smart people at the Video on the Net: Beyond YouTube? breakout session at the Beyond Broadcast conference.

There are those who beleive that the video internet is currently going through it’s growth spurt much like text internet did in the 1990s. In some respects, I very much agree. The phenominal growth of activities such as video blogging, aggregation, playlisting and podcasting have gone far to make video a normal part of the web.

In other respects, I see a long road still ahead. Mike Lanza of Click.TV outlined a thought that is very pertinant. He stated that in the current iteration of online video, interaction, particularly social interaction occurs around the video with tools that are firmly based in the world of the textual web: tagging, commenting, sharing and the like. This is evident all through the popular video aggregators and video blogs, a quick trip to YouTube should illustrate enough.

Of course, there is more that is happening. People are remixing, starting to make comments in-time with video, people are creating videos in response to other videos but these are certainly not the dominant forms.

It is obvious that online video must and is taking a different form from the video that we have all experienced over the past 50 years (namely TV). It is on-demand, lean forward and nessecarily of limited quality and duration.

What is slightly less obvious is that current iterations of the popular online video formats are black boxes. They depend on the the text around them to provide the context and searchability. Metadata, which could provide some of this information is non-standard if existant at all. In other words, we are moving from a pure text internet to a multimedia internet but that multimedia in order to be useful needs to be described or put back into text in some manner.

Now, I am not saying this is a bad thing or useless thing. We can scan text, pull out key points in a non-linear fashion, navigate through text. None these things are easy with video in it’s current form. Video is rich and has tremendious emotional impact but it also has a lot of baggage.

One of the the things we discussed in our group discussion was “What would a video wiki look like?” A wiki being a very successful example of many of the things that the web was originally designed for. Wikis are open platforms for anyone to write, edit, erase, converse and otherwise publish content online.

Unfortunately, no one really had an answer. There are thoughts that collaborative editing platforms are getting there but editing is only one aspect of the language of video. There is also all of the production in the first place. Perhaps wikis just don’t translate into something where there is an infinite number of variables. In text, language adds some semblance of the finite, in video there isn’t a defined language with parseable portions.

My thesis here (and this is not new nor original) is that for video on the net to reach the relevance of text on the net, to be truly searchable, scannable and sharable it must be parseable at the very least. We must be able to hyperlink to portions, drill deeper within it, copy and paste it and search it.

What would a video wiki look like?

Last a note: Researchers, Academics, Cinematographers and practicioners who use video have been talking about these issues for as long as video has been around. This is not a new conversation but certainly one that is becoming increasingly relevant. One place that you might find people discussing these very issues is netvidtheory Yahoo Group.

5 thoughts on “Networked Video in 10 Years : Networked Video == Parseable Video”

  1. Certainly searchability, which requires ‘parseability’, and without relying on related/linked text is key to making a really useful video ‘encyclopedia’ out of the great pile of digital images gathering online. Would not something along the lines of ‘facial recognition software’ be the direction any such tool would have to take? Obviously the problem is considerably more complex than Western text which can be broken down and completely represented in less than 128 pieces. However I suspect that a study of shapes and their relationships could come up with a reasonable number of checks that would lead to near perfect matches. In other words, if you see a duck as an ‘oval with a triangle attached at one end and a curved tube at the other’ with a size relationship thrown in you would in fact find way more actual images of ducks than anything else. Allowance would have to be made for ranking ‘nearness of match’ since while a ‘t’ is a ‘t’ and ‘oval’ can be many things, but that should be easily doable.

    A vocabulary of image templates could be built, for example; human faces have certain features within a very limited range of relationships, so do cats, dogs, sailboats, chairs houses, etc., etc., etc.

    If this scanning and indexing of images was done on an ongoing basis as the search engines do with text, it would seem to me that a routine search for images of this or that could be quite rapid, and at least as accurate as text based searches on a particular subject, without any reliance on linked text clues at all.

  2. Hi Dan,

    I agree, there could be software developed that implements search on video for things like faces and ducks and so on. Unfortunately, every attempt at this that I have seen does a pretty poor job as compared with humans doing the identification.

    My feelings are that the whole process of producing video needs to change. Instead of creating moving images with a dumb camera that just captures light, moving images should be captured with cameras that know the time and location, that can be told the context and that allow video to be marked up and tagged on the spot. Futhermore, the video shouldn’t be one continous run, it should have scene detection, it should record the settings used in software like exposure, white balance, iris, zoom level and so on.

    This would give us a running start and isn’t nessecarily difficult.

