Ministry of Truth

Michel asked me to republish my comments on OpenCalais.

On the Kendra mailing list I read an interesting posting about the OpenCalais system, developed by Reuters. According to their website the Calais initiative seeks to help make all the worlds content more accessible, interoperable and valuable via the automated generation of rich semantic metadata, the incorporation of user defined metadata, the transportation of those metadata resources throughout the content ecosystem and the extension of it’s capabilities by user-contributed components. Let’s see how the Clearforest Web service works:

  • A User publishes an article on his/her blog (e.g. a WordPress blog)
  • A WordPress plugin creates a Semantic tag cloud of the article (to be done)
  • the Semantic tags are sent to the Calais system
  • the Calais system stores the Semantic tags and returns a unique RDF identifier

The article that was published by a User is information. According to the DIKW model information becomes knowledge when it is put into context. The Calais system creates knowledge by putting User Generated Content into context. The example on the Calais website illustrates how this works:

  • many Users write in their blogs: person ‘z’ was appointed chairman of company ‘y’ on date ‘x’
  • as a result of natural language processing, machine learning and other methods the Calais system “knows” that person ‘z’ IS chairman of company ‘y’.
  • the Calais system issues and maintains the RDF identifiers pointing to different creators of similar Semantic tag clouds.
  • since the information is confirmed by different sources the quality of the Calais system’s knowledge is good

Information (content) is useless if it is out of context. Putting information in context creates value. Thus Clearforest adds value to user generated content. By comparing information from different sources and creating a measure for the validity the Clearforest service adds further value. Which are the benefit of Clearforest for web users? Imagine a User wants to know who is the current chairman of company ‘BASF’. He can search google for BASF + chairman.

  • the first hit yields Juergen Hambrecht appointed new BASF Chairman.
  • but when the user looks at the URL google has referenced as the source of this information – www.basf.com/corporate/news2002/newsinfo_board_071802.html – he will notice that this article has been created 7 years ago.
  • So the user may ask himself if this information is still valid in the context of the year 2008

Once a search engine (let’s call it ClearSearch) will be connected to the Clearforest service a user may enter the (natural language) question: who is the chairman of BASF?

  • the Clearforest system “knows” that person ‘z’ IS chairman of company ‘y’ (Dr. Jürgen Hambrecht still is chairman of BASF).
  • This knowledge has been verified by Reuters correspondents and probably has been confirmed by hundreds of articles published by independent bloggers – so the chance is good that ClearSearch’s answer is correct and always up to date.
  • By analysing User Generated Content it’s even possible that Clearforest will “know” about events Reuters correspondents haven’t heard of.

This raises an interesting question: Who owns or has the right to control the access to the knowledge generated from User Generated Content. Publishers of web pages (e.g. bloggers) have the right to control the access to their information, because they own the copyright of their articles. But do Web users also have the right to control access to their collective knowledge? And even if it is technically possible to control access to public knowledge, is it ethical to support technologies aiming at this goal? Wikipedia proves that Web users have developed working processes to create knowledge from User Generated Content. According to the official policy the Wikipedia knowledge is licensed under the GFDL – so everybody can modify or use this knowledge for free. The passage on Intellectual Property Rights in the Clearforest Terms of Use depicts quite a different conception about the ownership of public knowledge:

ClearForest’s Intellectual Property Rights. You acknowledge that ClearForest owns all right, title and interest in and to the Service, including without limitation all intellectual property rights (the “ClearForest Rights”), and such ClearForest Rights are protected by U.S. and international intellectual property laws. Accordingly, you agree that you will not copy, reproduce, alter, modify, or create derivative works from the Service. You also agree that you will not use any robot, spider, other automated device, or manual process to monitor or copy any content from the Service. The ClearForest Rights include rights to (i) the Service developed and provided by ClearForest; and (ii) all software associated with the Service. The ClearForest Rights do not include third-party content used as part of Service, including the content of communications appearing on the Service.

Your Intellectual Property Rights. ClearForest does not claim any ownership in any of the content, including any text, data, information, images, photographs, music, sound, video, or other material, that you upload, transmit or store in using ClearForest SWS.

In other words: Users still own the Copyright of User Generated Content, but ClearForest will own the Intellectual Property Rights of the Service based on User Generated Content. The knowledge that person ‘z’ IS chairman of company ‘y’ will be the Intellectual Property of Reuters. As a consequence Reuters will have the (legal and technical) power to decide whether or not knowledge exists. The Calais initiative reminds me of George Orwell’s Ministry of Truth. On their portal they call for developers to incorporate their service in web applications. I am not in favour of privatizing public knowledge and therefore I would not recommend developers to support Calais. Instead they should think about an open and decentralized alternative (e.g. OpenDover) based on Open Source technologies (e.g. UIMA).

Leave A Comment

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.