In the long run, open data may be more important than open source

An important argument by Ian Davis:

“Data outlasts code … therefore open data is more important than open source.

First, it’s important to note what I did not say. I did not say that open source is not important. On the contrary I said that open source was extremely important and it has sounded the death knell for proprietary software. Later speakers at the conference referred to this statement as controversial too :). (What I actually meant to say was that open source has sounded the death knell for propietary software models). I also mentioned that open source and free software has a long history and that open data is where open source was 25 years ago (I am using the term open source and free software interchangeably here).

I also did not say that code does not last nor that algorithms do not last. Of course they last, but data lasts longer. My point was that code is tied to processes usually embodied in hardware whereas data is agnostic to the hardware it resides on. The audience at the conference understand this already: they are archivists and librarians and they deal with data formats like MARC which has had superb longevity. Many of them deal with records every day that are essentially the same as they were two or three decades ago. Those records have gone through multiple generations of code to parse and manipulate the data.

It’s true that you need code to access data, but critically it doesn’t have to be the same code from year to year, decade to decade, century to century. Any code capable of reading the data will do, even if it’s proprietary. You can also recreate the code whereas the effort involved in recreating the data could be prohibitively high. This is, of course, a strong argument for open data formats with simple data models: choosing CSV, XML or RDF is going to give you greater data longevity than PDF, XLS or PST because the cost of recreating the parsing code is so much lower.

Here’s the central asymmetry that leads me to conclude that open data is more important than open source: if you have data without code then you could write a program to extract information from the data, but if you have code without data then you have lost that information forever.

Consider also, the rise of software as a service. It really doesn’t matter whether the code they are built on are open source or not if you cannot access the data they manage for you. Even if you reproduce the service completely, using the same components, your data is buried awayout of your reach. However, if you have access to the data then you can achieve continuity even if you don’t have access to the underlying source of the application. I’ll say it again: open data is more important than open source.

Of course we want open standards, open source and open data. But in one or two hundred years which will still be relevant? Patents and copyrights on formats expire, hardware platforms and even their paradigms shift and change. Data persists, open data endures.

The problem we have today is that the open data movement is in its infancy when compared to open source. We have so far to go, and there are many obstacles. One of the first steps to maturity is to give people the means to express how open their data is, how reusable it is. The Open Data Commons is an organisation explicitly set up to tackle the problem of open data licensing. If you are publishing data in any way you ought to check out their licences and see if any meet with your goals. If you licence your data openly then it will be copied and reused and will have an even greater chance of persisting over the long term.”

4 Comments

Sepp Hasslberger March 7, 2009 at 7:28 pm

was my first contact with the concept. This has morphed into a Ning group

There is also an interesting proposal:

Economists say copyright and patent laws are killing innovation; hurting economy
The authors argue that license fees, regulations and patents are now so misused that they drive up the cost of creation and slow down the rate of diffusion of new ideas. Levine explains, “Most patents are not acquired by innovators hoping to protect their innovations from competitors in order to get a short term edge over the rest of the market. Most patents are obtained by large corporations who have built portfolios of patents for defense purposes, to prevent other people from suing them over patent violations.”

Boldrin and Levine promote a drastic reform of the patent system in their book. They propose the law should be restored to match the intent of the U.S. Constitution which states: Congress may “promote the progress of science and useful arts, by securing for limited times to authors and inventors the exclusive right to their respective writing and discoveries.”

Sepp Hasslberger March 7, 2009 at 7:49 pm

http://www.primarilypublicdomain.org/ was my first contact with the concept. This has morphed into a Ning group

http://ethicalpublicdomain.ning.com/

There is also an interesting proposal I just came across today:

Economists say copyright and patent laws are killing innovation; hurting economy

The authors argue that license fees, regulations and patents are now so misused that they drive up the cost of creation and slow down the rate of diffusion of new ideas. Levine explains, “Most patents are not acquired by innovators hoping to protect their innovations from competitors in order to get a short term edge over the rest of the market. Most patents are obtained by large corporations who have built portfolios of patents for defense purposes, to prevent other people from suing them over patent violations.”

Robin Boast March 8, 2009 at 1:09 am

You are absolutely right to separate open data from open code. However, the analogy only goes so far. If you mean by Open Data XML, Text, delimited data, then it certainly can persist, but this does not mean that it lasts. Highly categorized data, such as RDF, may also loose all meaning over time, even though the “data” is still readable. MARC also may persist, but, as every librarian knows, even MARC data has changed its meanings several times over since Avram created it. In this sense, old MARC data persists, but does not endure.

Sepp Hasslberger March 9, 2009 at 10:42 am

Robin,

I believe what is being discussed is not open data formating but the openness of data in the sense that it is not enclosed by copyright which prevents its being re-used and re-mixed or even – at times – copied.

Open data, that is data free of copyright and similar encumbrances, has a better chance of persisting than the closed variety because open data will be available in many different places and in different forms, while the “protected” variety of data is normally only available in one form and often only in one place, which makes its disappearance that much more likely.

4 Comments →

Sepp Hasslberger March 7, 2009 at 7:28 pm
was my first contact with the concept. This has morphed into a Ning group
There is also an interesting proposal:
Economists say copyright and patent laws are killing innovation; hurting economy
The authors argue that license fees, regulations and patents are now so misused that they drive up the cost of creation and slow down the rate of diffusion of new ideas. Levine explains, “Most patents are not acquired by innovators hoping to protect their innovations from competitors in order to get a short term edge over the rest of the market. Most patents are obtained by large corporations who have built portfolios of patents for defense purposes, to prevent other people from suing them over patent violations.”
Boldrin and Levine promote a drastic reform of the patent system in their book. They propose the law should be restored to match the intent of the U.S. Constitution which states: Congress may “promote the progress of science and useful arts, by securing for limited times to authors and inventors the exclusive right to their respective writing and discoveries.”
Sepp Hasslberger March 7, 2009 at 7:49 pm
http://www.primarilypublicdomain.org/ was my first contact with the concept. This has morphed into a Ning group
http://ethicalpublicdomain.ning.com/
There is also an interesting proposal I just came across today:
Economists say copyright and patent laws are killing innovation; hurting economy
The authors argue that license fees, regulations and patents are now so misused that they drive up the cost of creation and slow down the rate of diffusion of new ideas. Levine explains, “Most patents are not acquired by innovators hoping to protect their innovations from competitors in order to get a short term edge over the rest of the market. Most patents are obtained by large corporations who have built portfolios of patents for defense purposes, to prevent other people from suing them over patent violations.”
Boldrin and Levine promote a drastic reform of the patent system in their book. They propose the law should be restored to match the intent of the U.S. Constitution which states: Congress may “promote the progress of science and useful arts, by securing for limited times to authors and inventors the exclusive right to their respective writing and discoveries.”
Robin Boast March 8, 2009 at 1:09 am
You are absolutely right to separate open data from open code. However, the analogy only goes so far. If you mean by Open Data XML, Text, delimited data, then it certainly can persist, but this does not mean that it lasts. Highly categorized data, such as RDF, may also loose all meaning over time, even though the “data” is still readable. MARC also may persist, but, as every librarian knows, even MARC data has changed its meanings several times over since Avram created it. In this sense, old MARC data persists, but does not endure.
Sepp Hasslberger March 9, 2009 at 10:42 am
Robin,
I believe what is being discussed is not open data formating but the openness of data in the sense that it is not enclosed by copyright which prevents its being re-used and re-mixed or even – at times – copied.
Open data, that is data free of copyright and similar encumbrances, has a better chance of persisting than the closed variety because open data will be available in many different places and in different forms, while the “protected” variety of data is normally only available in one form and often only in one place, which makes its disappearance that much more likely.