Kaaaaaahhhhhnnnnnn!!!! No Semantics in OIDs

If you don't understand the title, look here. You Klingon bastard. I know that I took spelling liberties.

Bob Kahn gave the keynote speech at DARPA's Distributed Object Storage and Retrieval (DOSR) workshop. He put forth the notion that globally unique object identifiers should not have semantics. My initial response to this assertion was Whatcha talkin' about Willis?

There are many great examples of adding semantics to object identifiers that would seem to contradict him. Some come to mind:

  • The Secure File System embeds public key information into file names in order to make names self certifying. This was a great idea that improved the security of network file system access.
  • Content-addressable storage identifies portions of data using the hash of the content. The big advantage is that a system can keep one copy of content that is shared among files and then refer to the single copy many times. This has compression benefits of 70-95%. This was originally proposed in Venti, which applied the concept recursively so that files consisted of content hashes of the content hashes of the blocks that make up the file.

So, what's up with Kaaaahhhhnnnn! He makes the point that in distributed storage systems names need to be persistent, long-lived, uniform, and global. Thus, they should be technology, location, and format independent. W.r.t Venti, the SHA-1 160 bit identifier is already starting to age. The semantics of using this hash function to uniquely identify content will become less valid over time, as the hash weakens and as we store more than 2^160 objects in a store (well not that really, but the birthday paradox applies and collisions will happen, c.f. Val Henson's argument). He also points out that the semantics should be language independent, because semantics useful to one community are not to another. The same applies to operating systems.

There's also a security concern. Semantics leak unintended information. My brother once identified a company's revenue by ordering products and inferring their volume based on the transaction id, which were assigned sequentially. He parlayed this into a big investment win as his fund was able to take action prior to earnings announcement.

The conclusion is that the best that we can do is to uniquely assign random numbers. These have no semantics. I didn't understand these points and, consequently, gave a lecture that argued for semantics in identifiers in our course on digital preservation.

So, what about the benefit of semantics? Well they apply to specific data storage technologies and systems at particular places in time. A repository can and should use content-addressable storage, but the content-based identifiers should not be used as global identifiers. Rather, assign a global identifier and link this to semantic identifiers. You get the benefits of the semantics when accessing the data on the specific system, but do not tie the persistence and longevity of the object to those semantics. The Secure File System embraced this concept back in 1997–to avoid the complexity of public keys in paths, they allowed for soft links. I always thought that this was a cop out and that it reduced (qualitatively) somehow the security of the system. I might need to rethink.

Discussion

Enter your comment (wiki syntax is allowed):
YSKYD
 
blog/khan_....khan_no_semantics_in_uu-oids.txt · Last modified: 2009/07/14 19:07 (external edit)
 
Recent changes RSS feed Donate Powered by PHP Valid XHTML 1.0 Valid CSS Driven by DokuWiki