Randal Burns' Big Data Blog

smallbigrbface.jpghpim1156.jpg

RSS feed!

Thoughts related to the management of large data sets, storage systems, database systems, scientific computing, and anything else that catches my fancy.

Distributed Sensing

originally posted 02/24/2008

My student Ioannis has been trying to turn me on to compressive sensing for a while now. He loves it because it is so “radical”, “new”, and “cool”. These are important things, but systems researchers tend to dismiss such superlatives and ask mundane questions like, “how can this improve operational behavior?” The best innovations are those that have profoundly change the way that we assemble information systems.

I looked into it more deeply in the context of my Advanced class and am absolutely convinced that Ioannis is right. It is cool and radical. But, not for it’s elegance, but rather for its ability to do distributed sensing. (OK, I am not going to summarize the technique to much, so check out the tutorial.)

In a sensor network, data collection comprises computing a random matrix that defines what observations each sensor should make, e.g. for a time series of n values, each sensor will observe s of those n, s<n, compose the collection into a single value. These collections are shipped back to a host computer where the signal is reconstructed by inverting the randomized matrix.

The implications are many. Each sensor can act independently, without coordination overhead. The system as a whole collects only as many samples as is needed to capture the information of the signal (as determined by its sparsity). Essentially, the data are already compressed when they are sampled. This defies intuition, but it works. This is the killer app. The proponents like to talk about the “single pixel camera” as a way to explain the technique. But it doesn’t pack the same intellectual punch as distributed sensing. It makes the technique seem like a toy. From my experience, the single pixel camera kept my interest at Bay until I delved more deeply, because it did not capture the heart and mind.

As a systems guy, the distributed/uncoordinate nature of CS is the best property and has incredible implications on environmental sensing. In most applications, the technique is useless, because wavelet techniques have better fidelity at the same data size. They should, they address the sparsity directly rather than through randomness.

I am not sure that I will ever use these techniques, because they don’t tend to fit the kinds of applications that I am working on. But, for me personally, compressive sensing goes into the category of cool, useless things that I would like to use later. The other recent techniques in this category are manifold characterization and learning techniques, such as LLE and Isomap (see the Science issue of 22 December 2006) and LDPC codes (nice tutorial). I should probably write an entry on why I think LDPC codes are useless.

· %2008/%03/%14 %17:%Mar · Randal Burns · 0 Comments

Re-Thinking MSST

originally posted 12/22/2008

Steve Louis of Lawrence Livermore National Labs wrote me today to ask if I wanted to help out with a new workshop entitled “First Workshop on Computing with Massive and Persistent Data” (no Web presence yet). This would seem to be right up this blog’s alley. The idea is to do a small interactive thing with 10-12 people that actually have big data. Steve’s the man (in a good way not THE MAN who’s keeping us down) and the conference is in Baltimore so it’s kind of a no-brainer for me to say yes enthusiastically.

A side conversation in our exchange is that the executive committee is trying to reinvent the parent conference Mass Storage Systems and Technologies to target it more toward practitioners who have big data problems and away from academics who write about what it might be like to have big data problems. This is a great thing. There’s no real outlet for the former and MSST is inferior at the latter (when compared with FAST). One problem with MSST is that the academic nature of the program is not of interest to the practitioner crowd (who reasonably don’t think that one can write about big data without having big data). And, few academics come to the conference, because it’s a practitioner conference.

(My mea culpa). I was part of the problem in that I supported a more academic conference and I served on the program committee for a bunch of years.. In this, I was wrong. I think this has lead to the conference being less interesting to both communities. In the interim, I have also learned the value of actually having a lot of data and am striving to do so. (But it’s still puny by Nat’l Labs standards.)

So the question then is “How does one successfully execute this redirection?” One obstacle that needs to be overcome–and it has always been an obstacle–is that the academics tend to outperform on editorial content and on concept space and often in a superficial, rather than substantial, way. To solve this, the conference needs to judge the contributors and participants more on “what they do” and less on “what they submit”. Some solutions for MSST:

  • More masterworks
  • More panels
  • Only accept short (1 or 4 page) submissions
  • More glitz, *-athons, and challenges (a la SC)
  • Change editorial guidelines to favor practice

The good news is that the MSST exec. committee knows all this and I suspect that they have known it all along. The result will be a better conference and it will be more interesting to even me (an dyed in the wool academic without any large data).

· %2008/%03/%14 %17:%Mar · Randal Burns · 0 Comments

On Regular Decomposition

Update for 03/14 We just finished a VLDB demo submission on this topic and Misha points out that regular implies a regular division of space. So, we use the term data-independent decomposition, which renders the language (not the point) of this point moot.

originally posted on 01/08/2008

shapeimage_2.jpg

Just finished a pre-proposal to NSF’s Cyber-Enabled Discovery and Innovation (CDI). I don’t know what the NSF is thinking having three program deadlines all out of the Office of Cyberinfrastructure in the same week right after the holidays. But, I was more than happy to turn my schedule upside-down to accommodate them.

As I was writing this proposal it became clear to me that there is a close connection between distributed scientific computing and indexing based on the regular decomposition of space (data independent decomposition). While not a new concept, this whole issue seem to be somewhat under-realized. Joe Hellerstein mentioned is at CIDR last year, somewhat along the lines of

JH: “...you have the choice of whether to index
 the data itself or the space around the data...” 

in which indexing the space around the data produces a regular decomposition.However, our discussion never got to what were the implications of this choice. Similarly, Hanan Samet’s book deals with the issue only definitionally. Noting that there are regular decompositions (region trees and space filling curves) but now providing a compare/contrast treatment of when and why you would like to use these.

I have come to the conclusion that regular spatial decompositions are the only way to provide access paths and indexes across data federations. Data dependent decompositions mandate that all data are indexed together, e.g. R-trees, grid files, k-d trees, Voronoi diagrams, etc. This means that if one wants to add or remove data, one needs to manipulate or recalculate the index. If one wants to put data together in an ad-hoc fashion, data dependent indexes are useless. In contrast, with regular decompositions, you can take two data sets that are indexed, stored, and managed independently and the indexes compose automatically, with the caveat that they use the same decomposition.

The next observation is that there are very few options for regular decomposition. Most techniques are based on the recursive division of space, e.g. region trees, tesselations, and space filling curves. This is great as long as your objects map well to the division of space, i.e. are contained by relatively few spatial partitions. This is not true with Estuaries. Misha put it this way: “The problem arises because estuaries—and the Potomac in particular—exhibit a large number of winding, tendril-like tributaries. Because they are long and skinny, the tributaries are not likely to be wholly encapsulated by the cells of the partition. Because of their winding nature, the distribution of tributary fragments will not conform to the regular indexing of the partition and the data will be spread across non-contiguous regions of the disk.” Word!

This is really the motivation for our current project, which is to try and generate regular decompositions of space that reflect the structure of the embedding space from which data are drawn. For now, this means using the medial axis transform on the outline of the Chesapeake Bay to create a linearization of space the reflects the Bay structure.

· %2008/%03/%14 %17:%Mar · Randal Burns · 0 Comments

The 50% Rule

originally posted on 12/04/2007

I taught my yearly class of dynamic storage allocation processes out of Knuth Volume 1. This includes the 50% rule. (Joke of the day was that it indicated that 50% of my cousins have all of their teeth.) Rather, it states that when allocation N objects in storage using Next Fit allocation and a deallocator that merges adjacent free space, one can expect their to be ~N/2 regions of free space when objects are variably sized (see Knuth for a more correct description). It’s a steady state argument.

This has lots of implications. You can derive the average size of free regions based on knowing the number of allocated objects and the capacity. It means that as space fills, the free regions get smaller and smaller, etc.

The conclusion that I had not come to prior to today is that it also means that disk storage systems don’t age! As long as the size and number of objects stored on a disk does not grow, the system should have the same number and average size of free regions. Aging does not degrade the allocation process. Cool! This is why we don’t need disk de-fragmentation utilties.

Pretty straightforward stuff. My students got to hear me say this like it is the most obvious conclusion ever.

· %2008/%03/%14 %17:%Mar · Randal Burns

Outages. Every Year.

originally posted on 11/27/2007

Well, it happened again. This year it was Cyber Monday during which Yahoo! shops experienced a 4 hour outage affecting 45,000 stores. Last year it was is was Black Friday that saw Amazon and Walmart stumble. I hate to admit it, but I take some pleasure in watching this happen. Last year more than this year. Watching the big guys get hit for their own incompetence is one thing. Everyone likes to see Goliath take the stone in their head. However, Yahoo! shopping affected lots of little guys that rely on Yahoo! being up to stay financially viable. These stores have really gotten the short end of the stick.

So, I really have two thoughts about the situation.

(1) It is so disappointing that Web services don’t degrade gracefully. There has been a tremendous amount of research in this area, i.e. dealing with overload, most of it addressing TCP connections. None of this work seems to be implemented. One of my all time favorite papers is in this space: “P. Druschel and G. Banga. http://citeseer.ist.psu.edu/druschel96lazy.html. OSDI, 1996.” If I recall, the idea is to combine early receive processing, i.e. getting stuff out of the TCP queue into connection specific request queues, and then letting these queues overflow so that all overflows affect specific applications. And then to dynamically allocate memory to these queues in proportion to the priority of the application/process that will be receiving the data. In other words, drop specific packets from low priority applications, which allows some (high priority) applications to make progress. This is neither the first nor the last paper that explores this space, but it is the first one that I saw and I remember it well.

(2) I do some research that allows clients to inspect data when it is outsourced and I have gotten a lot of feedback that outsourcing sites are trustworthy, which would render my research moot. This premise is nonsense. The biggest companies and the smallest companies all suffer from availability issues. Relying on any service uniquely (trusting it) makes clients vulnerable to outages, data loss, etc. when that trust is misplaced. So, small businesses should really diversify their strategy, running shopping sites with many providers. Similarly, clients that outsource storage, serving, etc. with service providers need techniques to examine, audit, and manage their outsourced resources.

(Ach! Point 2 didn’t really come out the way I wanted it.)

Also, I really need to meet Peter Druschel. This occurred to me as I was describing LRP. The other paper he wrote that I think is the bomb-diggity (actually a word–kind of–and shouldn’t be hyphenated) is IO-lite. This is another wonderful idea that has not made its way into OS implementations. He just seems to work on stuff that is 10 years in the future. Bien fait!

· %2008/%03/%14 %17:%Mar · Randal Burns · 1 Comment
 
bigdatablog.txt · Last modified: 2009/07/14 19:07 (external edit)
 
Recent changes RSS feed Donate Powered by PHP Valid XHTML 1.0 Valid CSS Driven by DokuWiki