Randal Burns' Big Data Blog

smallbigrbface.jpghpim1156.jpg

RSS feed!

Thoughts related to the management of large data sets, storage systems, database systems, scientific computing, and anything else that catches my fancy.

Balancing Workload and Architecture with Amdahl's Number

Alex Szalay sold me on the concept of Amdahl's number, which is a ratio between the number of operations and the amount of I/O. For example, an Amdahl number of 1.0 corresponds to a machine that can do 1 bit of I/O / sec per 1 operation / second. So, the ratio is a comparison of rates that represents the I/O balance of a machine and, thus, is scale-invariant in some sense.

I have touted that the petabyte capacity database here at Hopkins has a high Amdahl number as if it is meritorious. The number is about 0.6, which is an order of magnitude higher than typical HPC systems. This is great, but leads to questions along the line:

Why do I care about the Amdahl number?

A nice discussion of this issue was covered by Oliver Ratzenberger in the XtremeLargeMPP blog.

This question has become much easier to answer based on instrumenting applications. You can measure the Amdahl number of an application, which reflects the balance of how much I/O it does relative to the computation. The in-depth analysis and scans done by scientists on the Sloan Digital Sky Survey and the JHU Turbulence Database have very high Amdahl numbers (>0.5). Most computational and multi-scale simulations do a small amount of I/O, have low Amdahl numbers (<0.1) and, thus, run just fine on most parallel supercomputers.

If you match the Amdahl number of the machine and application there should be no wasted resources. This is true only in a time averaged sense; I/O and computation do not always happen concurrently and different portions applications have different ratios. It's also worth noting that the Amdahl number of an application or workload is independent of the architecture on which it is run.

However, I want to put forth that supercomputing applications have co-evolved with HPC architectures and that the mix of I/O and computation represents the set of applications that will run well on the hardware deployed at HPC facilities, rather than the natural balance of I/O and computation. For example, it is typical that in a multi-scale simulation, it might take 5 times as long to store a time-step as to compute (solve the equations) a timestep.

Blue Screen of Death

It turns out that I, much like a PC, can crash. Tonight, I gave a 5 minute extemporaneous talk at CIDR in the Gong Show. I wanted to try out some ideas about how to do big queries incrementally that Nolan and I spoke about last week. I liked the ideas. I'll blog about them tomorrow.

So, I though I would go casual, lighthearted no slides, little prep. Well, that's not me and the experiment was less than successful.

I stood in front of the projector on the stage. Big blue screen bearing right down on me. Told my opening joke…no laughs. Then, forgot the title of my talk. 10 seconds pause maybe. Someone in the crowd gave me a bzzzzzzzzzzzz.

Then, I rebooted and got through my stuff. I was a little rattled so it wasn't great. I'm sure it was OK, but felt miserable.

To cool down, I took a long walk on the beach at 10 pm, where I encountered two yoots (that would be local youths) that decided that they would try and physically intimidate me.

Not Big Data's night. Hasta manana.

· %2009/%01/%06 %02:%Jan · Randal Burns · 1 Comment · 0 Linkbacks

Kaaaaaahhhhhnnnnnn!!!! No Semantics in OIDs

If you don't understand the title, look here. You Klingon bastard. I know that I took spelling liberties.

Bob Kahn gave the keynote speech at DARPA's Distributed Object Storage and Retrieval (DOSR) workshop. He put forth the notion that globally unique object identifiers should not have semantics. My initial response to this assertion was Whatcha talkin' about Willis?

There are many great examples of adding semantics to object identifiers that would seem to contradict him. Some come to mind:

  • The Secure File System embeds public key information into file names in order to make names self certifying. This was a great idea that improved the security of network file system access.
  • Content-addressable storage identifies portions of data using the hash of the content. The big advantage is that a system can keep one copy of content that is shared among files and then refer to the single copy many times. This has compression benefits of 70-95%. This was originally proposed in Venti, which applied the concept recursively so that files consisted of content hashes of the content hashes of the blocks that make up the file.

So, what's up with Kaaaahhhhnnnn! He makes the point that in distributed storage systems names need to be persistent, long-lived, uniform, and global. Thus, they should be technology, location, and format independent. W.r.t Venti, the SHA-1 160 bit identifier is already starting to age. The semantics of using this hash function to uniquely identify content will become less valid over time, as the hash weakens and as we store more than 2^160 objects in a store (well not that really, but the birthday paradox applies and collisions will happen, c.f. Val Henson's argument). He also points out that the semantics should be language independent, because semantics useful to one community are not to another. The same applies to operating systems.

There's also a security concern. Semantics leak unintended information. My brother once identified a company's revenue by ordering products and inferring their volume based on the transaction id, which were assigned sequentially. He parlayed this into a big investment win as his fund was able to take action prior to earnings announcement.

The conclusion is that the best that we can do is to uniquely assign random numbers. These have no semantics. I didn't understand these points and, consequently, gave a lecture that argued for semantics in identifiers in our course on digital preservation.

So, what about the benefit of semantics? Well they apply to specific data storage technologies and systems at particular places in time. A repository can and should use content-addressable storage, but the content-based identifiers should not be used as global identifiers. Rather, assign a global identifier and link this to semantic identifiers. You get the benefits of the semantics when accessing the data on the specific system, but do not tie the persistence and longevity of the object to those semantics. The Secure File System embraced this concept back in 1997–to avoid the complexity of public keys in paths, they allowed for soft links. I always thought that this was a cop out and that it reduced (qualitatively) somehow the security of the system. I might need to rethink.

What's the big I-Dee-Uh?

It turns out that I am a member of a new Institute the Johns Hopkins Institute for Data Intensive Science and Engineering (IDIES). This is alternately pronounced ideas or eye-dee-ess. I prefer the latter and will do everything in my power (admittedly limited) to influence the organization in this direction. As of now, the institute has exactly zero Web presence. Compare this with some of the other JHU institutes, e.g. inbt or jhuisi with popups of some of our favorite people. I am sure that this will be remedied as time passes. I thought that we would start this process in as small a way as possible…..with my blog.

The good news is that IDIES has got the right mix of people and projects and has developed organically from existing projects. These include:

  1. PANSTARRS database development at JHU

This group amounts to my research friends (i.e. my brffs).

The mission of the Institute is to develop novel HPC architectures and the scientific data systems that support data intensive science, i.e. scientific discovery through data mining, feature extraction, correlation, and other forms of knowledge discovery through the search of petascale data sets. That may actually be my CS take on the mission, the scientists themselves may be more application focused. In either case, interdisciplinary teams work together to create scientific data systems according to Jim Gray's data laws.

  1. Need scale-out solution for analysis
  2. Take the analysis to the data!
  3. Start with “20 queries”
  4. Go from “working to working”

I'll write another entry on the laws themselves, maybe. But, Alex Szalay is promoting the “Jim Gray” process of making big data systems that work and codifying this as a best practice. It will be awesome when we speak of Gray's laws with the same breeziness with which we talk about Moore's or Amdahl's.

· %2008/%07/%11 %18:%Jul · Randal Burns · 0 Comments

Scientific DB Panel @ ICDE

Joe Hellerstein invited Jignesh Patel and me to organize a panel on scientific databases at ICDE this year, which we did. This turned out to be one of the highlights of my semester. While organizing it, Jignesh steered the theme toward the social and community aspects of scientific databases and developed the theme “Scientific Data Mangamenet: An Orphan in the Database Community” and indeed it is an orphan. The panel description is here Panel #2 Panelists were:

  • Sue Davidson
  • Jignesh Patel
  • Yannis Ionnadis
  • Miron Livny
  • Randal Burns (moderator)

I need to get their slides.

I walked off of the beach into the room and there were approx. 14 chairs around a table and I thought to myself “you must be kidding me!” I did how much work for 14 chairs - 4 panelists = 10 people. No way! So I asked for more chairs. Well, we filled the room with 60-70 folks and it was lively. Half the rooom shouted BULLSHIT! to Miron at one point.

Jignesh kicked us off with a statistical analysis of ICDE/VLDB/SIGMOD papers showing that one is 4 times more likely to get a SkyLine paper than a Scientific Data Management paper.

The panel is probably best described via some of the notable disagreements.

Some of the panel members (Yannis) recommended that scientists interested in interdisciplinary work should wait
until after tenure.  This idea was rejected by others (Miron) stating that //"6 or 7 years is too long to wait."//
While I agree with Miron, Yannis' statement resembles the truth in that much of the best interdisciplinary CS work
is done after tenure when scientists.  Tenure may provide the liberation and independence that encourages the risky
strategy and extreme time investment of scientific DB work.
It was asserted that PostDocs were a good time to expand one's research skills, perhaps taking a PostDoc in a different
field that a PhD.  Miron cited his son who went from wet work to computer (simulation?) in chemistry.  I chimed in saying
that PostDocs seem to be short and narrow as a transition to faculty jobs.  Susan disagreed pointing out that at UPENN they
have longer term interdisciplinary postdocs specifically to address this need.  Jignesh got my back saying that most
PostDocs seem to say in the same "research silo".  Conclusion...we need to recast the role of PostDocs.

A good time was had by all. Nancy (my wife) snuck in the back (without a badge) to watch. Her thought was that the whole panel was delightful uncivil and confrontational. She loved this in that she is tired of watching people kowtow and pretend to be nice. In all, the disagreements were charming and love-spirited (that's the opposite of mean-spirited right?)

· %2008/%04/%17 %08:%Apr · Randal Burns · 0 Comments

Older entries >>