Woodchuck is a framework for scheduling the transmission of delay tolerant data, such as RSS feeds, email and software updates. Woodchuck aims to maximize data availability (the probability that the data the user wants is accessible) while minimizing the incurred costs (in particular, data transfer charges and battery energy consumed). By scheduling data transfers when conditions are good, Woodchuck ensures that data subscriptions are up to date while saving battery power, reducing the impact of data caps and hiding spotty network coverage.
At the core of Woodchuck is a daemon. This centralized service reduces redundant work and facilitates coordination of shared resources. Redundant work is reduced because only a single entity needs to monitor network connectivity and system activity. Further, because the daemon starts applications when they should perform a transfer, applications do not need to wait in the background to perform automatic updates thereby freeing system resources. With respect to the coordination of shared resources: the cellular data transmission budget and the space allocated for prefetched data need to be allocated among the various programs.
Applications need to be modified to benefit from Woodchuck. Woodchuck needs to know about the streams that the user has subscribed to and the objects which they contain as well as related information such as an object's publication time. Woodchuck also needs to be able to trigger data transfers. Finally, Woodchuck's scheduler benefits from knowing when the user accesses objects. In my experience, the changes required are relatively non-invasive and not difficult. This largely depends, however, on the structure of the application.
I considered implementing Woodchuck without requiring application modifications. A promising approach is to use a transparent proxy that examines the request stream to identify data to prefetch. This requires implementing a proxy for each supported protocol. Supporting a handful of common protocols, such as RSS over HTTP, is fairly straightforward, however, many services use their own API, e.g., Twitter and Facebook. Transparent proxying also only works when the data is transferred in the clear: if the data stream is encrypted, Woodchuck would have to perform a sort of man-in-the-middle attack. Another issue arises with prefetching: the user only learns about new transmissions when the application actually checks for updates: using a proxy, the application must still poll for updates; when a centralized server tells an application to transmit some data, the application can inform the user about the new data as soon as it is transferred, e.g., new email is available.
I designed Woodchuck's API to be easy to use. A major goal was to allow applications to progressively add support for Woodchuck: it should be possible to add minimal Woodchuck support and gain some benefit of the services that Woodchuck offers; more complete support results in higher-quality service.
To support Woodchuck, an application needs to do three things:
- register streams and objects;
- process upcalls: update a stream, transfer an object, and, optionally, delete an object's files; and,
- send feedback: report stream updates, object downloads and object use.
The rest of this document is written as a tutorial that assumes that you are using PyWoodchuck, the Python interface to Woodchuck. If you are using libgwoodchuck, a C interface, or the low-level DBus interface, this document is still a good starting point for understanding what your application needs to do.
To interface with Woodchuck, an application instantiates the PyWoodchuck class. PyWoodchuck's init method requires two parameters: a human readable name and the application's DBus service name. The human readable name is the application's name, that which normally appears in the title bar. Woodchuck uses this when it shows the user something about the application. The DBus service name serves two roles. Woodchuck uses it to start the application (for this to work, your application needs to install a DBus service file). And, Woodchuck uses the DBus service name as a way to uniquely identify the application and its associated streams and objects.
For a fictional podcast manager called Podchuck, we could initialize the connection to Woodchuck as follows:
from pywoodchuck import PyWoodchuck import woodchuck
wc = PyWoodchuck(human_readable_name="Podchuck Podcast Manager", dbus_service_name="org.podchuck")
Typically, an application does not directly instantiate a PyWoodchuck object but subclasses it and implements the standard callbacks. If the application directly instantiates PyWoodchuck or sets the request_feedback parameter to PyWoodchuck's init method to False, Woodchuck does not delivery any callbacks. This is useful if the application is split into a front end GUI and back end daemon, in which case, the front end reports object use, but only the backend processes the upcalls. How to subclass PyWoodchuck to receive upcalls is shown below in the section "Processing Upcalls."
Registering Streams and Objects
To schedule a transmission, Woodchuck needs to know about it. Woodchuck uses two abstractions for representing transmissions: streams and objects. A stream is updated periodically and contains information about new and updated objects; an object is the data that the user is actually interested in and is typically downloaded once. In the case of an RSS reader, the RSS file would be represented by a stream and the individual articles by objects. Oftentimes, a part of each object is delivered inline. For instance, an RSS feed for a blog usually includes the blog posts' text, but does not embed any referenced images.
When an application registers a stream or an object, Woodchuck assigns it an identifier. To ease application integration, an application can also associate an identifier of its choosing with the stream or object. The motivation for supporting application-assigned identifiers is that there is usually some existing mechanism for identifying the objects. If possible, we want to reuse that and not burden the application with having to maintain a mapping between its identifiers and Woodchuck's. Woodchuck provides an API to locate a stream or an object in a given stream with a particular application-assigned identifier. Possible application-assigned identifiers include a database record's primary key, and an RSS feed's URI. Whatever the application uses as the application-assigned identifier, it should be relatively stable.
For Podchuck, our example application, we use a feed or podcast's URL as the application-assigned identifier. To register a new feed, we use the stream_register method (we assume that there is an object called 'feed' with the attributes title and url):
if wc.available(): # Register the feed. Indicate that the feed should be updated # approximately every six hours. wc.stream_register(stream_identifier=feed.url, human_readable_name=feed.title, freshness=6 * 60 * 60)
We first check that Woodchuck is really available: if Woodchuck is not available, the application should not fail; the application should treat Woodchuck as an optional service and continue to work if it is not present. If Woodchuck is present, we register the stream. We use the feed's URL as its application-assigned identifier and the feed's title as its human readable name. The freshness parameter is a hint to Woodchuck indicating approximately how often to update the stream. Woodchuck revises this value based on the user's actual use of objects in the stream.
Registering an object is similar to registering a stream except instead of invoking a method on the PyWoodchuck object, we invoke a method on the stream that contains the object. To get the stream object, we simply index the PyWoodchuck object using the stream's application-assigned identifier: the Woodchuck object acts like a dictionary mapping streams' application-assigned identifiers to PyWoodchuck streams. (Similarly, stream objects act like dictionaries mapping objects' application-assigned identifiers to PyWoodchuck objects.)
if wc.available(): wc[feed.url].object_register( object_identifier=podcast.url, human_readable_name=podcast.title, expected_size=podcast.size)
In addition to the application-assigned identifier and human readable name, we also provide the object's expected size. This is the amount of disk space that the object requires after the transfer completes. If the object is to be uploaded and will be deleted after being upload, this value should be negative indicating the amount of disk space that will be freed.
If an object is updated regularly, e.g., a weather report or stock information, use the transfer_frequency parameter to indicate to Woodchuck approximately how often it should download the object (in seconds).
You should register a stream whenever the user subscribes to a resource. Likewise, whenever a stream is updated, you should register any newly discovered objects. If, after updating a feed, an object has been updated, you can tell Woodchuck to fetch it again by setting its need update property to True:
wc[feed.url][podcast.url].need_update = True
Woodchuck automatically clears this after the object has been marked as being successfully transferred.
When the user unsubscribes from a feed, the application should unregister the stream:
Similarly, when an object becomes unavailable, the application should unregister it:
The first time an application detects that Woodchuck is available, it should register all of its streams and objects with Woodchuck. In case some changes were made and not registered, it is advisable to synchronize the application's data base with Woodchuck's every time the application starts.
The following example shows how to check that a series of podcast feeds and their containing podcasts matches what Woodchuck knows about. There are two cases we need to consider: a feed or podcast is not yet registered with Woodchuck; and, the user unsubscribed a feed or a podcast is no longer available, but it is still registered with Woodchuck.
if wc.available(): # Get the list of streams (feeds) that are already registered. registered_feeds = wc.keys()
# Iterate over the list of feeds. for feed in get_list_of_feeds(): if feed.url in registered_feeds: # The feed is already registered with Woodchuck. registered_feeds.remove(feed.url) else: # The feed is not yet registered with Woodchuck. try: # Indicate that the feed should be updated # approximately every six hours. wc.stream_register(stream_identifier=feed.url, human_readable_name=feed.title, freshness=6 * 60 * 60) except woodchuck.Error, e: print("Registering feed %s with Woodchuck: %s" % (feed.title, str(e))) continue # Do the same for the stream's objects. for podcast in feed.get_list_of_podcasts(): registered_podcasts = wc[feed.url].keys() if podcast.url in registered_podcasts: # The podcast is already registered with Woodchuck. registered_podcasts.remove(podcast.url) else: try: wc[feed.url].object_register( object_identifier=podcast.url, human_readable_name=podcast.title, expected_size=podcast.size) except woodchuck.Error, e: print("Registering podcast %s with Woodchuck: %s" % (podcast.title, str(e))) continue # We just registered the podcast with Woodchuck. If # it is already downloaded or was already listened # to, tell Woodchuck to not bother transferring it. if (podcast_downloaded(podcast) or podcast_heard(podcast)): wc[feed.url][podcast.url].dont_transfer = True # registered_podcasts contains podcasts registered with # Woodchuck, but which are no longer available. Remove # them. for podcast_url in registered_podcasts: del self[feed.url][podcast_url] # registered_feeds contains feeds registered with Woodchuck, but # which we the user no longer subscribes to. Remove them. for feed_url in registered_feeds: del self[feed_url]
To receive upcalls, instead of directly instantiating PyWoodchuck, the application subclasses PyWoodchuck and implements the desired callbacks. To avoid missing upcalls, the application should not block the thread running the main loop: DBus holds messages up to 25 seconds. This restriction applies to stream updates and object transfers: these operations involve the network and, thus, are potentially long running. There are a few strategies that the application can use including event programming, threading, and using another process. It doesn't matter to Woodchuck what strategy the application chooses; Woodchuck just wants to know if the operation succeeds or fails.
If the application does miss messages or does not respond, Woodchuck will invoke the upcall again. As such, when the application receives an upcall, it should not blindly enqueue the job: it should first check whether it is currently executing the job or whether the job is already enqueued.
Because PyWoodchuck implements a DBus service, it needs to integrate with the mainloop. This means that the application must call either DBusGMainLoop or DBusQtMainLoop (which are defined in dbus.mainloop.glib) before making any use of DBus.
To start an application, Woodchuck directs upcalls to the application's DBus service name. Once the application has started and registered with Woodchuck (or, rather, once some process has registered to receive upcalls for the application), subsequent messages will be sent to that process's DBus private name. Again, because DBus only queues messages for 25 seconds, to avoid missing any messages, the application should promptly register its DBus service name.
The following example shows how to receive the stream update and object transfer upcalls. Note: to ease extending the callback interface, the application should always include the kwargs parameter in the callbacks' signatures. The callback functions call either feed_update or podcast_download to do the actual work.
import sys import dbus from pywoodchuck import PyWoodchuck import woodchuck
from dbus.mainloop.glib import DBusGMainLoop DBusGMainLoop(set_as_default=True)
class Podchuck(PyWoodchuck): def init(self): try: self.bus_name = dbus.service.BusName("org.application", bus=dbus.SessionBus(), do_not_queue=True) except dbus.exceptions.NameExistsException, e: print_and_log("Already running (Unable to claim %s: %s)." % (self.dbus_service_name, str(e))) sys.exit(1)
PyWoodchuck.__init__( self, human_readable_name="Podchuck Podcast Manager", dbus_service_name="org.woodchuck.podchuck") def stream_update_cb(self, stream, **kwargs): print("stream update called on %s" % (str(stream.identifier),)) feed_update(stream.identifier) def object_transfer_cb(self, stream, object, version, filename, quality, **kwargs): print("object transfer called on stream %s, object %s" % (stream.identifier, object.identifier)) podcast_download(stream.identifier, object.identifier)
There are two types of feedback that the application should provide: when a stream is update or an object is transfered; and, when the user uses an object.
Stream Updates and Object Transfers
When a stream is updated, the application should indicate this to Woodchuck--whether it is due to an upcall or because the user clicked on the update now button. Further, the application should register any newly discovered objects and any objects for which an update is available.
if wc.available(): # Register the update with Woodchuck. wc[feed.url].updated( # The ways in which the user was informed of the update, if any. indicator..., # The number of bytes transferred and the time. transferred_down=..., transferred_up=..., transfer_time=..., transfer_duration=..., # The number of new object, and updated objects. new_objects=len(new_podcasts), updated_objects=len(updated_podcasts), # The number of objects that were transferred completely inline. objects_inline=...)
# Register any new objects with Woodchuck. Note: if the object # was completely transferred inline, you need to also call # wc[feed.url][podcast.url].transferred(...). for podcast in new_podcasts: try: obj = wc[feed.url].object_register( object_identifier=podcast.url, human_readable_name=podcast.title, excepted_size=podcast.size) except (KeyError, woodchuck.Error), e: print("Registering podcast %s: %s" % (podcast.title, str(e))) obj.publication_time = podcast.publication_time # Tell Woodchuck to (re)download any object for which an update # is available. for podcast in updated_podcasts: try: wc[feed.url][podcast.url].need_update = True except (KeyError, woodchuck.Error), e: print("Marking podcast %s as having an update: %s" % (podcast.title, str(e)))
The indicator parameter to the stream update method indicates how the user was told about the update (if at all). It is a bit-wise or of values drawn from woodchuck.Indicator. An email application might indicate new mail by vibrating the phone and blicking an LED whereas a podcast manager might only show a small in-application message.
When an object is transferred, you report this to Woodchuck using the transferred function:
if wc.available(): wc[feed.url][podcast.url].transferred( indicator=..., transferred_up=..., transferred_down=..., transfer_time=..., transfer_duration=..., object_size=..., files=...)
You specify either the object_size parameter or the files parameter. object_size is the size of the object in bytes. files is an array of filename, dedicated, deletion policy tuples. The dedicated predicate indicates whether the file is used exclusively by the object. The deletion policy (drawn from woodchuck.DeletionPolicy) specifies whether Woodchuck can delete the file without telling the application (DeleteWithoutConsultation), may ask the application to delete the file (DeleteWithConsultation), or whether the file is precious and may not be considered for deletion (Precious).
Reporting object use is important as Woodchuck's scheduler adapts to user behavior. If a user typically accesses the objects in a stream shortly after they have been downloaded, Woodchuck can infer that frequent updates and prompt downloads are important. In such cases, Woodchuck may decide to update the stream over the cellular connection, if the data transfer budget is sufficient.
Although it is possible for Woodchuck to observe application behavior externally using something like inotify, oftentimes file access do not correspond to actual object uses: the application may just be reading the file to generate a thumbnail or some other preview. Reporting object use reduces the uncertainty.
To report object use, the application invokes the used method on the PyWoodchuck object as follows:
if wc.available(): try: wc[feed.url][podcast.url].used(start, duration, use_mask) except (KeyError, woodchuck.Error), e: print("Marking podcast %s as having been used: %s" % (podcast.title, str(e)))
The used method takes optional three parameters: start, duration and use_mask. If any of the parameters are unknown or difficult to get, it is reasonable to not supply the corresponding arguments. start is the time the user began using the object (in seconds since the epoch) and duration is the how long the user used the object (in seconds). use_mask is a 64-bit mask indicating which parts of the object the user used when the object is viewed as a linear progression. The least-significant bit corresponds to the first 64th of the object, etc. A PDF could be linearized by considering the pages as a linear sequence. This linearization may be very different from one that considers the PDF as a byte stream: each page in a PDF file consists of links to objects, which can be located anywhere in the file.
That's all there is to it! For more detailed information please refer to the API documentation.