Downloading data …

One of the problems we faced with a data archive is that some of the data is very large – too large to be realistically kept on a disk farm given that it will be little accessed, even if it is immensely valuable – such as astronomical sky survey data.

In our model the data will sit on nearline storage in a HFS like file system. this means that a user requesting data cannot expect an instant download experience, especially for large datasets which will need to be retrieved and read into the cache.

To this end we decided to implement a dropbox style mechanism where someone requests data, and once it has been retreived and is available they get an obfuscated link to download it.

Our inspiration for the mechanism was Zend.to, an open source equivalent to Yousendit, from the University of Southampton in the UK.

In the end we wrote our own module in java, which was inspired by, rather than fased on the Zend.to source.

Rahul Khanna, who did the coding explains:

While the general idea of allowing users to download a file has been adapted in the Collection Request module, most other functionality did not directly suit our needs. Here are some of the points that were considered while attempting to replicate the functionality of ZendTo:

1) I obtained the technical understanding of how Zend.to works from http://zend.to/technical.php . The database structure helped in knowing what data is needed to have an effective dropbox and pickup system.

2) The process of uploading files was already implemented in a separate module. ZendTo’s upload functionality was not replicated.

3) ZendTo uses 2 random strings – a claimID and a claimPasscode, so that the details of the dropoff can be sent by 2 different methods if required for security. I’ve implemented the same method of using 2 random strings to access the dropbox, however the actual random string generation is completely different.

4) ZendTo uses CAPTCHA to determine if the person requesting a dropbox is human. We’ll be authenticating users with their username and password. In situations where ZendTo uses authentication, we’re authenticating using CAS.

5) Just like ZendTo the IP address and userID of the person performing an action (creating or accessing a dropbox) is logged. Also, like Zend.to relevant timestamps are stored on dropbox creation and access.

6) ZendTo uses virus scanning. Collection Request module doesn’t. (Hopefully the infrastructure providers will take care of it).

 I think it’s fair to say that no code has been copied or adapted from ZendTo even though its broad functioning has been replicated to suit our requirements.

The code is available at https://github.com/anu-doi/DataCommons

Advertisement

About dgm

Former IT professional, previously a digital archiving and repository person, ex research psychologist, blogger, twitterer, and amateur classical medieval and nineteenth century historian ...
This entry was posted in Uncategorized. Bookmark the permalink.

1 Response to Downloading data …

  1. Pingback: Progress, progress, progress | Building an archive solution

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s