One of the problems we faced with a data archive is that some of the data is very large – too large to be realistically kept on a disk farm given that it will be little accessed, even if it is immensely valuable – such as astronomical sky survey data.
In our model the data will sit on nearline storage in a HFS like file system. this means that a user requesting data cannot expect an instant download experience, especially for large datasets which will need to be retrieved and read into the cache.
To this end we decided to implement a dropbox style mechanism where someone requests data, and once it has been retreived and is available they get an obfuscated link to download it.
Our inspiration for the mechanism was Zend.to, an open source equivalent to Yousendit, from the University of Southampton in the UK.
In the end we wrote our own module in java, which was inspired by, rather than fased on the Zend.to source.
Rahul Khanna, who did the coding explains:
While the general idea of allowing users to download a file has been adapted in the Collection Request module, most other functionality did not directly suit our needs. Here are some of the points that were considered while attempting to replicate the functionality of ZendTo:
1) I obtained the technical understanding of how Zend.to works from http://zend.to/technical.php . The database structure helped in knowing what data is needed to have an effective dropbox and pickup system.
2) The process of uploading files was already implemented in a separate module. ZendTo’s upload functionality was not replicated.
3) ZendTo uses 2 random strings – a claimID and a claimPasscode, so that the details of the dropoff can be sent by 2 different methods if required for security. I’ve implemented the same method of using 2 random strings to access the dropbox, however the actual random string generation is completely different.
4) ZendTo uses CAPTCHA to determine if the person requesting a dropbox is human. We’ll be authenticating users with their username and password. In situations where ZendTo uses authentication, we’re authenticating using CAS.
5) Just like ZendTo the IP address and userID of the person performing an action (creating or accessing a dropbox) is logged. Also, like Zend.to relevant timestamps are stored on dropbox creation and access.
6) ZendTo uses virus scanning. Collection Request module doesn’t. (Hopefully the infrastructure providers will take care of it).
I think it’s fair to say that no code has been copied or adapted from ZendTo even though its broad functioning has been replicated to suit our requirements.
The code is available at https://github.com/anu-doi/DataCommons
Pingback: Progress, progress, progress | Building an archive solution