Our candidate programmatic Orcid updater

Back in March 2014 we made our prototype application for programmatic Orcid updates available. This was designed only as a prototype and not intended for general use in a production environment.

As of 01 April 2014 Orcid are going to move to release 1.2 of their schema, which may break our app.

When we say may we really mean may, we don’t know as we havn’t tested it against the new version of the Orcid schema – just now we’re working on some other projects and don’t have the bandwidth to test things properly, and more importantly, update them if they break.

However, we will be testing and if necessary updating our tool later this year – we just can’t say when …

Posted in Uncategorized | Leave a comment

Citing Dynamic datasets

There’s an assumption in data citation that all the datasets in a repository are fixed and final, just as a research paper when published is in it’s fixed and final form.

In practice this isn’t quite true, there’s a lot of use case where the data in a datastore is subject to change – the obvious cases being longitudinal or observational studies, which these days can be extended to cover datasets derived from automated sensors, with consequent errors due to technical failures, and subsequent fixups.

Most researchers actually only work on an extract of the data, and current best practice is to redeposit the extract along with a description of how it was created as if it was a new dataset.

This is fine for small datasets, but does not scale well for big data, be it numerical, sensor based, or evenworse multimedia – video recordings of language and accompanying transcripts being one use case.

I have previously thought about using bazaar, subversion or git as a version control system, creating a new object each time the dataset updates, but that also suffers from scaling problems, but at least ahs the benefit of being able to recreate the extract against the dataset at a known point in time.

Consequently I was interested to hear Andreas Rauber of the University of Vienna speak on the Research Alliance approach to dynamic dataset citation [PDF of my webinar notes].

Essentially their approach is to use standard journaling techniques to track changes to the data base and to store timestamped queries to allow a query to be rerun against a dataset in a known state to recreate the extract. This approach is highly scalable as regards storage and allows the flexibility of rerunning queries against the datset as was at prior point in time or against the current dataset.

As a solution it seems quite elegant, my only worry would be the computational power required to handle and rerun changes where the dataset has been very cahtty with many updates as would be the case with some automated sensor derived datasets.

I also wonder about its applicability to datasets which are essentially executable binary elements such as iPython notebooks and other forms of mathementical models, but undoubtedly this approach solves the problem of continually redepositing dataset extracts with their consequent impact on storage – but then, this week at least, storage is cheap …

Posted in Uncategorized | Leave a comment

The Data Commons and GitHub

Over the past few months there’s been a growing interest in archiving software projects, such as can be done via Zenodo. This is part of more general problem – as researchers increasing use environments such as iPython notebooks for their resercah there’s a growing need to archive these notebooks to allow a replay of the determination of the results.

Inspired by Stuart Lewis’s recent post on a GitHub to repository deposit we’ve recently added a mechanism to the Data Commons to allow the import of metadata from GitHub to the Data Commons, allowing the creation of an object record for a GitHub project.

Rather than import the code, we create a referential entry for the project, although of course files could be downloaded and added manually if a local copy was required.

We would hope to generalise this mechanism in the future to allow the import of dataset records from other repositories and stores meaning that content need not always be in the same place as the metadata …

Posted in Uncategorized | Leave a comment

ANU Data Commons Documentation page updated

The documentation page, covering both technical and user documentation for the ANU Data Commons has been updated.

Documentation covers

  • The Data Commons itself
  • The ANU Data uploader
  • The Metadata stores project
  • Our prototype ORCID updater
  • Links to project source code

To check out any of these please go to https://itservices.anu.edu.au/research-computing/anu-data-commons/documentation/

The documentation is covered by a Creative Commons Attribution-NonCommercial-ShareAlike 3.0 Australia License and the source code by a GPL3 licence.

Posted in Uncategorized | Leave a comment

The Online Revolution: Education for Everyone [Webinar]

 Webinar report: The Online Revolution – Education for Everyone

Andrew Ng Coursera/Stanford

One of the advantages of the online revolution is that it makes it comparatively easy to follow presentations elsewhere, in this case an ACM presentation on the development of MOOCs by Andrew Ng, one of the founders of Coursera.

I won’t summarise the whole presentation – there’s a copy online, I don’t know if you have to be an ACM member to view it, but I’m sure it would be possible to negotiate access.

So rather than summarise the presentation I’ll extract what to me are the key points


  • Business and computer science courses predominate, but singular most popular course is one on social psychology
  • India and China lead signups but engagement is global
  • Coursera has put significant work into infrastructure to improve content delivery to China, plus localisation of introductory material
  • Many students already have a significant qualification, MOOC’s appear to provide an opportunity for ongoing development rather than substituting for conventional study
    • implies many students already have learned the habits of study

Apps and access

  • providing Android and iOS apps for tablet computers have allowed student in areas with poor internet access to access and download content to then work offline
    • Android app is extremely well used in China
  • Poorest and most disadvantaged student usually face most difficulties in gaining internet access (cf development on linux based internet access programs in Latin America)

Credentials and completion

  • most students if they complete the first two weeks complete the course
  • students who register (and pay) for a completion certificate usually complete the course
  • considerable work goes in to make sure that the student submitting work is the same person as the registered student
  • credentials increasingly accepted by recruiters as evidence of ongoing personal development
  • suspicion that some of the non-completers who follow significant part of course are either using it as a substitute for background reading or as an experiment to see if a particular subject area is for them

Flipped classroom and MOOC’s

  • Flipped classroom – idea that classroom/lecture time used for discussion and experimentation and students review course material in their own time
    • Reverse of C19 model of lecture delivery and work at home
    • C19 model result of the technologies of the time
  • Flipped versus MOOCs model sterile debate – MOOC material can be successfully reused in flipped model (cf re use of OU material by traditional universities in the UK, or exchanging online material to allow provision of courses where otherwise not possible)


Posted in Uncategorized | Leave a comment

Comprehensive data management at UCL

Yesterday, ANDS hosted a webinar by Max Wilkinson, head of research data services at UCL – clicking on the link will take you to a pdf of my notes.

I have my own views on what a research data service might look like but I like Max Wilkinson’s take for a number of reasons:

  • it’s incremental – rather than enforcing the use of a service let it spread and be adopted gradually – this both gives the service time to grow, and avoids the inevitable problems when a one size fits all big bang approach fails to deliver
  • it leverages off existing competences – rather than build a whole new storage infrastructure it uses existing infrastructure to leverage the service – that way you don’t need to implement storage at the same time as a service
  • there’s an understanding that researchers need work in progress shared storage as well as archival storage, and providing the former lays the ground for moving data to archival storage
  • using a project based approach allows you to garner a sparse metadata record from day one – it also allow the tracking of a life cycle of the project
  • there’s an understanding that there’s a definite problem with data stored on legacy media but it’s a separate problem from managing born digital data .

All in all an excellent webinar – I’d recommend reviewing the video once it’s online

As always, my notes, my views, no one else.

Posted in Uncategorized | Leave a comment

Minting ORCID identifiers programmatically …


As part of our work to develop a solution to better manage metadata about researchers, publications and data, we set off down the road of trying to link publications to datasets, and by extension grey literature, deposited in various online collections as well as linking to information on grants held to try and build a picture of research activity.

This is useful in a lot of ways – for example it potentially allows us to implement a service to automatically generate online profile or portfolio pages and highlight a researcher’s most cited papers, or who has had the most international collaborations, and so on.

One problem that we immediately hit was that of disambiguation – being able to accurately identify that the Fred Smith who deposited a set of photographs of Wari shroud wrappings was the same person as the Fredrick L Smith who published a monograph on the iconography of early Wari shroud wrappings.

Internally this is not really a problem as we have a universal identifier scheme in place, but it does tend to break down when tracking collaborators at other institutions. One could imagine a scenario where colleagues elsewhere publish an analysis of data held in our data repository, and vice versa. This is particularly a problem where you have a number of large scale international collaborations.

Granting external people temporary identifiers is not a viable long term answer as people change institution, change their name, change their email address, meaning that over time the information held may become invalid.

The obvious answer would be to use some form of persistent universal global identifier, and one in which either the owner of the identifier, or the provider of the identifier, is invested in keeping up to date.

In fact there are already a number of global identifiers in use, many of which are proprietary. Examples of possible global identifiers include scopus identifiers, ResearcherID, NLA/Trove identifiers, Google Scholar profiles and the like.

However, none of these identifiers are truly universal, as not everyone has the same set of identifiers, and some people, such as early career researchers, may have none at all.

Our approach was to build a light weight database keyed to the institutional identifier to record these attributes for each researcher.

At the same time ORCID was clearly an identifier gaining widespread adoption worldwide and not tied to any field of study, or scholarly information service, making it suitable for use as a universal career lifetime persistent identifier.

As part of feasibility testing we decided to develop an ORCID minting tool, to demonstrate that we could programatically take the information we already knew about a researcher and either create an ORCID id for them from existing data sources or update their existing ORCID record.

The solution has been tested against the ORCID sandbox. This is very much a proof of concept exercise but we believe that the code is sufficiently robust a that it could form the basis of a generic ORCID client.

The application allows the user to

  • create an ORCID profile
  • update an ORCID profile
  • link existing publications to an ORCID profile
  • view the records for publication stored in an ORCID profile

The code was developed by Genevieve Turner of the ANU Data Commons team and is available for download from https://github.com/anu-doi/orcid-updater

under a GPL 3 license. As it is prototype code no warranty is made to its suitability or fitness.


Posted in Uncategorized | 1 Comment