The Data Commons and GitHub

Over the past few months there’s been a growing interest in archiving software projects, such as can be done via Zenodo. This is part of more general problem – as researchers increasing use environments such as iPython notebooks for their resercah there’s a growing need to archive these notebooks to allow a replay of the determination of the results.

Inspired by Stuart Lewis’s recent post on a GitHub to repository deposit we’ve recently added a mechanism to the Data Commons to allow the import of metadata from GitHub to the Data Commons, allowing the creation of an object record for a GitHub project.

Rather than import the code, we create a referential entry for the project, although of course files could be downloaded and added manually if a local copy was required.

We would hope to generalise this mechanism in the future to allow the import of dataset records from other repositories and stores meaning that content need not always be in the same place as the metadata …

Posted in Uncategorized | Leave a comment

ANU Data Commons Documentation page updated

The documentation page, covering both technical and user documentation for the ANU Data Commons has been updated.

Documentation covers

  • The Data Commons itself
  • The ANU Data uploader
  • The Metadata stores project
  • Our prototype ORCID updater
  • Links to project source code

To check out any of these please go to

The documentation is covered by a Creative Commons Attribution-NonCommercial-ShareAlike 3.0 Australia License and the source code by a GPL3 licence.

Posted in Uncategorized | Leave a comment

The Online Revolution: Education for Everyone [Webinar]

 Webinar report: The Online Revolution – Education for Everyone

Andrew Ng Coursera/Stanford

One of the advantages of the online revolution is that it makes it comparatively easy to follow presentations elsewhere, in this case an ACM presentation on the development of MOOCs by Andrew Ng, one of the founders of Coursera.

I won’t summarise the whole presentation – there’s a copy online, I don’t know if you have to be an ACM member to view it, but I’m sure it would be possible to negotiate access.

So rather than summarise the presentation I’ll extract what to me are the key points


  • Business and computer science courses predominate, but singular most popular course is one on social psychology
  • India and China lead signups but engagement is global
  • Coursera has put significant work into infrastructure to improve content delivery to China, plus localisation of introductory material
  • Many students already have a significant qualification, MOOC’s appear to provide an opportunity for ongoing development rather than substituting for conventional study
    • implies many students already have learned the habits of study

Apps and access

  • providing Android and iOS apps for tablet computers have allowed student in areas with poor internet access to access and download content to then work offline
    • Android app is extremely well used in China
  • Poorest and most disadvantaged student usually face most difficulties in gaining internet access (cf development on linux based internet access programs in Latin America)

Credentials and completion

  • most students if they complete the first two weeks complete the course
  • students who register (and pay) for a completion certificate usually complete the course
  • considerable work goes in to make sure that the student submitting work is the same person as the registered student
  • credentials increasingly accepted by recruiters as evidence of ongoing personal development
  • suspicion that some of the non-completers who follow significant part of course are either using it as a substitute for background reading or as an experiment to see if a particular subject area is for them

Flipped classroom and MOOC’s

  • Flipped classroom – idea that classroom/lecture time used for discussion and experimentation and students review course material in their own time
    • Reverse of C19 model of lecture delivery and work at home
    • C19 model result of the technologies of the time
  • Flipped versus MOOCs model sterile debate – MOOC material can be successfully reused in flipped model (cf re use of OU material by traditional universities in the UK, or exchanging online material to allow provision of courses where otherwise not possible)


Posted in Uncategorized | Leave a comment

Comprehensive data management at UCL

Yesterday, ANDS hosted a webinar by Max Wilkinson, head of research data services at UCL – clicking on the link will take you to a pdf of my notes.

I have my own views on what a research data service might look like but I like Max Wilkinson’s take for a number of reasons:

  • it’s incremental – rather than enforcing the use of a service let it spread and be adopted gradually – this both gives the service time to grow, and avoids the inevitable problems when a one size fits all big bang approach fails to deliver
  • it leverages off existing competences – rather than build a whole new storage infrastructure it uses existing infrastructure to leverage the service – that way you don’t need to implement storage at the same time as a service
  • there’s an understanding that researchers need work in progress shared storage as well as archival storage, and providing the former lays the ground for moving data to archival storage
  • using a project based approach allows you to garner a sparse metadata record from day one – it also allow the tracking of a life cycle of the project
  • there’s an understanding that there’s a definite problem with data stored on legacy media but it’s a separate problem from managing born digital data .

All in all an excellent webinar – I’d recommend reviewing the video once it’s online

As always, my notes, my views, no one else.

Posted in Uncategorized | Leave a comment

Minting ORCID identifiers programmatically …


As part of our work to develop a solution to better manage metadata about researchers, publications and data, we set off down the road of trying to link publications to datasets, and by extension grey literature, deposited in various online collections as well as linking to information on grants held to try and build a picture of research activity.

This is useful in a lot of ways – for example it potentially allows us to implement a service to automatically generate online profile or portfolio pages and highlight a researcher’s most cited papers, or who has had the most international collaborations, and so on.

One problem that we immediately hit was that of disambiguation – being able to accurately identify that the Fred Smith who deposited a set of photographs of Wari shroud wrappings was the same person as the Fredrick L Smith who published a monograph on the iconography of early Wari shroud wrappings.

Internally this is not really a problem as we have a universal identifier scheme in place, but it does tend to break down when tracking collaborators at other institutions. One could imagine a scenario where colleagues elsewhere publish an analysis of data held in our data repository, and vice versa. This is particularly a problem where you have a number of large scale international collaborations.

Granting external people temporary identifiers is not a viable long term answer as people change institution, change their name, change their email address, meaning that over time the information held may become invalid.

The obvious answer would be to use some form of persistent universal global identifier, and one in which either the owner of the identifier, or the provider of the identifier, is invested in keeping up to date.

In fact there are already a number of global identifiers in use, many of which are proprietary. Examples of possible global identifiers include scopus identifiers, ResearcherID, NLA/Trove identifiers, Google Scholar profiles and the like.

However, none of these identifiers are truly universal, as not everyone has the same set of identifiers, and some people, such as early career researchers, may have none at all.

Our approach was to build a light weight database keyed to the institutional identifier to record these attributes for each researcher.

At the same time ORCID was clearly an identifier gaining widespread adoption worldwide and not tied to any field of study, or scholarly information service, making it suitable for use as a universal career lifetime persistent identifier.

As part of feasibility testing we decided to develop an ORCID minting tool, to demonstrate that we could programatically take the information we already knew about a researcher and either create an ORCID id for them from existing data sources or update their existing ORCID record.

The solution has been tested against the ORCID sandbox. This is very much a proof of concept exercise but we believe that the code is sufficiently robust a that it could form the basis of a generic ORCID client.

The application allows the user to

  • create an ORCID profile
  • update an ORCID profile
  • link existing publications to an ORCID profile
  • view the records for publication stored in an ORCID profile

The code was developed by Genevieve Turner of the ANU Data Commons team and is available for download from

under a GPL 3 license. As it is prototype code no warranty is made to its suitability or fitness.


Posted in Uncategorized | 1 Comment

Webinar on changes to ARC funding rules

Following on from the recent ANDS webinar on the  ARC changes vis à vis research data management ANDS hosted an online discussion of the changes with Douglas Robertson (ANU), Joe Thubron (Intersect), Justin Withers (ARC) and Greg Laughlin (ANDS). The meeting was chaired – animé as they say in French – by Adrian Burton

The discussion was pretty informative.

Rather than précis my notes of the discussion here I’ve posted them online as as pdf for download.

As always these notes represent my interpretation of the discussion, and not necessarily the views of the particpants

Posted in Uncategorized | 1 Comment

ARC research data management plans

The ARC has recently announced a requirement for future grant applications to include a statement as regards research data management.

ANDS recently organised a webinar with ARC participation on the changes.

My notes of the webinar are online as a pdf but the executive summary of the changes would seem to be

  • there is no mandate for open data – but researchers should consider how data could be made available
  • there is an understanding that ethical, commercial and intellectual property considerations my restrict the publication of data, and this should be explained in any data management statement
  • it is expected that the plan will reflect current best practice within the researcher’s discipline
  • the data management statement should only consist of a paragraph or two and should focus on publication and availability rather than on technological or infrastructure issues
  • it is expected that the statement will be more than a simple statement indicating compliance with institutional requirements
  • assessment of the data management component is part of the overall assessment process, and not a separate criterion
  • there is no requirement to use institutional data repositories as the ARC recognises differences in capability between institutions, and differences in the volume and nature of the data produced by different projects

As always, this assessment reflects my understanding at time of writing and does not constitute formal advice.

Posted in Uncategorized | 2 Comments is a service to shorten these long full canonical DOI strings to something more memorable, and easy to type. It’s effectively a link shortener like, but for digital object identifiers

Looking at the system it would be trivial to add a short doi option to our minting service – allowing a user to generate a short doi once they’ve got a a doi minted by simply clicking a check box during the minting process.

Why would we do this ?

In a word, convenience. Digital object identifier strings are extremely unfriendly for the user to type, and for people to note down in the course of a presentation, or to retype from a printed handout. At 15 hex characters they are just too damn long.

While the full doi remains the canonical form for citation and publication purposes, providing a short doi alternative makes it easier for a user to quote an unambiguous reference, and by building it into the minting process it simplifies the process for the user – they don’t need to go and cut and paste their newly minted doi into another website, and then store the short link somewhere …

Posted in Uncategorized | Leave a comment

Using BibTeX for dataset citation

As I’ve written before we chose to use BibTeX as our lowest common denominator citation export format.

Despite our focus on datasets the adoption of BibTeX came out of our researcher identification work and we were not really thinking very hard about BibTeX and data sets.

Obviously an oversight on our part. However at yesterday’s ANDS/Intersect meeting in Sydney there was some mention of how Evernote now supports dataset citation.

This reminded me that we had never actually resolved the question of dataset citation and BibTeX. However, as in all things google was my friend.

As with all things BibTeX theres more than one way of finangling it. JabRef suggests the use of an @electronic type, while others suggest using an @online or @misc type.

As we are talking about using BibTeX as a data interchange format the use of an @misc type is perhaps the most applicable as we are making no special assumptions about the capabilities of the application.

Therefore we’d be looking at something like

  title = {{MS Windows NT} Kernel Description},
  howpublished = {\url{}},
  note = {Accessed: 2010-09-30}

and for a dataset something like

author = {Claire O'Brien},
title = {{Impact of Colonoscopy Bowel Preparation on Intestinal Microbiota},
doi = {10.4225/13/511C71F8612C3},
howpublished= {\url{}} 

where we store the Digital Object Identifier as a url, as well as citing it normally. Obviously we could refine it further by expressing the researcher’s Orcid number as a url so that the user can access the object.

If we use JabRef to autogenerate an entry we end up with something very similar:

  author = {Claire O'Brien},
  year = {2013},
  title = {Impact of Colonoscopy Bowel Preparation on Intestinal
  language = {English},
  howpublished = {\url=},
  doi = {10.4225/13/511C71F8612C3},
  owner = {dgm},
  timestamp = {2013.11.28}

which is very similar, especially if we use  howpublished rather than url given the lack of a standard form for url citation in BibTeX. As I said earlier it may be preferable to use @misc in preference to @electronic when creating a lowest common denominator entry for reuse

Reference: Guide BibTeX pour la création de bibliographies avec LaTeX

Written with StackEdit.

Posted in Uncategorized | 3 Comments

Using Bagit …

Pete Sefton, over at UWS, recently published an interesting post on their use of the bagit specification in their data capture applications.

The Data Commons also uses bagit in its ingest mechanism, but for slightly different reasons. The genesis of using bagit goes back a long way, in fact to when I was working for some other people entirely, building a what was described as a digital asset management system but was actually a digital archive.

What we had then were recordings of spoken language, and transcriptions and translations of the same, both with annotations.

They were all related, and while they were separate objects they had a very close linkage – the whole was clearly greater than the sum of the parts.

My idea at the time was to zip the lot together and deposit the zip file – which ws not perfect as it meant we lost access to the technical metadata, but at least we kept the related things together.

The zip file was not about compression, it was merely a way of keeping things together.

Fast forward to last year. We had the same problem with data capture where quite often different sets of observational data were separate but related files. I had previously played with file formats like epub which use zip to keep all the component parts together, all of which have a defined role, and which has the concept of a manifest. This isn’t unique to epub, the libre/open office document format does a similar trick – if you want to poke around, copy an odt file to and open it with your favourite archive manager.

We actually thought about inventing our own format based on epub for all of five minutes until we discovered bagit.

Bagit was a good choice as (a) it had the Library of Congress behind it making likely to become a de facto standard and (b) it addressed a problem we handn’t thought of – bagit will let you have a holey archive, ie have files stored by reference elsewhere.

In a world of collborative research this is potentially important, where one archive holds the say the transcriptions of some medieval texts, but somone else entirely holds the high resolution images that our transcriptions were compile from.

So we chose bagit. To get round our problem with losing access to the technical metadata we feed files on ingest through Fido and Tika and store the technical metadata in an associated file inside the bag – the bagit archive.

We have also looked at adding normalization to the ingest process, even though we havn’t implemented this yest – essentailly when we get a document file in a format we recognise we feed it through a standard conversion tool to save a second copy in a standard format to the bag following the National Archives of Australia’s normalisation guidelines out of project Xena – that way we will always have a version of the document that we can read with software current at the time, and from which we can generate other formats.

Other ideas we have looked at is feeding the file through topic modelling software to effectively do automatic keyword extraction – something potentially can be used to improve the qality of the search results.

Posted in Uncategorized | 1 Comment