UC DLFx 2018: Combined Session A Notes

Combined Session A:

UC Data Network: A Systemwide Solution for Free Research Data Management - Stephen Abrams

Point: what can we do collectively together to raise the systemwide capacity to deal with UC's rich research outputs. Technical platforms and community for open data publication, preservation, sharing, and reuse of UC research data. Partnership between campus VCRs, CIOs, etc.

New imperative/justification: Research data are raw material of scholarly inquiry and important to be recognized as first class scholarly outputs. Effective management of this research data and sharing of it increasingly called for by funders, publishers, institutions. Data stewardship good for scholars and scholarship. Should be scholarly best practice. Responsibility for this rests with individual scholars who may not have time, inclination, expertise or technical capacity to meet these obligations. Many resort to commercial alternatives (Googledrive, etc.)--legitimate concern that we don't want to be 10 years down the road where we are with published literature where we give it away and then buy it back.

Pieces of the Puzzle Individual scholars get grants, those can be used to produce and publish data, but cant use grant funds to preserve data over the long term (beyond life of grant). Campus VCs have strong interest in research data management for competitive research advantage, but no resources to do much about it. IT doesn't support data preservation or publication, just facilities for computation and working storage. We compete with other demands on scholar time, we compete with perception of research data management is not aligned with scholars' primary interests, and competing against free commercial alternatives.

UC systemwide system should look like:
- respond to technical and social barriers to adoption
- leverage prior investments in system wide expertise and technical capabilities
- UC data should be managed by UC; we need to turn that ownership obligation into a stewardship obligation. Incumbent on uni to fund resources
- drive down cost to free (campus chargebacks for storage capacity)

UCDN: VCRs, CIOs, ULs, collaborate as stakeholders. Each asked for contribution to common good. - Scholars contribute publishable data
- VCRs contribute sustainability, outreach, resources, imprimatur
- CIOs contribute local storage capacity to a common pool
- ULs contribute data literacy training and curation
- CDL contributes core technology

When a dataset in reasonably stable or publishable form, can be submitted to data publication platform
*dash system - submission modalities, offers outside curators opportunity to do QA review, metadata management (submission, publication, search, analysis). Appearance of repository, but just submission and discovery layered on Meritt repository. [Hmmm, something to think about once our migration to Samvera is done and we can look at its offerings]

Data publication made easy- use Dash - every dataset assigned DOI, Datacite registered, indexed in high level discovery as well as for marching for analytics. Integrates with ORCID IDs and FundRef, indicate which publications are related to the dataset. Once progress made on open data, we will then deal with restricted data.

Conversations between stakeholders. If nothing else, all those UC folks talking to each other. Moving forward with pilot: initially 4 campuses instead of all 10. Asking for 1PB per campus, UCR already said yes. Initial integration with UCR's RCloud storage - cnc.ucre.edu/rcloud

Things to deal with later:
- really big data (no matter what figure we use, we can always fill it up). But initially we are looking at long tail data in social sciences.
- software and workflows
- sensitive data -

Summary
UCDN is effective platform for dat publication and preservation
geographic and heterogeneous network
strategic partnership and community of common concern, advocacy and practice
free for individual scholars to contribute to and retrieve from
an investment in the future of UC research

Questions
What is the role of curation in this?
A: Dash product manager. Used to be that only individuals who could edit the descriptive metadata were contributors/coauthors. Now we can designate curatorial agents at the campus level to have access to any dataset contributed by someone affiliated with that campus. Can review metadata and data itself (in case of wrong format), additional descriptive metadata. Fairly new feature

Q: What about while faculty are still working with their data, is this the tool used, or just at publishable stage?
A: Dash soes support limited time embargoes. Not transactional working storage- that happens individually at various campuses. This is for publishable stage data.

Q: Should we ask faculty for previous datasets?
A: Great! Re use is not just for others but also for the scholar themselves when they deal with retrospective as well as prospective data. [Curious as to what this is]

Beyond the Repository: Exploring Integration Between Local and Disributed Digital Preservation Systems - Sibyl Schaefer (UCSD)

Grant with Northwestern University. Gilas were to investigate common problems in digital object curation, versioning, and interoperability, etc - ask for slides.
Oh, hey - Declan Fleming! I remember him.

How does one curate objects to ingest into a long-term dark preservation system?
How does versioning of objects and metadata play out in ling erm dark preservation
How can those two systems be made interoperable.

Gathered info via a survey to get breadth of local implementations, ID potential preservation policies and rights issues. In depth interviews with breadth of interested parties.

Survey results: 170 valid responses. 65% had collected 10TB or more, more than 80% expected content to grow by more than 10TB in the coming year. Mostly academic libraries, 73 willing to further discuss.

Systems used: BePress, Archivematica, ContentDM, Space, duraclous, Fedora, Islandora, homegrown, etc.

Distributed storage and number of copies kept: 85% said kept multiple copies in multiple locations (3 locations; 7+ are LOCKKS). If not, funding was barrier. Where stores? Multiple locations onsite came first, then cloud (S3, Glacier, Box), digital preservation services. How copies are tracked: automated tools (logs), don't keep track, homegrown, IT support. MetaArchive Conspectus, Spreadsheet/database/other manual method. Versioning distributed copies - 85% reported they keep all the different versions. 20% only keep newest version, 20% unsure.

Curation: Sending materials to digital preservation services is not cheap. How do you select from 10TB to send to APTrust or other such services? 48% say they select a subset of materials to send to a distributed repository, and the criteria were mandate and intrinsic value of materials.

Interviews: working way to rank not just material but level of preservation they would receive. Local backups, but more unique might be backed up to cloud, but if born digital special collections ideal to place in distributed digital preservation system. DPS is top level gold standard, cloud intermediate, and extra copy on site for something fairly easy to replace. DPS makes item more valuable bc of the investment made in the material.

No one has any idea about versioning- haphazard, not really documented, not sure what we're doing now.

Interoperability: many systems were homegrown. Problem of interoperability widely felt.

Brutal honesty: folks realize what they're doing is insufficient in face of disaster; institution doesn't realize the value or wanton spend the resources.

Conclusions
Mandate and intrinsic value important. Some consensus that there are different tiers of materials, and different tiers of digital preservation. No consistent version practices locally, let alone in distributed systems. Interoperability: Bag It was mentioned in almost every interview as common tool for data packaging

Recommendations:
Create a BagIt specification for local repository systems. Fedora will export a bag, in specification will be metadata on the bag--need schema to allow different systems that create bags to declare them.
Create standardized API for disoriented digital preservation services
Need for curation decision tools to codify top level digital preservation and what materials fall into that category. For my particular institution, what are the things that matter?

Final Report: https://goo.gl/cnEgej

Questions

Q: Project what's next?
A: Application for implementation grant. Data is published with the report if you want to do a cross analysis.
Q: So, horrible but not surprising.

Q Did any one go into detail on how they determined intrinsic value?
A: That will be part of the implementation grant.

Search This Blog

Guardienne of the Tomes

UC DLFx 2018: Combined Session A Notes

Comments

Popular posts from this blog

First Impressions & Customer Service Failures

Email Lists: A Dose of Common Sense

On the Great Myth of the Librarian Grays