UC DLFx 2018: DeMystifying Data Curation
UC DLFx 2018
Demystifying Data Curation - Vessela Ensberg (UCD), Emily Lin (UCM), Ho Jung Yoo (UCSD), Amy Never (UCB)
Purpose of data curation: FAIR data principles--Findable, Accessible, Interoperable, Reusable. Qualities of data that make them valuable.
Findable: DOI; Accessible: online, fixity-checked, backed-up---these are called Bit Curation (we depend on technology).
Interoperability: file formats still active, can be open and read
Reusability: metadata to put data in context for researchers to decide whether or not they can use it.
This is curation for long term use.
Curation for long term use:
When: pre-ingest or post-ingest--
How: in depth (documentation happens down to each variable) or limited in scope.
Tension between time, quality, and return on investment
4 models/case studies on how they approached curation in their experience, what skills are necessary, and what difficulty there is in providing those services.
Case 1
Emily Lin - UC Merced
NSF program: Critical Zone Observatories
Conversations around this started many years ago; had talked to PI years ago, then had more in depth interview with him in 2011 about research data curation and general needs. he recognizes importance of this issue, said there are millions of dollars spent on acquiring data that will ultimately lost. Funding has remained at same levels, but work has increased exponentially. Even with his recognition, difficult to get traction in discussions about research data curation. Not until 2016 that he approached them. He had a data manager on his team, but what drove him to approach the Library is that the data project funding ceased in 2015. Impetus: Cyberinfrastructure project funding end; desire for DOIs. CZO current list of data had no descriptive record about data, just a page of links.
In 2016, had already spun off a Dash instance (UC CBL service for depositing data) in place to address their desire for DOIs.
Curation Activities:
1. EZID Sponsored account
2. Datacite metadata
3. Readme
4. Merritt deposit & surfacing/publishing in Dash.
To do this needs to create metadata in Datacite, also guided them to create a Readme file for the set: the kinds of documentation they need to be more explicit about what codes mean, what does level 0, 1, etc cover? They did have a file system for how to keep the data, but helped them be more explicit. Coordinated with CDL to harvest dataset into Dash.
Follow up recommendations:
Get ORCID
Deposit software programs that they used
Populate related items that cite the data to create linkages
Over a year later: they came back
"Long time processing 10 year data" - took hem that long to process the data
Direct depositing Dash
Co-contributor management issues (problems revising old data because was under old PI and no user management)
Sorting out DOIs (Dash has its own shoulder for datasets, so not using the EZID; EZID can't manage DOIs because is under UCM DOI in Dash)
Still need to figure out why only one author has ORCID apparent in web lookup, and why link to documentation instead of bundling and depositing. Second instance, weren't as hands-on with first deposit.
Case 2
Vessela Ensberg
UCLA/Social Science Data Archive
Open Archival Information Model - key to interoperability
Restricted formats (spreadsheets, some statistical formulas)
DDI metadata creation using Colectic
Format migration, ascii
Now at UC Davis, Nov 2017 implementation of Dash (trying to serve whole campus)--can't do deep level curation due to lack of staff. Limited curation plan to allow them to look at datasets in Dash and improve discovery looking for keywords proposed by research. Most metadata schemas are administrative, but researchers care about subjects which is only available in keywords.
She keeps a limited curation spreadsheet:
Submitter, email, Dataset title, DOI, N deposited files, N files tested for whether they open
List of files that don't open, format recommendations, keyword recommendations, controlled vocabulary for keywords
Other metadata recommendations:
Add a readme file
Explain acronyms
Provide links to scripts
More descriptive file names
Tempting to go into each file to make recommendations because it's not scalable.
Limited curation difficulty level:
Easy: File download, N files tested for opening, List of files that did not open,
Medium: format and keyword recommendations, other metadata recommendations, controlled vocabulary keywords.
Difficult that requires deep subject expertise is outside the scope of limited curation.
Motivation concepts:
work in progress: revisit and improve
Perfect is the enemy of the good
Data Curation Network (uni libraries coming together to do deep curation for each other in a distributed model)
Case 3
Ho Jung Yoo
Analyst in UCSD Library
To get datasets into institutional data repository-mediated
Consult with researchers from any discipline o campus. Zip files with hundreds of thousands of JSON files to video of 3D renderings of objects.
Repository stakeholders have various goals that fall along a spectrum effort
Incrasing curation effort: Meet mandates to share data openly; archive and showcase the scholarly output of the institution; increase researchers' scholarly impact...
Curation: balancing funding and time. Quantity vs quality.
Adapt to each project's needs. Library has built tools ot streamline workflow for ingest. Scaling up is ongoing challenge.
Collection Ingest Workflow diagram (ask for slide)
After we receive data &metadat: wrangling (getting contents ready for ingest). Making best use of repository platform to meet researcher's goals for sharing the data.
Collection structure
Distribution on more than one landing page? Think of Doing issues with this?
What is the best way to bundle the data (conversion with researcher)--how will most end users want to use the data, and how will how we bundle it impact discovery.
More than one right way to structure a collection. Comparisonof landing page. One item has components organized by experiment in the study; in second item the researcher's core data was 600GB in size (too onerous fo rend user) so packaged in 6 zip files at 50G each to make it easier.
Enhance discoverability through encouraging detailed descriptive etadat, asking for resources in the collection we can link to
Scripps Institute of Oceanography collect data on cruises. Not all use the same naming convention. Hoping to normalize cruise names to allow user to click on cruise line facet and collect the data. Every time ship goes out to sea, multiple research teams collecting their own kinds of data. once ship and data return to port, data is put in separate silos with own naming formats, so no one way to get all data collected on particular cruise at this time.
Provide non-subject expert review of files and metadata. Ask questions like does descriptivemeetadata tell me what files are in collection, what software versions needed to review data, do files open, proofreading the metadata. Researchers make a lot of little errors in documentation. Many can be detected with little/no knowledge of the discipline. Metadata spreadsheet, breaks in patterns, accounting type errors. Researchers can clarify and modify. Can do this for a small data collection (plus she has some expertise in field).
Also try to keep EZID records of DOIs populated and up to date in the hope that link harvesting systems (Datacite) will eventually generate metrics on active use.
Timing can be challenging. For datasets associated with articles, ideally release dataset before article is published, but usually get after article with the DOI established.
Pain Points
- Asynchronous datasets
- Reproducible/reusable.interoperable = gold medal. Comprehensible data = bronze
- Versioning ongoing projects - no current systemsm, so need to get creative about how they handle collections.
- Selective access for reviewers
- Some publishers want datasets available to reviewers, want data to be embargoed. Library is logical 3rd party to release data to reviewers. But Library is not in the business of mediating data. (RDLShare?)
Curation value can be gained when someone with researcher worksflows and constraints can
Domain expertise beneficial but can survive without it. rely on researchers to improve usability by pushing right amount of work back to them. Want to promote this growing culture of data sharing on campus as well as data management best practices. Juggling priorities but moving forward - joggling (haha)
Case 4
Amy Never, UCB (prior, UMich)
UMich Research Data Curation Librarian. very focused on data repository. Recruited datasets, preparing librarians for researcher data submission. Network for research data services: library liaisons (domain experts), research data services core team, specialist (functional specialist--peopel with expertise in areas, but not the biology librarian. Ex., a digital preservation librarian, a metadata librarian). Research Data Services was an interplay between these three.
Many subject librarians felt they didn't have the skills necessary for data services. Spent a long time talking to all subject librarians and doing informal interviews--a get-to-know-you, and how does work you're already doing translate to research data services. Librarians discovered they had more skills to contribute than they thought. Very lengthy to talk to everyone but was worth it. Subject librarian articulate expertise in relation to their domain on their library page, added research data services items. increased librarian confidence.
Data Curation Network: sites.google.com/site/datacurationnetwork
DCN CURATE checklist of steps and FAIRness Scorecard: DCN Planning Phase report 2017
Grad student would take checklist and review dataset, then she would go to subject librarian an ego through checklist. then check for differences between grad student info and her/librarian info.
UM Data Curation Checklist
1. Brief Summary of Findings
2. content (data itself) review
Need to run virtual machines to open up files in weird softwares. What fee format? Is any data missing? How big is this? Need to break up?
3. Context review - documentation - is it a Readme, a data dictionary, how does the description of the data in the readme file read? contact info for people responsible for the dataset, other license, disciplinary standards
4. Connections review - what are connections to other things connected to this dataset (papers, etc), relationship between files, do file names make sense
5. References/additional resources
1. stitch up report fro grad student and all this; if send to researcher and make recommendations, no real authority over it ("As data curator, i would recommend..." - some researchers interested in working with us, some not, so varying levels of curation.
Previously was a science librarian, so a great opportunity to get people together and make connections. How can we share expertise? DCN is a really cool model
All: Doing something is better than nothing. Early is better than later.
Some people intimidated and not going into curation because no training/expertise in data curation, but you have more than you think you
Questions
Q: Policies on what licenses researchers can add?
A: Default to CC0 or CC-BY and usually researchers are fine with that. Regents can delegate authority, check with the UL. But you cant own facts so at a is a weird situation. CC-0 or CC-BY but doesn't mean data can be used without conditions. Plagiarism has nothing to do with copyright.
Q: Where does dryad fit into this workflow because NSF - researchers deposit into Dryad, and stop. Need to convince researchers to deposit data for curation in the long term.
A: Researchers should put their data where users are most likely to find it. If Dryad is where they want to put their dataset, perfectly fine practice. That said, recently looking at a snapshot of affiliations of Dryad depositors, only about 50 UCSD submissions, so not sure how much it's actually used. We fill the same function they do. Researchers who can't afford Dryad did come to the library for a solution.
Demystifying Data Curation - Vessela Ensberg (UCD), Emily Lin (UCM), Ho Jung Yoo (UCSD), Amy Never (UCB)
Purpose of data curation: FAIR data principles--Findable, Accessible, Interoperable, Reusable. Qualities of data that make them valuable.
Findable: DOI; Accessible: online, fixity-checked, backed-up---these are called Bit Curation (we depend on technology).
Interoperability: file formats still active, can be open and read
Reusability: metadata to put data in context for researchers to decide whether or not they can use it.
This is curation for long term use.
Curation for long term use:
When: pre-ingest or post-ingest--
How: in depth (documentation happens down to each variable) or limited in scope.
Tension between time, quality, and return on investment
4 models/case studies on how they approached curation in their experience, what skills are necessary, and what difficulty there is in providing those services.
Case 1
Emily Lin - UC Merced
NSF program: Critical Zone Observatories
Conversations around this started many years ago; had talked to PI years ago, then had more in depth interview with him in 2011 about research data curation and general needs. he recognizes importance of this issue, said there are millions of dollars spent on acquiring data that will ultimately lost. Funding has remained at same levels, but work has increased exponentially. Even with his recognition, difficult to get traction in discussions about research data curation. Not until 2016 that he approached them. He had a data manager on his team, but what drove him to approach the Library is that the data project funding ceased in 2015. Impetus: Cyberinfrastructure project funding end; desire for DOIs. CZO current list of data had no descriptive record about data, just a page of links.
In 2016, had already spun off a Dash instance (UC CBL service for depositing data) in place to address their desire for DOIs.
Curation Activities:
1. EZID Sponsored account
2. Datacite metadata
3. Readme
4. Merritt deposit & surfacing/publishing in Dash.
To do this needs to create metadata in Datacite, also guided them to create a Readme file for the set: the kinds of documentation they need to be more explicit about what codes mean, what does level 0, 1, etc cover? They did have a file system for how to keep the data, but helped them be more explicit. Coordinated with CDL to harvest dataset into Dash.
Follow up recommendations:
Get ORCID
Deposit software programs that they used
Populate related items that cite the data to create linkages
Over a year later: they came back
"Long time processing 10 year data" - took hem that long to process the data
Direct depositing Dash
Co-contributor management issues (problems revising old data because was under old PI and no user management)
Sorting out DOIs (Dash has its own shoulder for datasets, so not using the EZID; EZID can't manage DOIs because is under UCM DOI in Dash)
Still need to figure out why only one author has ORCID apparent in web lookup, and why link to documentation instead of bundling and depositing. Second instance, weren't as hands-on with first deposit.
Case 2
Vessela Ensberg
UCLA/Social Science Data Archive
Open Archival Information Model - key to interoperability
Restricted formats (spreadsheets, some statistical formulas)
DDI metadata creation using Colectic
Format migration, ascii
Now at UC Davis, Nov 2017 implementation of Dash (trying to serve whole campus)--can't do deep level curation due to lack of staff. Limited curation plan to allow them to look at datasets in Dash and improve discovery looking for keywords proposed by research. Most metadata schemas are administrative, but researchers care about subjects which is only available in keywords.
She keeps a limited curation spreadsheet:
Submitter, email, Dataset title, DOI, N deposited files, N files tested for whether they open
List of files that don't open, format recommendations, keyword recommendations, controlled vocabulary for keywords
Other metadata recommendations:
Add a readme file
Explain acronyms
Provide links to scripts
More descriptive file names
Tempting to go into each file to make recommendations because it's not scalable.
Limited curation difficulty level:
Easy: File download, N files tested for opening, List of files that did not open,
Medium: format and keyword recommendations, other metadata recommendations, controlled vocabulary keywords.
Difficult that requires deep subject expertise is outside the scope of limited curation.
Motivation concepts:
work in progress: revisit and improve
Perfect is the enemy of the good
Data Curation Network (uni libraries coming together to do deep curation for each other in a distributed model)
Case 3
Ho Jung Yoo
Analyst in UCSD Library
To get datasets into institutional data repository-mediated
Consult with researchers from any discipline o campus. Zip files with hundreds of thousands of JSON files to video of 3D renderings of objects.
Repository stakeholders have various goals that fall along a spectrum effort
Incrasing curation effort: Meet mandates to share data openly; archive and showcase the scholarly output of the institution; increase researchers' scholarly impact...
Curation: balancing funding and time. Quantity vs quality.
Adapt to each project's needs. Library has built tools ot streamline workflow for ingest. Scaling up is ongoing challenge.
Collection Ingest Workflow diagram (ask for slide)
After we receive data &metadat: wrangling (getting contents ready for ingest). Making best use of repository platform to meet researcher's goals for sharing the data.
Collection structure
Distribution on more than one landing page? Think of Doing issues with this?
What is the best way to bundle the data (conversion with researcher)--how will most end users want to use the data, and how will how we bundle it impact discovery.
More than one right way to structure a collection. Comparisonof landing page. One item has components organized by experiment in the study; in second item the researcher's core data was 600GB in size (too onerous fo rend user) so packaged in 6 zip files at 50G each to make it easier.
Enhance discoverability through encouraging detailed descriptive etadat, asking for resources in the collection we can link to
Scripps Institute of Oceanography collect data on cruises. Not all use the same naming convention. Hoping to normalize cruise names to allow user to click on cruise line facet and collect the data. Every time ship goes out to sea, multiple research teams collecting their own kinds of data. once ship and data return to port, data is put in separate silos with own naming formats, so no one way to get all data collected on particular cruise at this time.
Provide non-subject expert review of files and metadata. Ask questions like does descriptivemeetadata tell me what files are in collection, what software versions needed to review data, do files open, proofreading the metadata. Researchers make a lot of little errors in documentation. Many can be detected with little/no knowledge of the discipline. Metadata spreadsheet, breaks in patterns, accounting type errors. Researchers can clarify and modify. Can do this for a small data collection (plus she has some expertise in field).
Also try to keep EZID records of DOIs populated and up to date in the hope that link harvesting systems (Datacite) will eventually generate metrics on active use.
Timing can be challenging. For datasets associated with articles, ideally release dataset before article is published, but usually get after article with the DOI established.
Pain Points
- Asynchronous datasets
- Reproducible/reusable.interoperable = gold medal. Comprehensible data = bronze
- Versioning ongoing projects - no current systemsm, so need to get creative about how they handle collections.
- Selective access for reviewers
- Some publishers want datasets available to reviewers, want data to be embargoed. Library is logical 3rd party to release data to reviewers. But Library is not in the business of mediating data. (RDLShare?)
Curation value can be gained when someone with researcher worksflows and constraints can
Domain expertise beneficial but can survive without it. rely on researchers to improve usability by pushing right amount of work back to them. Want to promote this growing culture of data sharing on campus as well as data management best practices. Juggling priorities but moving forward - joggling (haha)
Case 4
Amy Never, UCB (prior, UMich)
UMich Research Data Curation Librarian. very focused on data repository. Recruited datasets, preparing librarians for researcher data submission. Network for research data services: library liaisons (domain experts), research data services core team, specialist (functional specialist--peopel with expertise in areas, but not the biology librarian. Ex., a digital preservation librarian, a metadata librarian). Research Data Services was an interplay between these three.
Many subject librarians felt they didn't have the skills necessary for data services. Spent a long time talking to all subject librarians and doing informal interviews--a get-to-know-you, and how does work you're already doing translate to research data services. Librarians discovered they had more skills to contribute than they thought. Very lengthy to talk to everyone but was worth it. Subject librarian articulate expertise in relation to their domain on their library page, added research data services items. increased librarian confidence.
Data Curation Network: sites.google.com/site/datacurationnetwork
DCN CURATE checklist of steps and FAIRness Scorecard: DCN Planning Phase report 2017
Grad student would take checklist and review dataset, then she would go to subject librarian an ego through checklist. then check for differences between grad student info and her/librarian info.
UM Data Curation Checklist
1. Brief Summary of Findings
2. content (data itself) review
Need to run virtual machines to open up files in weird softwares. What fee format? Is any data missing? How big is this? Need to break up?
3. Context review - documentation - is it a Readme, a data dictionary, how does the description of the data in the readme file read? contact info for people responsible for the dataset, other license, disciplinary standards
4. Connections review - what are connections to other things connected to this dataset (papers, etc), relationship between files, do file names make sense
5. References/additional resources
1. stitch up report fro grad student and all this; if send to researcher and make recommendations, no real authority over it ("As data curator, i would recommend..." - some researchers interested in working with us, some not, so varying levels of curation.
Previously was a science librarian, so a great opportunity to get people together and make connections. How can we share expertise? DCN is a really cool model
All: Doing something is better than nothing. Early is better than later.
Some people intimidated and not going into curation because no training/expertise in data curation, but you have more than you think you
Questions
Q: Policies on what licenses researchers can add?
A: Default to CC0 or CC-BY and usually researchers are fine with that. Regents can delegate authority, check with the UL. But you cant own facts so at a is a weird situation. CC-0 or CC-BY but doesn't mean data can be used without conditions. Plagiarism has nothing to do with copyright.
Q: Where does dryad fit into this workflow because NSF - researchers deposit into Dryad, and stop. Need to convince researchers to deposit data for curation in the long term.
A: Researchers should put their data where users are most likely to find it. If Dryad is where they want to put their dataset, perfectly fine practice. That said, recently looking at a snapshot of affiliations of Dryad depositors, only about 50 UCSD submissions, so not sure how much it's actually used. We fill the same function they do. Researchers who can't afford Dryad did come to the library for a solution.
Comments