UCDLFx 2018 - Keynote (Notes)

UCDLFx - Digital Library Federation X Conference
February 27, 2018
Keynote Christine L. Borgman

Big Data, Little Data, or No Data? Scholarship and Stewardship to build the UC Digital Library - Dr. Christine L. Borgman

See http://knowledgeinfrastructures.gseis.ucla.edu
http://christineborgman.info
@scitechprof

Her book: Big data, little data, no data: Scholarship in the networked world (MIT Press, 2015)

DLF folks: Endangered data week is this week Oct 15-17 is DLF forum, NDSA immediately after.

UCR University Librarian remarks. Historical framework for why this is important. Mistake in library school--mission statement too close to computer science department, so lost the SLIS. We lost nerve as librarians about bringing unique insights to the table. CS folks came to us to ask about how to handle all this metadata and data? Th library profession has an incredible role shaping the digital now and the digital future. Vital force in bringing unique and critical insights to he table, how we create value added resources atop digital content as full partners in this digital ecosystem.

Dr. Borgman
1665 - first English language journal. 350th anniversary issue.
Data sharing policies push around the world (Europe, Asia, Australia. US is behind). Push has been idea from data as part of process of research to product and we should be sharing and reusing them to create new knowledge. Funding agencies expect depositing/publishing work.

RDA - Research Data Alliance. Getting pubs hard enough, data deposited even harder. Big data: volume, variety, rate of change, veracity, speed of change, big in scale - longitudinal. data size, data sources.

What are data? Mouse is not data, it's what you *do* with it. Obecjts themselves aren't data, it's what you do with them. Phenomenological definition of data, when you start to treat it s evidence. Malleable, mutable. mobile. See book for her definition of data.
Data fits into ecosystem and flow but not well explained or documented like usual research articles.

Sloan Digital Sky Survey as example - open telescope sky surveys. Streaming for a decade or more. Data have been archived in different ways. Astronomers take bits of it, not all,
Research Process
 - models and theories
- research questions
- methods

Center for Embedded Network Sensing (NSF & Tech Center, 10 years, 300 members, 5 universities. Scientists wanted new ways to get data at different levels of granularity.

Signal to noise differences. Engineers vs biologists about calibration and testing instrument for measurement. CS devices couldn't be used to publish in biology journals; working side by side, data kept separate and definitions different.

Sometimes little data becomes big data. veg.isti.cnr.it/griffin/

Publications.
Grey lit: reports, working papers, conference papers, preprints, patents, datasets, audio, video, slides, posters, codebooks, course syllabi, proposals, memos

See Berkeley Technology Law Journal, 33(2) - "Open data, grey data and stewardship: Universities at the privacy frontier." Borgman, 2018.

Taking us into a deeply data centric world. a machine readable interoperable one.
Kelty, C. (2016). It's the data, stupid: What Elsevier's purchase of SSRN also means. at savage minds.org
Publications not simply containers for data. Publications are arguments made by authors, an the data ae the evidence used to support the arguments. (See Big Data book.)
Publications <-> Data

Publications <-> Mapping
Deposit article and dataset assumes one to one relationship between publication and datasets. What gets released and when, we need to protect at a until appropriate to release. Questions of autonomy and academic freedom.

Publications <-> Data: Attribution
Publications we think of as independent, we pick up a book/article, we read top to bottom. Authorship is negotiated. Data is much different. Ownership is rarely clear, people don't sit down and negotiate up front who owns the data. In UC the Regents own your data - a 1958 paragraph (own lab notebooks and all other original sets of data). Compund objects -- data sets, algorithms. Long term responsibility for datasets usually rests with PI. Expertise for interpretation is folks/grad student sin the field. (Ask Matt about CSU's.)

Data citation and analytics: credit, attribution, and discovery. Why are we thinking about data citation? To give people credit? Different formats. Blue book, MIT's own version, various citation formats - ZOteero has over 1800 unique styles. We cant agree on how to cite a journal article, much less data. How do we turn it into data citation. Bibliometric data are messy bc depend on authors using one of these styles.
"If we share data, will anyone use them? Data sharing and reuse in the long tail of scene and technology" doi.org/10.1371/journal.pone.0067332

Bibliometrics by source. What are we measuring? Thee numbers are being used to hire, promote people. False precision. Difference in numbers and policy between Google scholar, Web of science, SCOPUS. H index. What do these mean? Specious and unstable.

How do we attribute responsibility. Undermines deeply embedded western infrastructure.
Concerns bout credit and plagiarism. Scholarly credit: contributorship (trying to say what each person did). At what level of granularity?
(reminder to self: CNI)`

Biology authors have to be alive bc requires authors to be alive to sign off.
When we find data, we want to do more than read it.  If you don't have the software, you don't have the data.

Identity  persistence. If you put a DOI on a dataset does it make it more persistent? Identifier is persistent and unique, not the thing it points to. What's worth keeping and for how long? Journals occasionally ask for your ORCID. Not being picked up as quickly in Asia.

Intellectual property - what can you do with it? Provenance around access to data is real challenge. What rights are associated? Tough to maintain in computer readable way.
Information and autonomy privacy (**Important, see UCOP Privacy and Information Security Initiative. 2013. http://cop.edu/privacy-intiative)  what does privacy mean in intensely data driven world, how does it cross into intellectual property, academic freedom, etc. Info privacy is narrower - keeping SS, drivers lic, ccs secure. Autonomy privacy is right to be left alone and right to work without having your work revealed. Need to think about data and collection stewardship, want data to be open when appropriate but also to protect privacy and security.

FAIR - Findable, Accessible, Interoperable, Reusable. Wilkinson et al. in Scientific Data, 3, 2016. Do our systems assume that a natural person is at keyboard, or a robot, or no keyboard? Are we designing so human can pick out things on page. if going to aggregate ni intelligent ways to make new knowledge has to be much more than assuming human is making decisions at a splash page. FAIR is a goal, set of principles and guidelines.

Data stewardship reality:
datamartist,com
people doing the work on teh ground, it's grad students and post-docs. May have to find student after they get a job, when you need their data expertise--they not just collect data but write the software. Need to move beyond "gradware" to stable systems.

If you cant protect it, wont collect it. if you collect it, you must protect it (privacy and autonomy).
democracies.eu/blog/open-by-design -
privcybydesign

Surely we wrote into contract that the data would be exported. Data security index 3 - UC - not really about data ownership.

Records retention and disposal cycle. DCC Curation Lifecycle model. Need more cross conversation. We are keeping things longer than required bc easier to keep than get rid of, and then are legally liable for that info.

Responsible data practices. Many stakeholders with competing interests. Need more joint governance processes. Hope that more discussion of what governance process is. Why humans moving documents multiple places when it could be ingested by multiple systems once student upload (i.e. theses).

Scholarship and stewardship  - we want more mission driven stewardship. Research teaching and service.
Steward the scholarly record (integrated workflows, version of record, record of versions). Support discovery at scale (human/machine/lawyer readable), sustain trust of community (privacy: information, autom=nomy academic freedom stewardship and governance).

Move to undergrads into special collections because we I've such neat packages that they don't understand where these things came from.

UC Leadership in Data Policy - UCOP Privacy and information security initiative (2013), cop.edu/privacy-initiative - see principles

UCLA Center for Knowledge Infrastructures.

Questions
Researchers say they're measuring something, think they've measured it, but often measurements don't capture what they think they are. What does it mean to have a document today? Wikipedia changes by the hour, what does it mean to attribute authorship, how do you refer to the article? Googledocs used as final papers. I don't even know what field I'm in. LinkedIn and FastCompany get greatest views, but changeable.

A Libraries and archives tend to acquire things after they are finished, so we assume its finished when most of this is streaming data and at best you get a slice in time. What will be the stability of documents under quantum computing? Our descriptive systems i our collections are not thinking about this adequately. Some systems started with continuous streams of data off the instrument, but then you write  paper and cant reproduce at a in that paper because cant get back to that slice. So now have data release 1, data release 2, capturing slices. Need to acknowledge it's inherently unstable. Memento - time travel for the web by moving back in timestamps. Provenence guidelines from W3C that try ot recreate snapshots in time. They think more of info as flow and not static cutpoints in time. our systems aren't prepared to think about data this way.

Q: Are people thinking about liability around data privacy issues- if it's falsified, etc? Pitched by admin: "A unit is responsible for keeping systems up to date or financially responsible" - but what about adequate technical support to make sure unit is covered?

A: If UC has a contract with software company X, try to get liability built in. But what is happening, Silicon Valley coming to uni and wanting to make contracts with INDIVIDUALS (researchers, individual faculty member), professor ays yes, students forced into terms and conditions to stay in class, info is distributed, if breach, then who is liable because contract isn't with UC? Shadow economy of technology economy around. Only when it blows up will we find out who is liable. UC Regents claim ownership but not STEWARDSHIP. Materials transfer agreements. Really open territory and changing more rapidly than people recognize. We need to grapple with this, it's hitting the fan fast.

Q: Does current SLIS curriculum adequately preparing librarians? Adequately adapted to digital age?

A: Some schools moving along. We're strong in archives and social justice, but there's no one behind me to teach these data courses. MAYBE someone to teach privacy courses. Data courses - ways to engage university. two course sequence, students work with researchers to develop data management plan, then second class I have them work with researchers to broker data into an archive. Big inroads with faculty who know know that students and librarians can do this. Other iSchools not doing as much to connect with faculty. We're the collectors, people with content. We need to take it into richer design space.

Comments

Popular posts from this blog

First Impressions & Customer Service Failures

The Dissertation Problem and ProQuest's "Legitimacy" Lie

On the Great Myth of the Librarian Grays