ASIST 2017 Panel: Organizational and Institutional Work in Data Infrastructures
Organizational and Institutional Work in Data Infrastructures
1. Data Archive Sustainability: Science Policy, and Business Model Planning for Social Science Data Archives 1965-2001 (Kalpana Shankar, Kristin Escenfelder, and Rachel Williams)
Still up to ears in data and trying to make sense of anything so no grand findings yet. More of a think piece on data archive sustainability. Comparative historical study of six social science data archives.
RQs:
How have SSDA change how they "do business" over the long term?
What factors have encouraged/discouraged that change?
How have national level science funding approaches influenced that change?
Acquiring documents like memos, board minutes, annual reports, strategic plans, etc.
Business models. Lay recognition that data archiving is important thing, can do cool new science in different ways, people starting up dat archives but with little thought to long term and how these things would maintain themselves over a 20-40 year period. Too many personal pet projects, based on grants, based on volunteer labor. So they searched for long-lived data archives.
Views of Data Archive Funding Sources
- Line item in a budget (unicorn)
Public good , so public funding. Pro: reliable and easier to plan. Con: may discourage innovation, no extra money for special projects
- 5 year grant cycles from national science/educational agency
Public good/public funding argument
- Marketplace: subscriptions, memberships, contract work
Marketplace of ideas, prove value, attract support, cross-subsidize
pro: pitch new ideas. Con: volatile, uncertainty.
Marketplace: ICPSR, Roper center, LIS cross national data center
5 year cycle: UK Data Archive, EDINA, Minnesota Population Center
Line item: UC Berkeley, Irish Social Science Data Archive
Different ways orgs change themselves over the long term to remain robust as data archives providing data services to researchers.
Same Data, Differing Objectives: What happened when research libraries took on a large scientific dataset?
Sloan Digital Sky Survey (SDSS) big data on mapping the universe. 160TB dataset. AT time almost unprecedented for astronomy. Once STSS operations ceased... Key leadership started to address curating data in long term for SDSS data transfer process. Five year process, MOU signed with two libraries.
Data archive server - A: Archive & Serve, B: Archive
Catalog archive server: " "
Administrative archive: Preserve B: Mirror
Help Desk: A: Assume responsibility
Raw Data
Software
Differing reflections. Library stakeholders pleased with process.
SDSS had mixed perspectives - slower than anticipated, Helpdesk didn't offer scientific assistance
Why differing perspectives?
Tensions during process (SDSS vs library perspectives)
a) Metatension - data vs infrastructure as primary legacy. Libraries need to constantly recruit people to services, and see research data services as a broader service to use and offer to other scientists with other datasets. SDSS wanted a helpdesk tailored to specialized astronomy knowledge, library wanted something easily adapted to other domains.
b) Curating living dataset vs curating closed dataset. SDSS wanted usable dataset live and available, where library was concerned with preserving in static format and preventing bit rot. Differing interpretations of what 'curation' meant and the terms around this. Here differences were resolved during data process.
c) Infrastructure tailored to astronomy/SDSS vs
Main motivations differed.
Site Based Infrastructuring: Data Work at Launch and Termination
Karen S. Baker
3 concepts: Local collective data work, participation, infrastructuring.
Local collective data work: Local wrt data origin/where generated. Collective refers to a community assembling data from a shared sampling location. Data work (Baker, 2017) - defined as any effort [...]
Ethnography - studying data work pratcice
Participatory design
Infrastructuring (Star & Bowker, 2002; Karasti & Baker, 2004, 2014). Ongoing socio-technical process.
EcoPrairie - End of Mature Infrastructure.
Data manager as a data *ally* vs primary information.
How does local data collective management end?
Participation as a participatory designer. CO-creating site closing checklist
Infrastructure happening as a process: 2 years to terminate, packaging datasets, migrating data to a second repository, partnering with university on digitized materials.
EcoRiver
Researchers as data allies.
How does local collective data management begin?
Adding data to agenda at meeting for field station, co-authoring, co-chairing. Decisions as to what to do due to local factors t be juggled that the researchers decide what to do. Data stewardship workshop to inform colleagues at various field stations. Here infrastructuring resulted in new instrumentation with streamed data, and the technician, not the data specialist, felt they were doing data management.
Local data collective and remote center data partnering. Designing, planning, making, developing, growing infrastructure.
Sustainability Under Construction: DataOne
Suzie Allard (UTK)
Researcher as practitioner.
DataOne - 2009 named and funded - infrastructure represented by 3 coordinating nodes, and 40 member nodes. Coordinating nodes hold all the data, central holds the metadata. Enable new science by providing access to data. Member nodes from different communities - biological, earth science, geological data. From beginning, Data One had participatory design. Two diff working groups: Cyber to deal with infrastructure, sustainability and governance, and outreach.
Sustainability - answer needs of the people we serve: scientists in different domains (academic, government), computer scientists, librarians.
Sustainability in 4 contexts: community engagement, long term planning, what kind of business models do we have to have (money sustains infrastructure), and change management )what is core thing we offer that will be there in long term?).
Lesson learned:
- Understand your stakeholders. Value proposition: what is value we bring to our scientists as stakeholders? People running repositories are crucial to sustainability, Data managers at partners are crucial into future. Making sure you get more downloads, prove value, however you need to report to be funded.
- Planning from inception: what is sustainability, what will it look like.
- Develop business case, think entrepreneurally, and how you will change
- Adjust the formula to respond to change. Where we are now, assess vs where are going to.
What has changed: customers and market, SOPs to make sure infrastructure stays in place faster, cleaner. Thinking abut cost structure and revenue stream, and what resources do you have that you can afford to keep into future and what will it look like into future.
Rubbing Shoulders: Data Sharing Approaches in Scientific Data Repositories
Sarika Sharma and Steve Sawyer
Data sharing: improve transparency, allow for reuse, and encourage open access. Sharing as a property of the scientific community; sharing beyond informal reciprocity requires institutions to exist
A shared capacity for the future? Big infrastructure for science - complex landscape (publishers, academic societies, government agencies, host institutions, community/collectives - both colleagues and competitors.
Governance: institutions go beyond informal practices - make decisions, how do you manage resources, allocate decision-making, enact mechanisms for coordination and resolution of differences.
Scientific repositories: Aligning data sharing practices, infrastructure and governance. Long Term Ecological Research, Great Lakes Ecological Network, Digital Archaeological Record. (Universities now building repositories as competitive advantage for their faculty). How does governance of scientific data repository
LTER - sites in network, federated shareable data, infrastructure localized layered on top is a generalized search platform. GLEON - researcher groups, hundreds involved. Data standards: upon membership agreement, search data via individual GLEON researchers; infrastructure local but is a contact list, totally decentralized. tDAR - single portal gives access to multiple repositories, individual scholar can contact the loca site they need. Governance there is making sure people stay connected; data belongs to individual subscribers and not network.
Scientific repositories from governance standpoint: federated access, balancing shared and local, human-based infrastructure. If you/lab leave, hard to replace. How do we know we can trust you, that your data will stay and is structured the way we need.
Models: Federated access, human-based access, directed access.
No such thing as a best practice that would work across communities.
Concept of governance is broader in terms of enagaging community but in what way and to what purpose?
Q&A
With no influx of money, the value of data repositories of info with living specimens are difficult ot maintain. When funding dies or lab leaves, data leaves with it. Unreplicatable data (for exzample see U of Louisiana at Monroe example).
Also digital objects--as close down and leave space, all manner of photographs and digitized infrastructure, field notebooks, etc. once digitized, could be handled.
Politics and infrastructure.
What about arts and humanities as relying on data? This was all science repositories. Unprivileged and underserved population. We don't usually think of them as data driven. Answer (Steve) - scientific data infrastructure: what is a library, what is a museum, what are our artifacts? These have well founded governance processes, they've figured it out. In most of scientific infrastructure trying to catch up to hundreds of years of libraries and museums, so unsophisticated governance. When stakeholder driven, very short term focus, looking right now, and it does change way we think about these artifacts. Point of great tension because these scientific communities tend to be localized. Data governance for humanities and arts: while libs and museums are doing things, only certain communities [...] governance is different.
Emerging communities.
Need a dissertation on the politics behind the History of ARTStor and the one that got subsumed in ARTstor. Dance and theater are interesting because intellectual property issues get out of control. Multiple levels of IP that get difficult to make things available in a networked sort of way. Digital repository of Ireland has been working on digitizing humanities and making available - almost e
Transcription for choreography - labannotation?
Differential access to resources for decision-making - why are some people give access or power of who has access to a dataset and others are not? [My thought: same reason resources are differentially distributed anywhere else]. Depends on incentivization within community in which scholars exist. If they get a benefit for sharing, see more of it than hoarding/not-sharing. Externalities and how people will act. Or sensitive data. Or business model - sometimes restrict sharing because need to charge because if no charge, organization wouldn't exist to even provide the sharing. Policy around information is difficult because people perceive 'information' in different ways, complex and multidimensional.
Allude to need to make infrastructuring more visible. As user of infrastructure, want it to be a black box that just works. We only realize importance of infrastructure where it fails. But here, infrastructure made visible in order to even make it happen. Ex: Apache user community is millions, but people involved in upkeep is much smaller. Governance decides how transparent it needs to be.
1. Data Archive Sustainability: Science Policy, and Business Model Planning for Social Science Data Archives 1965-2001 (Kalpana Shankar, Kristin Escenfelder, and Rachel Williams)
Still up to ears in data and trying to make sense of anything so no grand findings yet. More of a think piece on data archive sustainability. Comparative historical study of six social science data archives.
RQs:
How have SSDA change how they "do business" over the long term?
What factors have encouraged/discouraged that change?
How have national level science funding approaches influenced that change?
Acquiring documents like memos, board minutes, annual reports, strategic plans, etc.
Business models. Lay recognition that data archiving is important thing, can do cool new science in different ways, people starting up dat archives but with little thought to long term and how these things would maintain themselves over a 20-40 year period. Too many personal pet projects, based on grants, based on volunteer labor. So they searched for long-lived data archives.
Views of Data Archive Funding Sources
- Line item in a budget (unicorn)
Public good , so public funding. Pro: reliable and easier to plan. Con: may discourage innovation, no extra money for special projects
- 5 year grant cycles from national science/educational agency
Public good/public funding argument
- Marketplace: subscriptions, memberships, contract work
Marketplace of ideas, prove value, attract support, cross-subsidize
pro: pitch new ideas. Con: volatile, uncertainty.
Marketplace: ICPSR, Roper center, LIS cross national data center
5 year cycle: UK Data Archive, EDINA, Minnesota Population Center
Line item: UC Berkeley, Irish Social Science Data Archive
Different ways orgs change themselves over the long term to remain robust as data archives providing data services to researchers.
Same Data, Differing Objectives: What happened when research libraries took on a large scientific dataset?
Sloan Digital Sky Survey (SDSS) big data on mapping the universe. 160TB dataset. AT time almost unprecedented for astronomy. Once STSS operations ceased... Key leadership started to address curating data in long term for SDSS data transfer process. Five year process, MOU signed with two libraries.
Data archive server - A: Archive & Serve, B: Archive
Catalog archive server: " "
Administrative archive: Preserve B: Mirror
Help Desk: A: Assume responsibility
Raw Data
Software
Differing reflections. Library stakeholders pleased with process.
SDSS had mixed perspectives - slower than anticipated, Helpdesk didn't offer scientific assistance
Why differing perspectives?
Tensions during process (SDSS vs library perspectives)
a) Metatension - data vs infrastructure as primary legacy. Libraries need to constantly recruit people to services, and see research data services as a broader service to use and offer to other scientists with other datasets. SDSS wanted a helpdesk tailored to specialized astronomy knowledge, library wanted something easily adapted to other domains.
b) Curating living dataset vs curating closed dataset. SDSS wanted usable dataset live and available, where library was concerned with preserving in static format and preventing bit rot. Differing interpretations of what 'curation' meant and the terms around this. Here differences were resolved during data process.
c) Infrastructure tailored to astronomy/SDSS vs
Main motivations differed.
Site Based Infrastructuring: Data Work at Launch and Termination
Karen S. Baker
3 concepts: Local collective data work, participation, infrastructuring.
Local collective data work: Local wrt data origin/where generated. Collective refers to a community assembling data from a shared sampling location. Data work (Baker, 2017) - defined as any effort [...]
Ethnography - studying data work pratcice
Participatory design
Infrastructuring (Star & Bowker, 2002; Karasti & Baker, 2004, 2014). Ongoing socio-technical process.
EcoPrairie - End of Mature Infrastructure.
Data manager as a data *ally* vs primary information.
How does local data collective management end?
Participation as a participatory designer. CO-creating site closing checklist
Infrastructure happening as a process: 2 years to terminate, packaging datasets, migrating data to a second repository, partnering with university on digitized materials.
EcoRiver
Researchers as data allies.
How does local collective data management begin?
Adding data to agenda at meeting for field station, co-authoring, co-chairing. Decisions as to what to do due to local factors t be juggled that the researchers decide what to do. Data stewardship workshop to inform colleagues at various field stations. Here infrastructuring resulted in new instrumentation with streamed data, and the technician, not the data specialist, felt they were doing data management.
Local data collective and remote center data partnering. Designing, planning, making, developing, growing infrastructure.
Sustainability Under Construction: DataOne
Suzie Allard (UTK)
Researcher as practitioner.
DataOne - 2009 named and funded - infrastructure represented by 3 coordinating nodes, and 40 member nodes. Coordinating nodes hold all the data, central holds the metadata. Enable new science by providing access to data. Member nodes from different communities - biological, earth science, geological data. From beginning, Data One had participatory design. Two diff working groups: Cyber to deal with infrastructure, sustainability and governance, and outreach.
Sustainability - answer needs of the people we serve: scientists in different domains (academic, government), computer scientists, librarians.
Sustainability in 4 contexts: community engagement, long term planning, what kind of business models do we have to have (money sustains infrastructure), and change management )what is core thing we offer that will be there in long term?).
Lesson learned:
- Understand your stakeholders. Value proposition: what is value we bring to our scientists as stakeholders? People running repositories are crucial to sustainability, Data managers at partners are crucial into future. Making sure you get more downloads, prove value, however you need to report to be funded.
- Planning from inception: what is sustainability, what will it look like.
- Develop business case, think entrepreneurally, and how you will change
- Adjust the formula to respond to change. Where we are now, assess vs where are going to.
What has changed: customers and market, SOPs to make sure infrastructure stays in place faster, cleaner. Thinking abut cost structure and revenue stream, and what resources do you have that you can afford to keep into future and what will it look like into future.
Rubbing Shoulders: Data Sharing Approaches in Scientific Data Repositories
Sarika Sharma and Steve Sawyer
Data sharing: improve transparency, allow for reuse, and encourage open access. Sharing as a property of the scientific community; sharing beyond informal reciprocity requires institutions to exist
A shared capacity for the future? Big infrastructure for science - complex landscape (publishers, academic societies, government agencies, host institutions, community/collectives - both colleagues and competitors.
Governance: institutions go beyond informal practices - make decisions, how do you manage resources, allocate decision-making, enact mechanisms for coordination and resolution of differences.
Scientific repositories: Aligning data sharing practices, infrastructure and governance. Long Term Ecological Research, Great Lakes Ecological Network, Digital Archaeological Record. (Universities now building repositories as competitive advantage for their faculty). How does governance of scientific data repository
LTER - sites in network, federated shareable data, infrastructure localized layered on top is a generalized search platform. GLEON - researcher groups, hundreds involved. Data standards: upon membership agreement, search data via individual GLEON researchers; infrastructure local but is a contact list, totally decentralized. tDAR - single portal gives access to multiple repositories, individual scholar can contact the loca site they need. Governance there is making sure people stay connected; data belongs to individual subscribers and not network.
Scientific repositories from governance standpoint: federated access, balancing shared and local, human-based infrastructure. If you/lab leave, hard to replace. How do we know we can trust you, that your data will stay and is structured the way we need.
Models: Federated access, human-based access, directed access.
No such thing as a best practice that would work across communities.
Concept of governance is broader in terms of enagaging community but in what way and to what purpose?
Q&A
With no influx of money, the value of data repositories of info with living specimens are difficult ot maintain. When funding dies or lab leaves, data leaves with it. Unreplicatable data (for exzample see U of Louisiana at Monroe example).
Also digital objects--as close down and leave space, all manner of photographs and digitized infrastructure, field notebooks, etc. once digitized, could be handled.
Politics and infrastructure.
What about arts and humanities as relying on data? This was all science repositories. Unprivileged and underserved population. We don't usually think of them as data driven. Answer (Steve) - scientific data infrastructure: what is a library, what is a museum, what are our artifacts? These have well founded governance processes, they've figured it out. In most of scientific infrastructure trying to catch up to hundreds of years of libraries and museums, so unsophisticated governance. When stakeholder driven, very short term focus, looking right now, and it does change way we think about these artifacts. Point of great tension because these scientific communities tend to be localized. Data governance for humanities and arts: while libs and museums are doing things, only certain communities [...] governance is different.
Emerging communities.
Need a dissertation on the politics behind the History of ARTStor and the one that got subsumed in ARTstor. Dance and theater are interesting because intellectual property issues get out of control. Multiple levels of IP that get difficult to make things available in a networked sort of way. Digital repository of Ireland has been working on digitizing humanities and making available - almost e
Transcription for choreography - labannotation?
Differential access to resources for decision-making - why are some people give access or power of who has access to a dataset and others are not? [My thought: same reason resources are differentially distributed anywhere else]. Depends on incentivization within community in which scholars exist. If they get a benefit for sharing, see more of it than hoarding/not-sharing. Externalities and how people will act. Or sensitive data. Or business model - sometimes restrict sharing because need to charge because if no charge, organization wouldn't exist to even provide the sharing. Policy around information is difficult because people perceive 'information' in different ways, complex and multidimensional.
Allude to need to make infrastructuring more visible. As user of infrastructure, want it to be a black box that just works. We only realize importance of infrastructure where it fails. But here, infrastructure made visible in order to even make it happen. Ex: Apache user community is millions, but people involved in upkeep is much smaller. Governance decides how transparent it needs to be.
Comments