Metadata
Metadata
Improving the ability for scientists to describe experimental context for Bioinformatic Research
Design
Duminda Aluthgamage (Lead Researcher)
Ljubomir Bradic (Director of Design)
Bioinformatics
Kara Woo
Kelsey Montgomery
Milen Nicolov
Nicole Kaur
Team
Planning
Research Operations (Recruitment, participant admin)
Interview Guides
Formative Interviews
Synthesis
Evaluative Research & Analysis
Reporting
Stakeholder Presentation
My Role
Metadata Formative Research Report
HTAN Curator User Testing Report
DCC Validator User Testing Report
Feature Use Case Prioritization
Formative Research Presentation
Deliverables
The Opportunity
Quality experimental Metadata, is required to maximize the value and usefulness of datasets generated from bioinformatic research. What are our users’ experiences & challenges around Metadata?
Study Goals
Foundational: Understand the experiences and challenges of people contributing data
What are the end-to-end workflows when doing this work?
What types of projects? What tools and technologies are used during these processes, including complementary products?
What are the current pain points for users during this process, what services are they missing from their overall workflow?
What are the differences/similarities of needs among involved user segments?
Formative: Better understand the existing knowledge of metadata
What are the attitudes and understanding of the value of quality metadata.
Understand users’ perceptions of metadata and annotations, and Open/FAIR data
Formative: Review internally derived use cases
How do the assumed user stories compare with the requirements of those contributing data?
Where should we focus our immediate efforts on design / development, and further research
Evaluative: Understand User Experience with newly developed R Shiny tools
What is the UX and expectations (on functionality and value) of existing applications for upload, annotations and curation?
What other functionalities would users like to see?
I lead the design, execution and reporting on this research project in order to support a working group whose goal was to improve the end-to-end data contribution workflow for our users. I sought feedback from multiple stakeholders within the organization and performed all design research duties end-to-end. These were; study planning, study guide writing & piloting, recruitment, interviewing 16 participants (semi structured interview, card sort & evaluation of R Shiny Prototypes), data synthesis, reporting and conducting multiple presentations to stakeholders.
Interviews & Synthesis
Interview participants can be grouped into one of the following categories:
Data Generators
Non computational stake-holders generating data in clinical/experimental studies
Data Processors
Computational stakeholders involved in processing data using programmatic pipelines
Data Analysts
Stakeholders drawing insights from data collected through clinical/ experimental studies
Data Curators
Internal stakeholders involved in curating data uploaded to Synapse
Programmers
Stakeholders building systems for collecting, processing or analyzing data
Interviews were conducted remotely and recorded to capture conversations. The data collected from the interview sessions included key anecdotes and experiences during work and video footage of participants experiencing the prototypes and card sort. Recordings were partially transcribed using Dovetail. Key observations and quotes from interviews were captured on digital post-it notes, which were then regrouped into categories/themes using a standard affinity diagramming approach. Once affinity diagramming was complete, themes gave rise to insights and further considerations.
Foundational Research Insights
The following 10 insights were generated from the themes.
1. Metadata impacts all, but in specific ways dependent on their roles and projects
All Open Science stakeholders are part of a generalized workflow spanning from data collection to publication. However, their consideration of metadata annotations can differ significantly based on their project type and roles.
2. There is a strong understanding of value of Metadata within the Open Science Community
Contributors understand that metadata is vital for making data findable and reusable, is required for proper data processing and analysis, and provides direct benefits to the data contributors themselves.
3. A lack of experience can increase the burden for contributing to Open Science projects
Some contributors lack necessary education and technical knowledge, resulting in difficulty contributing and managing metadata and requiring better documentation of tools and processes
4. Issues endemic to Open Science can negatively impact Metadata creation
A lack of staffing and training, inconsistencies caused by the variance of data sources, difficulties tracking samples, and dealing with differing timelines can reduce the ability to provide metadata.
5. Process breakdowns can inhibit metadata quality and result in research inefficiencies
A lack of staffing and training, inconsistencies caused by the variance of data sources, difficulties tracking samples, and dealing with differing timelines can reduce the ability to provide metadata.
6. The complexity around schema design and maintenance is an ongoing challenge
Designing and refining a data model and schema requires many considerations, including how to set up a suitable schema, how to educate others in understanding it and how to effectively communicate and manage changes.
7. Current systems don’t resolve key metadata contribution problems at scale
A variety of tools and processes have been developed to resolve the issues of tracking, validation & annotation, however they are not scalable solutions
8. There has been success with tools and process improvements within Sage
Recent developments and improvements to Synapse, Validation apps built with R Shiny, and processes throughout the contribution process have been well received.
9. Metadata Standards are valued but many people have limited experience with them
Many users have limited knowledge of metadata standards within Bioinformatics. Whilst it may be difficult to develop and adhere to standards, the immense value of doing so is recognized by the community.
10. Accreditation and Accountability are important factors within contribution
All acknowledge the importance of appropriate accreditation and the task of metadata annotation is also deemed to be a responsibility of those who generate the data
Key Takeaways
I generated 26 key takeaways from this research, however the most impactful for the design phase were:
Serve both programmatic and non-programmatic users with our upload and annotation features in the platform. The Web UI & the Command Line Interface must be equally prioritized.
Allow data to be uploaded both prior to and after metadata is uploaded.
Educate on how to manage metadata, through documentation, structured onboarding, and best practice guides. Standardize existing docs/support and unify channels (eg. Slack, Github)
Continue to understand where contributors may be struggling with sample tracking and how different sites compare with each other. Some sites are doing better due to internal systems.
Ensure that all stakeholders have a strong understanding of the schema by exposing data dictionaries as well as visual ways of representing the schema.
Consider how the contributor of the metadata can be connected to and attributed in both the upload feature (Synapse GUI) and within the published dataset (eg. Data Portal)
Extend our provenance system to aid with the tracking samples across the entire project, considering integrations with existing internal systems used by contributing labs.