Metadata

Metadata

Improving the ability for scientists to describe experimental context for Bioinformatic Research

Design
Duminda Aluthgamage (Lead Researcher)
Ljubomir Bradic (Director of Design)

Bioinformatics
Kara Woo
Kelsey Montgomery
Milen Nicolov
Nicole Kaur

Team

Planning
Research Operations (Recruitment, participant admin)
Interview Guides
Formative Interviews
Synthesis
Evaluative Research & Analysis
Reporting
Stakeholder Presentation

My Role

Metadata Formative Research Report
HTAN Curator User Testing Report
DCC Validator User Testing Report
Feature Use Case Prioritization
Formative Research Presentation

Deliverables

The Opportunity

Quality experimental Metadata, is required to maximize the value and usefulness of datasets generated from bioinformatic research. What are our users’ experiences & challenges around Metadata?

Study Goals

Foundational: Understand the experiences and challenges of people contributing data

  1. What are the end-to-end workflows when doing this work?

  2. What types of projects? What tools and technologies are used during these processes, including complementary products?

  3. What are the current pain points for users during this process, what services are they missing from their overall workflow?

  4. What are the differences/similarities of needs among involved user segments?

Formative: Better understand the existing knowledge of metadata

  1. What are the attitudes and understanding of the value of quality metadata.

  2. Understand users’ perceptions of metadata and annotations, and Open/FAIR data

Formative: Review internally derived use cases

  1. How do the assumed user stories compare with the requirements of those contributing data?

  2. Where should we focus our immediate efforts on design / development, and further research

Evaluative: Understand User Experience with newly developed R Shiny tools

  1. What is the UX and expectations (on functionality and value) of existing applications for upload, annotations and curation?

  2. What other functionalities would users like to see?

I lead the design, execution and reporting on this research project in order to support a working group whose goal was to improve the end-to-end data contribution workflow for our users. I sought feedback from multiple stakeholders within the organization and performed all design research duties end-to-end. These were; study planning, study guide writing & piloting, recruitment, interviewing 16 participants (semi structured interview, card sort & evaluation of R Shiny Prototypes), data synthesis, reporting and conducting multiple presentations to stakeholders.

Interviews & Synthesis

Interview participants can be grouped into one of the following categories:

Data Generators

Non computational stake-holders generating data in clinical/experimental studies

Data Processors

Computational stakeholders involved in processing data using programmatic pipelines

Data Analysts

Stakeholders drawing insights from data collected through clinical/ experimental studies

Data Curators

Internal stakeholders involved in curating data uploaded to Synapse

Programmers

Stakeholders building systems for collecting, processing or analyzing data

Interviews were conducted remotely and recorded to capture conversations. The data collected from the interview sessions included key anecdotes and experiences during work and video footage of participants experiencing the prototypes and card sort. Recordings were partially transcribed using Dovetail. Key observations and quotes from interviews were captured on digital post-it notes, which were then regrouped into categories/themes using a standard affinity diagramming approach. Once affinity diagramming was complete, themes gave rise to insights and further considerations.

Foundational Research Insights

The following 10 insights were generated from the themes.

1. Metadata impacts all, but in specific ways dependent on their roles and projects

All Open Science stakeholders are part of a generalized workflow spanning from data collection to publication. However, their consideration of metadata annotations can differ significantly based on their project type and roles.

2. There is a strong understanding of value of Metadata within the Open Science Community

Contributors understand that metadata is vital for making data findable and reusable, is required for proper data processing and analysis, and provides direct benefits to the data contributors themselves.

I mean if you don’t have that, you don’t have anything, right? If sample swabs are missing annotation [then] the data is garbage. Like I can’t use a sample that I don’t know if it’s from a tumor or not, or if it’s from a patient who received treatment A or treatment B. It’s a critical, critical connector piece. And if it’s not there, you’re totally pooped...
— P10 (Bioinformatician/PI)

3. A lack of experience can increase the burden for contributing to Open Science projects

Some contributors lack necessary education and technical knowledge, resulting in difficulty contributing and managing metadata and requiring better documentation of tools and processes

In an ideal situation, I would have a much clearer understanding of what the expectations are for the data that I’m uploading. Because then that allows me to track stuff on my end, much more easily.
— P6 (Bioinformatician)

4. Issues endemic to Open Science can negatively impact Metadata creation

A lack of staffing and training, inconsistencies caused by the variance of data sources, difficulties tracking samples, and dealing with differing timelines can reduce the ability to provide metadata.

5. Process breakdowns can inhibit metadata quality and result in research inefficiencies

A lack of staffing and training, inconsistencies caused by the variance of data sources, difficulties tracking samples, and dealing with differing timelines can reduce the ability to provide metadata.

Someone [needs] to do this curation, but you’re adding an extra layer in which something can go wrong, because first there is the lab curating the metadata. Then it gets uploaded to Synapse where another person again needs to read it, create the schema that can go on Synapse, and then I download it and then I create a new schema for the manifest. So there are a lot of points where error can happen.
— P8 (Bioinformatician)

6. The complexity around schema design and maintenance is an ongoing challenge

Designing and refining a data model and schema requires many considerations, including how to set up a suitable schema, how to educate others in understanding it and how to effectively communicate and manage changes.

Ultimately it’s about scalability, right? [our consortium] has hundreds of potential metadata contributors and 30 different dataset types, even a little bit more. So if we have to try to keep track of the communication between us and all of these people and we are working with static templates that contain various attributes that evolve over time. We have to go back and forth with emails with these people. This will get unwieldy very fast.
— IP3 (Data Curator)

7. Current systems don’t resolve key metadata contribution problems at scale

A variety of tools and processes have been developed to resolve the issues of tracking, validation & annotation, however they are not scalable solutions

8. There has been success with tools and process improvements within Sage

Recent developments and improvements to Synapse, Validation apps built with R Shiny, and processes throughout the contribution process have been well received.

For the most part actually it works pretty well. I mean I think that the platform is intuitive enough that you can figure out how to do, at least the most basic stuff like uploading a few files and downloading a few files and organizing the folders.
— P5 (Neurologist)

9. Metadata Standards are valued but many people have limited experience with them

Many users have limited knowledge of metadata standards within Bioinformatics. Whilst it may be difficult to develop and adhere to standards, the immense value of doing so is recognized by the community.

I haven’t heard of those...
— P1, P3, P4, P5, P6, P7, P8, P9, P10, P12

10. Accreditation and Accountability are important factors within contribution

All acknowledge the importance of appropriate accreditation and the task of metadata annotation is also deemed to be a responsibility of those who generate the data

It should be acknowledged somehow because [Metadata annotation] is a huge part of it...I feel like it’s that one thing in science that’s always overlooked completely... that there’s somebody going through this ... So there are some people who are able to do this work and other people who can’t. And the final product changes based on what you put into it basically. So yes, I would like to see people acknowledged!
— P7 (Bioinformatician)

Key Takeaways

I generated 26 key takeaways from this research, however the most impactful for the design phase were:

  • Serve both programmatic and non-programmatic users with our upload and annotation features in the platform. The Web UI & the Command Line Interface must be equally prioritized.

  • Allow data to be uploaded both prior to and after metadata is uploaded.

  • Educate on how to manage metadata, through documentation, structured onboarding, and best practice guides. Standardize existing docs/support and unify channels (eg. Slack, Github)

  • Continue to understand where contributors may be struggling with sample tracking and how different sites compare with each other. Some sites are doing better due to internal systems.

  • Ensure that all stakeholders have a strong understanding of the schema by exposing data dictionaries as well as visual ways of representing the schema.

  • Consider how the contributor of the metadata can be connected to and attributed in both the upload feature (Synapse GUI) and within the published dataset (eg. Data Portal)

  • Extend our provenance system to aid with the tracking samples across the entire project, considering integrations with existing internal systems used by contributing labs.