By: Nicole Vasilevsky and Jane Lomax
Like all in-person gatherings in this past year, the annual International Society for Biocuration conference went virtual in 2021. At the inaugural session on April 13, 2021, a group of panelists discussed ‘the future of biocuration’. The panel was moderated by Rama Balakrishnan, who has served on the ISB Executive Committee since 2017, and is the co-chair (along with Susan Bello from the Jackson Laboratory) of the Biocuration2021 conference. Rama was joined by four panelists from various roles in academia and industry to discuss what is in store for our community. The recording is available here.
What is curation: Distilling knowledge from information
Rama initiated the discussion with the fundamental and relevant question, ‘what does the word curation mean to you?’ Working in the biocuration field, many curators can probably relate to this question, a question that is frequently asked by people who are outside this field. The role of a curator at a museum, for example, may be more familiar, but biocuration is a less well-understood field. Rama, who has held varying roles as a curator (academic and industry), tried to get after how the actual task of curation may differ amongst us. Sandra Orchard, from EBI shared a classical definition of ‘turning unstructured data into structured searchable data’, but recognized this is not always true as, whilst some curation tasks involve making data more structured, text-minable and machine-readable, the outcome of data curation does not always result in completely structured data. Carol Bult from MGI defined curation as “applying semantic standards to ensure data findability and aggregation.”
Coming from the industry perspective, both Kambiz Karimi (Myriad Women’s Health) and James Malone (SciBite) agreed. Curation involved meaning-based capture and structuring of content using controlled vocabularies. Data curation can also include data cleaning, which is often a pre-curation task. Curation can help improve and enrich data interpretability and ultimately add value. It allows for enhanced search, querying, semantic integration and meta-analysis.
How can we ensure quality?
Given that the panelists all agreed on a high level definition of curation, Rama then asked about ensuring data quality. What does good quality mean and what are metrics to assess quality? Different quality control (QC) and quality assurance (QA) processes apply, depending on the type of curation that is being done, whether you are curating tax forms (as James did in a summer job long ago) or curating the mouse biology literature. Some processes that were discussed by Carol and others included intercurator checks, crowdsourcing feedback from downstream users, practices to ensure collaboration, regression testing to ensure continuity and consistency across datasets. Sandra pointed out that curators cannot be all things to everything, and stressed the importance of specialist databases with curators who are domain experts who can take the first pass at the curation, and build re-processing pipelines or scoring mechanisms to export high quality subsets to other data resources.
James and Rama noted how detecting outliers can assist with quality checks. However, it may not always be easy to detect the outliers without the expert knowledge in a specific area. For example, Rama curates patient data at Genentech, and once came across a data reporting a patient had a 100℃ fever (rather than 100℉), which was easy to spot as an error. However, in a more complicated clinical use case, detecting erroneous data points may not be so obvious and require more specialized knowledge.
Kambiz shared that Myriad has several QC approaches, including a peer review process, a spot checking program to have curators spot check each other’s work and a quality check process that compares their classification to previous classifications from the community.
Sandra also noted the importance of researchers collaborating with curators prior to publication. She shared an anecdote where an author published a paper with an erroneous dataset, a simple mistake where a row in a spreadsheet had been accidentally deleted, causing nonsensical results. The curator picked this up and contacted the author, who was able to correct it, but this speaks to the importance of pre-submitting data to the database before publication and the important role a curator can play with the research community.
Opportunities with Machine Learning and Automation
While a lot of biocuration is done manually, more and more processes and workflow are being automated, with text mining, machine learning (ML), natural language processing (NLP) and AI. The panel was asked their opinion on how AI and ML will affect the work of biocurators? Sandra assured us that machine learning will enhance our work, but is not concerned that it will replace human curation. Data is too messy, the literature is too unstructured, and human review and curation is going to be needed in the foreseeable future. James echoed her sentiments in saying, “[Machine Learning] will become an assistant, it will not replace subject matter experts who are biologists, scientists, curators. It will play a role in helping us.” James sees it as an opportunity for biocuration, where we should work to exploit advances in deep learning, noting the importance of biocuration is more pronounced now than ever. We can train AI to aid in biocuration and we can work together. In addition, quality Machine Learning/AI requires training sets that have been human-curated, and the advances of these technologies will require more curators; this is a new opportunity for this community. Carol agreed, but brought up the point that there may be the perception that these technologies are advanced to the point where curators can be replaced. This is causing challenges with funding for biocuration due to the notion that machine learning can do all or most of what human curators do. While machine learning can assist with making biocuration scalable, we need to do better as a community at communicating how these things interrelate and feed off each other.
“Biocuration has never been more valuable than it is now and yet under appreciated.” It’s something the Society can help us tackle: this perception and articulate how manual and machine learning biocuration can go hand and hand. – Carol Bult
Approaching authors
An audience member inquired whether database curators approached authors for clarification about their published data, and whether authors were responsive. Kambiz shared that they did approach authors when there was ambiguity with the content or data in an article. Sandra concurred, and alluded to the challenge with time dependencies; if a paper was recently published (1 year – 18 months ago), they frequently got a response. If a paper is over 3 years old, in general, they were less likely to get a reply, as the first author may have moved on and the PI is unfamiliar with the details of the data.
This may speak to an opportunity to better train researchers in becoming familiar with curation methods and standards, to allow for unambiguous reporting in their publications. Requirements to share data at the time of publication will also help address this need.
Getting the journals involved
This led to the next question about working with the journals to publish data in a more structured way. Carol has had some experience working with journals in the mouse community, who are careful about publishing mouse names with the accepted terminology and nomenclature. She did mention that sometimes there is push back as to whether the recommended standard is the accepted standard, and whether this is going to evolve or change in the future. We all may be familiar with the situation below.
This is an opportunity for a systematic community approach, the ISB should promote standards adoption to the journals.
Sandra pointed out that a challenge with approaching journals to use our standards, is the sheer number of journals. A more targeted approach may be more appropriate. For example, the proteomics community was successful in getting a restricted number of journals in their field to require data sharing to ProteomeXchange (http://www.proteomexchange.org/) prior to publication.
Sandra also recommended that we first talk amongst ourselves as a community and define our needs, and what standards to adopt and promote, and then approach the journals.
The elephant in the room: Funding
In recent years, NIH funding has decreased to various databases. How do we sustain our own careers, and train the next generation of curators?
Kambiz felt it is easier to justify the need for curation due to the regulatory aspect of his industry. Even if there are NLP based processes to extract gene to disease relationships, manual review will always be needed. He foresees automated processes will assist with manual curation going forward.
Carol emphasized that we need to promote how important curation is to data science. Data science is recognized as an important field, therefore we should frame curation within its role in data science. We have to be better about explaining return on investment in curation – what can we do when data is curated, and we wouldn’t be able to do, if it wasn’t? She pointed out that the reality that biocuration is considered infrastructure, which is largely ignored, until it is broken. As a Society, can we demonstrate the impact that biocuration has on advancing data science?
Sandra reiterated that we need to make ourselves more visible, we need people outside the community to understand what we do. We need to work together as a community efficiently to not duplicate efforts, we need to align on standards, use specialist databases for initial analysis and data cleaning, and use the baseline resources like accession numbers, and show good examples of good curation.
Continue the conversation on Slack.
Do you have topics you’d like to discuss in a future panel, or suggested speakers? Please let us know (intsocbio@gmail.com).