BioMedBERT

Natural Language Processing and Understanding Tools for Biomedical Researchers

AI vs COVID-19 Initiative

Breakthroughs in machine learning make it possible for the first time to read and understand complex research at scale. Our goal is to make BioMedBERT a resource for biomedical researchers, doctors and virologists, to augment their ability to sift through biomedical knowledge and existing research to extract novel insights and help them make new drug discoveries.

Who will benefit?

  1. Biomedical researchers - those that look for cures to illnesses. In particular we aim to help researchers involved in looking for treatment / vaccines for Covid-19 and more generally doing research in novel drugs, vaccines and treatment protocols.

  2. Virologists - those doing research on viruses, their mechanism of reproduction, propagation and infection of host organisms.

  3. Epidemiologists - those studying patterns of frequency and the causes and effects of diseases in the human population.

Why build it?

  1. COVID-19. The virus is rapidly advancing and there is currently no proven therapy or vaccine. BioMedBERT will provide tools to discover novel insights into existing research.

  2. Research Volume & Velocity. There are millions of research documents in a variety of repositories and the pace of publications is accelerating. The existing volume of research already too vast for any single individual or small group to be master and it is only growing.

  3. Missed Latent Connections. Nuanced or weak connections between research items could be recognized by our language model, allowing researchers to approach existing problems with a broader set of tools and ideas.

Approach

Our approach starts with gathering one of the largest research datasets in the world, 'BREATHE'. The BREATHE (Biomedical Research Extensive Archive To Help Everyone) dataset contains more than 3.7 million machine-read medical and research publications. Our approach then uses state-of-the-art machine learning techniques to identify latent insights from literature and deep Neural Network based model for Language Understanding. We utilize emerging language architectures (BERT, T5) to achieve these insights.

Our infrastructure runs on Google Cloud Platform and takes advantage of the massive compute power available via the TensorFlow Research Cloud. A single Cloud TPU v3 Pod can deliver 100+ petaflops (1 petaflop=one thousand million million (10 15th) floating-point operations per second). By utilizing the compute, storage, and networking of Google Cloud along with standard open source platforms and tools such as TensorFlow, we are able to train, refine, and iterate our models faster than ever before.

Team members and collaborators:

Our team feels the of sense of urgency to bring unique tools to help with the current crisis, and to further help the world become better prepared for the next. The team is composed of machine learning experts, computer scientists, and technology practitioners from across the globe. This includes AI experts who are Machine Learning Google Developer Experts (GDEs), Software Developers from 42 Silicon Valley, and partners who are experts from Google Cloud and TensorFlow Research Cloud.

Dave Elliott

Dan Goncharov

Francesco Mosconi

Ivan Kozlov

Uliana Popov

Ekaba Bisong

Souradip Chakraborty

Shweta Bhatt

Antoine Delorme

Khloe Hou

Igor Popov

Gulnozai Khodizoda

Blaire Hunter

Simon Ewing

Suzanne Repellin

Christine Yang

Soonson Kwon

Partners and Sponsors:




Relevant Research

  • Mat2vec: "Unsupervised word embeddings capture latent knowledge from materials science literature", Nature (2019)

  • BioBERT: “BioBERT: a pre-trained biomedical language representation model for biomedical text mining”, Bioinformatics (2020)

  • BERT: “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding”, NAACL (2019)

  • T5: “Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer”, (2019)