(Biomedical Research Extensive Archive To Help Everyone)
Natural Language Understanding Tools for Biomedical Researchers
AI vs COVID-19 Initiative
Breakthroughs in machine learning make it possible for the first time to read and understand complex research at scale. Our goal is to make BREATHE Dataset and BREATHE Deep Literature Search a resource for biomedical researchers, doctors and virologists, to augment their ability to sift through biomedical knowledge and existing research to extract novel insights and help them make new drug discoveries.
Who will benefit?
Biomedical researchers - those that look for cures to illnesses. In particular we aim to help researchers involved in looking for treatment / vaccines for COVID-19 and more generally doing research in novel drugs, vaccines and treatment protocols.
Virologists - those doing research on viruses, their mechanism of reproduction, propagation and infection of host organisms.
Epidemiologists - those studying patterns of frequency and the causes and effects of diseases in the human population.
Why build it?
COVID-19. The virus is rapidly advancing and there is currently no proven therapy or vaccine. BioMedBERT will provide tools to discover novel insights into existing research.
Research Volume & Velocity. There are millions of research documents in a variety of repositories and the pace of publications is accelerating. The existing volume of research already too vast for any single individual or small group to be master and it is only growing.
Missed Latent Connections. Nuanced or weak connections between research items could be recognized by our language model, allowing researchers to approach existing problems with a broader set of tools and ideas.
Our approach starts with gathering one of the largest research datasets in the world, 'BREATHE'. The BREATHE (Biomedical Research Extensive Archive To Help Everyone) dataset contains more than 16 million machine-read medical and research publications. Our approach then uses state-of-the-art machine learning techniques to identify latent insights from literature and deep Neural Network based model for Language Understanding. We utilize emerging language architectures (BERT, T5) to achieve these insights.
Our infrastructure runs on Google Cloud Platform and takes advantage of the massive compute power available via the TensorFlow Research Cloud. A single Cloud TPU v3 Pod can deliver 100+ petaflops (1 petaflop=one thousand million million (10 15th) floating-point operations per second). By utilizing the compute, storage, and networking of Google Cloud along with standard open source platforms and tools such as TensorFlow, we are able to train, refine, and iterate our models faster than ever before.
Team members and collaborators:
Our team feels the of sense of urgency to bring unique tools to help with the current crisis, and to further help the world become better prepared for the next. The team is composed of machine learning experts, computer scientists, and technology practitioners from across the globe. This includes AI experts who are Machine Learning Google Developer Experts (GDEs), Software Developers from 42 Silicon Valley, and consultations with experts from Google Cloud and TensorFlow Research Cloud.
Scientific Advisory Board:
Partners and Sponsors:
Mat2vec: "Unsupervised word embeddings capture latent knowledge from materials science literature", Nature (2019)
BioBERT: “BioBERT: a pre-trained biomedical language representation model for biomedical text mining”, Bioinformatics (2020)
BERT: “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding”, NAACL (2019)
T5: “Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer”, (2019)