Designing the BREATHE Dataset
The first step in designing BREATHE was to reach out to biomedical researchers to better understand their workflows, tools, challenges, and most importantly, the ‘relevance’ in medical literature. We found some common insights:
overwhelming amount of existing and new information
ambiguous and inconsistent sources of truth
limited information retrieval functionality in current tools
search based only on simple keywords
multiple scattered datasets
inability to understand the meaning of words in context
One of the pillars of the current AI revolution is the ability of these systems to become better as they analyze more data. Recent work (BERT, XLNEt, T5, GPT3) uses millions of documents to train state of the art neural networks for NLP tasks.
Based on these insights, we determined the best way to help the research community was to create a single dataset containing a very large corpus of papers, and then to make that dataset available in machine usable formats. Inspired by the Open Access movement and initiatives such as the Chan Zuckerberg Institute’s Meta, they sought to find as many relevant and unique, freely available publications and collect them into one easily accessible dataset designed specifically to train AI systems.
The Biomedical Research Extensive Archive To Help Everyone (BREATHE), is a large-scale biomedical database containing entries from top biomedical research repositories. The dataset contains titles, abstracts, and full body texts (when licensing permitted) for over 16 million biomedical articles published in English. We released the first version in June 2020, and expect to release new versions as the corpus of articles is constantly updated by our search crawlers. Collecting articles originally written in different languages (other than English) is among the ideas on how to further improve the dataset and the domain specific knowledge that it tries to capture.
While there are several COVID-19 specific datasets, BREATHE differs in that it is:
broad - contains many different sources
publicly accessible and free-to-use
hosted on a scalable, easy to analyze, cost-effective data-warehouse - Google BigQuery
BREATHE Dataset Creation Architecture
BREATHE Dataset Creation
The development and automation of the article download workflow was significantly accelerated by using Google Cloud infrastructure. This system, internally called the “ingestion pipeline”, has the classical three stages: Extract, Transform and Load (ETL).
To easily prototype the main logic of the scrapers, the interns & collaborators used a Google Colaboratory Notebook (or ‘Colab’). Colab is a hosted Python Jupyter notebook that enables users to write and execute Python in the browser, with no additional setup or configuration and provides free, limited access to GPUs, making it an attractive tool of choice for many machine learning practitioners. Google Colab provided us the ability to easily share code amongst our interns and collaborators.
The scrapers are written using Selenium, a suite of tools for automating web browsers, among which they chose Chromium in headless mode (Chromium is the open source project on which the Google Chrome browser is based). All the raw data from the different sources is downloaded directly to their Google Cloud Storage bucket.
We ingested over 16 million articles from ten different sources, each one with raw data formatted in CSV, JSON or XML and its own unique schema. Our tool of choice to efficiently process this amount of data was Google Dataflow. Google Dataflow is a fully managed service for executing Apache Beam pipelines on Google Cloud. In the transform stage the pipeline processes every single raw document, applying cleaning, normalization and multiple heuristic rules to extract a final general schema, formatted in JSONL. Some of the heuristic applied includes checks for null values, invalid strings, and duplicate entries. We also verified the consistency between fields with different names, in different tables, which represented the same entity.
Documents going through these stages end up in three different sink buckets, based on the status of the operation:
Success: for documents correctly processed
Rejected: for documents that did not match one or more of our rules
Error: for documents that the pipeline failed to process
Apache Beam allows us to design logic that is not straightforward with an easy-to-read syntax. Google Dataflow makes it easy to scale this process across many Google Cloud compute instances, without having to change any code. The pipeline was applied to the full raw data distilling it to 16.7 million records for a total of 100GB of JSONL text data.
Finally the data was loaded into Google Cloud Storage buckets and Google BigQuery tables. BigQuery didn’t require us to manage any infrastructure nor does it need a database administrator - making it ideal for our project, which is composed mainly of data science experts. We iterated several times on the ingestion process, as we scaled the number of total documents processed. In the initial stages of data exploration, data scientists were able to explore the contents of the data loaded into BigQuery by simply using the standard Structured Query Language (SQL).
Basic data exploration of the dataset reveals that if one considers all the abstracts in BREATHE, there are 3.3 billion total words and 2.8 million unique words. Using Python and Colab, it was also easy to do some exploratory data analysis. For example, here's a plot of the word frequencies:
Google Public Dataset Program
We believe other data scientists may find value in the dataset, so we chose to make it available via the Google Public Dataset Program. This public dataset is hosted in Google BigQuery and is included in BigQuery's free tier. Each user can process up to 1TB for free every month. This quota can be used by anyone to explore the BREATHE dataset using simple SQL commands. We also released a static, point-in-time dump via the Google program.