TiDE, Clinical Text De-identification pipeline
...is now open source software
Our privacy preserving TiDE clinical text de-identification pipeline is now open source
Aug 31, 2021: In 2019, with development of our new STARR-OMOP Clinical Data Warehouse, we developed a clinical text de-identification pipeline (TiDE) that leveraged hiding in plain sight to reduce re-identification probability. TiDE is now open source. The algorithm is containerized and detailed documentation is provided for end users.
In TiDE, "hiding in plain sight" is used where surrogates replace names and addresses in the clinical text. Names and addresses are found using Name Entity Recognition (NER) as well as by pattern matching to known PHI for the patient. We also replace other HIPAA identifiers such as telephone numbers, MRNs with realistic surrogates using text processing approaches. The clinical text looks near real to a human or AI. The open source version can be run on a laptop or a powerful server. Apache Beam is used for batch processing.
In Research IT OMOP-deid Clinical Data Warehouse pipeline, TiDE is used in a distributed computing framework in GCP where 800 worker nodes are booted up on GCP for 6 hrs to de-identify 100 million clinical notes in <6 hrs. The open source version excludes the distributed computing framework used by Research IT since most end users are unlikely to be GCP users at this time. User can parallelize across multiple servers by sharding their clinical text across multiple servers.
- TiDE open source codebase (more)
- Research IT manuscript with algorithmic details, Supplementary Section 5 (more)
- Privacy preserving "Hiding in Plain Sight" (more)
- CoreNLP used for Name Entity Recognition (NER) in TiDE (more)
- Apache Beam (more)
We thank our partners, Vertisystems, who helped Research IT develop the open source package.