At the UMC Utrecht, large amounts of unstructured text are created in electronic healthcare record (EHR) systems every day. While there are efforts to improve how healthcare professionals structure data at the time of entering, there is also a need to use unstructured data that is already in the EHR systems in a more efficient way. Unstructured text contains enormous amounts of information that are not captured as structured data. Enabling the analysis of unstructured text data allows for new ways to do research on patient data, to make hospital processes more efficient and to improve patient care.
The Data Solutions & Research IT team of the UMC Utrecht is starting a multi-phase project to unlock the potential of this type of data. In the first phase we will build a data-processing pipeline that extracts data from our current systems and makes it accessible in a fast search application. This process will include an entity linking component to detect medical concepts in text, such as names of diseases, symptoms and medications. This will be challenging because of the nature of unstructured text, which often contains acronyms, typos, spelling mistakes, negations and probabilities. We will attempt to solve this by using natural language processing (NLP) methods to capture the context in which concepts are used and distinguish the different ways concepts can be used.
The goal of this research project is to develop methods and models for entity linking in Dutch medical text. The project will consist of three major components:
• data engineering to handle the large datasets;
• data science such as building NLP models for named-entity recognition and linking;
• medical informatics such as working with medical ontologies and vocabularies.
– How well do modern NLP methods built for the English medical domain perform in extracting Dutch medical terms from unstructured text in patient records?
– In Dutch medical text, can acronyms and abbreviations, which are often ambiguous, be linked to the correct concept based on context?
– Do modern named-entity recognition methods outperform traditional regular expression-based approaches in pseudonymization of privacy sensitive contents in medical text?
– Enrollment in data science, artificial intelligence, medical informatics, bioinformatics, computer science or comparable MSc program.
– Experience with programming in Python and command line interfaces.
– An interest in working with healthcare data.
– Most data will be Dutch medical text, therefore being fluent in Dutch is a necessity.
Nice to have
– Experience with building data-processing pipelines and handling large data sets.
– Experience with natural language processing (NLP).
– Experience with software development tools, such as Docker, Git and CI/CD.
– Familiar with handling databases, such as SQL and ElasticSearch.
– Knowledge of medical terminologies and data standards.
Sander Tan, UMC Utrecht, Data Solutions & Research IT, firstname.lastname@example.org
- UMC Utrecht
- 7 maanden