Scientists are largely interested in generative artificial intelligence (genAI), but identifying suitable use cases for it within research or lab operations is proving to be a challenge. In this interview, Christian Baber, PhD, chief portfolio officer at the Pistoia Alliance, shares what they have learned about how life science researchers are leveraging genAI.
Q. Anecdotally, how are some scientists you’ve spoken to using genAI in life science research?
Lab Quality Management Certificate
The Lab Quality Management certificate is more than training—it’s a professional advantage.
Gain critical skills and IACET-approved CEUs that make a measurable difference.
A. From speaking to our members, life sciences organizations are using genAI to support their work from both a data science perspective and a scientific research perspective. On the data side, we’ve seen companies using genAI to conduct natural language searches of their datasets and summarize responses. We’ve also seen genAI being used to annotate datasets and to support metadata processing, as well as generating computer code for a specific purpose. On the research side, organizations are using genAI to generate chemical structures [along with] protein and peptide sequences with desired properties.
Q. What challenges does genAI address for them?
A. GenAI helps life sciences organizations by democratizing data access, reducing the pressure on a limited number of experts, and freeing them up to work on other tasks. Researchers can now use natural language search to query data and quickly extract insights, removing the delays associated with relying on overburdened data specialists. This accelerates workflows and enhances productivity. Additionally, genAI’s understanding of natural language empowers researchers to augment their creativity, making it particularly valuable for drafting scientific texts and papers.
Elsewhere, the technology’s ability to analyze large datasets is also supporting pattern recognition—particularly for sequencing chemical structures—and hypothesis generation.
Q. What challenges does genAI create for them?
A. Our members have reported several challenges resulting from the adoption of genAI into their workflows. In the following [list], we will speak mainly about large language models (LLMs) because many use cases that we saw recently use LLMs. But this does not mean that the applications are limited to only this model type.
Some examples of these challenges include:
Hallucinations: Most genAI models have some risk of hallucination, which is inherent in their ability to be generative. This creates risk for companies, especially for use cases that are considered high-risk and could directly impact patients. However, it is worth noting that hallucinations are required for truly generative use cases, such as the production of novel structures in previously unexplored chemical space, as the AI can suggest new hypotheses.
Computer code issues: In general, LLM-created code is not optimally efficient, but it is good for prototyping.
Lack of prompt engineering expertise: The ability to guide a genAI model to produce desired outputs is much more demanding and critical to output accuracy than we thought a few years ago. In fact, prompt engineering is so complex that it may become a competitive disadvantage for genAI tools compared to other search methods, such as structured query languages. LLM performance is improved when a prompt given to it looks like a story, as opposed to dry, skeletal prompts. Prompt engineering is becoming an expert skill itself, reducing the democratization benefits of LLMs.
Output differences caused by varying data structure: Not all data structures are accessible for mining by LLMs at the same level of quality. LLMs operate well on text and have generally been trained using common vocabularies, so the closer the structure of data to natural language text is, the easier it is to process for an LLM.
Lack of benchmarks: Many genAI use cases in life sciences lack proper benchmarks for validating AI outputs and evaluating the claims of vendors that sell commercial AI tools. This lack of benchmarking makes it challenging to prove the accuracy and reliability of models to regulators, which is becoming increasingly important as new legislation emerges.
Copyright issues: [As of December 2024,] Pistoia Alliance research found 42 percent of life science professionals do not consider copyright before sharing or using third-party information with AI tools, and only 40 percent of organizations report having a dedicated team or expert focused on AI copyright compliance. This gap could lead to infringement risks, fines, and reputational damage. Specialized knowledge on data licensing, text-mining rights, copyright, and IP legislation is becoming a must, but these skills are hard to acquire, and competition for hiring such experts is fierce.
Q. Are these genAI solutions off-the-shelf or developed in-house? In either case, are they trained/finetuned on the lab’s data?
A. Currently, we are seeing companies using a mix of off-the-shelf solutions they have fine-tuned and models they have developed fully in-house. Each has its own pros and cons, and the choice depends on the level of expertise companies have access to and how many resources they are able to invest.
Christian Baber, PhD, has led both R&D and technology divisions for global pharmaceutical organizations focused on informatics and predictive modelling for drug discovery. Baber has also worked with the Pistoia Alliance for more than 15 years, including four years as a board director.