• AIPressRoom
  • Posts
  • Dr. Serafim Batzoglou, Chief Knowledge Officer at Seer – Interview Sequence

Dr. Serafim Batzoglou, Chief Knowledge Officer at Seer – Interview Sequence

Serafim Batzoglou is Chief Knowledge Officer at Seer. Previous to becoming a member of Seer, Serafim served as Chief Knowledge Officer at Insitro, main machine learning and knowledge science of their strategy to drug discovery. Previous to Insitro, he served as VP of Utilized and Computational Biology at Illumina, main analysis and expertise growth of AI and molecular assays for making genomic knowledge extra interpretable in human well being.

What initially attracted you to the sphere of genomics?

I got interested within the area of computational biology at the beginning of my PhD in laptop science at MIT, after I took a category on the subject taught by Bonnie Berger, who grew to become my PhD advisor, and David Gifford. The human genome venture was selecting up tempo throughout my PhD. Eric Lander, who was heading the Genome Heart at MIT grew to become my PhD co-advisor and concerned me within the venture. Motivated by the human genome venture, I labored on whole-genome meeting and comparative genomics of human and mouse DNA.

I then moved to Stanford College as school on the Pc Science division the place I spent 15 years, and was privileged to have suggested about 30 extremely proficient PhD college students and plenty of postdoctoral researchers and undergraduates. My workforce’s focus has been the appliance of algorithms, machine studying and software program instruments constructing for the evaluation of large-scale genomic and biomolecular knowledge. I left Stanford in 2016 to guide a analysis and expertise growth workforce at Illumina. Since then, I’ve loved main R&D groups in trade. I discover that teamwork, the enterprise side, and a extra direct impression to society are attribute of trade in comparison with academia. I labored at revolutionary firms over my profession: DNAnexus, which I co-founded in 2009, Illumina, insitro and now Seer. Computation and machine studying are important throughout the expertise chain in biotech, from expertise growth, to knowledge acquisition, to organic knowledge interpretation and translation to human well being.

Over the past 20 years, sequencing the human genome has turn out to be vastly cheaper and sooner. This led to dramatic progress within the genome sequencing market and broader adoption within the life sciences trade. We at the moment are on the cusp of getting inhabitants genomic, multi-omic and phenotypic knowledge of adequate dimension to meaningfully revolutionize healthcare together with prevention, prognosis, therapy and drug discovery. We will more and more uncover the molecular underpinnings of illness for people by way of computational evaluation of genomic knowledge, and sufferers have the prospect to obtain therapies which are personalised and focused, particularly within the areas of most cancers and uncommon genetic illness. Past the plain use in drugs, machine studying coupled with genomic data permits us to achieve insights into different areas of our lives, comparable to our family tree and diet. The following a number of years will see adoption of personalised, data-driven healthcare, first for choose teams of individuals, comparable to uncommon illness sufferers, and more and more for the broad public.

Previous to your present position you had been Chief Knowledge Officer at Insitro, main machine studying and knowledge science of their strategy to drug discovery. What had been a few of your key takeaways from this time interval with how machine studying can be utilized to speed up drug discovery?

The traditional drug discovery and growth “trial-and-error” paradigm is plagued with inefficiencies and intensely prolonged timelines. For one drug to get to market, it may well take upwards of $1 billion and over a decade. By incorporating machine studying into these efforts, we will dramatically scale back prices and timeframes in a number of steps on the way in which. One step is goal identification, the place a gene or set of genes that modulate a illness phenotype or revert a illness mobile state to a extra wholesome state will be recognized by way of large-scale genetic and chemical perturbations, and phenotypic readouts comparable to imaging and practical genomics. One other step is compound identification and optimization, the place a small molecule or different modality will be designed by machine learning-driven in silico prediction in addition to in vitro screening, and furthermore desired properties of a drug comparable to solubility, permeability, specificity and non-toxicity will be optimized. The toughest in addition to most necessary side is maybe translation to people. Right here, alternative of the correct mannequin—induced pluripotent stem cell-derived strains versus major affected person cell strains and tissue samples versus animal fashions—for the correct illness poses an extremely necessary set of tradeoffs that finally mirror on the power of the ensuing knowledge plus machine studying to translate to sufferers.

Seer Bio is pioneering new methods to decode the secrets and techniques of the proteome to enhance human well being, for readers who’re unfamiliar with this time period what’s the proteome?

The proteome is the altering set of proteins produced or modified by an organism over time and in response to surroundings, diet and well being state. Proteomics is the examine of the proteome inside a given cell kind or tissue pattern. The genome of a human or different organisms is static: with the necessary exception of somatic mutations, the genome at beginning is the genome one has their total life, copied precisely in every cell of their physique. The proteome is dynamic and adjustments within the time spans of years, days and even minutes. As such, proteomes are vastly nearer to phenotype and finally to well being standing than are genomes, and consequently extra informative for monitoring well being and understanding illness.

At Seer, now we have developed a brand new solution to entry the proteome that gives deeper insights into proteins and proteoforms in complicated samples comparable to plasma, which is a extremely accessible pattern that sadly to-date has posed an excellent problem for typical mass spectrometry proteomics.

What’s the Seer’s Proteograph platform and the way does it supply a brand new view of the proteome?

Seer’s Proteograph platform leverages a library of proprietary engineered nanoparticles, powered by a easy, speedy, and automatic workflow, enabling deep and scalable interrogation of the proteome.

The Proteograph platform shines in interrogating plasma and different complicated samples that exhibit giant dynamic vary—many orders of magnitude distinction within the abundance of varied proteins within the pattern—the place typical mass spectrometry strategies are unable to detect the low abundance a part of the proteome. Seer’s nanoparticles are engineered with tunable physiochemical properties that collect proteins throughout the dynamic vary in an unbiased method. In typical plasma samples, our expertise allows detection of 5x to 8x extra proteins than when processing neat plasma with out utilizing the Proteograph. Consequently, from pattern prep to instrumentation to knowledge evaluation, our Proteograph Product Suite helps scientists discover proteome illness signatures which may in any other case be undetectable. We prefer to say that at Seer, we’re opening up a brand new gateway to the proteome.

Moreover, we’re permitting scientists to simply carry out large-scale proteogenomic research. Proteogenomics is the combining of genomic knowledge with proteomic knowledge to establish and quantify protein variants, hyperlink genomic variants with protein abundance ranges, and finally hyperlink the genome and the proteome to phenotype and illness, and begin disentangling the causal and downstream genetic pathways related to illness.

Are you able to focus on a few of the machine studying expertise that’s presently used at Seer Bio?

Seer is leveraging machine studying in any respect steps from expertise growth to downstream knowledge evaluation. These steps embrace: (1) design of our proprietary nanoparticles, the place machine studying helps us decide which physicochemical properties and mixtures of nanoparticles will work with particular product strains and assays; (2) detection and quantification of peptides, proteins, variants and proteoforms from the readout knowledge produced from the MS devices; (3) downstream proteomic and proteogenomic analyses in large-scale inhabitants cohorts.

Final yr, we published a paper in Advanced Materials combining proteomics strategies, nanoengineering and machine studying for enhancing our understanding of the mechanisms of protein corona formation. This paper uncovered nano-bio interactions and is informing Seer within the creation of improved future nanoparticles and merchandise.

Past nanoparticle growth, now we have been growing novel algorithms to identify variant peptides and post-translational modifications (PTMs). We just lately developed a way for detection of protein quantified trait loci (pQTLs) that’s strong to protein variants, which is a identified confounder for affinity-based proteomics. We’re extending this work to instantly establish these peptides from the uncooked spectra utilizing deep learning-based de novo sequencing strategies to permit search with out inflating the scale of spectral libraries.

Our workforce can be growing strategies to allow scientists with out deep experience in machine studying to optimally tune and make the most of machine studying fashions of their discovery work. That is completed by way of a Seer ML framework based mostly on the AutoML instrument, which permits environment friendly hyperparameter tuning by way of Bayesian optimization.

Lastly, we’re growing strategies to cut back the batch impact and improve the quantitative accuracy of the mass spec readout by modeling the measured quantitative values to maximise anticipated metrics comparable to correlation of depth values throughout peptides inside a protein group.

Hallucinations are a standard subject with LLMs, what are a few of the options to forestall or mitigate this?

LLMs are generative strategies which are given a big corpus and are educated to generate related textual content. They seize the underlying statistical properties of the textual content they’re educated on, from easy native properties comparable to how usually sure mixtures of phrases (or tokens) are discovered collectively, to larger degree properties that emulate understanding of context and which means.

Nevertheless, LLMs usually are not primarily educated to be right. Reinforcement studying with human suggestions (RLHF) and different methods assist prepare them for fascinating properties together with correctness, however usually are not totally profitable. Given a immediate, LLMs will generate textual content that almost all carefully resembles the statistical properties of the coaching knowledge. Typically, this textual content can be right. For instance, if requested “when was Alexander the Nice born,” the right reply is 356 BC (or BCE), and an LLM is probably going to provide that reply as a result of inside the coaching knowledge Alexander the Nice’s beginning seems usually as this worth. Nevertheless, when requested “when was Empress Reginella born,” a fictional character not current within the coaching corpus, the LLM is prone to hallucinate and create a narrative of her beginning. Equally, when requested a query that the LLM could not retrieve a proper reply for (both as a result of the correct reply doesn’t exist, or for different statistical functions), it’s prone to hallucinate and reply as if it is aware of. This creates hallucinations which are an apparent downside for severe purposes, comparable to “how can such and such most cancers be handled.”

There are not any excellent options but for hallucinations. They’re endemic to the design of the LLM. One partial answer is correct prompting, comparable to asking the LLM to “think twice, step-by-step,” and so forth. This will increase the LLMs chance to not concoct tales. A extra refined strategy that’s being developed is the usage of information graphs. Data graphs present structured knowledge: entities in a information graph are related to different entities in a predefined, logical method. Establishing a information graph for a given area is in fact a difficult process however doable with a mixture of automated and statistical strategies and curation. With a built-in information graph, LLMs can cross-check the statements they generate in opposition to the structured set of identified information, and will be constrained to not generate a press release that contradicts or just isn’t supported by the information graph.

Due to the basic subject of hallucinations, and arguably due to their lack of adequate reasoning and judgment talents, LLMs are right now highly effective for retrieving, connecting and distilling data, however can’t exchange human consultants in severe purposes comparable to medical prognosis or authorized recommendation. Nonetheless, they’ll tremendously improve the effectivity and functionality of human consultants in these domains.

Are you able to share your imaginative and prescient for a future the place biology is steered by knowledge slightly than hypotheses?

The standard hypothesis-driven strategy, which includes researchers discovering patterns, growing hypotheses, performing experiments or research to check them, after which refining theories based mostly on the information, is changing into supplanted by a brand new paradigm based mostly on data-driven modeling.

On this rising paradigm, researchers begin with hypothesis-free, large-scale knowledge technology. Then, they prepare a machine studying mannequin comparable to an LLM with the target of correct reconstruction of occluded knowledge, sturdy regression or classification efficiency in plenty of downstream duties. As soon as the machine studying mannequin can precisely predict the information, and achieves constancy corresponding to the similarity between experimental replicates, researchers can interrogate the mannequin to extract perception concerning the organic system and discern the underlying organic rules.

LLMs are proving to be particularly good in modeling biomolecular knowledge, and are geared to gas a shift from hypothesis-driven to data-driven organic discovery. This shift will turn out to be more and more pronounced over the subsequent 10 years and permit correct modeling of biomolecular programs at a granularity that goes properly past human capability.

What’s the potential impression for illness prognosis and drug discovery?

I consider LLM and generative AI will result in vital adjustments within the life sciences trade. One space that can profit significantly from LLMs is medical prognosis, particularly for uncommon, difficult-to-diagnose illnesses and most cancers subtypes. There are large quantities of complete affected person data that we will faucet into – from genomic profiles, therapy responses, medical data and household historical past – to drive correct and well timed prognosis. If we will discover a solution to compile all this knowledge such that they’re simply accessible, and never siloed by particular person well being organizations, we will dramatically enhance diagnostic precision. This isn’t to suggest that the machine studying fashions, together with LLMs, will be capable of autonomously function in prognosis. Attributable to their technical limitations, within the foreseeable future they won’t be autonomous, however as an alternative they are going to increase human consultants. They are going to be highly effective instruments to assist the physician present beautifully knowledgeable assessments and diagnoses in a fraction of the time wanted to this point, and to correctly doc and talk their diagnoses to the affected person in addition to to the complete community of well being suppliers related by way of the machine studying system.

The trade is already leveraging machine studying for drug discovery and growth, touting its capability to cut back prices and timelines in comparison with the standard paradigm. LLMs additional add to the accessible toolbox, and are offering wonderful frameworks for modeling large-scale biomolecular knowledge together with genomes, proteomes, practical genomic and epigenomic knowledge, single-cell knowledge, and extra. Within the foreseeable future, basis LLMs will undoubtedly join throughout all these knowledge modalities and throughout giant cohorts of people whose genomic, proteomic and well being data is collected. Such LLMs will support in technology of promising drug targets, establish probably pockets of exercise of proteins related to organic operate and illness, or recommend pathways and extra complicated mobile capabilities that may be modulated in a particular manner with small molecules or different drug modalities. We will additionally faucet into LLMs to establish drug responders and non-responders based mostly on genetic susceptibility, or to repurpose medicine in different illness indications. Lots of the current revolutionary AI-based drug discovery firms are undoubtedly already beginning to assume and develop on this course, and we should always anticipate to see the formation of extra firms in addition to public efforts aimed on the deployment of LLMs in human well being and drug discovery.

Thanks for the detailed interview, readers who want to be taught extra ought to go to Seer.