Knowledge-Driven Feature Selection and Engineering for Genotype Data with Large Language Models.

Predicting phenotypes with complex genetic bases based on a small, interpretable set of variant features remains a challenging task. Conventionally, data-driven approaches are used for this task, yet the high dimensional nature of genotype data makes the analysis and prediction difficult. Motivated by the biomedical knowledge encoded in pretrained LLMs and the emerging applications for genetics, we set to examine the ability of LLMs in feature selection and engineering for tabular genotype data, with a novel knowledge-driven framework. We develop FreeForm, F ree-flow R easoning and E nsembling for E nhanced F eature O utput and R obust M odeling, designed with chain-of-thought and ensembling principles, to select and engineer features with the intrinsic knowledge of LLMs. Evaluated on two distinct genotype-phenotype datasets, genetic ancestry and hereditary hearing loss, we find this framework outperforms several data-driven methods, particularly on low-data regimes. FreeForm is available as open-source framework at GitHub: https://github.com/PennShenLab/FREEFORM.