Main

Twice a year, a panel of experts from the World Health Organization (WHO) gathers to recommend vaccine strains for the upcoming influenza seasons. To reduce the disease burden of influenza, the goal of the panel is to optimize ‘vaccine effectiveness’, which is defined as the reduction in the odds of influenza infection among vaccinated individuals compared to those who are not vaccinated1. After each season, influenza vaccine effectiveness is estimated by institutions such as the Centers for Disease Control and Prevention (CDC) through observational test-negative studies involving participants with medically attended outpatient illness2. If the vaccine strains that were selected turned out to be well aligned with the distribution of viral strains that were circulating during the influenza season, the effectiveness of an inactivated influenza vaccine may reach up to 40–60% in that season3. However, despite decades of research in prevention and surveillance, current influenza vaccines provide limited protection. CDC estimates from the US Influenza Vaccine Effectiveness Network reveal that overall vaccine effectiveness across subtypes and age groups dropped below 40% in five of the 10 years from 2012 to 2021 (refs. 4,5,6,7,8,9,10,11,12), with a concurrent rise in influenza-related hospitalization rates (Extended Data Fig. 1). For instance, during the 2014–2015 winter season, vaccine effectiveness was only 19%6.

Many factors influence vaccine effectiveness, including the population’s immune history and the platforms and formulations used in vaccine production13,14,15,16,17,18. However, the antigenic match of vaccine strains to circulating viral strains is particularly important. This work focuses on improving this antigenic match, a necessary condition for vaccine effectiveness11,15,19.

Antigenic match can be assessed using two types of data. First, surveillance datasets such as the Global Initiative on Sharing All Influenza Data (GISAID)20 gather the sequences of viral proteins along with their collection time, providing the distribution of viral genotypes in the season of interest. Based on data, we can compute a viral strain’s ‘dominance’, which is the frequency of its occurrence during a particular season. Second, ‘antigenicity’ data measure the inhibition capacity of antibodies produced by a vaccine to a specific viral strain. For example, in vitro assays such as hemagglutination inhibition (HI) tests21 using postinfection ferret antisera are conducted by WHO Collaborating Centres (WHO CCs) to quantitatively analyze the antigenicity of candidate vaccines to circulating viruses. Both data sources contribute to antigenic match. Specifically, we quantify a vaccine’s antigenic match as the average of its antigenicity across circulating viral strains, weighted by each of their dominance. In this paper, we refer to this measure of antigenic match as the ‘coverage score’.

Because inactivated influenza vaccines require a production period of 6–9 months22, we need to evaluate the candidate vaccine strains prospectively. However, viral strain distributions at the time of vaccine selection may differ from those in the upcoming influenza season10,23,24 (illustrated in Extended Data Fig. 1c). Moreover, it is prohibitively expensive to validate the antigenicity of all candidate vaccines against every virus potentially circulating in the future season, and there are limited viral specimens available for study. In vitro assays such as the HI test21 can be run on only a limited number of vaccine candidates, typically fewer than 10 (ref. 25).

We hypothesize that the coverage score can be prospectively predicted using machine learning models trained on the aforementioned data from past seasons. In this paper, we propose VaxSeer, which integrates predictors for two components of the coverage score: one for dominance and another for antigenicity. Because the hemagglutinin (HA) protein plays a critical role in viral infection and immune response, we represent vaccine and virus strains solely through their HA protein sequences26,27. Our dominance predictor estimates the probability of an input HA protein sequence occurring in a given season. Dominance changes over time due to the competition between viral strains, and the rate of its change depends on the combined effect of all mutations in the protein sequence. Traditional epidemiological studies estimate the rate of change as the sum of independent contributions from single amino acid mutations28,29,30,31,32,33,34,35,36,37. These approaches may not fully capture higher-level properties, such as protein stability or host interactions, which depend on complex interactions between residues across the protein sequence38. In this study, we leveraged protein language models and an ordinary differential equation (ODE) to automatically capture the relationship between protein sequences and their dominance. Although existing protein language models assume a static fitness landscape39,40,41,42,43, our approach accounts for dynamic dominance shifts, making it better suited for rapidly evolving viruses such as influenza. In addition to the dominance prediction model, we also developed an antigenicity prediction model, which takes as input a pair of protein sequences (virus and vaccine) and outputs their predicted HI test outcome. Our antigenicity prediction model enables the in silico prediction of HI test outcomes for any vaccine–virus pair, reducing the need for time-consuming antigenicity experiments.

Using our model, we conducted a 10-year retrospective evaluation on two influenza subtypes: A/H3N2 and A/H1N1. We examined the correlation between the coverage score and real-world vaccine effectiveness estimated by three independent institutions across the United States, Europe and Canada as well as the number of symptomatic illnesses and medical visits averted by influenza vaccines in the United States, as estimated by the CDC. We then studied the ability of VaxSeer to prospectively identify vaccine strains with high antigenic match. In particular, we evaluated our proposed vaccine strains against the WHO’s annual recommendations, based on their empirical coverage scores, computed retrospectively from observed dominance and experimental HI data. Overall, our evaluations highlight the potential of VaxSeer to aid vaccine selection by prioritizing high-antigenic-match candidates from vast viral pools for resource-intensive laboratory or clinical validation.

Results

Overview of VaxSeer

Selecting vaccine strains based on predicted coverage score

To rank vaccine strains during the selection process, we predict their coverage scores with an in silico approach (Fig. 1). Given a set of circulating viral proteins and a set of candidate vaccine proteins, the dominance predictor estimates the expected dominance of each viral protein in the upcoming season. In addition, the antigenicity predictor uses a pairwise sequence alignment of viral and vaccine proteins as input and predicts the corresponding HI test results. Together, these predictions provide the predicted coverage score for each candidate vaccine by averaging its predicted antigenicity across multiple circulating viruses, weighted by their predicted dominance.

Fig. 1: VaxSeer features a two-track model for predicting the coverage score.
figure 1

For each circulating virus strain, its HA protein sequence is fed into the dominance predictor, which outputs the dominance of this protein at the forecasting time (for example, the next influenza season). For a candidate vaccine strain, its HA protein and that of a circulating virus strain are aligned and input into the antigenicity predictor to obtain the antigenicity (quantified by HI values). The predicted coverage score of a candidate vaccine strain is then calculated by averaging its antigenicity across multiple circulating viruses, weighted by their respective dominance.

Training dominance predictor

The dominance predictor learns the relationship between HA sequences and the change of their dominance over time, enabling more accurate predictions of the future viral landscape. We train our dominance predictor using the dataset of protein sequences collected before the vaccine selection time, along with their respective collection dates. For each protein sequence, two language models44 predict the initial dominance and its rate of change, which are then used in an ODE to derive its dominance at the collection time. These two models are trained by aligning the predicted dominance with the actual protein distributions.

Training antigenicity predictor

The antigenicity predictor learns to predict the HI test results from vaccine and virus proteins. We train the antigenicity predictor using HI test data21 for vaccine–virus pairs, with both vaccine and viral proteins collected before the vaccine selection time. The HI test results are inversely proportional to the amount of vaccine-induced antibodies that can inhibit a specific virus strain. Because antigenicity depends on both the vaccine and virus strains, the antigenicity predictor takes the pair of HA sequences as input. To model relationships between locations both within a single sequence and across the pair of sequences, the antigenicity predictor is based on neural network architectures used to encode protein multiple sequence alignments (MSAs)45. The model is trained by regressing the predicted antigenicity with the actual antigenicity. Additional details on the dominance and antigenicity predictors are provided in the Methods.

Experimental setup

Data construction

We conducted retrospective validation of VaxSeer on the A/H1N1 and A/H3N2 subtypes of influenza. To quantify viral dominance during each season, we downloaded influenza HA protein sequences with associated collection times from the GISAID20. For antigenicity measurements, we collected HI test results based on postinfection ferret antisera from the reports of a WHO CC (Francis Crick Institute) prepared for the annual consultations on the composition of influenza vaccines from 2003 to February 2023.

To validate the correlation between coverage scores and disease outcomes, we gather influenza vaccine effectiveness for historically used vaccines from annual CDC publications4,5,6,7,8,9,10,11. This vaccine effectiveness is estimated based on patients with medically attended outpatient acute respiratory illness (ARI) in the US Influenza Vaccine Effectiveness Network. We also gathered the annual number of influenza illnesses and medical visits averted by the influenza vaccine in the United States, as estimated by the CDC46,47. For the analysis of vaccine effectiveness, we excluded the vaccine strains without subtype-specific effectiveness. We also excluded the years 2020 and 2021 to avoid potential bias caused by SARS-CoV-2 vaccines48,49, resulting in 10 past recommended vaccines with available effectiveness data for two subtypes. For additional comparisons, we also include vaccine effectiveness estimated from two other sources: Influenza Monitoring Vaccine Effectiveness (I-MOVE) in Europe50,51,52,53,54,55 and the Sentinel Practitioner Surveillance Network (SPSN) in Canada56,57,58,59,60,61,62,63. I-MOVE estimates influenza vaccine effectiveness based on patients with ARI or influenza-like illness (ILI) who present to a general practitioner across multiple centers in Europe. SPSN estimates vaccine effectiveness based on patients with ILI who present to community-based sentinel practitioners in Canada. Throughout the ‘Results’, vaccine effectiveness refers to CDC estimates, unless explicitly stated otherwise (for example, SPSN or I-MOVE). Further details on effectiveness estimates are available in the Methods and in Extended Data Table 1.

Evaluation metric

The main challenge in evaluating different vaccine selections is the lack of data on their real-world effectiveness. After all, we can only measure vaccine effectiveness and its impact on clinical endpoints for vaccines that were actually administered to the population. Instead, we propose to use a surrogate measure, the ‘empirical’ coverage score, to counterfactually evaluate vaccine strains that were not selected in past seasons. The empirical coverage score is an empirical approximation of the ground truth coverage score, which is computed retrospectively using observed frequencies of viral strains from surveillance data (GISAID) and experimental HI test results from WHO CCs. Although reliant on retrospective data, the empirical coverage score can be computed for any candidate vaccine strains that have been adequately tested with HI assays in our evaluation. In addition, although antigenic match is only one of many factors, the empirical coverage score is highly correlated with vaccine effectiveness for strains with available data from the CDC, exhibiting a Pearson correlation of 0.895 and a Spearman rank correlation of 0.976, with both P < 0.001 (Fig. 2a, blue). The positive correlation between empirical coverage scores and vaccine effectiveness is also observed across independent European and Canadian effectiveness studies (Extended Data Fig. 2). The empirical coverage score also demonstrates a positive correlation with the disease burden (symptomatic illnesses and medical visits) prevented by the vaccines in the United States, as estimated by the CDC (Extended Data Fig. 3a,b). The definitions of vaccine effectiveness, empirical coverage score and predicted coverage score are distinguished in Extended Data Table 2.

Fig. 2: VaxSeer selecting vaccine strains with better antigenic match.
figure 2

a, Both empirical and predicted coverage scores exhibit positive correlations with real-world vaccine effectiveness, as estimated by the CDC (two-sided Spearman rank correlation ρ with P = 1.46 × 10−6 for empirical and P = 0.0005 for predicted coverage scores, two-sided Pearson correlation r with P = 0.0005 for empirical and P = 0.0014 for predicted coverage scores and linear regression slope m). The empirical coverage scores serve as a surrogate metric for evaluating vaccine selection due to its strong correlation with vaccine effectiveness. b, Vaccine strains selected by VaxSeer have higher empirical coverage scores than those recommended by the WHO for A/H3N2. The statistical significance of this improvement is confirmed by a one-sided Wilcoxon signed-rank test (P = 4.1 × 10−5). VaxSeer frequently selects the strains with the highest empirical coverage scores. Violin plots depict the distribution of predicted coverage scores among all candidate vaccines, including those with low experimental coverage. c,d, The predicted antigenicity (HI values) of circulating A/H3N2 strains in the 2019 Northern Hemisphere season, with respect to the vaccine recommended by the WHO (c) and VaxSeer (d). Our recommended vaccine covers a larger variety of circulating viruses (3C.2a1b.1a/b and 3C.2a1b.2b/a) than the WHO’s recommendation (3C.3a1). The red circle encompasses viruses from clade 3C.3a1, whereas the yellow and orange circles represent viruses from clade 3C.2a1b.1a/b and clade 3C.2a1b.2b/a, respectively. e, The phylogenetic tree of A/H3N2 during and after the 2019 Northern Hemisphere season, constructed by nextflu78 (version dated 21 November 2024). Each dot represents a viral strain, and black crosses indicate vaccine strains selected by the WHO. The vaccine strain chosen for the 2019 winter season is marked with an arrow. VaxSeer’s recommended vaccine strain covers the dominant clades (3C.2a1b.2a/b and 3C.2a1b.1a/b). Furthermore, clade 3C.2a1b.2a continued to expand in subsequent seasons, as depicted by the upper right gray branches. All pred, predicted coverage scores for all candidate vaccine strains.

Retrospective evaluation settings

Our evaluation spans the 2012–2021 winter seasons, starting in October of each year and continuing to March of the next year. We conduct evaluations only during the winter seasons due to the availability of vaccine effectiveness data. In line with the WHO’s recommendation schedule, we train our models on strains collected and HI tests conducted up to 8 months prior to the start of the season (before 1 February). To construct the sets of candidate vaccine strains and circulating virus strains, we consider all virus strains that were isolated at least five times within the past 3 years, similar to strategies for vaccines recommended by the WHO. For example, to evaluate the 2021–2022 winter season (October 2021 to March 2022), we train our models on any data collected before 1 February 2021, and we predict coverage scores for candidate vaccine strains isolated between February 2018 and February 2021. To ensure the accurate assessment of empirical coverage scores for those candidate vaccine strains, we only evaluate vaccines with HI test results against at least 40% of circulating viral sequences during the season of interest. A total of 51 candidate vaccines for A/H3N2 and 50 candidate vaccines for A/H1N1 were subject to comparison. The predicted and empirical coverage scores, along with the percentages of viral samples included in the calculation of empirical coverage score, are presented in Supplementary Table 1. Further details on data construction and evaluation settings are provided in the Methods.

Vaccine selection based on predicted coverage score

As illustrated in Fig. 2b and Extended Data Fig. 3c, the vaccine strains selected based on our models’ predicted coverage scores outperform the WHO’s recommendations in six out of 10 years for A/H1N1 and in nine out of 10 years for A/H3N2, when evaluated using empirical coverage scores. For the majority of remaining years, our recommendations are similar to those of the WHO. Overall (two subtypes, 10 years), our predictions achieve a statistically significant improvement in empirical coverage scores over the WHO’s recommendations (one-sided Wilcoxon signed-rank test, P = 4.1 × 10−5). In fact, VaxSeer successfully recommends the ‘best’ vaccine strain (top empirical coverage score) in seven out of 10 years for A/H1N1 and in five out of 10 years for A/H3N2, whereas the WHO’s recommendation matches the best vaccine strain in just three out of 10 years for A/H1N1 and in zero out of 10 years for A/H3N2. Due to experimental constraints of WHO CCs, only a subset of candidate vaccines is tested broadly (over 40% of influenza viruses), allowing for the assessment of their empirical coverage scores. Thus, the strains selected in Fig. 2b and Extended Data Fig. 3c only reflect these limited choices of candidate vaccines. If we consider the distributions of predicted coverage scores over all candidate strains circulating in the previous 3 years (violin plots in Fig. 2b and Extended Data Fig. 3c), there are strains that scored even higher but were not subjected to sufficient experimental validation by WHO CCs. This highlights the possibility that there may exist even more effective vaccine strains waiting to be discovered.

In certain instances, our model can anticipate suitable vaccine strains a year in advance of the WHO. For example, in the 2016 winter season for A/H1N1, although the WHO recommendation (A/California/07/2009) has a decent empirical coverage score, VaxSeer proposes an alternative with an even higher score (A/Michigan/45/2015, collected earliest on 7 September 2015). Promisingly, A/Michigan/45/2015 was recommended by the WHO for the subsequent winter season.

To visualize the breadth in coverage of different vaccine strains, we plot the predicted antigenicity of A/H3N2 strains circulating in the 2019 winter season. The strain recommended by the WHO aligns with one emerging clade, 3C.3a1 (Fig. 2c). By contrast, the strain recommended by VaxSeer offers a complementary antigenic profile, mainly covering the 3C.2a1b.1a/b and 3C.2a1b.2b/a clades (Fig. 2d). Although the WHO selects the vaccine covering the new clade54, our model tends to select the vaccine strain that is effective against the majority of circulating clades and the clade undergoing further expansion (3C.2a1b.2a), as depicted in the phylogenetic tree (Fig. 2e).

Correlation of predicted coverage score with vaccine effectiveness and clinical endpoints

Correlation with vaccine effectiveness

As shown in Fig. 2a (orange), the predicted coverage scores significantly correlate with vaccine effectiveness against outpatient ARI, as estimated by the CDC (r = 0.861, P = 0.0014 and ρ = 0.891, P = 0.0005). This correlation is consistently observed with vaccine effectiveness estimates from I-MOVE in Europe and SPSN in Canada, as shown in Extended Data Fig. 2. Further analyses in Fig. 3a and Extended Data Fig. 4a highlight the superior performance of our approach that combines antigenicity and dominance, along with accurate future dominance predictions, compared to alternative metrics based solely on viral dominance or antigenicity.

Fig. 3: The predicted coverage score is correlated with vaccine effectiveness, reduction of disease burden and empirical coverage score.
figure 3

a, The two-sided Pearson correlation and associated P values between vaccine effectiveness and various scoring strategies for the vaccine strains selected by the WHO from 2012 to 2019 for A/H1N1 and A/H3N2. Evaluation was conducted on n = 10 data points, where each point represents a vaccine strain selected by the WHO for a specific year and subtype, with corresponding vaccine effectiveness estimates from the CDC. Our predicted coverage score shows the strongest correlation. b, The vaccines with higher effectiveness (>40) have higher predicted coverage scores than vaccines with lower effectiveness (≤40). The medians of the coverage scores are depicted by the center lines within the boxes. The box spans from the first to the third quartile, and the whiskers cover 1.5 times the interquartile range. Outliers beyond the whiskers are represented by individual marks. The P value, calculated by a one-sided independent t-test, is illustrated. c, Correlations between the predicted coverage scores and the estimated number of influenza medical visits averted by vaccination in the United States. Two-sided Pearson correlations (r) and corresponding P values (P) are illustrated in the text. d, Two-sided Spearman rank correlation and the associated P values between the empirical coverage scores and predicted coverage scores based on different dominance predictors39,42, evaluated across candidate vaccine strains of A/H1N1 (50 strains) and A/H3N2 (51 strains) from 2012 to 2021. Each data point represents a distinct strain for a specific year and subtype. Given our antigenicity predictor, our dominance model achieves the best performance. predict, predicted.

First, without accounting for antigenicity, the dominance of vaccine strains in the last season (‘Dominance of vaccine (last season)’ in Fig. 3a) is poorly correlated with vaccine effectiveness (Pearson correlation r = 0.4920 and P = 0.15), as further detailed in Extended Data Fig. 4b. Second, using antigenicity without considering the future dominance of viruses also yields suboptimal results. The baseline ‘average antigenicity’ method, which simply calculates the average HI value for a set of expert-selected viruses, demonstrates inferior correlation (Extended Data Fig. 4c). By contrast, our predicted coverage score achieves the highest correlation with vaccine effectiveness, emphasizing the need to model both antigenicity and future dominance.

Furthermore, ranking vaccines by their predicted coverage score allows us to distinguish between past vaccines with higher (≥40) and lower (<40) effectiveness (Fig. 3b). Vaccines with higher effectiveness exhibit significantly greater predicted coverage scores, as confirmed by a one-sided independent t-test (P = 0.026). Here, we set 40% as the threshold for vaccine effectiveness, as it is generally considered a lower bound of effectiveness when influenza vaccine strains are antigenically similar to circulating influenza viruses3.

Correlation with reduction of disease burden

Beyond effectiveness, we further analyze the association between predicted coverage score and the reduction of disease burden to showcase the potential impact of our approach. The CDC estimates the reduction of disease burden using the number of medical visits (Fig. 3c) and influenza illnesses (Extended Data Fig. 3d) averted by vaccination in the United States46,64. We combined the predicted coverage scores from two subtypes, weighted by the ratio of infections for each subtype, because the reduction in disease burden data does not differentiate between subtypes. The year 2020 was excluded due to the lack of available information. As shown in Fig. 3c and Extended Data Fig. 3d, our predicted coverage scores show positive alignment with the number of averted symptomatic illnesses and medical visits (Pearson correlations of 0.6858 and 0.6993, respectively, P < 0.05).

Performance of the dominance predictor

We next evaluate our dominance prediction model in isolation. With the goal of optimizing the empirical coverage scores for vaccines, the dominance prediction model’s performance is evaluated based on its accuracy in predicting the empirical coverage scores, in combination with a fixed antigenicity model. To explore the effect of modeling changes in dominance for VaxSeer, we compare it with the following baselines that models static viral dominance and are trained on the same data. We first compare against a baseline that defines dominance based on the empirical frequencies calculated from the previous season (‘Last’). This baseline assumes that the distribution of variants does not change between seasons. We further conduct a comparison with three state-of-the-art machine learning models, LM40, CSCS39 and EVEscape42, which learn static fitness landscape for proteins and have shown potential in predicting virus fitness and escapability39,41,42 (see Methods for details). By contrast, we define the evolution of dominance by a time-dependent ODE parameterized by language models44.

Our results show that our proposed dominance predictor achieves the highest correlation (Fig. 3d) and the lowest error (Extended Data Fig. 5a,b) with the empirical coverage score, in conjunction with our antigenicity predictor, highlighting the importance of using a temporal model. We also demonstrate the superior performance of our dominance predictor through its ability to identify future dominant sequences, as shown in Extended Data Fig. 5c. Extended Data Fig. 5d and Extended Data Table 3 also present a comparison of antigenicity models.

Discussion

Vaccines are an important defense against infectious diseases. However, in the presence of continual evolution, it is challenging to forecast the future landscape of viral strains and assess the antigenicity of candidate vaccines at scale. Here we propose VaxSeer, an in silico framework that ranks vaccine strains based on their predicted coverage score—a prospective estimation of a vaccine’s antigenic match with future viruses. We consider future dominance of viruses in the coverage score, which is predicted by a dominance predictor. This predictor expresses the change in dominance over time using an ODE, with parameters estimated by protein language models, enabling the prediction of dynamic dominance based on the entire protein sequence. Due to the observational nature of public health data, we validated VaxSeer through computational surrogates (empirical coverage score) rather than actual vaccine effectiveness from population-based trials. However, the strong correlation between the empirical coverage scores and real-world effectiveness suggests that VaxSeer has the potential to help select vaccine strains with improved effectiveness. Our model can contribute to influenza vaccine selection in two ways. First, it provides a complementary perspective to existing antigenic measurements by specifically considering future viral distributions. Thereby, it can be used as an additional information source during the selection process. Second, as a cost-effective and computationally efficient in silico tool, our model enables rapid screening of large viral strain pools to identify the most promising candidates. These candidates can then be prioritized for further resource-intensive validation through traditional laboratory techniques such as two-way tests or clinical trials.

Although various factors contribute to the ultimate effectiveness of a vaccine, our study focuses on antigenic match. We do not address other factors related to vaccine production or hosts that impact vaccine selection, such as dosage18,65, adjuvant choices16, vaccine platform17, egg adaptations in the production process66, vaccination timing67 or host immune history13,14,15. Despite not accounting for these factors, our predicted coverage score still demonstrates a reasonable correlation with vaccine effectiveness, highlighting its utility in improving antigenic match for vaccine selection. Moreover, our approach can be integrated with methods that consider these additional factors to offer a more holistic assessment of vaccine effectiveness.

When estimating the antigenic match, the sequencing data used in our study may be subject to sampling biases, leading to challenges in accurately determining strain distributions. For instance, the geographic distribution of sequences is uneven across the globe, and sequencing strategies might change over time (see Methods for details). In recent years, there has been an increasing preference for sequencing directly from original clinical specimens rather than from egg-passaged or cell-passaged viruses (Extended Data Fig. 6a), because viruses may mutate during the passage process68. To evaluate the impact of these biases on empirical coverage scores, we examined two factors: geographical location and passage history. As shown in Extended Data Fig. 7, the empirical coverage scores derived from viral samples across different locations and passages remain consistent, suggesting that vaccine antigenic profiles are relatively robust to sampling biases in viral sequences. Nevertheless, acquiring more unbiased sequencing data is beneficial for improving the accuracy of our predictions.

In the present study, the antigenicity is estimated using HI assays conducted by one WHO CC (Francis Crick Institute) due to data availability. These data may be biased by the protocols adopted by this institute, and the HI assay itself has several inherent limitations. First, instead of measuring neutralization, it detects the binding of antibodies to HA proteins by observing the inhibition of hemagglutination. Thus, applying HI assays to recent H3 variants is problematic, as these strains exhibit reduced hemagglutination activity69. Second, the antibodies are obtained from postinfection ferret antisera70, which may not align with the immune responses in human populations13,14,15. Third, the passage histories of vaccines and viruses can impact HI test results68,71. Despite these limitations, HI assays remain useful for approximating antigenicity, as empirical coverage scores derived from them demonstrate a significant correlation with vaccine effectiveness. Furthermore, our antigenicity predictor is able to learn meaningful antigenic features from noisy HI data, as evidenced by its accurate coverage score predictions. The accuracy of our method can be further improved by incorporating antigenicity data from more precise assays, such as standardized HI tests across multiple laboratories, advanced neutralization assays such as microneutralization72 or high-content imaging-based neutralization test (HINT)73 and passage strategies minimizing antigenic mutations74.

In our experiments, we limit candidate vaccine strains to those circulating in previous seasons, as the antigenicity predictor is trained on a limited number of existing vaccine strains. However, because our antigenicity predictor relies solely on protein sequences, it can, in principle, compute coverage scores for any vaccine, including those that have not yet been isolated or do not exist in nature. By expanding the diversity of vaccines in the antigenicity dataset, VaxSeer can explore a much larger vaccine space, making it a valuable tool for virtual screening, functional optimization and de novo vaccine design. Nonetheless, further exploration is needed to assess the ability of our model to generalize to vaccines that are substantially different from those in our training data.

We showcase the feasibility of our method in influenza owing to the extensive efforts in data construction for this virus. This approach can be applied to other pandemic viruses provided that sequence and antigenicity data are available. However, the neural network architecture of our model requires sufficient training data to perform effectively, so they may face limitations when applied to emergent or rare viruses. For example, validating this approach for SARS-CoV-2 is more challenging due to the limited public availability of antigenicity data. Nevertheless, the overall concept of computationally forecasting antigenicity and dominance remains viable with alternative architectures.

Multiple directions can further improve this work. First, the current implementation of VaxSeer only considers the influenza virus’s HA protein, whereas studies have shown that other proteins, such as neuraminidase, may also influence viral fitness and vaccine antigenicity75,76,77. Thus, we expect that modeling larger portions of a viral genome will provide a more complete picture for vaccine selection. Second, we only evaluated the performance of VaxSeer over vaccines with sufficient HI test data from WHO CCs. There exist vaccine strains with higher predicted coverage scores, but we lacked the experimental resources to validate their empirical coverage scores. Third, we predict antigenicity based solely on HA protein sequences. Integrating posttranslational modifications of HA proteins, such as glycosylation, could enhance the accuracy of antigenicity predictions71. Finally, our current approach only computes the predicted coverage score over observed viral sequences, without considering novel sequences that may appear. In fact, during a given influenza season, over 40% of HA sequences were unseen in the previous year (Extended Data Fig. 6b). To account for these emergent sequences, we could leverage our dominance predictor as a generative model and sample viral sequences given by the predicted future distribution. This would enable us to compute the predicted coverage score over both current and potential viral strains. Unfortunately, we do not have sufficient experimental data to validate the antigenicity of these novel viruses.

In summary, this study showcases the potential of machine learning to assist humans in the discovery of more effective vaccines.

Methods

Dataset description

Dominance

The training corpora of the dominance prediction model were obtained from the GISAID. We downloaded 394,090 HA sequences submitted before 2 March 2023. We retained only HA amino acid sequences from human hosts and with almost full length (with a minimum length of 553 amino acids). Among the sequences, approximately 67.5% contain gender information, with a nearly equal distribution between male and female. The geographical distribution of the samples is as follows: 34.8% from North America, 29.2% from Europe, 20.0% from Asia, 7.1% from Oceania, 4.9% from South America and 4.1% from Africa. The distribution of passage histories for the samples is as follows: 44.0% from original specimens, 31.8% from cell lines or eggs and 24.1% from unknown sources. The distribution of host ages is as follows: 2.7% are younger than 2 years, 13.3% are 2–8 years, 7.8% are 9–17 years, 20.9% are 18–49 years, 7.2% are 50–64 years, 9.7% are 65 years or older and 38.4% do not have available age information. We used HA sequences collected worldwide, consistent with the WHO’s global mandate to recommend a vaccine strain for each subtype. Additionally, we incorporated sequences from all passage histories, as a large portion of viruses collected in earlier years originated from cell line passages (Extended Data Fig. 6a).

Starting from October 2003, we discretized every 6 months into one season. Two subtypes, A/H1N1 and A/H3N2, are considered separately. For each subtype, the seasons with fewer than 100 HA samples are discarded. After the preprocessing, 28,546 non-repeated HA sequences are obtained for A/H3N2, and 23,736 non-repeated HA sequences are obtained for A/H1N1. The sequences were further split into training and testing sets based on collection time. Specifically, to predict the dominance of sequences in the winter season of a particular year (for example, test set October 2018 to April 2019), we train on sequence data collected before February of that year (for example, before February 2018).

During training, we randomly split the sequences into training and validation sets with a 9:1 ratio. The best model checkpoint is selected based on the validation loss. The number of samples used for training the dominance prediction model is provided in Supplementary Table 2.

Antigenicity

The HI test data were extracted from the published reports that have been prepared for the WHO annual consultation on the composition of influenza vaccines from 2003 to February 2023. The HI test data comprise the names of virus and vaccine strains, along with the dilution of antibodies leading to hemagglutination inhibition. We retrieved the HA sequences with respect to strain names from the GISAID. Strains for which HA sequences could not be found were excluded from the dataset. Twenty-six percent of A/H3N2 virus strains (1,700 out of 6,541) and 20% of A/H1N1 virus strains (1,398 out of 7,068) were excluded. Eleven percent of A/H3N2 vaccine strains (18 out of 169) and 5% of A/H1N1 vaccine strains (4 out of 77) were excluded. If multiple HA sequences corresponded to one strain name, we enumerated all the possible HA sequences. If a pair of vaccine–virus sequences had multiple HI test values, their geometric mean is used. After processing, we obtained the HI test results for 70,631 distinct vaccine–virus HA protein pairs for A/H1N1 and 63,299 distinct pairs for A/H3N2. For A/H1N1, a total of 3,068 distinct virus sequences and 109 distinct vaccine sequences were included. For A/H3N2, a total of 2,731 distinct virus sequences and 255 distinct vaccine sequences were included. The HA protein sequences of the vaccine and testing circulating virus were aligned by MMseqs2 (ref. 79).

The HI assay determines the highest dilution of antibodies capable of inhibiting virus binding to red blood cells21. A higher dilution indicates stronger antibody binding to the viruses. Following previous work80, we quantify antigenic similarity h(v, x) by the relative (logarithmic) dilution:

$$h(v,x)=\min (\log {\rm{HI}}(v,x)-\log {\rm{HI}}(v,v),0).$$
(1)

where HI(v, x) is the highest dilution factor (in folds) of serum containing antibodies from vaccine v that can inhibit viral strain x. HI(v, v) is the dilution against the reference virus v used to produce vaccine v. A higher h(v, x) indicates a higher antigenic similarity between the vaccine strain v and the viral strain x.

When training each year’s HI predictor, we use only vaccine–virus pairs with sequences collected before 1 February. These pairs are further split into training, validation and test sets in an 8:1:1 ratio. The validation set is used to select the best model checkpoint, whereas the test set evaluates HI prediction error.

Evaluation

We provide details about the data in the evaluation tasks below.

  1. 1.

    Vaccine effectiveness. To investigate the relationship between vaccine effectiveness and coverage scores, we built the test set including past recommended vaccines from 2012 to 2021. Vaccine effectiveness estimates of these past vaccines were obtained from three independent resources:

    1. (a)

      CDC (United States): In the United States, influenza vaccine effectiveness is estimated through the US Influenza Vaccine Effectiveness Network. The network enrolls participants (aged at least 6 months) presenting with ARI (new or worsening cough) at outpatient facilities, including emergency departments, within 7 days of symptom onset4,5,6,7,8,9,10,11.


    2. (b)

      SPSN (Canada): The SPSN estimates vaccine effectiveness based on patients aged at least 1 year of age who present within 7 days of onset of ILI to community-based sentinel practitioners in Canada56,57,58,59,60,61,62,63. ILI is defined as the acute onset of fever and cough, accompanied by at least one additional symptom that includes sore throat, myalgia, arthralgia or prostration.


    3. (c)

      I-MOVE (Europe): The I-MOVE network, which includes study sites across nine European countries, estimates vaccine effectiveness based on patients presenting with ILI or ARI to general practitioners or pediatricians who had a specimen taken within 8 days of symptom onset50,51,52,53,54.


    We excluded the years 2020 and 2021 to mitigate potential biases introduced by the influence of SARS-CoV-2 vaccines48,49. The vaccine effectiveness data used, along with reference, are summarized in Extended Data Table 1.


  2. 2.

    Empirical coverage score. The empirical coverage score is used in two evaluation tasks: assessing the relationship between the predicted coverage score and empirical coverage score and evaluating the antigenic match of our selected vaccines against the WHO’s recommended vaccines. We constructed a test set comprising candidate vaccines with available empirical coverage scores. For each year, we included vaccine strains whose protein sequences appeared at least five times in the past 3 years in the GISAID dataset and had HI test results covering at least 40% of the circulating viruses for that year. This results in 51 vaccine strains for A/H3N2 and 50 vaccine strains for A/H1N1.


  3. 3.

    Reduction of disease burden. To examine the correlation between coverage score and the reduction in disease burden, we constructed a test set comprising past recommended vaccines used from 2012 to 2022, excluding 2020 due to unavailable data. Data on the number of medical visits and influenza illnesses averted by these vaccinations were obtained from the CDC. Because the data did not differentiate between subtypes, we combined the coverage scores for A/H1N1 and A/H3N2, resulting in a total of nine data points.


Definition of coverage score

Let \({\mathcal{X}}\) denote the set of circulating viruses, and let v denote the vaccine in question. The mathematical definition of ‘coverage score (CS)’ for a vaccine v during a future season t is

$$C{S}_{t}(v)=\sum _{x\in {\mathcal{X}}}{p}_{t}(x)h(v,x),$$
(2)

where pt(x) is the dominance (probability) of virus x in season t, and h(v, x) measures the antigenicity between vaccine v and virus x. The probability is normalized over the set \({\mathcal{X}}\), meaning that \({\sum }_{x\in {\mathcal{X}}}{p}_{t}(x)=1\). Large values of pt(x) indicate that virus x has high dominance, and large values of h(v, x) indicate that vaccine v is effective against virus x. The vaccine candidates with the highest coverage scores are those recommended by our algorithm.

Dominance predictor

Our dominance predictor aims to model the time-resolved distributions of protein sequences. Given an HA protein sequence and a particular time as input, our dominance predictor outputs the probability (dominance) of this HA protein sequence occurring at that particular time.

Given an amino acid sequence x VL with length L (where V is the set of all possible amino acids, typically V = 20), inspired by the SIR81 model, we define the change in frequency of occurrence (un-normalized dominance) of a viral protein x (annotated as nt(x)) by the following ODE:

$$\frac{d{n}_{t}(x)}{dt}=R(x){n}_{t}(x),\quad {n}_{0}(x)=\exp (b(x)),$$
(3)

in which \(R(x)\in {\mathbb{R}}\) represents the rate of change in the dominance of sequence x, and \(\exp (b(x))\in {\mathbb{R}}\) describes the initial condition. By solving this ODE, we can obtain the frequency nt(x) at time t as

$${n}_{t}(x)=\exp (R(x)t+b(x)).$$
(4)

To obtain the dominance (probability) pt(x) of sequence x, we need to normalize nt(x) over the entire protein sequence space. However, directly calculating probabilities is impractical due to the vast number of amino acid combinations. Instead, we express the probability in terms of ‘autoregressive’ conditional probabilities44. That is, we propose to model the conditional probability

$${p}_{t}({x}_{i}| {x}_{ < i};\theta )=\frac{\exp (R({x}_{i}| {x}_{ < i};\theta )t+b({x}_{i}| {x}_{ < i};\theta ))}{{\sum }_{{x}_{i}^{{\prime} }\in V}\exp \left(R\left({x}_{i}^{{\prime} }| {x}_{ < i};\theta \right)t+b\left({x}_{i}^{{\prime} }| {x}_{ < i};\theta \right)\right)},$$
(5)

in which xi is the i-th residue in the sequence x. Both R(xix<i; θ) and b(xix<i; θ) are parameterized by the Transformer-based language model44. The probability of the complete protein sequence x can be represented as the product of conditional probabilities for each residue:

$${p}_{t}(x;\theta )=\mathop{\prod }\limits_{i=1}^{L}{p}_{t}({x}_{i}| {x}_{ < i};\theta ).$$
(6)

During the training process, we sample pairs of protein sequences and their collection times. The parameters θ of the language model are optimized based on the maximum likelihood estimation objective:

$${\mathcal{L}}(\theta)={{\mathbb{E}}}_{(x,t)\sim{p}_{train}}\log{p}_{t}(x;\theta)={{\mathbb{E}}}_{(x,t) \sim{p}_{train}}\mathop{\sum}\limits_{i = 1}^{L}\log{p}_{t}({x}_{i}| {x}_{< i};\theta ).$$
(7)

Architecture

We use a 12-layer GPT-2 model44 to parameterize the R(xix<i, θ) and another 12-layer GPT-2 model to parameterize b(xix<i, θ). The dominance predictors are trained for 100 epochs with a batch size of 16 and a learning rate of 1 × 10−5.

Implementation details for dominance prediction baselines

The baseline annotated as ‘Last’ defines dominance based on the empirical frequencies calculated from the previous season. We incorporated sequence data collected during the 6-month period prior to 1 February.

CSCS39 adopts the mutation probability and functional dissimilarity calculated from a masked protein language model to estimate protein fitness and escapability from antibodies.

EVEscape42 scores for individual mutations are found by combining three sources of information: a deep generative model for fitness prediction, structural information about the HA protein to estimate antibody binding potential (weighted contact number for each residue position) and chemical distances in charge and hydrophobicity between mutated and wild-type residues. We used the EVEscape42 codebase to train models on the same dataset as ours. The structures of H3 and H1 proteins were obtained from the Protein Data Bank, using 1RVX for A/H1N1 and 4FNK for A/H3N2.

LM models the dominance of sequences in a static manner. Similarly to equation (5), the likelihood of amino acid sequence x is factorized autoregressively without using the collection time information:

$$p({x}_{i}| {x}_{ < i};\theta )=\frac{\exp (F({x}_{i}| {x}_{ < i};\theta ))}{{\sum }_{{x}_{i}^{{\prime} }\in V}\exp \left(F\left({x}_{i}^{{\prime} }| {x}_{ < i};\theta \right)\right)},$$
(8)

in which F(xix<i; θ) is the logits output by a GTP-2 (ref. 44).

Antigenicity predictor

Our antigenicity predictor is constructed using a 12-layer MSA Transformer45, with a linear output layer to regress the HI values as defined in equation (1). The model is trained with a batch size of 32 and a learning rate of 1 × 10−5 for 150,000 steps.

In our ablation studies, we considered both experimentally derived and simple machine learning baselines. In the BLOSUM baseline, for the vaccine–virus pairs collected before 1 February and with available experimental HI results, we use their experimental results. For the vaccine–virus pairs without available experimental HI results, we search for the most similar vaccine–virus pairs collected before 1 February that have tested the HI values in the dataset and use the average of them as the estimated HI results. The similarity is calculated from the BLOSUM62 matrix82, and the similarity between two virus–vaccine pairs is the summation of the similarity between virus sequences and the similarity between vaccine sequences.

In addition, we also considered two machine learning baselines. LR+ (ref. 80) is a linear regression model whose input features are the amino acid substitution of virus and vaccine protein sequences as well as their amino acids in each position. Finally, CNN is a convolutional neural network that consists of two one-dimensional convolution layers with 64 channels and kernel sizes equal to 15 intervening by two one-dimensional maximum pooling layers. The input is the concatenation of two one-hot representations of HA protein sequences from vaccine and virus. We use the hyperparameters suggested in ref. 83.

Evaluation metrics

We used Pearson correlation and/or Spearman rank correlation84 as the primary metrics to evaluate the relationships between coverage scores and vaccine effectiveness, between coverage scores and the reduction of disease burden and between predicted coverage scores and empirical coverage scores.

We used Pearson correlation when both data series being tested were normally distributed, as determined by the Shapiro–Wilk test (P > 0.01). When the data were not normally distributed, we used Spearman rank correlation. In the analysis of correlations between vaccine effectiveness and coverage scores, we also reported Spearman rank correlation to highlight their monotonic relationship. For evaluating the correlations between predicted coverage scores and empirical coverage scores, we additionally present results using mean absolute error (MAE) and root mean squared error (RMSE) for a more comprehensive assessment.

We used the empirical coverage scores to evaluate the antigenic match of different vaccine strains. To assess the statistical significance, we performed a one-sided Wilcoxon signed-rank test. We chose one-sided over two-sided because our goal was to assess the advantage of VaxSeer over past methods according to the empirical coverage scores rather than merely identifying any difference.

As a complementary evaluation for dominance predictor, we compared the frequency of occurrence of the top 20% of sequences predicted by different dominance prediction methods. To assess the statistical significance of our model’s higher frequency, we performed a one-sided Wilcoxon signed-rank test.

As a complementary evaluation for the antigenicity predictor, we compared the MAE of HI values predicted by different antigenicity prediction methods. To assess the statistical significance of our model’s lower error, we conducted a one-sided Wilcoxon signed-rank test, adjusted for multiple comparisons using the Benjamini–Hochberg procedure (false discovery rate).

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.