A comparison of the ability of the National Early Warning Score and the National Early Warning Score 2 to identify patients at risk of in-hospital mortality: A multi-centre database study

Aims To compare the ability of the National Early Warning Score (NEWS) and the National Early Warning Score 2 (NEWS2) to identify patients at risk of in-hospital mortality and other adverse outcomes. Methods We undertook a multi-centre retrospective observational study at five acute hospitals from two UK NHS Trusts. Data were obtained from completed adult admissions who were not fit enough to be discharged alive on the day of admission. Diagnostic coding and oxygen prescriptions were used to identify patients with type II respiratory failure (T2RF). The primary outcome was in-hospital mortality within 24 h of a vital signs observation. Secondary outcomes included unanticipated intensive care unit admission or cardiac arrest within 24 h of a vital signs observation. Discrimination was assessed using the c-statistic. Results Among 251,266 adult admissions, 48,898 were identified to be at risk of T2RF by diagnostic coding. In this group, NEWS2 showed statistically significant lower discrimination (c-statistic, 95% CI) for identifying in-hospital mortality within 24 h (0.860, 0.857–0.864) than NEWS (0.881, 0.878-0.884). For 1394 admissions with documented T2RF, discrimination was similar for both systems: NEWS2 (0.841, 0.827-0.855), NEWS (0.862, 0.848–0.875). For all secondary endpoints, NEWS2 showed no improvements in discrimination. Conclusions NEWS2 modifications to NEWS do not improve discrimination of adverse outcomes in patients with documented T2RF and decrease discrimination in patients at risk of T2RF. Further evaluation of the relationship between SpO2 values, oxygen therapy and risk should be investigated further before wide-scale adoption of NEWS2.


Introduction
Vital signs based aggregate early warning score (EWS) systems, which assign weights to each vital sign according to the deviation from assumed normal values, are recommended for routine use in UK hospitals 1,2 . In 2012, the Royal College of Physicians of London (RCPL) published a proposed National EWS (NEWS) 3 , which has now undergone extensive validation [4][5][6] . In NEWS, oxygen saturations (SpO 2 ) receive increasing weights for values of 95% or less, and oxygen therapy receives a flat weight. However, guidance for the management of patients with type II respiratory failure (T2RF) 7,8 , and those deemed at risk of T2RF before blood gas analysis 7 , suggests lower SpO 2 values (88-92%) should be targeted. Consequently, it is suggested that the NEWS SpO 2 weighting system is inappropriate for patients with/at risk of T2RF [9][10][11] . Some authors suggest that this weighting risks inappropriate oxygen therapy for these patients, with potential deleterious consequences 9,10 .
In December 2017, the RCPL published an update to NEWS -the National Early Warning Score 2 (NEWS2) 12 -which includes several modifications to the NEWS vital sign weightings. To account for concerns about NEWS and T2RF, NEWS2 includes a new SpO 2 scoring scale for patients with/at risk of T2RF. This scale, termed SpO 2 scale 2 assigns weights at lower SpO 2 thresholds than NEWS and combines these lower thresholds with weights for the use of supplemental oxygen at higher SpO 2 levels, reflecting the concern of hyperoxia-induced hypercapnic respiratory failure 12 (see appendix A1). Although the derivation of these thresholds is not presented, and NEWS2 is as yet unvalidated, NHS England has endorsed NEWS2 0 s use in acute and ambulance settings 13 , and is considering the use of the Commissioning for Quality and Innovation (CQUIN) payment system 14,15 to encourage organisations to implement NEWS2 by March 2019.
In this study, we used a large multi-centre dataset of vital signs to compare retrospectively the performance of NEWS2 and NEWS. We studied the performance of NEWS and NEWS2 in three risk groups: those with documented T2RF; those at risk of T2RF; and patients in neither of these groups.

Methods
The database for this study was created with Health Research Authority (reference: 16/SC/0264 and 08/02/1394) approval. The study protocol is available online 16 ; we follow the TRIPOD statement for reporting 17 .  20 . The following data were recorded: date and time of observation (automatically by SEND/VitalPAC TM ); heart rate, systolic blood pressure, respiratory rate, body temperature, neurological status using the Alert-Voice-Pain-Unresponsive (AVPU) scale, SpO 2 ; and the patient's inspired gas (air or supplemental oxygen) at the time of SpO 2 measurement. The HAVEN database also contains administrative and patient demographic information, and information about the occurrence and timing of cardiac arrest, unanticipated intensive care unit (ICU) admission and hospital discharge status (dead/alive). Prescription data from the electronic patient record is also available within the database for OUH admissions.

Study sites
The study took place at five hospitalsthe four hospitals in the OUH group [The John Radcliffe Hospital (large university hospital), The Horton General Hospital (small district general hospital), The Churchill hospital (large university cancer centre) and The Nuffield Orthopaedic Hospital] and a single large district general hospital, PH.

Participants
All completed adult admissions to the four hospitals comprising the OUH group (January-December 2016) and to PH (January 2012 -December 2016) with at least one complete set of vital signs observations recorded electronically were considered. These study periods represent times of full deployment of electronic vital signs documentation in these hospitals. Patients discharged alive from the hospital before midnight on the day of admission and those with no vital signs recorded in the 24 h prior to discharge (as a proxy for patients on end-of-life pathways) were excluded from the analysis. For the main analysis, we combined admissions from all hospitals, but we also analysed data from each hospital trust separately (see appendix A3).

Early warning scores (see appendix A1)
The NEWS2 adjustment for patients with/at risk of T2RF differs from NEWS in the assignment of weights to measured SpO 2 (NEWS weights SpO 2 values below 96%; NEWS2 below 88%). Additionally, for patients with/at risk of T2RF, NEWS2 assigns weights for SpO 2 values above 92% when receiving oxygen.

Outcome
The primary outcome was in-hospital death within 24 h of an observation set, in line with previous studies 21,22 . Secondary outcomes include cardiac arrest, unanticipated ICU admission, and either cardiac arrest, unanticipated ICU admission, or death within 24 h of an observation set. We present the results for all secondary outcomes, flagging those where insufficient outcomes exist (< 100), due to sample size, as recommended in the TRIPOD guidelines 17 . All outcomes were obtained retrospectively from different clinical information systems, including the hospitals' patient administration systems, the ICU clinical information systems, and the hospitals' National Cardiac Arrest Audit (https://ncaa.icnarc.org) databases.

Predictors
Vital sign sets (see above) were recorded using SEND/VitalPAC TM .
Where the patient's conscious level had been assessed only using the Glasgow Coma Scale (GCS), we converted GCS to an AVPU equivalent 21 . Vital signs were then assigned weights for NEWS and NEWS2 scores (see appendix A1). The sum of the weights (aggregate   22,24 , and in line with previous vital-signs-based EWS research [25][26][27][28] , each vital sign set was analysed as independently associated with the outcome.

Missing data
For the analysis, we considered complete observation sets (i.e., sets with measurements of all vital signs), in line with previous NEWS studies 22,24 . The SEND system allows recording of incomplete vital sign sets, which is discouraged in the VitalPAC TM system. We did an a priori sub-analysis in which we used multiple imputation, a generalpurpose and widely used approach to missing values 29 which only occurred in the OUH dataset.

Statistical analysis
Performance of NEWS and NEWS2 was assessed by discrimination using receiver operating characteristic (ROC) curve analysis (calibration was not assessed, as the EWS systems do not give estimates of absolute risk). We also assessed the effect of suggested thresholds for patient review (aggregate NEWS/NEWS2 scores of 5 or above, or 7 or above 12 ) by reporting sensitivity, specificity and positive predictive values. We also show SpO 2 distributions for three different risk groups (see below). All analysis was performed using the R statistical software (v3.4.4) 30 and ROC curves were calculated using the pROC package 31 . Differences in the area under the ROC curve (AUROC), or c-statistic, between NEWS and NEWS2 were compared using bootstrapping (2000 samples) 31 . We did post-hoc sub-analyses of performance by institution (in light of the different patient numbers contributed). We also performed post-hoc efficiency curve analysis (as we were unable to conduct decision curve analysis as estimates of risk for a given score are not available).

Risk groups
After exclusion criteria were applied, we categorised each admission according to the following risk groups: 1 Patients with recorded T2RF, identified using the Adult Oxygen Prescription form of the current admission (OUH only).  3 Patients not at risk of T2RF, i.e., not in groups 1 or 2 above. We report the performance metrics of each scoring system for each of these risk groups. We report the results of the SpO 2 scale 2 of NEWS2 in the third risk group (patients not at risk of T2RF) to demonstrate the effect of erroneous use of the scale in this population.

Development versus evaluation datasets
NEWS was originally developed using a dataset with admissions to PH's Medical Assessment Unit (MAU) [22]. The NEWS2 report does not identify a development dataset for NEWS2 12 . The study evaluation database (HAVEN) includes data from all admissions to OUH and pH for the periods stated above. Vital sign data for all sites are present from hospital admission to hospital discharge/death. NEWS2 is recommended for use in all the included settings.

Descriptive statistics
A total of 251,266 distinct admissions were included. Fig. 1 shows the application of inclusion/exclusion criteria, resulting in the final cohort of admissions. All patients in the final dataset had at least one complete vital sign set. A total of 48,898 admissions were associated with patients at risk of T2RF, and 1394 with patients with documented T2RF (80.3% of whom also belong to the group of patients at risk of T2RF). Table 1 summarises the admission demographic descriptors and other clinical information for the three risk groups. Patients in risk groups 1 (documented T2RF) and 2 (at risk of T2RF) both had higher mortality rates (and rates of other adverse outcomes) when compared to patients who were not at risk (i.e. risk group 3).
The distribution of SpO 2 values for patients with documented T2RF is bell-shaped, whereas that for the group of patients who are not at risk was right skewed (Fig. 2). In patients with documented T2RF, 77.4% of admissions had at least one recorded SpO 2 measurement above 92% on room air, compared with 98.7% in the non-risk group (Fig. 2).

Performance of early warning scores
Performance metrics for the three risk groups for in-hospital death are presented in Table 2, and the corresponding ROC curves are represented in Fig. 3. Those for the secondary outcomes are shown in Table 3.
Results of the sub-analyses by institution are shown in appendix A3. The effects of using multiple imputation to replace missing vital sign values are shown in appendix A4.
In patients with documented T2RF, the AUROCs for predicting inpatient mortality within 24 h for the two scoring systems were as follows: NEWS 0.862 (95% CI: 0.848 to 0.875); NEWS2 0.841 (0.827 to 0.855) ( Table 2). Using a threshold of 5 points, positive predictive values for NEWS and NEWS2 were 2.5% and 3.0% respectively. In patients at risk of T2RF, the AUROC for predicting inpatient mortality within 24 h for the two scoring systems were as follows: NEWS 0.881 (0.878 to 0.884); NEWS2 0.860 (0.857 to 0.864). Using a threshold of 5 points, positive predictive values for NEWS and NEWS2 were 3.2% and 2.7%, respectively.
Our sub-analysis using multiple imputation to deal with missing values gave similar results (appendix A4).
We calculated efficiency curves (see appendix A2) to compare the efficiency of NEWS and NEWS2. The curves demonstrate that, for the few patients with documented T2RF, the use of NEWS2 at the suggested RCPL cut-offs of 5 and 7 points 12 reduces absolute staff workload by approximately 11% and 5% respectively, but at the expense of reduced sensitivity of approximately 10% and 14%, respectively. For patients at risk of T2RF, the use of NEWS2 at the suggested RCPL cut-offs of 5 and 7 points 13 does not significantly decrease staff workload, but reduces sensitivity by 5-6%. Finally, if used in error for patients not at risk of T2RF at the suggested RCPL cut-offs, NEWS2 is slightly more sensitive than NEWS but, to achieve this, risks doubling the workload.

Main findings
This is the first study to evaluate the performance of NEWS2 in hospitalised patients who have documented T2RF or are at risk of it. For the primary outcome -in-hospital death within 24 h of an observation -NEWS2 demonstrated no improvement in discrimination over NEWS for patients with documented T2RF, but at the suggested RCPL cut-offs of 5 and 7 points, the positive predictive values (PPV) were higher for NEWS2 than NEWS. However, for patients at risk of T2RF, NEWS had superior discrimination and higher PPV compared to NEWS2. When applied to patients not at risk of T2RF (to simulate the impact of using NEWS2 in error in such patients) NEWS2 discriminated less well than NEWS and had lower PPV. Finally, NEWS2 did not improve discrimination for any of the secondary outcomes compared to NEWS.
Modified scores have been suggested to account for chronically altered physiology in patients with respiratory-related conditions [10][11][12] . One of these, CREWS 11 , improved the positive predictive value compared to NEWS in patients with or at risk of T2RF (see appendix A5), but at the expense of decreasing sensitivity for events. However, such approaches challenge the premise that a universal EWS, with its attendant advantages, should be employed throughout hospitals. In NEWS2, assigning lower SpO 2 thresholds together with heuristic weights for the use of supplemental oxygen at higher SpO 2 values reflects the concern of hyperoxia-induced hypercapnic respiratory failure. However, encoding this concern as undertaken in NEWS2 does not improve discrimination in any of the three risk groups of admissions. Given the main purpose of EWS systems is to identify ill or deteriorating patients, the reduced sensitivity introduced by NEWS2 in patients with documented T2RF and those at risk of it is a disadvantage compared to NEWS. This reduced sensitivity could be ameliorated to an extent by reducing the trigger values for NEWS2, but this would increase staff workload, whilst also introducing further complexity.
The performance of NEWS in this study is similar to that of the original derivation study for NEWS (AUROC, 0.89) 22 supporting previous external evaluations of the scoring system 32,33 (see appendix A3 in Supplementary material, which describes the results considering admissions to each trust, separately).

Strengths
This study focuses on the patient groups for which the new SpO 2 scoring "scale" in NEWS2 were intended. Robust electronic data capture allowed us to identify groups of patients admitted with/at risk of T2RF; this has not previously been undertaken. Unlike previous studies 32 , our study includes vital signs taken throughout the patient's hospital journey. The additional analyses, and the TRIPOD statement that guides our work further strengthen the findings of our study, promoting both clarity and interpretability.

Limitations
Our study relies on diagnostic codes and records of oxygen prescription to categorise patients with/at risk of T2RF, so patients could have been missed or misclassified. However, diagnostic coding for COPD has been shown to be relatively reliable 34 , suggesting using this approach to identify those at risk of T2RF may also be reliable. In the case of oxygen prescriptions, the prescribing clinician's assessment of whether or not the patient is a "carbon dioxide retainer" is Table 2 -Performance metrics of the two scoring systems (NEWS and NEWS2) for predicting the primary outcome in the three risk groups, which include the area under the receiver operating characteristics curve (AUROC), with 95% confidence interval (CI), and sensitivity, specificity and positive predictive value values at a threshold of 5 and 7. The fourth column (NEWS -NEWS2) indicates the mean difference (95% CI) between the AUROCs of NEWS and NEWS2. T2RF denotes Type II Respiratory Failure.  recorded, and it seems likely that the same assessment would underlie the choice of SpO 2 scale used. Our database does not include documentation of "new confusion", which is now recommended to be part of the assessment of consciousness on for NEWS2 12 ; hence, we could not take account of this in our analysis.

NEWS
Nevertheless, as new confusion was not part of NEWS, our study clearly demonstrates the effect of the differences in oxygen SpO 2 scales between the two systems for patients with T2RF. Moreover, the absence of this component is unlikely to have a different effect in the risk groups. By analysing each vital sign set as independently associated with outcome (allowing comparison with previous NEWS publications 22,24 ) we run the risk of over-representation of some patient groups. However, previous work 35 suggests allowing an outcome to be represented only once has little effect on assessed outcomes. Evaluation of the secondary outcomes (cardiac arrest and unanticipated ICU admission) in the documented T2RF group should be interpreted with caution given the small number of outcomes (<100).

Implications
We could find no performance benefit of NEWS2 in any diagnostic group compared to NEWS. If used in error in patients not at risk of T2RF, NEWS2 generally reduces discrimination compared to NEWS. Using NEWS2 instead of NEWS for patients with or at risk of T2RF reduces sensitivity for detecting patients with adverse outcomes. Improving sensitivity could be achieved by reducing the trigger values for NEWS2, but this would also increase staff workload. The recent endorsement by the RCPL and NHS England of the use of NEWS2 without underpinning evidence makes our study both important and urgent. Implementing NEWS2 requires additional staff training, and new multi-coloured charts, both of which are likely to be costly. The clinical impact of introducing NEWS2 may also have unexpected clinical consequences, some of which may also have financial impact.
Applying the same "normal range" to patients with chronically abnormal physiology (e.g. COPD or heart failure) is a compelling criticism of using a single early warning score (EWS). It is certainly at odds with the interpretation of individual vital signs in clinical practice. However, this possible advantage needs to be counter-balanced with the simplicity of a single system. Applying different scores also creates a more complex protocol and observation chart, potentially increasing staff workload 36,37 . Ultimately, increasing score complexity has to be shown to improve performance for it to be worthwhile.
Our study shows that the modifications made to NEWS2 (specifically, the alternative SpO 2 scale), which increase chart complexity, are not likely to improve the detection of deterioration and/or reduce false alarms in patients with chronic respiratory disease.

Conclusion
For patients at risk of, or with documented, T2RF, the changes proposed in NEWS2 do not improve the detection of adverse outcomes, including in-hospital death, unanticipated ICU admission, and cardiac arrest. The intent to account for known physiological differences in patients with chronic respiratory failure is laudable, as are the recommended improvements in the chart for recording oxygen prescriptions. However, the relationship between SpO 2 values, oxygen therapy and the risk of adverse outcomes should be studied further before wide-scale adoption of NEWS2. In the interim, a more appropriate alternative to changing the weighting system for NEWS, might be to modify the clinical care escalation protocol and response to triggering 38 .

Funding
This publication presents independent research commissioned by the Health Innovation Challenge Fund (HICF-R9-524; WT-103703/Z/14/ Z), a parallel funding partnership between the Department of Health and Wellcome Trust. The views expressed in this publication are those of the authors and not necessarily those of the Department of Health or Wellcome Trust. PJW is supported by the National Institute for Health

Contributors
Study design: PJW, GSC, SG, JM, PES, GBS, DP; data preparation: OR, MAFP, DP; data analysis: OR, MAFP, DP, SG; data interpretation and writing up of the protocol and paper: all authors contributed.