1 BMJ BMA House, Tavistock Square, London WC1H 9JR, UK
2 London School of Hygiene & Tropical Medicine London WC1E 7HT, UK
Correspondence to: Dr Sara Schroter sschroter{at}bmj.com
| SUMMARY |
|---|
|
|
|---|
Design 607 peer reviewers at the BMJ were randomized to two intervention groups receiving different types of training (face-to-face training or a self-taught package) and a control group. Each reviewer was sent the same three test papers over the study period, each of which had nine major and five minor methodological errors inserted.
Setting BMJ peer reviewers.
Main outcome measures The quality of review, assessed using a validated instrument, and the number and type of errors detected before and after training.
Results The number of major errors detected varied over the three papers. The interventions had small effects. At baseline (Paper 1) reviewers found an average of 2.58 of the nine major errors, with no notable difference between the groups. The mean number of errors reported was similar for the second and third papers, 2.71 and 3.0, respectively. Biased randomization was the error detected most frequently in all three papers, with over 60% of reviewers rejecting the papers identifying this error. Reviewers who did not reject the papers found fewer errors and the proportion finding biased randomization was less than 40% for each paper.
Conclusions Editors should not assume that reviewers will detect most major errors, particularly those concerned with the context of study. Short training packages have only a slight impact on improving error detection.
| Introduction |
|---|
|
|
|---|
Three studies have reported on the rate of detection of errors by reviewers. The first used two fictitious reports submitted to all reviewers of a general medical journal and found that reviewers missed many of the deliberate errors in the manuscripts. A second study introduced 10 major and 13 minor errors in a manuscript and distributed it to 262 reviewers of the Annals of Emergency Medicine. Reviewers failed to identify two thirds of the major errors and about 7% recommended acceptance. The third study reported that, on average, reviewers detected only two out of eight areas of weakness in a modified paper.
We conducted a single blind randomized controlled trial (RCT) on the effect of training on the performance of peer reviewers of a general medical journal, the BMJ. Reviewers were randomized to one of three groups (control, face-to face training and self-taught) and invited to review three manuscripts during the study period. The training package focused on what editors want from reviewers and how to critically appraise RCTs. For all groups, we inserted nine major and five minor methodological errors into each manuscript before sending the papers out for review. The authors of the original manuscripts gave their consent for the insertion of errors and their use in the trial. The quality of review, assessed using a validated instrument, was the primary outcome measure and the number of major errors detected was secondary. The objective of this paper is to report the frequency with which the nine major and five minor errors were detected and the impact that training had on each of the 14 errors studied. As the methods of the trial and primary results have previously been reported, they are described only briefly in this paper. In this paper the data from the RCT is used as observational data.
| Methods |
|---|
|
|
|---|
Participants
We performed a power calculation (reported in the previous paper) based on our primary outcome measure, review quality, and estimated that 190 reviewers were needed in each group. All BMJ reviewers (n=1256) resident in the UK who had reviewed at least one paper between January 1999 and February 2001 were invited to take part. No exclusion criteria were applied, other than non-residence in the UK.
Consenting reviewers were randomized into three groups – two intervention groups and a control group – using a stratified permuted blocks randomization method. Previous studies identified several factors that affect review quality and so the stratification was based on age, current investigators in medical research projects, postgraduate training in epidemiology, postgraduate training in statistics, and membership of an editorial board of a scientific or medical journal.,
Assessments and procedures
Three previously published papers each describing an RCT on a general medical subject were selected for use in this study, and the authors and journal editors were contacted for permission to use them. The papers selected described studies evaluating the effects of discharge summaries, personalized computer-generated health records, and patients holding their own records (i.e. they were general articles). Papers describing RCTs were specifically chosen as they usually provide more structure for review than other research designs. The names of the original authors were removed and the titles of the manuscripts and references to study locations were changed.
Deliberate errors were introduced into the first test paper. To determine the level of difficulty of the errors inserted, we piloted the paper on a sample of three editors and two epidemiology postgraduate students. The paper was subsequently modified to exclude the errors not detected by any of the sample reviewers, and the remaining errors were classified individually as major (nine) or minor (five) by members of the research team (NB, SS, RS, FG). Where two people indicated major and two minor, the difficulty of the error was discussed as a group until a consensus was reached. Similar errors, in terms of type and level of difficulty, were then inserted in the other two test papers.
The nine major errors focused on methodological weaknesses, inaccurate reporting of data and unjustified conclusions, while the five minor errors focused on omissions and inaccurate reporting of data (
). As a result of severe editing of the manuscripts to insert the deliberate errors, there were some additional unintended inconsistencies in the papers. Whilst these were reported by many reviewers, we considered only the identification of the 14 deliberate errors. One major error (unknown reliability and validity of outcome measure) was introduced into each paper to act as a control – that is, no training was provided and we did not expect to see improvement in the detection of this error.
|
Reviewers were sent the manuscripts in a style similar to the standard BMJ review process, but were told these papers were part of the study and were not paid for the reviews. Reviewers were asked to review the papers within three weeks and were sent the standard BMJ guidance for reviewers (see bmj.com for details) and a prepaid return envelope. Reminders were sent to increase response rates.
Outcome measure: number of deliberate errors detected
The number of major and minor errors reported in each review was assessed independently by two researchers (SS and LO) blind to the identity and study group of the reviewer. A strict marking scheme was used; an identification of error was only counted if there was a clear statement describing the error and explaining the problem, so that the review would be of practical use to the authors and the editor. One point was allocated for each error if the reviewer clearly identified the error and half a point was given if there was some evidence that the error had been identified. If the reviewer returned the manuscript, it was also checked for comments indicating the identification of an error. A point was only awarded if the error had been clearly identified – the underlining of text on the manuscript alone was not considered sufficient.
Statistical analysis
The intra-class correlation coefficient was used to assess the level of agreement between raters for each error and for the total error score. Generally, values >0.70 are considered acceptable for the intra-class correlation coefficient.
To calculate the percentage of reviewers reporting each error, half points were rounded to full points for each rater and an average of the two raters' scores was calculated. Scores were then rounded again (0=0, 0.5=1, 1=1) so that if at least one rater indicated that the reviewer had identified the error, the reviewer was given a mark.
| Results |
|---|
|
|
|---|
|
|
|
|
|
| Discussion |
|---|
|
|
|---|
The poor detection rate we observed was not due to over-demanding expectations of reviewers. For example, we classified inconsistencies between text and tables as only a minor error, despite this being an important issue that needs to be picked up somewhere in the review process as it has an impact on the readability of the manuscript and the understanding of the results.
The detection rate varied between the nine major errors. Those most likely to be detected (>50%) related to the sampling and randomization techniques. In contrast, those least likely to be detected (<30%) related to the analysis of data and inconsistencies in the reporting of results. Baxt et al. reported similar findings: 68% of the reviewers in their study did not realize that the conclusions of the work were not supported by the results. We found that whilst many reviewers acknowledged that the conclusions went beyond the results, about 40% failed to report that the authors had extrapolated their results to areas of care not studied.
Training led to some improvements in error detection. Broadly speaking, the errors that were detected more frequently after training were those to do with technical aspects such as the response rate, randomization procedure and sample size calculation. Areas in which little or no improvement occurred were to do with putting the study in context, both in terms of the pre-existing literature and in terms of the implications of the findings for policy or practice. The dramatic improvement in detection of the control error is partly explained by the low level of detection in the pre-intervention paper (18%). However, it also suggests that the Hawthorne effect may have contributed to improvements in detection of several errors.
Strengths and weaknesses of study
There are several limitations to our study to consider when drawing any lessons or implications:
Relationship to other studies
Two other studies looking at the effects of peer review have found limited improvements in manuscript quality., A recent study comparing the quantity and quality of data tables and figures in reports of RCTs submitted to the BMJ and subsequently published in peer-reviewed journals found peer review to be limited in improving the presentation of data. BMJ external peer reviewers seldom commented on the tables or figures and the numbers of tables and figures did not change markedly between submission and publication. Goodman et al. found manuscript quality improved after peer review and editing at Annals of Internal Medicine, but that improvement was modest and there was still substantial room for improvement. Aspects that showed the most improvement were discussion of study limitations, acknowledgement and justifications of generalizations, appropriateness of the strength and tone of the conclusions, use of confidence intervals and description of the setting. However, due to the study design it was not possible to distinguish the effects of external peer review from internal editing.
Implications
The principal implication of our findings, when taken together with the previous studies cited above, is that journal editors should not assume that their reviewers will detect most major flaws in manuscripts. The study paints a rather bleak picture of the effectiveness of peer review. Improvements after training were minor despite using the types of papers easiest to review for errors, our reviewers being better trained and qualified than those at many smaller journals, and despite focusing on technical errors that are easier to detect than more fundamental errors involving flawed assumptions and theoretical models. Clearly, using more than one reviewer may increase the total numbers of errors detected, though some errors are likely to remain undetected. This may be of no immediate consequence if the major errors which have been detected lead to a decision to reject the manuscript. However, as a principal aim of peer review is to improve the quality of published papers, it seems that it is only partially successful. The improvements after training were trivial and were largest in the technical aspects of review, which could be identified by well-trained journal staff and probably don't require expert professional external reviewers. The shortcomings of peer reviewing that have been revealed in this and other studies cannot be easily resolved by short training interventions: the small effects observed were not worth the resources and time required for training. The question remains as to how best to improve the peer review process.
| Footnotes |
|---|
DECLARATIONS
Competing interests FG is the editor of the BMJ, SS is a senior researcher for the BMJ, RS is the former editor of the BMJ and NB, SE, and SS review for the BMJFunding This study was funded by the NHS London Regional Office Research & Development Directorate. The views and opinions expressed in this paper do not necessarily reflect those of NHSE (LRO) or the Department of Health
Ethical approval The ethics committee of the London School of Hygiene and Tropical Medicine approved the study
Guarantor SS
Contributorship NB and RS initiated the study; SS, NB, RS, and FG designed it; NB created the test papers; SS conducted the study; SS and SE did the data analysis; and SS, NB and SE, interpreted the results. All authors assisted in writing the paper
| Acknowledgements |
|---|
|
|
|---|
| References |
|---|
|
|
|---|
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||