RSM logo
JRSM

Home Current issue Browse archive Alerts About the journal Feedback
 
J R Soc Med 2008;101:507-514
doi:10.1258/jrsm.2008.080062
© 2008 Royal Society of Medicine

This Article
Right arrow Abstract Freely available
Right arrow Figures Only
Right arrow Full Text (PDF)
Right arrow Send a Quick Comment
Right arrow Alert me when this article is cited
Right arrow Alert me when Quick Comments are posted
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Related articles in JRSM
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Download to citation manager
Right arrow reprints & permissions
Citing Articles
Right arrow Citing Articles via HighWire
Right arrow Citing Articles via Google Scholar
Google Scholar
Right arrow Articles by Schroter, S.
Right arrow Articles by Smith, R.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Social Bookmarking
 Add to CiteULike   Add to Complore   Add to Connotea   Add to Del.icio.us   Add to Digg   Add to Reddit   Add to Technorati  
What's this?

What errors do peer reviewers detect, and does training improve their ability to detect them?

Sara Schroter Nick Black Stephen Evans Fiona Godlee Lyda Osorio   Richard Smith

1 BMJ BMA House, Tavistock Square, London WC1H 9JR, UK
2 London School of Hygiene & Tropical Medicine London WC1E 7HT, UK

Correspondence to: Dr Sara Schroter sschroter{at}bmj.com


    SUMMARY
Go to previous sectionTOP
 SUMMARY
Go to next sectionIntroduction
Go to next sectionMethods
Go to next sectionResults
Go to next sectionDiscussion
Go to next sectionAcknowledgements
Go to next sectionReferences
 
Objective To analyse data from a trial and report the frequencies with which major and minor errors are detected at a general medical journal, the types of errors missed and the impact of training on error detection.

Design 607 peer reviewers at the BMJ were randomized to two intervention groups receiving different types of training (face-to-face training or a self-taught package) and a control group. Each reviewer was sent the same three test papers over the study period, each of which had nine major and five minor methodological errors inserted.

Setting BMJ peer reviewers.

Main outcome measures The quality of review, assessed using a validated instrument, and the number and type of errors detected before and after training.

Results The number of major errors detected varied over the three papers. The interventions had small effects. At baseline (Paper 1) reviewers found an average of 2.58 of the nine major errors, with no notable difference between the groups. The mean number of errors reported was similar for the second and third papers, 2.71 and 3.0, respectively. Biased randomization was the error detected most frequently in all three papers, with over 60% of reviewers rejecting the papers identifying this error. Reviewers who did not reject the papers found fewer errors and the proportion finding biased randomization was less than 40% for each paper.

Conclusions Editors should not assume that reviewers will detect most major errors, particularly those concerned with the context of study. Short training packages have only a slight impact on improving error detection.


    Introduction
Go to previous sectionTOP
Go to previous sectionSUMMARY
 Introduction
Go to next sectionMethods
Go to next sectionResults
Go to next sectionDiscussion
Go to next sectionAcknowledgements
Go to next sectionReferences
 
Peer reviewers are responsible for improving the quality of manuscripts to be published and ‘should weed out serious methodological errors’. Despite the use of peer review, errors, inconsistencies and methodological weaknesses are commonly found in published medical research and peer review has been criticized as being an ineffective and expensive procedure.,

Three studies have reported on the rate of detection of errors by reviewers. The first used two fictitious reports submitted to all reviewers of a general medical journal and found that reviewers missed many of the deliberate errors in the manuscripts. A second study introduced 10 major and 13 minor errors in a manuscript and distributed it to 262 reviewers of the Annals of Emergency Medicine. Reviewers failed to identify two thirds of the major errors and about 7% recommended acceptance. The third study reported that, on average, reviewers detected only two out of eight areas of weakness in a modified paper.

We conducted a single blind randomized controlled trial (RCT) on the effect of training on the performance of peer reviewers of a general medical journal, the BMJ. Reviewers were randomized to one of three groups (control, face-to face training and self-taught) and invited to review three manuscripts during the study period. The training package focused on what editors want from reviewers and how to critically appraise RCTs. For all groups, we inserted nine major and five minor methodological errors into each manuscript before sending the papers out for review. The authors of the original manuscripts gave their consent for the insertion of errors and their use in the trial. The quality of review, assessed using a validated instrument, was the primary outcome measure and the number of major errors detected was secondary. The objective of this paper is to report the frequency with which the nine major and five minor errors were detected and the impact that training had on each of the 14 errors studied. As the methods of the trial and primary results have previously been reported, they are described only briefly in this paper. In this paper the data from the RCT is used as observational data.


    Methods
Go to previous sectionTOP
Go to previous sectionSUMMARY
Go to previous sectionIntroduction
 Methods
Go to next sectionResults
Go to next sectionDiscussion
Go to next sectionAcknowledgements
Go to next sectionReferences
 
The trial was approved by the London School of Hygiene & Tropical Medicine ethics committee. On invitation to take part in the study, participants were asked to give written consent to review three papers for the study and to agree to attend a full day of training if selected to do so.

Participants
We performed a power calculation (reported in the previous paper) based on our primary outcome measure, review quality, and estimated that 190 reviewers were needed in each group. All BMJ reviewers (n=1256) resident in the UK who had reviewed at least one paper between January 1999 and February 2001 were invited to take part. No exclusion criteria were applied, other than non-residence in the UK.

Consenting reviewers were randomized into three groups – two intervention groups and a control group – using a stratified permuted blocks randomization method. Previous studies identified several factors that affect review quality and so the stratification was based on age, current investigators in medical research projects, postgraduate training in epidemiology, postgraduate training in statistics, and membership of an editorial board of a scientific or medical journal.,

Assessments and procedures
Three previously published papers each describing an RCT on a general medical subject were selected for use in this study, and the authors and journal editors were contacted for permission to use them. The papers selected described studies evaluating the effects of discharge summaries, personalized computer-generated health records, and patients holding their own records (i.e. they were general articles). Papers describing RCTs were specifically chosen as they usually provide more structure for review than other research designs. The names of the original authors were removed and the titles of the manuscripts and references to study locations were changed.

Deliberate errors were introduced into the first test paper. To determine the level of difficulty of the errors inserted, we piloted the paper on a sample of three editors and two epidemiology postgraduate students. The paper was subsequently modified to exclude the errors not detected by any of the sample reviewers, and the remaining errors were classified individually as major (nine) or minor (five) by members of the research team (NB, SS, RS, FG). Where two people indicated major and two minor, the difficulty of the error was discussed as a group until a consensus was reached. Similar errors, in terms of type and level of difficulty, were then inserted in the other two test papers.

The nine major errors focused on methodological weaknesses, inaccurate reporting of data and unjustified conclusions, while the five minor errors focused on omissions and inaccurate reporting of data ( Go). As a result of severe editing of the manuscripts to insert the deliberate errors, there were some additional unintended inconsistencies in the papers. Whilst these were reported by many reviewers, we considered only the identification of the 14 deliberate errors. One major error (unknown reliability and validity of outcome measure) was introduced into each paper to act as a control – that is, no training was provided and we did not expect to see improvement in the detection of this error.


View this table:
[in this window]
[in a new window]

 
Table 1. Descriptions of 14 deliberate errors

 
All consenting reviewers were asked to review the first paper. After this baseline assessment, one intervention group received a full day of face-to-face training and the other intervention group was mailed a self-taught training package. Details of the training are described in a previous publication. Reviewers who completed the first review were sent the second paper to review two to three months after the intervention; the third paper was sent approximately six months later if they completed the second review.

Reviewers were sent the manuscripts in a style similar to the standard BMJ review process, but were told these papers were part of the study and were not paid for the reviews. Reviewers were asked to review the papers within three weeks and were sent the standard BMJ guidance for reviewers (see bmj.com for details) and a prepaid return envelope. Reminders were sent to increase response rates.

Outcome measure: number of deliberate errors detected
The number of major and minor errors reported in each review was assessed independently by two researchers (SS and LO) blind to the identity and study group of the reviewer. A strict marking scheme was used; an identification of error was only counted if there was a clear statement describing the error and explaining the problem, so that the review would be of practical use to the authors and the editor. One point was allocated for each error if the reviewer clearly identified the error and half a point was given if there was some evidence that the error had been identified. If the reviewer returned the manuscript, it was also checked for comments indicating the identification of an error. A point was only awarded if the error had been clearly identified – the underlining of text on the manuscript alone was not considered sufficient.

Statistical analysis
The intra-class correlation coefficient was used to assess the level of agreement between raters for each error and for the total error score. Generally, values >0.70 are considered acceptable for the intra-class correlation coefficient.

To calculate the percentage of reviewers reporting each error, half points were rounded to full points for each rater and an average of the two raters' scores was calculated. Scores were then rounded again (0=0, 0.5=1, 1=1) so that if at least one rater indicated that the reviewer had identified the error, the reviewer was given a mark.


    Results
Go to previous sectionTOP
Go to previous sectionSUMMARY
Go to previous sectionIntroduction
Go to previous sectionMethods
 Results
Go to next sectionDiscussion
Go to next sectionAcknowledgements
Go to next sectionReferences
 
Reviewer characteristics
Five hundred and twenty two (86%) of the 607 reviewers randomized completed a review of the first paper, 440 of 522 (84%) completed the second, and 418 of 440 (95%) completed the third. The self-reported characteristics of the reviewers in terms of age, sex, postgraduate experience in statistics and/or epidemiology, current research investigator and member of a journal editorial board are shown in Go. Characteristics were similar for reviewers completing each of the papers.


View this table:
[in this window]
[in a new window]

 
Table 2. Characteristics of reviewers completing each review

 
Reliability of ratings
A good level of agreement was reached between the two independent raters for the assessment of the reporting of individual errors (Go). The intra-class correlation coefficients were >0.70 for each error when averaged across the three papers. An intra-class correlation coefficient >0.90 for the nine-item total major error score reflects excellent agreement.


View this table:
[in this window]
[in a new window]

 
Table 3. Agreement for assessment of error reporting in all three papers

 
Detection of errors
For all groups combined (control, self-taught, face-to-face) the average number of the nine major errors reported was 2.58 (standard deviation [SD] 1.9) in Paper 1, 2.71 (SD 1.6) in Paper 2 and 3.05 (SD 1.8) in Paper 3. The average number of the five minor errors reported was 0.91 (SD 0.8) in Paper 1, 0.85 (SD 0.8) in Paper 2 and 1.09 (SD 0.8) in Paper 3. Go shows the data for the combined group and for each study group.


View this table:
[in this window]
[in a new window]

 
Table 4. Mean (SD) errors identified by group for each paper

 
Go shows the proportion of reviewers reporting each error for each paper by study group. The detection of errors was relatively consistent across papers. Overall, the errors most frequently reported were biased randomization procedure and no explanations for ineligible or non-randomized cases. The least often reported errors were word reversal, no mention of a Hawthorne effect (a temporary change in behaviour – typically an improved response – in response to altered environmental conditions), and inconsistency between text and tables. There was consistency between the three groups in the errors detected (i.e. the interventions had little effect on the detection of errors).


View this table:
[in this window]
[in a new window]

 
Table 5. Proportion of reviewers identifying each error by group for the three papers

 
Go and Go show the proportion of reviewers reporting each error labelled with a number from 1 to 14 (the order based on frequency of reporting shown separately for major and minor errors) for those who recommended rejection of Paper 1 and those who did not recommend rejection, respectively. Go and Go show these proportions for Paper 2, and Go and Go for Paper 3. The proportion of reviewers reporting each error in each paper was higher for reviewers recommending rejection than for those who did not. For each of the three papers, over 60% of the reviewers who recommended rejection reported that the randomization procedure was biased [error 1]. Other errors frequently reported by those rejecting the papers included inadequate reporting of ineligible or non-randomized cases [error 10] (58% averaged across the three papers), a poor response rate [error 4] (48%) and unjustified conclusions [error 3] (46%). Whilst these same errors were those most frequently reported by reviewers not recommending rejection, the proportions were considerably lower (34, 32, 29 and 35%, respectively).


Figure 1
View larger version (28K):
[in this window]
[in a new window]

 
Figure 1. Proportion of reviewers identifying each error for those who did and did not recommend rejection of each paper

A: Reviewers rejecting Paper 1 (n=335); B: Reviewers not rejecting Paper 1 (n=156); C: Reviewers rejecting Paper 2 (n=346); D: Reviewers not rejecting Paper 2 (n=71); E: Reviewers rejecting Paper 3 (n=325); F: Reviewers not rejecting Paper 3 (n=74)

Errors: 1, Biased randomization procedure; 2, Inconsistent denominator; 3, Unjustified conclusions; 4, Poor response rate; 5, Poor justification; 6, Discrepancy between abstract & results; 7, ITT would be appropriate; 8, No sample size calculation; 9, Unknown reliability & validity; 10, No explanation of drop outs; 11, No ethics approval; 12, Hawthorne effect; 13, Word reversal; 14, Inconsistency between text & tables

 

    Discussion
Go to previous sectionTOP
Go to previous sectionSUMMARY
Go to previous sectionIntroduction
Go to previous sectionMethods
Go to previous sectionResults
 Discussion
Go to next sectionAcknowledgements
Go to next sectionReferences
 
Principal findings
On average, reviewers reported only three out of nine major errors in their reviews, with almost a quarter of the reviewers reporting one or less. This is similar to two previously reported studies. Baxt et al. found reviewers failed to identify two thirds of the major errors in a manuscript and Godlee et al. found the mean number of weaknesses in design, analysis or interpretation commented on was only two out of eight, with only 10% of reviewers identifying four or more areas of weakness and 16% failing to identify any.

The poor detection rate we observed was not due to over-demanding expectations of reviewers. For example, we classified ‘inconsistencies between text and tables’ as only a minor error, despite this being an important issue that needs to be picked up somewhere in the review process as it has an impact on the readability of the manuscript and the understanding of the results.

The detection rate varied between the nine major errors. Those most likely to be detected (>50%) related to the sampling and randomization techniques. In contrast, those least likely to be detected (<30%) related to the analysis of data and inconsistencies in the reporting of results. Baxt et al. reported similar findings: 68% of the reviewers in their study did not realize that the conclusions of the work were not supported by the results. We found that whilst many reviewers acknowledged that the conclusions went beyond the results, about 40% failed to report that the authors had extrapolated their results to areas of care not studied.

Training led to some improvements in error detection. Broadly speaking, the errors that were detected more frequently after training were those to do with technical aspects such as the response rate, randomization procedure and sample size calculation. Areas in which little or no improvement occurred were to do with putting the study in context, both in terms of the pre-existing literature and in terms of the implications of the findings for policy or practice. The dramatic improvement in detection of the control error is partly explained by the low level of detection in the pre-intervention paper (18%). However, it also suggests that the Hawthorne effect may have contributed to improvements in detection of several errors.

Strengths and weaknesses of study
There are several limitations to our study to consider when drawing any lessons or implications:

Relationship to other studies
Two other studies looking at the effects of peer review have found limited improvements in manuscript quality., A recent study comparing the quantity and quality of data tables and figures in reports of RCTs submitted to the BMJ and subsequently published in peer-reviewed journals found peer review to be limited in improving the presentation of data. BMJ external peer reviewers seldom commented on the tables or figures and the numbers of tables and figures did not change markedly between submission and publication. Goodman et al. found manuscript quality improved after peer review and editing at Annals of Internal Medicine, but that improvement was modest and there was still substantial room for improvement. Aspects that showed the most improvement were discussion of study limitations, acknowledgement and justifications of generalizations, appropriateness of the strength and tone of the conclusions, use of confidence intervals and description of the setting. However, due to the study design it was not possible to distinguish the effects of external peer review from internal editing.

Implications
The principal implication of our findings, when taken together with the previous studies cited above, is that journal editors should not assume that their reviewers will detect most major flaws in manuscripts. The study paints a rather bleak picture of the effectiveness of peer review. Improvements after training were minor despite using the types of papers easiest to review for errors, our reviewers being better trained and qualified than those at many smaller journals, and despite focusing on technical errors that are easier to detect than more fundamental errors involving flawed assumptions and theoretical models. Clearly, using more than one reviewer may increase the total numbers of errors detected, though some errors are likely to remain undetected. This may be of no immediate consequence if the major errors which have been detected lead to a decision to reject the manuscript. However, as a principal aim of peer review is to improve the quality of published papers, it seems that it is only partially successful. The improvements after training were trivial and were largest in the technical aspects of review, which could be identified by well-trained journal staff and probably don't require expert professional external reviewers. The shortcomings of peer reviewing that have been revealed in this and other studies cannot be easily resolved by short training interventions: the small effects observed were not worth the resources and time required for training. The question remains as to how best to improve the peer review process.


    Footnotes
 

DECLARATIONS
Competing interests FG is the editor of the BMJ, SS is a senior researcher for the BMJ, RS is the former editor of the BMJ and NB, SE, and SS review for the BMJ

Funding This study was funded by the NHS London Regional Office Research & Development Directorate. The views and opinions expressed in this paper do not necessarily reflect those of NHSE (LRO) or the Department of Health

Ethical approval The ethics committee of the London School of Hygiene and Tropical Medicine approved the study

Guarantor SS

Contributorship NB and RS initiated the study; SS, NB, RS, and FG designed it; NB created the test papers; SS conducted the study; SS and SE did the data analysis; and SS, NB and SE, interpreted the results. All authors assisted in writing the paper


    Acknowledgements
Go to previous sectionTOP
Go to previous sectionSUMMARY
Go to previous sectionIntroduction
Go to previous sectionMethods
Go to previous sectionResults
Go to previous sectionDiscussion
 Acknowledgements
Go to next sectionReferences
 
We thank all the reviewers who participated, the editors who assisted with the face-to-face training and the Critical Appraisal Skills Programme (CASP) team, the authors of the original manuscripts for allowing us to use them, and Joe Kim for his help with the graphics


    References
Go to previous sectionTOP
Go to previous sectionSUMMARY
Go to previous sectionIntroduction
Go to previous sectionMethods
Go to previous sectionResults
Go to previous sectionDiscussion
Go to previous sectionAcknowledgements
 References
 

  1. Altman DG. Poor-quality medical research. What can journals do? JAMA 2002;287:2765–7[Abstract/Free Full Text]

  2. Altman DG. Statistics in medical journals. Stat Med 1982;1:59–71[Medline]

  3. Andersen B. Methodological errors in medical research. Oxford: Blackwell; 1990

  4. Altman DG. The scandal of poor medical research. BMJ 1994;308:283–4[Free Full Text]

  5. Smith R. Peer review: Reform or revolution? BMJ 1997;315:759–60[Free Full Text]

  6. Jefferson T, Alderson P, Wager E, Davidoff F. Effects of peer review: A systematic review. JAMA 2002;287:2784–6[Abstract/Free Full Text]

  7. Nylenna M, Riis P, Karlsson Y. Multiple blinded reviews of the same two manuscripts. JAMA 1994;272:149–51[Abstract/Free Full Text]

  8. Baxt WG, Waeckerle JF, Berlin JA, Callaham ML. Who reviews the reviewers? Feasibility of using a fictitious manuscript to evaluate peer reviewer performance. Ann Emerg Med 1998;32:310–7[Medline]

  9. Godlee F, Gale CR, Martyn CN. Effect on the quality of peer review of blinding reviewers and asking them to sign their reports: a randomised controlled trial. JAMA 1998;280:237–40[Abstract/Free Full Text]

  10. Schroter S, Black N, Evans S, et al. Effects of training on the quality of peer review: A randomised controlled trial. BMJ 2004;328:657–8[Free Full Text]

  11. van Rooyen S, Black N, Godlee F. Development of the Review Quality Instrument (RQI) for assessing peer reviews of manuscripts. J Clin Epidemiol 1999;52:625–9[Medline]

  12. Evans AT, McNutt RA, Fletcher SW, Fletcher RH. The characteristics of peer reviewers who produce good-quality reviews. J Gen Intern Med 1993;8:422–8[Medline]

  13. Black N, van Rooyen S, Godlee F, Smith R, Evans S. What makes a good reviewer and a good review in a general medical journal. JAMA 1998;280:231–3[Abstract/Free Full Text]

  14. Scientific Advisory Committee for the Medical Outcomes Trust. Assessing health status and quality-of-life instruments: Attributes and review criteria. Qual Life Res 2002;11:193–205[Medline]

  15. Schriger DL, Sinha R, Schroter S, Liu PY, Altman DG. From submission to publication: a retrospective review of the tables and figures in a cohort of randomised controlled trials submitted to the British Medical Journal. Ann Emerg Med 2006;48:750–6[Medline]

  16. Goodman SN, Berlin J, Fletcher SW, Fletcher RH. Manuscript quality before and after peer review and editing at Annals of Internal Medicine. Ann Intern Med 1994;121:11–21[Abstract/Free Full Text]

  17. Moher D, Schulz KF, Altman DG, for the CONSORT Group. The CONSORT statement: Revised recommendations for improving the quality of reports of parallel-group randomised trials. Ann Intern Med 2001;134:657–62[Abstract/Free Full Text]


Add to CiteULike CiteULike   Add to Complore Complore   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us   Add to Digg Digg   Add to Reddit Reddit   Add to Technorati Technorati    What's this?

Related articles in JRSM:

The collapse of capitalism, values – and peer review?
Kamran Abbasi
JRSM 2008 101: 479. [Full Text]  



This article has been cited by other articles:


Home page
JRSMHome page
E. Wager and K. Abbasi
Medical editors and trial reporting: A betrayal of patient care
J R Soc Med, January 1, 2009; 102(1): 4 - 5.
[Full Text] [PDF]


This Article
Right arrow Abstract Freely available
Right arrow Figures Only
Right arrow Full Text (PDF)
Right arrow Send a Quick Comment
Right arrow Alert me when this article is cited
Right arrow Alert me when Quick Comments are posted
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Related articles in JRSM
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Download to citation manager
Right arrow reprints & permissions
Citing Articles
Right arrow Citing Articles via HighWire
Right arrow Citing Articles via Google Scholar
Google Scholar
Right arrow Articles by Schroter, S.
Right arrow Articles by Smith, R.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Social Bookmarking
 Add to CiteULike   Add to Complore   Add to Connotea   Add to Del.icio.us   Add to Digg   Add to Reddit   Add to Technorati  
What's this?

Walking London's Medical History