The big case register


While introducing their review of case register research over the last 50 years, Munk-Jørgensen and colleagues [1] express possibly overpessimistic opinions about this methodology and its application. For example, it is stated that whilst case registers can demonstrate a phenomenon, such as raised mortality or changes in admission rates or duration, they cannot explain underlying reasons for this. However, case registers are no better or worse in this respect than any other observational research design. Indeed, and as acknowledged by Munk-Jørgensen and colleagues, the data they provide have overriding benefits for some questions and may be the only means of addressing these. It is true that limited data quality or availability may preclude some applications in some registers, but these limitations should not be seen as inevitable or as necessarily a ‘feature’ of the case register design.

Despite the methodological advantages of the randomized controlled trial, observational data remain fundamental to health research, and much of what we know (or assume we know) is derived from observation rather than experiment. For example, John Snow's meticulous study of the geographic distribution of cholera cases in 1854 demonstrated that this fitted a model of waterborne transmission more closely than that of airborne transmission, the dominant paradigm at the time. Through a simple piece of well-designed observational research, essentially a ‘case register’, the understanding of the disorder was significantly advanced, around 30 years before Vibrio cholerae was identified as the underlying cause.

Although they can contribute to aetiological research, case registers are particularly suited to the investigation of the course and outcome of a disorder, as well as allowing intervention response to be evaluated in large, naturalistic samples and settings. As the mainstay study design in these circumstances (because trials and conventional cohort studies are too limited in size and/or representativeness), there is no intrinsic reason why case registers may not also provide important mechanistic information. Taking mental disorders and mortality as an example, we know that much of the reduced life expectancy is not accounted for by violence and suicide, but it would be negligent to assume that the remainder is no more than an expected product of less healthy lifestyles (smoking, obesity etc.). Even if this were to be the case, there are still important questions on whether mortality is raised in all mental disorders or whether at-risk groups can be defined. Are there symptom profiles, medications, or patterns of health service use/engagement which predict these outcomes? Is premature mortality in severe mental illness distributed equally across geographic areas or does this exhibit the spatial patterning that is seen in the general population? Improvement in life expectancy is unlikely ever to be a feasible outcome for a randomized trial, and biological data are unlikely to be relevant to pathways involving environmentally determined health conditions and issues such as access to, or engagement with, healthcare services. The ‘big’ case register remains the only means to answer these sorts of questions and if we bemoan the limitations of the data, it is surely our task to find ways to improve this.

Negative associations are often assumed between the size of the database and the depth of the data available. Whilst quite probably true at an aggregate level (administrative data tend to be ‘shallow’ and administrative databases tend to be large), this does not have to be inevitable. As argued previously [2], electronic health records (EHRs) in mental healthcare represent data which are both large and deep – because in theory, these contain every piece of information that has been recorded in a clinical service about a person's presentation, symptoms and relevant background history, as well as interventions received and observed outcomes. The challenges lie in accessing this information in a way that is both scientifically useful and ethical / socially acceptable. The fact that these challenges are formidable does not mean that they are necessarily insurmountable.

The principal limitation of large case registers lies in the quality and scope of the data most typically available. The first and most widely applied means of addressing this is through data linkage. If a piece of information is not available in mental healthcare data, might that information lie elsewhere in another database? For example, linkages between mental health case registers and cancer registers have allowed important questions to be investigated concerning that particular outcome. We know that cancer mortality is generally higher in people with severe mental disorders, but findings on cancer incidence have been heterogeneous – so a higher risk of developing cancer does not appear to be the sole explanation for a higher overall risk of cancer mortality. Data on from mental health and cancer care in Western Australia highlighted both delayed presentation of cancer (i.e. higher risk of metastatic spread at presentation) and reduced likelihood of treatment receipt across mental disorders as a whole [3]. Similar analyses using linked mental health and cancer registers in South London, investigating severe mental illness more specifically, found no difference in spread at presentation, but did indicate higher postdiagnosis mortality despite similar presentations [4]. Taken together, these findings challenge the assumption that improving cancer mortality in people with mental disorders is simply a matter of promoting smoking cessation and lifestyle improvement, worthwhile though these may be; instead they suggest that more effort needs to be put into achieving prompt diagnosis and equitable treatment. Similar disadvantages in treatment receipt have been demonstrated in people with schizophrenia and bipolar disorder following acute myocardial infarction, using linked administrative data from the Taiwanese National Health Insurance Database [5]. These examples at least indicate that the role of case register data is not simply to enable associations to be described, but can also be used to tease out underlying pathways.

A second approach to data enhancement lies in the mental health record itself. Although use of EHRs remains patchy internationally, and although there are important issues that need considering around data security and the entry of information in the clinical encounter, it is hard to believe that mental health services in 50 years time will continue to be paper-based. If electronic health records (EHRs) do become the norm, potentially supplemented by patient-reported outcomes in shared records systems, then enormous quantities of data will accrue. For example, in our EHR-sourced ‘CRIS’ case register at the South London and Maudsley NHS Foundation Trust [6], around 250 000 cases are represented from a catchment population of 1.2 million residents and numbers have increased consistently by around 20 000 per year since the case register's development in 2008. Thus, instead of national healthcare datasets containing little more than demographic measures, diagnosis and service contact, case registers of the future could in theory contain all elements of every clinical encounter. The technical challenges in this field are largely solvable: high performance computer clusters can now process the volumes of data created with acceptable levels of efficiency, and the complex structuring of data fields in a typical EHR can be ‘translated’ into research databases through bespoke processing pipelines which can also strip out identifying information so that the databases are effectively pseudonymized. More challenging is the fact that the most valuable information lies in text rather than structured fields. However, as will be described, rapid advances in natural language processing have been made over the last few years which offer the potential to transform the depth of data available on a large scale and thus the nature of mental health case registers and their application.

Natural language processing describes a variety of techniques through which a computer makes sense of human language, including ‘information extraction’, which involves the automatic generation of predefined structured information from unstructured text. These techniques date back to the late 1970s, but their potential applicability for enhancing data derived from EHRs is only now being appreciated [7]. Recent studies across a range of disciplines have overcome such obstacles as ungrammatical sentence structures (e.g. telegraphic phrases), uses of shorthand, unstandardized abbreviations and misspellings [7-9] and applications in general healthcare have included large-scale extraction of information on treatment and treatment response [10, 11] or diagnoses [12]. To date, there have been relatively few applications in mental health clinical records beyond those used for de-identification purposes; however, early progress includes two recent US studies using these techniques to determine depression outcome [13], and adverse drug events [14], as well as the characterization of diagnostic profiles in a Danish psychiatric case register [15]. In the CRIS case register in south London, having begun with a relatively simple task of extracting routine Mini Mental State Examination scores recorded in text [16], we have extended natural language processing to ascertain characteristics such as smoking status [17] and have begun a programme of work to generate data on symptom profiles, beginning with negative symptoms of schizophrenia [18]. These have ‘unlocked’ information which could previously only be obtained by manual review of anonymized case notes in limited samples. The application of emerging techniques such as Machine Learning has, in our experience, also substantially increased the speed at which new applications can be developed and validated.

In many respects, therefore, once-formidable technical obstacles have now been overcome, allowing large and detailed case registers to be set up at relatively little cost. The most important challenge remaining is how to develop and use such information in a way that is acceptable to the general public, and most importantly to the patients whose personal, and often highly sensitive information forms the database. These considerations have been topical recently in the UK, following controversy over plans to create a national healthcare database (summarized in a recent Nature editorial:, and in Europe, with proposed changes to EU data protection law which would have a major negative impact on the use of administrative data for research [19]. Such challenges encompass not only the derived datasets themselves but also procedures around linking databases where use of identifiers is required to make the linkage, but which can be achieved in ways that effectively preserve anonymity. Legal frameworks around data protection vary internationally, but most do have some provision for the use of data without prior consent if these data are effectively anonymized and if important research cannot be carried out in any other way. It is important to bear in mind at the outset that few datasets can be claimed to be wholly anonymized. For example, even in the shallowest of administrative databases, a combination of age, gender and date/place of admission might well be sufficiently unique that it theoretically identifies a person. Technical solutions to anonymization are therefore never enough and need to be accompanied by a governance structure which evaluates database use for any risk of compromizing anonymity, as well as monitoring the appropriateness of the research being carried out using sensitive data, and of the people and agencies having data access. Our own approach with the CRIS mental health case register has been to involve patients at the outset, both in designing the security model and in leading ongoing oversight of data use and dissemination [20].

In summary, the case register has had a long and impressive history in shaping the way mental disorders are understood today. However, it is an evolving research design and, whilst the review by Munk-Jørgensen and colleagues is a timely reminder of the way it has been used over the years, it is important that past limitations in data quality and availability are not seen as an inherent feature of the design. As with any dataset used for ‘secondary analysis’, the researcher is often faced with having less information available than if they had initiated de novo data collection. This is compounded by the fact that case register data are generally not collected with research in mind and that analysed samples comprise people who have presented to clinical services rather than screened community cases. However, in most instances, this is simply a matter of understanding what samples and measurements represent in the clinical environment and ensuring that inferences clearly take into account these characteristics. From our own experience, one of the most important elements of working with case register data has been for researchers to have a thorough appreciation of clinical populations and the way clinical information is recorded. Close collaboration between the university and healthcare sectors is a prerequisite, but this is no bad thing for any number of reasons.