Friday 19 May 2023

Don’t take your data source at face value if your analysis is to be credible!

Sarah Thelwall

People are often surprised when we say ‘you can’t just take the data in its raw form!’. They are also surprised by just how much work we put into cleaning, tidying, and indeed correcting of data. This is because we bring in data from raw feeds such as the API from the Charity Commission, the feed from 360Giving or the ixbrl data from Companies House.

These data sources behave very differently to say the Index of Multiple Deprivation (IMD), population data, postcode data or other national data set produced by the Office of National Statistics (ONS). The ONS is responsible for producing a wide variety of datasets which are used widely by central and local government as well as by an assortment of consultancies, Arms Length Bodies (ALBs), agencies and membership organisations. The fact that they are produced for public sector purposes is the key point here. They have to be accurate if they are used to inform national policy or track national metrics. There is a reason the national census takes a while to publish in that it is not a raw feed of people’s responses it is reviewed by experts before being published in a static dataset.

In contrast the data sources which are made publicly available by the likes of the Charity Commission are not checked, corrected or edited in the same way. There is a legal responsibility on behalf of the charities which submit the data that it be accurate but whole areas of the data rely on a set of recommendations for how to report and there is no guarantee that these recommendations have been followed consistently by the hundreds of thousands of charities who submit data.

All data sources are not equal in their quality!

It matters therefore that if you are going to use one of these data sources that you understand its flaws, limitations and sources of error and establish processes to check for errors, record any corrections separately to the original data and work up sets of necessary exclusions along with your estimate of what difference this makes in the way you use the data source. If you do not do this what you will have frankly is junk.

As an example if you based in England or Wales and want to look at the size and scale of non-profits in your local authority area there are three sources of lists of registered non profits – Companies House, the Mutuals Register and the Charity Commission for England and Wales. Once you’ve combined and de-duplicated the lists from these sources you will need to consider whether to include or exclude the following:

Grant making organisations - you’ll exclude them if you want total turnover as to include them is to effectlvely double count the income
Academy schools, private schools, education trusts attached to either of these types of schools, universities & colleges– do you consider these to be part of the VCSE sector? The largest 400 private schools have a combined turnover of £7bn, and there are plenty of universities whose annual turnover exceeds £100m so again if it is turnover of a sector or numbers of employees you are interested in these make a big difference to your answers.
Right to Manage (RTM) domestic property organisations usually set up to manage ownership of a freehold of a building such as a block of flats where there are multiple lease holders – do you consider these to be part of the VCSE sector? (usually not)
Large non-profits which are based at one registered address but which operate across a larger geographic area than a single local authority eg. The National Trust. If you include them in data on the turnover of the non-profit sector in your area you’ll have an inflated figure vs. the activity which is actually occurring in your area.

There are decisions like this to be taken with all of the data points in any raw data feed and of course if you are going to use the data more than once you need to keep a record of the decisions you have made so that you can replicate them when you update the data.

We suggest that is it not reasonable to expect a data publisher to ensure accuracy, completeness and consistency for all the data they publish especially if you are using the data for purposes other than those for which it was collected. The onus is on the data user to learn the limits of the data. On the other hand it is reasonable to expect to be able to talk to a data publisher about where they set their limits of their responsibility and what that means for you as a data user.

There is a great deal of valuable data in these data sources and it is brilliant that they are publicly available. Let’s be clear however, they are not very accessible by comparison to the sorts of datasets produced by ONS so if you are going to use them it will take skill and good processes and this means you will incur both time costs and data management costs.

If you don’t make these efforts to improve the data before you rely on it then what you risk is creating a situation where people read your results, don’t believe them, do a little digging and point out the flaws in the data and have a reason to disregard all the work that relies upon them. This turns people off data and its uses and that’s not helpful, but if you have data that is little better than junk then it shouldn’t be relied on should it?