5 minute read time
Glenn Exton, head of data and analytics at RBS International, looks at how to assess the provenance of data to help you make data-informed decisions with confidence.
When making a decision, should you rely on intuition, experience, data or a mix of the three?
Think back to a moment when your manager has asked you to produce a report or provide a forecast and your heart is racing when it comes to pushing the send button. Where does this uneasy lack of confidence originate from? Is it your experience? Your intuition? Or perhaps it’s the data that you’re using. If it’s the latter, then the following should help to ease your nerves in the future.
More and more, across every industry in the world, we are seeing the effective use of data go hand in hand with capital to lead the market. Whether it’s sports teams acquiring new players or fintechs launching a new product, capital is required to complete the process, but it’s the reliance on data that ensures the investment will be a successful one.
Before thinking about using data to solve your issue, you must understand your responsibilities when using customer or colleague information and ensure that the data was collected ethically and stored securely. There is an ever-increasing focus on the ethical use of customer information; it is critical to act in a responsible manner to maintain customer trust and advocacy.
Is using data always better?
To answer this question we first need to understand the lifecycle of the data, also known as data provenance or data lineage. We must be aware of several factors: how was the data collected, why was the data collected, was permission granted to use the data, and what has happened to the data on its journey to your desk? To shed some light on these, we first need to distinguish between the different types of data.
Any data you’re handling will likely be one of the following:
- primary or secondary
- captured or exhaust
- structured or unstructured
- raw or processed
Primary or secondary data
Primary data is collected first-hand using methods like surveys, interviews, or experiments. It is collected with the research project in mind – or, in other words, it is obtained to answer your question. You should always aim to use primary data wherever possible as you know why and how the data was collected, as well as all the subsequent operations performed. You understand its provenance. However, it’s not always possible to use primary data, owing to factors such as lack of time or money; so secondary data will have to be used instead.
Secondary data is obtained from somebody else, or from a system other than that which holds the primary data. So, unless you’re extremely lucky, the reason it was collected was to answer a different question from your own. As a result, you must be far more cautious when drawing conclusions from secondary data.
Captured or exhaust
Whether primary or secondary, your data will be either captured or exhaust. Captured data is data that was gathered for a specific purpose so is generally more useful for categorical decision-making. Exhaust data, on the other hand, comes from activity logs and is useful to spot glitches and bugs or spark new ideas.
Structured or unstructured
Now that you’ve identified whether your data is primary, exhaust, secondary, captured or any one of the four combinations, you need to know if the data is structured or unstructured. Structured data is data that is primed and ready for analysis and formatted in an easily understandable manner, such as a table. As was the case with primary data, you would generally prefer to work with structured data.
Unstructured data, on the other hand, is the opposite: the data is not well formatted and requires a lot of organising before it will be ready to use in analysis. It is important to understand that structured and unstructured data may be the exact same data; it is just the organisation and formatting that differs.
Raw or processed
The final consideration when defining your data is whether it is raw or processed. Raw data has not been altered at all upon collection, meaning you can fully dictate the manipulation and lifecycle of the data. However, raw data can often be riddled with inaccurate information or blank entries; it requires manipulation to produce meaningful information and make it more accessible to the untrained eye. Upon manipulating raw data into more meaningful information, it becomes processed data. The most preferable option here is to obtain the raw data and then evolve it into processed data – that way you can fully track the lifecycle, ensuring that nothing vital has been lost.
It’s very easy to be seduced into finding a data set that suits your conclusions, rather than the other way around, so you must tread carefully
To refer back to our question ‘Is using data always better?’, the answer is no. If you have secondary, exhaust, unstructured, raw data it will probably look like nonsense that is unlikely to lead you to any logical conclusions. But it’s important to note that even when using seemingly sound data, you can still be led to illogical conclusions if the data is not used correctly and the data provenance isn’t understood.
Working with your own data
So in an ideal world, you have unlimited money and time and you’ve been able to obtain primary, captured, structured, raw data – but what pitfalls must you still look out for?
The first thing you must ask yourself is: does the data tell the full story? It’s very easy to be seduced into finding a data set that suits your conclusions, rather than the other way around, so you must tread carefully. Take Covid-19, for example, and assume you are trying to deduce which country has struggled to deal with the pandemic the most. You have conducted research and have the exact list of deaths per country; you can now safely conclude that the country at the top of the list has dealt with Covid-19 the worst. Then you receive more data that gives the number of deaths per capita, and you see that the country with the most deaths overall actually has considerably fewer deaths per capita than many other countries. Further data sets arrive regarding the socio-economics of different countries, their climate, density of living spaces etc and you quickly realise that using just one data set was vastly inadequate to answer your question.
The second trap to be wary of is assuming that your opinion is fact because it is now supported by data: the data itself is unequivocal but your conclusions and opinion are not. To use the Covid-19 example, the death statistics are factual but the conclusion of which country handled the pandemic the worst will always be subjective, no matter how much data is used to underpin it.
Working with other people’s data
More often than not, you will find yourself working with secondary data. This is where knowledge of the data provenance is essential.
To avoid squeezing square pegs into round holes by using the wrong data set, you must consider the following when using secondary data.
- Rationale: was the data collected to answer the same or a similar question to yours?
- Competence: did the team who conducted the research collect data in a competent manner?
- Bias: is the data set skewed to fit a specific agenda (the inclusion or omission of data to fit a pre-existing agenda)?
- Clarity: is there clear documentation to prevent you from misinterpreting the contents of the data set?
- Processing: has the data already been manipulated, transformed or tampered with before it arrives on your desk?
Where possible, you should contact the team who originally conducted the research to provide clarity on the five factors above. You should note any potential gaps in the data or problems with research conduct if using it to draw conclusions, and, most importantly, don’t rely heavily on secondary data to underpin important decisions.
The next time you are nervous about producing a report or making a critical decision, think about the type of data that you have used and allow this to dictate the confidence you have in your conclusion. If the data provenance indicates that your findings may be inconclusive, you should identify the elements that are decreasing the usefulness of the data and highlight these in your conclusion.