Organising Data For Analysis

Organising Data For Analysis

Introduction

The final result of a research project is the expression of the analyses of the data. For the analyses to be accurate and complete requires the data to be in a form that lends itself to proper analysis. Data analysis is most often done using computer programs. These programs dictate the structure in which the data must be presented.

Very often, little thought goes into this structural need. Data collection tools and the act of capturing the data generate data structures that must be altered prior to analysis. These alterations and corrections take time and effort, which could be spent elsewhere.

The term tidy data is used to describe data structured in such a way as to be amenable to analysis by computer program. The structure should also clearly convey information about a study, such as study design and research question(s).

Consider the table below and attempt to discern the design of the study for which the table was created and the data captured in.

Participant Placebo Intervention
1 2
2 2 3
3 2 4

There were clearly three participants in the study, with ID’s

,

, and

. It is not clear what the other numerical values in the table refer to. The dash, -, probably indicates that the data was not captured. Each participant had a value captured for both the Placebo and the Intervention columns. This may then have been a cross-over design study, where the participants received both a placebo and an intervention.

The table below, describes the same study and findings.

Group 1 2 3
Placebo 2 2
Intervention 2 3 4

The data has been transposed. However, neither of these tables are particularly conducive to understanding the study, nor are they fit for analysis.

These tables require data wrangling. This is the process of transforming data to convey the essence of the research project and to be amenable to analysis. Data wrangling also includes other tasks such correcting spelling errors and inconsistencies and dealing with missing information. Most data scientists have worked on datasets that contain entries such as 2020/12/03, 2020-03-12, 12-03-2020, 3-Dec-2020 and many other formats in the same table. Other frequent examples include entries such as Yes, yes, and Y, all used to indicate confirmation. Trailing spaces are a perennial cause of improper analysis. These are impossible to spot. Yes and Yes (the latter with a trailing space) are seen as different elements by analytical software.

For data tables to adhere to the mentioned requirements, in other words, for it to be tidy, requires certain rules.

Defintions

A dataset, presented as tables of data above, are collections of elements. These elements can be quantitative (numerical) or qualitative (categorical). Every element in the collection belongs to two entities. The first entity is a participant (human, object, or otherwise). When we capture a data point such as age, every age must belong to a participant. The second entity is a statistical variable. Age is just such a variable. A value collected for the temperature of a participant belongs to the variable temperature and not to the variable age. A statistical variable is something we can take a measurement for or capture a single value for.

A data table is a two-dimensional object where elements are arranged in rows and columns. One computational representation of this is a spreadsheet. Rows are listed one below the other and columns are arranged one next to another.

With these definitions, we can describe the properties of tidy data.

Tidy data

The three properties of tidy data are listed below.

  1. Each column represents a single variable
  2. Each row represents a participant (a participant can appear in more than one row)
  3. An observational unit forms a table

Property no. 3 refers to our ability to create more than one table. This is especially true when data is collected in a database as opposed to in spreadsheet software. In database software, there are multiple tables, that are linked. An obvious example of such a link would be the actual participants. One table might capture demographic data for each participant. The next captures their financial data. We can link all the tables by participant.

Both the two tables in the introduction do not adhere to the principles of tidy data. In the first table, the values in the Placebo column are not valid entries for such a variable. Data for another variable is captured in that column. In the second table, each row does not represent a participant.

The table below uses the principles of tidy data to transform the two prior tables.

Participant Group Measure
1 Placebo
2 Placebo 2
3 Placebo 2
1 Intervention 2
2 Intervention 3
3 Intervention 4

It is still clear that there are three participants. While they received a placebo, a measure was taken for some variable (for example a blood test) and when they received the intervention, these same variable was measured again.

This table is tidy and amenable to analysis.

Common mistakes

There are a variety of data table problems. We take a look at some of them.

Mutiple variables in one column

This is often done for categorical variables. One example is shown below for the variable Comorbid disease.

Participant Comorbidities
1 None
2 Diabetes and hypertension
3 Hypertension and diabetes
4 Diabetes
5 Hypertension
6 Chronic bronchitis and hypertension

This table is well-intentioned and convenient. It is well-intentioned as it reflects how our minds work in the clinical setting. It is convenient in the sense that it is easy to capture from a participant’s health records or while interviewing a participant. It is NOT amenable to analysis, though.

If we have thousands of participants and want to group them by whether they have hypertension, we would have to hand select each row (participant). Alternatively we would need to write formulas or computer code to extract the correct participants. The table below shows the same information, but in tidy format.

Participant Comorbidities Diabetes Hypertension Chronic bronchitis
1 N N N N
2 Y Y Y N
3 Y Y Y N
4 Y Y N N
5 Y N Y N
6 Y N Y Y

Now each column contains a variable. It is much simpler to select only those with hypertension or to count how many have comorbidities at all.

Inconsistent data entry

We have already mentioned this problem. It is illustrated in the table below.

Participant Diabetes
1 N
2 Yes
3 Y
4 1
5 0
6 No

This table shares a mistake with the previous example. Every variable must have a well defined sample space. A sample space is the set of elements from which a measurement value can be taken. We cannot capture someone’s age as 403 years. This is simply not in the sample space of values for a variable Age. It is clear what the researcher was attempting in the table above. It is just not possible to easily analyze this data.

The solution is to clearly define the sample space. In the case of diabetes, we should only have two elements in the sample space. Either yes on no. Express these in consistent form, either Yes and No or Y and N or

and

. Do NOT mix these.

Also, rather have a separate column (variable) to indicate whether this data is not known or have not been verified yet. Do not use special elements such as

to indicate that the patient is unsure or - if the records did not reflect the information. Leave entries blank if the data does not exist.

Columns are data values

Consider the table below.

Participant Measure 1 Measure 2
1 3
2 2 1
3 2 2
4 4

Some participants did not have a second measure of some unknown numerical variable. There is also no consistency in the missing data. Measure 1 and Measure 2 are not variables. The correct tidy form is shown in the table below.

Participant Measure Value
1 1 3
2 1 2
2 2 2
3 1 2
3 2 2
4 1 4

Variables in rows

Consider the table below for a study that requires the minimum and maximum value for a variable, i.e. a blood specimen result.

Participant Measure Value
1 min 3
1 max 5
2 min 2
2 max 7
3 min 2
3 max 2
4 min 2
4 max 4

There are four participants. While we can have more than a single row for a participant, we have two variables incorrectly assigned to the sample space of another variable. A tidy version of this table is shown below.

Participant Min Max
1 3 5
2 2 7
3 2 2
4 2 4

Tips

When designing a data collection tool, keep the following in mind.

  1. Keep it simple. Use a spreadsheet unless your project becomes too complex.
  2. Get help in designing your data collection tool, especially a database.
  3. Use tidy data principles AT ALL TIMES.
  4. Create only the variables required to answer your research question(s).* These may include descriptors of your participants (age, gender), their demographics, and their health data. This is useful to describe your participants so that readers of your findings can decide if they can infer your results to their populations. It also aides in subgroup analysis and comparisons. Then choose only the variables required for the specific research question. Collecting data on more variables is a waste of time and energy.

* When registering a larger electronic database, from which multiple projects will draw data, create as many variables as is practically possible.