Skip to Main Content
  • Library AND Information Service

Research Data Management: Data Processing

Data Processing

Data processing refers to operations performed on data to prepare the data for analysis. Data processing entails making changes to data values. As such it is important to track data changes that are made during data processing at the data value level so as to maintain the traceability of the research data. There are different methods through which data processing can be applied:

  1. Data cleaning
  2. Data transformation
  3. Data enhancement
  4. Data integration
  5. De-identification

 

Sources:

Zozus, M. 2017. The Data Book : Collection and Management of Research Data. Boca Raton: CRC Press, Taylor & Francis Group.

Data Cleaning

Data cleaning is a process of identifying and resolving discrepant data values. 

Sources:

Zozus, M. 2017. The Data Book : Collection and Management of Research Data. Boca Raton: CRC Press, Taylor & Francis Group.

Data Transformation

Data transformation is the process of converting data from one format or structure into another format or structure.

Data transformation may take on various forms e.g imputation, standardisation, mapping and coding.

Imputation

Imputation is the process of systematically replacing missing values with an estimated value.

Standardisation

Data standardisation is a process in which data values for a data element are transformed to a consistent representation. 

Mapping

Mapping is the process of associating data values to another set of data values, usually but not always using standard codes.

Coding

The process of adding standard codes to discrete codes is known as coding. 

Sources:

https://en.wikipedia.org/wiki/Data_transformation

Zozus, M. 2017. The Data Book : Collection and Management of Research Data. Boca Raton: CRC Press, Taylor & Francis Group.

Data enhancement is the association of data values with other usually external data values or knowledge. There are two types of data enhancements:

  1. Linkage with other data via referential integrity.
  2. Linkage of data elements of data values with external knowledge sources (vocabularies/dictionaries, taxonomies and ontologies).

In order to access several taxonomies and ontologies you can view FAIRsharing.org.

Sources:

Zozus, M. 2017. The Data Book : Collection and Management of Research Data. Boca Raton: CRC Press, Taylor & Francis Group.

Data Integration

Data integration entails using relationship between data in order to join data. 

Sources:

Zozus, M. 2017. The Data Book : Collection and Management of Research Data. Boca Raton: CRC Press, Taylor & Francis Group.

De-identification

According to the Protection of Personal Information Act, de-identification is a process that entails the deletion of any information that:

  • identifies a data subject;
  • can be used or manipulated by a reasonably foreseeable method to identify a data subject; or
  • can be linked by a reasonably foreseeable method to other information that identifies a data subject. 

 

There are different methods of de-identification that can be used namely: pseudo-anonymisation, data aggregation, masking, generalisation, subsampling, removal of direct identifiers, restriction of upper or lower ranges, statistical disclosure control (input and output based) and editing (in the case of digital images, audio recordings or videos).

Data Cleaning

OpenRefine

A free, open source, power tool for working with messy data.

Trifacta

Trifacta is a platform for exploring and preparing data for analysis that allows you to discover, wrangle and visualise data quickly.

 

Data Transformation

Easymorph
A powerful, easy-to-use data preparation and ETL (Extract, Transform, and Load) tool. Great for non-technical  users. 

 

De-identification

ARX Data Anonymization Tool

ARX is a comprehensive open source software for anonymizing sensitive personal data. It supports a wide variety of (1) privacy and risk models, (2) methods for transforming data and (3) methods for analyzing the usefulness of output data.

 

Amnesia Data anonymization tool 

Amnesia is a flexible data anonymization tool that allows to remove identifying information from data. Amnesia transforms relational and transactional data to anonymized datasets where formal privacy guaranties hold. It does not only remove direct identifiers like names, SSNs, etc., but also transforms secondary identifiers like birth date and zip code so that individuals cannot be identified in the data, by linking them to other sources of information.