Data processing refers to operations performed on data to prepare the data for analysis. Data processing entails making changes to data values. As such it is important to track data changes that are made during data processing at the data value level so as to maintain the traceability of the research data. There are different methods through which data processing can be applied:
Sources:
Zozus, M. 2017. The Data Book : Collection and Management of Research Data. Boca Raton: CRC Press, Taylor & Francis Group.
Sources:
Zozus, M. 2017. The Data Book : Collection and Management of Research Data. Boca Raton: CRC Press, Taylor & Francis Group.
Data transformation is the process of converting data from one format or structure into another format or structure.
Data transformation may take on various forms e.g imputation, standardisation, mapping and coding.
Imputation is the process of systematically replacing missing values with an estimated value.
Data standardisation is a process in which data values for a data element are transformed to a consistent representation.
Mapping is the process of associating data values to another set of data values, usually but not always using standard codes.
Data enhancement is the association of data values with other usually external data values or knowledge. There are two types of data enhancements:
In order to access several taxonomies and ontologies you can view FAIRsharing.org.
Sources:
Zozus, M. 2017. The Data Book : Collection and Management of Research Data. Boca Raton: CRC Press, Taylor & Francis Group.
Data integration entails using relationship between data in order to join data.
Sources:
Zozus, M. 2017. The Data Book : Collection and Management of Research Data. Boca Raton: CRC Press, Taylor & Francis Group.
According to the Protection of Personal Information Act, de-identification is a process that entails the deletion of any information that:
There are different methods of de-identification that can be used namely: pseudo-anonymisation, data aggregation, masking, generalisation, subsampling, removal of direct identifiers, restriction of upper or lower ranges, statistical disclosure control (input and output based) and editing (in the case of digital images, audio recordings or videos).
A free, open source, power tool for working with messy data.
Trifacta is a platform for exploring and preparing data for analysis that allows you to discover, wrangle and visualise data quickly.
Easymorph
A powerful, easy-to-use data preparation and ETL (Extract, Transform, and Load) tool. Great for non-technical users.
ARX is a comprehensive open source software for anonymizing sensitive personal data. It supports a wide variety of (1) privacy and risk models, (2) methods for transforming data and (3) methods for analyzing the usefulness of output data.
Amnesia Data anonymization tool
Amnesia is a flexible data anonymization tool that allows to remove identifying information from data. Amnesia transforms relational and transactional data to anonymized datasets where formal privacy guaranties hold. It does not only remove direct identifiers like names, SSNs, etc., but also transforms secondary identifiers like birth date and zip code so that individuals cannot be identified in the data, by linking them to other sources of information.