Guidance on organising, documenting and formatting data.
Ensuring your data are well organised and documented allows them to be located and understood more easily. This will help you and your collaborators during the project, and others who may wish to replicate and verify the research in future.
Formats used to create and collect data may vary according to discipline and practical requirements, and may include proprietary formats readable only using specific software.
For sharing and long-term preservation, however, data should be stored using standard or open formats. This will help to ensure the data remains accessible as technology progresses and changes.
Planning what these formats will be at the beginning of a research project will reduce the risk of data being locked into a proprietary format, and the formats chosen should be detailed in your data management plan.
The UK Data Service provides advice on formatting your data.
Devising a folder structure and file naming convention at the start of a project makes it easier to manage and keep track of data.
Elements within file names should be informative and consistently ordered, to provide version clarity and reduce risk of errors. They should also be standardised in terms of vocabulary, punctuation and numerical format.
Useful elements to incorporate in file names include:
Date (often at start in YYYYMMDD format)
Descriptive identifier (Interview, Questionnaire, Budget, etc)
Version number (eg v01)
Name or pseudonym of participant/correspondent, if appropriate
Name of last modifier
A hierarchical structure grouped by topic is recommended, with ongoing and completed work separated. Files that need to be more widely accessible may be copied or moved to higher-level folders, where permissions are easier to set.
Many versions of the same file may be created during the course of your research. It is important to differentiate between these using version control.
Your version control strategy should be consistent throughout the project, and files in different locations should be synchronised regularly.
Version control can take a number of forms using various tools, including:
- file naming system using dates and/or version numbers
- version control table or file history within a file
- versioning software, eg Subversion
- distributed version control system, eg Git
- file sharing services, eg University Google Drive (University login required).
Data documentation should give the information required by another researcher in your discipline to understand and reuse the data provided. Most often this researcher will be you in the future, so doing this helps you as well as others.
It describes the content, context and structure of the data, along with the conditions and processes involved in its creation, collection and processing.
At the start of your research, you should decide what information you need to record, and create documentation as the research progresses.
Project-level documentation is normally stored with associated data files and is often in the form of a plain text file called a README, which is placed in the top level of the dataset. Its contents may include
basic description of the research
inventory of files and relationship between them
methodologies, protocols, sampling techniques
equipment used, with settings and calibrations
software, code and algorithms
classification systems and abbreviations
details of third-party data
File or item-level documentation is usually included in individual files and contains details about variables within the file, including table headings, abbreviations, units of measurement and anomalies. This is commonly placed in
the worksheet of a spreadsheet
the header section of an html document
the metadata section of a .jpg file
Further documentation to enable data use may include
laboratory notebook records
sample consent forms
NB The terms ‘data documentation’ and ‘metadata’ are sometimes used interchangeably.
'Metadata' generally refers to details in a repository record that enable data discovery and access, whereas ‘documentation’ usually refers to information within a dataset that enables data to be understood and reused.
For further information, contact email@example.com.