Planning will assist you in overcoming a number of obstacles in your research. For example, if you have a pre-defined naming schema for your variables and file names, it will be much easier to find the right data and the right version in the future. If you have estimated the amount of data you will collect, you will be able to request grant funds for curating and managing them.
Planning is particularly important in longitudinal studies, studies that involve surveys, projects that result in multiple data files, including images and video, and Big Data.
A number of agencies now request data management plans to accompany proposals.
How to write a Data Management Plan
The data management plan should address
- How the data will be collected
- The type or format of data collected
- The size of the data
- How the data will be described (i.e will you be using codebooks, logs, specific metadata standards, ontologies, etc.)
- Where the data will be stored, backed up and secured if necessary
- How the data will be analyzed
- How the data will be shared and preserved, or reasons not to do so
Several helpful resources
A software carpentry-produced guide to data management, in particular metadata and version control.
UC Davis researchers have access to the DMPtool, a service of the California Digital Library, with their Kerberos login. The tool contains templates from multiple federal and private funders. The tool also permits the user to create an editable document for submission to a funding agency, and can accommodate different versions as funding requirements change.
DMPtool also provides guidance for writing DMPs.
Most GIS programs will allow you to create basic metadata that will reside along side the spatial and attribute data you create. Several government agencies and standards bodies have developed metadata standards for geospatial data. You should select a standard to follow based on what information you need to convey to potential users and who those users will likely be. Funding bodies may also set requirements that should be considered.
Description of Data Creation
Spatial analysis can generate a large number of intermediate files. Document the analysis workflow you follow as you perform it, noting which files and processes were used to generate each subsequent file. Some researchers write out a list of steps, while others use a flowchart, or a software system like ArcGIS’ Model Builder.
Sharing & Preservation
The files we work with for analysis may not necessarily be the ideal format for sharing or storing geospatial data. For example, when sending a shapefile, it can be easy to forget to include one of the multiple files required to properly use the data. Consider storing and sharing data in open formats (i.e. a format that doesn’t need a specific software to open it) to make your data accessible by the largest number of people.
Metadata standards and controlled vocabularies
Metadata is information about a data set. Typically metadata is created to help potential users understand how the data was created and other important factors that cannot be determined by looking at the data itself. Various organizations have created metadata standards to guide data developers to provide key metadata and standardize how metadata is written within a given field of research. For example, if you are working with sequencing data, in many cases you will be required to submit data to the Sequence Read Archive. We can help you prepare to collect the right metadata, so that the submission process goes smoothly at the end of your research project.
A controlled vocabulary is a list of words or phrases that can be used in response to a question in a survey or field in a database. Reasons to use a controlled vocabulary include reducing variation in responses, preventing extraneous variants of the same term (such as spelling mistakes or plurals), or making it easier for participants to provide a response.
Data collection and analysis needs to be well documented for the data to be useful. Different disciplines has different conventions on how to record those. If you do not have an established convention available to you, consider adopting one of the following:
A protocol or a standard operating procedure (sop) documents the actions involved in sample processing and data collection.
A log documents actions taken to either collect data or analyze a dataset with specific software.
A codebook is a document that lists the codes and meanings assigned to each code used in a research project.
A readMe file is a file that describes the files present in a file collection, gives more information about a given file, or describes a piece of software or an analysis script. These two documents, structural_readme_and_naming_conventions and analysis_readme, are based on GeorgiaTech Library and Stanford Library recommendations and will help you get started in organizing your research files.