Data Addition Protocol

Adding new data

Please follow the following instructions for adding new data, either digitized from the literature, or raw data obtained from experiments.

Data sources

These will typically come from published papers, dissertations, and reports.

Overall workflow

The overall steps are:

Compile the sources/papers that are candidates for digitization
Check if they have already been digitized into VecTraits/BioTraits
Add them to the Biotraits Mendeley library
Digitize
Validate your filled-in template on www.vectorbyte.org, see the validator documentation for more detail
Re-classify them on the Biotraits Mendeley Library as being digitized

Before digitizing data

Before starting to digitize new data, and in order to avoid duplicating data already in the database, please check your citations against those already in the database. To do this you can use the searchcite tool, which you can either obtain from the development team or from this repository.

Using the VecTraits/BioTraits template

When you have new data to include in the dataset, is really important to firstly "map" it to the template. There are here a few guidelines:

It can be a good idea to put your new data in a new directory within the BioTraitsDB repository under Digitized-Subsets and work on them there, so all changes you make are stored and shared there with the rest of the VecTraits/BioTraits team.
If you do not have access to the BioTraitsDB repository then it is worth keeping your data in an organised manner on a local drive until such a time when you do get access to the repository.
Open both template and raw data and first try to match all those fields for which you have data (you can start matching column names, maybe some of them are the same).
Pay special attention to inputting data into the following columns:
- originalid
- originaltraitvalue
- originaltraitunit
- location
- locationdate
- locationdateprecision
- citation
- published
- embargorelease
- submittedby
- contributoremail

If you are not sure abut the meaning of the field (column) name, you can have a look at the Field Definitions. If you still have doubts, just ask!
Fill the submittedby column with your name (if you have digitized the data) or the name of the appropriate person and put the corresponding e-mail address in the contributoremail column.
A biotraits data template containing all field names is available here.

File format

The upload procedures only work on CSV files. These can be exported from excel, R, or most other data software.

Do NOT use xls/xlsx files as the validator should always reject these as unreadable.

Missing Data

All fields with missing values should be left EMPTY.

Whilst the validator can deal with many common representations of missing data, there are plenty of nonstandard variants which may break the upload either at the validation stage or (more problematically) at the final upload stage.

These nonstandard blank values can also arise if exporting from certain tools or using unusual export mechanisms. Thus it is best to check for inserted NA values in your csv using a basic text editor or viewer such as Notepad on Windows, Textedit on Mac, or Gedit/cat/nano on Linux.

Standardizing original data reference

It is important that each row have a complete reference for the data source (unless it is an unpublished dataset). There is a column called Citation that contains the full citation. Having the citation in full is really important for retrieving the Digital Objects Identifier (DOI) of that reference. To obtain the DOIs (if not already provided), save a file containing all the full citations for which you need to find the DOI. Then you can use the ref2doi tool, which can be downloaded from the dgkontopoulos/ref2doi repository on BitBucket. A help/protocol file for using this tool is in the repository.

Standardizing taxonomy

We are currently building a taxonomy standardizer tool. This is based on the R package taxize (A guide to use this package can be found here. In the meantime you can directly use this package to retrieve all the taxonomic information.

First, you should check the species name (or minimum taxonomic level you have). This tool queries the Global Names Resolver through R and then parses the results.

We call the gnr_resolve tool by submitting the unique names in the interactor1 (or interactor2) column. Once it has finished, the output returns four different columns: submitted_name, matched_name, score (grade of similitude between submitted and matched names) and source (for each entry, the tool queries different sources to check the name). Then, we will match the submmited_name against the matched_name (the one queried by the tool), both in the results output. We will find that for some names we will have an exact match but for others not. For those names that there isn’t an exact match (e.g. the matched_name includes author and year besides species name depending on source), we will compare the scores. If for an entry, we get the same scores for all the matches, we can use any matched_name (but just keeping the species name and removing the extra information). Otherwise, we will use the matched name with the highest score.

Once we have checked all the names, we will retrieve the taxonomy using the tax_name function. You will need to select the database of interest to search for the taxonomy (NCBI is being used in the dataset as a first option, but if this is not available you can also use ITIS). First we will try to query the species name, but in case that doesn’t work, we will use the genus name. Finally, we will fill the dataset with our results.

Special cases

There are some special cases where the metabolic traits were measured not for a whole species but for a part of it. For example, the database can accommodate measures for tissues, leaves, etc. In these cases is necessary to distinguish between whole organism or part. There are then specific columns to do this: interactor1part and interactor1parttype.

Please let us know if you find any specific case for which you have problems, as this could help us to improve the dataset and make the data template as comprehensive and general as possible.

Storing the raw data

Your raw data can be stored in both .Rdata and .csv files, but CSV is preferred. Please, fill all the rows in your file and try to not leave blank spaces. You can use NULL or NA if you don’t have data to fill in some fields. Once the data have been mapped let us know. Then we will review it and import it into Biotraits.