Loading...

Summer Research Fellowship Programme of India's Science Academies

Disambiguation of plant binomial names and essential oil composition profiles

Shruthi M

Ramaiah Institute of Technology, MSRIT Post, M S Ramaiah Nagar, Bengaluru, Karnataka 560054

Dr Gitanjali Yadav

National Institute of Plant Genome Research, Aruna Asaf Ali Marg, New Delhi, Delhi 110067

Dr. Peter Murray-Rust

Department of Chemistry, University of Cambridge, United Kingdom

Abstract

The EssOilDB - ESSential OIL DataBase, (http://www.nipgr.ac.in/Essoildb/) is a continually updated knowledge resource which contains experimental records of essential oil composition data, from published reports. It also contains information on related geo-morphological factors at the time of collection and extraction in order to contextualize volatile profile patterns from a biological perspective. EssOilDB provides an opportunity for context based scientific research, through a multitude of queries on volatile profiles of native, invasive, normal or stressed plants, across taxonomic clades, geographical locations and several other biotic and abiotic influences. It contains records of emitted essential oils spanning a century of published reports on volatile profiles. Database normalization or disambiguation means the organization of the data in the database. It is a systematic, multi-step process that puts data into a tabular form, removing duplicated records from the relation tables. Normalization is used for mainly two purposes, for the elimination of redundant data and for ensuring data dependencies make sense i.e
data is logically stored. This project involves the normalization of plant names and profiles that were already existing in the EssOilDB version 1.0. Currently, the data contains 1838 plant names and 7157 compound emission records. The inconsistencies in the data include typographical errors, duplications and introduction of special characters. Some of the tools used during this project are R packages such as taxize, wikitaxa, WikidataR and Taxonstand. Other public databases used during the course of this project are uBio NameBank, the National Biodiversity Network (NBN), National Center for Biotechnology Information (NCBI), Catalogue of Life (COL), The Plant List (TPL), Encyclopedia of Life (EOL), Global Biodiversity Information Facility (GBIF) and Integrated Taxonomic Information System (ITIS).

Keywords: EssOilDB, volatile emissions, essential oils, geo-morphological factors

Abbreviations

Abbreviations
EssOilDB ESSential OIL DataBase
TPLThe Plant List
GBIFGlobal Biodiversity Information Facility  
COLCatalogue of Life
NCBINational Center for Biotechnology Information
NBNNational Biodiversity Network
ITISIntegrated Taxonomic Information System
CRAN Comprehensive R Archive Network
UTFUnicode Transformation Format

INTRODUCTION

Background

The EssOilDB (the ESSential OIL DataBase) is a continually updated knowledge resource for plant volatile emissions, containing experimental records of essential oil composition data, from published reports. EssOilDB also contains information on related geo-morphological factors at the time of collection and extraction in order to appreciate volatile profile patterns from a global perspective. EssOilDB provides an opportunity for context based scientific research, through a multitude of queries on volatile profiles of native, invasive, normal or stressed plants, across taxonomic clades, geographical locations and several other biotic and abiotic influences. It contains 123041 essential oil records spanning a century of published reports on volatile profiles, with data from 92 plant taxonomic families, spread across diverse geographical locations all over the globe. [Kumari S, et al., 2014]

R Programming

The name “R” refers to the computational environment initially created by Robert Gentleman and Robert Ihaka, similar in nature to the “S” statistical environment developed at Bell Laboratories. (http://www.r-project.org/about.html). It has since been developed and maintained by a strong team of core developers (R-core), who are renowned researchers in computational disciplines. R has gained wide acceptance as a reliable and powerful modern computational environment for statistical computing and visualisation, and is now used in many areas of scientific computation. R is free software, released under the GNU General Public License; this means anyone can see all its source code, and there are no restrictive, costly licensing arrangements. [Eglen, 2009]

The R language is widely used by biologists, and now has over 5,000 packages on the Comprehensive R Archive Network (CRAN) to extend R. R is great for manipulating, visualizing and fitting statistical models to data.

Disambiguation of a database

Database normalization or disambiguation means the organization of the data in the database. It is a systematic, multi-step process that puts data into a tabular form, removing duplicated records from the relation tables. Normalization is used for mainly two purposes, for the elimination of redundant data and for ensuring data dependencies make sense i.e data is logically stored. [​https://searchsqlserver.techtarget.com/definition/normalization​]

The use of taxonomic names is, unfortunately, not straightforward. Taxonomic names often vary due to name revisions at the generic or specific levels, lumping or splitting lower taxa (genera, species) among higher taxa (families), and name spelling changes.

Statement of Problems

  • This project involves the normalization of plant names and chemical profiles in the EssoilDB version 1.0. Currently, the data contains 1838 plant names and 7157 profiles.
  • The inconsistencies in the data include typographical errors, duplications, erroneous scientific names, introduction of special characters, lack of synonyms and suitable database structure.

Scope

Essential oils have huge potential in pharmacology both as preventive and treatment agents for a range of health disorders. Further, they have also shown to be involved in aromatherapy and facilitating skin penetra- tion and used for transdermal delivery of medicines. In addition to therapeutics, their commercial value in food and cosmetic industry has also increased tremendously. Apart from the scientists, the layman, entrepreneurs and farmers , can obtain the benifits from this database.

LITERATURE REVIEW

EssOilDB 1.0

Each EssOilDB record corresponds to the amount of emission of a particular compound in a specific oil profile. Further, in case a single journal article lists three different sets of volatile profiles, say for three different plant parts, or under three independent stresses, we treat the datasets as three independent records. Currently, the database contains a total of 123,041 such records spanning a century of published reports of essential oil profiles, starting from early 1900s to date. These records have been sourced from over 1520 citations and the data includes 1618 plant species, subspecies or varieties representing 92 distinct taxonomic families encompassing the entire range from ancient and lower plants like chlorophytes and mosses, to the gymnosperms and angiosperms. [​Kumari S, et al, 2014​] Fig 1 shows the various plant-specific and chemical-specific keys.

Search.png
    Snapshot of the EssOilDB page which shows the various plant-specific and chemical-specific keys

    R Studio

    RStudio provides popular open source and enterprise-ready professional software for the R statistical computing environment. It is an Integrated Development Environment (IDE) which aids in the development of R programs. [Allaire, 2012]

    Rstudio_4.png
      Screenshot of the RStudio window which shows the details about the R version being used

      R Package 'taxize'

      The taxize is a taxonomic tool belt for R. Taxize wraps APIs for a large suite of taxonomic databases available on the web. It has a suite of R functions that interact with many taxonomic data sources via their web APIs (Table 1).

      Examples of key functions in taxize, what they do, and their data sources
      Function name What it does Source
      eol_search Search EOL taxon information Encyclopedia of Life http://eol.org/
      get_tsn Get ITIS TSN Integrated Taxonomic Information System http://www.itis.gov/
      get_uid Get NCBI UID National Center for Biotechnology Information7
      gnr_resolve Resolve names using EOL's global names index Global Names Resolver http://resolver.globalnames.org/
      iucn_status IUCN status IUCN Red List http://www.iucnredlist.org
      searchbycommonname Search ITIS by common name Integrated Taxonomic Information System http://www.itis.gov/
      searchbyscientificname Search ITIS by scientific name Integrated Taxonomic Information System http://www.itis.gov/
      tax_rank Get rank of a taxonomic name Various

      R Package 'Taxonstand'

      The Taxonstand package is an automated standardization of taxonomic names and removal of orthographic errors in plant species names using 'The Plant List' website (www.theplantlist.org). [Luis Cayuela & Anke Stein, 2017]

      The Plant List

      The Plant List (http://www.theplantlist.org/) is an on‐line database of plant names that aims to be comprehensive for all described plant species. Version 1 of The Plant List includes 1 040 426 plant name records, of which 298 900 are accepted names. The Plant List is the product of a consortium of the Royal Botanic Gardens, Kew, and the Missouri Botanical Garden. [Kalwij, 2012]

      TPL home.png
        The screenshot showing the homepage of The Plant List
        tpl search.png
          Snapshot of result in TPL page obtained after submitting 'Abies alba' as a query

          R Package 'wikitaxa' - Taxonomy data from Wikipedia

          The goal of wikitaxa is to allow search and taxonomic data retrieval from across many Wikimedia sites, including: Wikipedia, Wikicommons, and Wikispecies. There are lower level and higher level parts to the package API:

          1. Low level API: The low level API is meant for power users and gives you more control, but requires more knowledge.

          • wt_wiki_page()
          • wt_wiki_page_parse()
          • wt_wiki_url_build()
          • wt_wiki_url_parse()
          • wt_wikispecies_parse()
          • wt_wikicommons_parse()
          • wt_wikipedia_parse()

          2. High level API: The high level API is meant to be easier and faster to use.

          • wt_data()
          • wt_data_id()
          • wt_wikispecies()
          • wt_wikicommons()
          • wt_wikipedia()

          Search functions:

          • wt_wikicommons_search()
          • wt_wikispecies_search()
          • wt_wikipedia_search()

          [Scott Chamberlain, 2018]

          Wikidata

          Wikipedia has been collecting increasing amounts of structured data: numbers, dates, coordinates, and many types of relationships from family trees to the taxonomy of species. This data has become a resource of enormous value, with potential applications across all areas of science, technology, and culture. Actual uses of the data are rare and often restricted to very specific pieces of information, such as the geo-tags of Wikipedia articles used in Google Maps. The reason for this striking gap between vision and reality is that Wikipedia’s data is buried within 30 million Wikipedia articles in 287 languages, from where it is very difficult to extract. The same information often appears in articles in many languages and on many articles within a single language. [​Vrandečić, et al, 2014​]​

          The goal of Wikidata is to overcome these problems by creating new ways for Wikipedia to manage its data on a global scale. It has the following features:

          • Open Editing: Like Wikipedia, Wikidata allows every user of the site to extend and edit the stored information, even without creating an account. A form-based interface makes editing very easy.
          • Community Control: Not only the actual data but also the schema of the data is controlled by the contributor community. Contributors edit the population number of Rome, but they also decide that there is such a number in the first place.
          • Plurality: It would be naive to expect global agreement on the ‘true’ data, since many facts are disputed or simply uncertain. Wikidata allows conflicting data to coexist and provides mechanisms to organize this plurality.
          • Secondary Data: Wikidata gathers facts published in primary sources, together with references to these sources. There is no ‘true population of Rome’, but a ‘population of Rome as published by the city of Rome in 2011’.
          • Multilingual Data: Most data is not tied to one language: numbers, dates, and coordinates have universal meaning; labels like Rome and population are translated into many languages. Wikidata is multi-lingual by design. While Wikipedia has independent editions for each language, there is only one Wikidata site.
          • Easy Access: Wikidata’s goal is to allow data to be used both in Wikipedia and in external applications. Data is exported through Web services in several formats, including JSON and RDF. Data is published under legal terms that allow the widest possible reuse.
          • Continuous Evolution: In the best tradition of Wikipedia, Wikidata grows with its community and tasks. Instead of developing a perfect system that is presented to the world in a couple of years, new features are deployed incrementally and as early as possible.

          [Vrandečić, et al, 2014]

          wikidata.png
            Screenshot showing the homepage of Wikidata

            R Package 'WikidataR'

            It is an API client for the Wikidata store of semantic data. [​Oliver Keyes, 2017]

            Global Biodiversity Information Facility (GBIF)

            GBIF—the Global Biodiversity Information Facility—is an international network and research infrastructure funded by the world’s governments and aimed at providing anyone, anywhere, open access to data about all types of life on Earth. [​https://www.gbif.org/en/what-is-gbif​]

            gbif_homepage.png
              Screenshot of the GBIF webpage

              WORK FLOW

              Concepts

              Resolution of binomial names using taxize

              The resolution of names is performed using the function gnr_resolve(), the syntax for which is given in Fig 7:

              attributes_taxize.jpg
                Usage of gnr_resolve()

                Arguments:

                • names - character; taxonomic names to be resolved. Doesn’t work for vernacular/common names.
                • data_source_ids - character; IDs to specify what data source is searched.
                • resolve_once - logical; Find the first available match instead of matches across all data sources with all possible renderings of a name. When TRUE, response is rapid but incomplete.
                • with_context - logical; Reduce the likelihood of matches to taxonomic homonyms. When TRUE a common taxonomic context is calculated for all supplied names from matches in data sources that have classification tree paths. Names out of determined context are penalized during score calculation.
                • canonical - logical; If FALSE (default), gives back names with taxonomic authorities. If TRUE, returns canocial names (without tax. authorities and abbreviations).
                • highestscore - logical; Return those names with the highest score for each searched name? Defunct
                • best_match_only - (logical) If TRUE, best match only returned. Default: FALSE
                • preferred_data_sources - (character) A vector of one or more data source IDs.
                • with_canonical_ranks - (logical) Returns names with infraspecific ranks, if present. If TRUE, we force canonical=TRUE, otherwise this parameter would have no effect. Default: FALSE
                • http - The HTTP method to use, one of "get" or "post". Default: "get". Use http="post" with large queries. Queries with > 300 records use "post" automatically because "get" would fail
                • cap_first - (logical) For each name, fix so that the first name part is capitalized, while others are not. This web service is sensitive to capitalization, so you’ll get different results depending on capitalization. First name capitalized is likely what you’ll want and is the default. If FALSE, names are not modified. Default: TRUE
                • fields - (character) One of minimal (default) or all. Minimal gives back just four fields, whereas all gives all fields back.
                • ... Curl options passed on to crul::HttpClient

                [Scott Chamberlain, 2017]

                Resolution of binomial names using Taxonstand

                The resolution of names is performed using the function TPL(). Fig 8 shows the usage of TPL.

                taxonstand.jpg
                  Usage of TPL()

                  Arguments:

                  • splist - A character vector specifying the input taxa, each element including genus and specific epithet and, potentially, author name and infraspecific abbreviation and epithet
                  • genus - A character vector containing the genera of plant taxon names. [Optional if taxa is submitted as input]
                  • species - A character vector containing the specific epithets of plant taxon names. [Optional if taxa is submitted as input]
                  • infrasp - A character vector containing the infraspecific epithets of plant taxon names. [Optional; required for specific queries only]
                  • infra - Logical. If TRUE (default), infraspecific epithets are used to match taxon names in TPL. [Optional; required for specific queries only]
                  • corr - Logical. If TRUE (default), spelling errors are corrected (only) in the specific and infraspecific epithets prior to taxonomic standardization. [Optional; required for specific queries only] [ Luis Cayuela & Anke Stein, 2017 ]

                  Retrieval of id from Wikidata using WikidataR

                  The id's of all the binomial names were retreived from Wikidata. The function find_item() aids in retrieving a set of Wikidata items where the aliase or descriptions match a particular search term. Its usage is shown below:

                  wikifind.jpg
                    Usage of find_item()

                    Arguments:

                    • search_term - a term to search for.
                    • language - the language to return the labels and descriptions in; this should consist of an ISO language code. Set to "en" by default.
                    • limit - the number of results to return; set to 10 by default.
                    • ... further arguments to pass to httr's GET.

                    Methods

                    Resolution of binomial names using the taxize package

                    This was the first step taken in the project. The names were obtained from the EssoilDB v1.0 in csv format [See Appendix A]. This was the input file ("a") for this step.

                    taxize_code.jpg
                      The code run for resolution of binomial names using the function - gnr_resolve()

                      Resolution of binomial names using the Taxonstand package

                      As the Taxonstand package uses The Plant List database, which is not used by the taxize package, this step was carried out to ensure the resolution using this database as well. The input file was the same as the input in the previous step.

                      TPL.png
                        The code run for resolution of binomial names using the TPL() function

                        Normalization of binomial names using GBIF web interface

                        1. A separate Microsoft Excel document containing a single column of the plant names only - was created.

                        2. The web page https://www.gbif.org/en/tools/species-lookup was used during this step.

                        3. A screenshot of the web page is shown below:

                        gbif.png
                          Screenshot of the GBIF species name matching webpage

                          4. The newly created document was uploaded into the space provided (on the web page) after renaming the header as “scientificName”.

                          5. After submission, the following is displayed on screen:

                          gbif2.png
                            The screenshot of the page obtained after submitting the file

                            6. The kingdom “Plantae” was selected and the “MATCH TO GBIF BACKBONE” button was clicked.

                            Retrieval of the synonyms of the plant names

                            The following procedure was followed in order to retrieve the synonym:

                            1. The normalized plant names were used during this step.

                            2. The taxize package has functions which can obtain the id and the synonyms for the respective taxon.

                            3. Catalogue of Life database was used in this as it contained most of the taxons.

                            4. The taxize functions which deal with this database are:

                            a) get_colid()

                            b) synonyms()

                            The below Fig 14 shows the code run for obtaining the synonyms of one of the plants - Abies alba

                            colid.png
                              The code run for obtaining the synonyms of Abies alba

                              Retrieval of common names of plants

                              The following steps were followed in order to obtain the common names:

                              1. The normalized names were retrieved from the file and were encoded with UTF-8 (to ensure that the special characters are retained in the data during processing). The code used was as follows:

                              wikicommons.png
                                The code used for extracting the required column (row 1) and for encoding the extracted column (row 2)

                                2. The wikitaxa package in R was used during this step. The code used is shown below:

                                wikicommons.png
                                  The code used to obtain the common names of plants

                                  Extraction of Wiki id for the plants

                                  The Wiki id corresponding to each taxon was retrieved using the R platform. The column containing scientific names was extracted from the GBIF output file and encoded in UTF-8 format. The code used is as follows:

                                  wikiid_4.png
                                    The code run in order to extract the Wiki id

                                    Classification of binomial names based on their status

                                    A) The following code was run on R platform to extract the information from the GBIF output file - a:

                                    synonym_accepted_1.png
                                      The code used for the extraction of columns from the input file

                                      B) The following code was used to determine whether the taxonomic name is a synonym or not:

                                      synonym.png
                                        The code used to determine whether the binomial names are synonyms or not

                                        C) The following code was used to determine whether the taxonomic name is accepted or not:

                                        accepted.png
                                          The code used to determine whether the binomial names are accepted or not

                                          RESULTS AND DISCUSSION

                                          Resolution of Binomial Names Using the Taxize Package

                                          A table containing the results was obtained after running the function gnr_resolve(); a sample of which is given below:

                                          A Sample of the results table obtained using gnr_resolve()

                                          user_supplied_name submitted_name matched_name data_source_title score
                                          1 Abies alba Abies alba Abies alba NCBI 0.988
                                          2 Abies borisii-regis Abies borisii-regis Abies borisii-regis NCBI 0.988
                                          3 Abies cephalonica Abies cephalonica Abies cephalonica NCBI 0.988
                                          4 Abies sachalinensis Abies sachalinensis Abies sachalinensis NCBI 0.988
                                          5 Acacia caven Acacia caven Acacia caven Freebase 0.988
                                          6 Acacia nuperrima Acacia nuperrima Acacia nuperrima NCBI 0.988
                                          7 Acacia nuperrima Acacia nuperrima Acacia nuperrima NCBI 0.988
                                          8 Acalypha segetalis Acalypha segetalis Acalypha segetalis EOL 0.988
                                          9 Achillea abrotanoides Achillea abrotanoides Achillea abrotanoides NCBI 0.988
                                          10 Achillea ageratum Achillea ageratum Achillea ageratum NCBI 0.988

                                          Resolution of Binomial Names Using the Taxonstand Package

                                          A table containing the results was obtained after running the function TPL(). The table contained the following columns:

                                          • Taxon
                                          • Genus
                                          • Hybrid.marker
                                          • Species
                                          • Abbrev
                                          • Infraspecific.rank
                                          • Infraspecific
                                          • Authority ID
                                          • Plant.Name.Index
                                          • TPL.version
                                          • Taxonomic.status
                                          • Family
                                          • New.Genus
                                          • New.Hybrid.marker
                                          • New.Species
                                          • New.Infraspecific.rank
                                          • New.Infraspecific
                                          • New.Authority
                                          • New.ID
                                          • New.Taxonomic.status
                                          • Typo
                                          • WFormat
                                          • Higher.level
                                          • Date

                                          Some of the important columns are shown below:

                                          Table showing few columns of the results table - obtained using TPL()
                                          Taxon Taxonomic.status New.Genus New.Species New.Authority New.Taxonomic.status Typo
                                          Abies alba Accepted Abies alba Mill. Accepted FALSE
                                          Abies borisii-regis Accepted Abies borisii-regis Mattf. Accepted FALSE
                                          Abies cephalonica Accepted Abies cephalonica Loudon Accepted FALSE
                                          Abies sachalinensis Accepted Abies sachalinensis (F.Schmidt) Mast. Accepted FALSE
                                          Acacia caven Accepted Acacia caven (Molina) Molina Accepted FALSE
                                          Acacia nuperrima Accepted Acacia nuperrima Baker f. Accepted FALSE
                                          Acacia nuperrima Accepted Acacia nuperrima Baker f. Accepted FALSE
                                          Acalypha segetalis Accepted Acalypha segetalis Müll.Arg. Accepted FALSE
                                          Achillea abrotanoides Accepted Achillea abrotanoides (Vis.) Vis. Accepted FALSE
                                          Achillea coarctata Accepted Achillea coarctata Poir. Accepted FALSE

                                          Normalization of Binomial Names Using GBIF Web Interface

                                          The following image (Fig. 21) shows a sample of the data obtained after step (6) mentioned in Section 5.2.3:

                                          gbif_result_1.png
                                            Screenshot of the results page obtained after species lookup in GBIF

                                            1. A csv file containing results was obtained by selecting the option - “Generate CSV” which is displayed at the end of the results page.

                                            2. The resulting file contains the following columns:

                                            • occurrenceId
                                            • verbatimScientificName (user-submitted name)
                                            • scientificName (name existing in the database)
                                            • key (unique number assigned to the particular species on GBIF
                                            • matchType (3 levels of result - EXACT, FUZZY, HIGHERRANK)
                                              • EXACT means the name exactly matches with the entry in the database
                                              • FUZZY indicates entries that may be mis-spelt
                                              • HIGHERRANK implies that the specific epithet of the entry is not being recognized (in other words, only genus is recognized)
                                            • confidence (expressed in terms of percentage)
                                            • status (3 levels of result - ACCEPTED, SYNONYM or DOUBTFUL)
                                              • ACCEPTED Treated as accepted
                                              • DOUBTFUL Treated as accepted, but doubtful whether this is correct.
                                              • SYNONYM A general synonym, the exact type is unknown.
                                            • rank (the highest rank recognized)
                                            • kingdom
                                            • phylum
                                            • class
                                            • order
                                            • family
                                            • genus
                                            • species

                                            Some of the important columns are shown below:

                                            Table showing few important columns of the table obtained from GBIF
                                            verbatimScientificName scientificName key matchType status
                                            Abies alba Abies alba Mill. 2685484 EXACT ACCEPTED
                                            Abies borisii-regis Abies borisii-regis Mattf. 2685519 EXACT ACCEPTED
                                            Abies cephalonica Abies cephalonica Loudon 2685326 EXACT ACCEPTED
                                            Abies sachalinensis Abies sachalinensis Mast. 2685437 EXACT ACCEPTED
                                            Acacia caven Acacia caven (Molina) Molina 2979244 EXACT SYNONYM
                                            Acacia nuperrima Acacia nuperrima Baker f. 2980107 EXACT ACCEPTED
                                            Acalypha segetalis Acalypha segetalis MÌ_ll.Arg. 3056915 EXACT ACCEPTED
                                            Achillea ageratum Achillea ageratum L. 3120391 EXACT ACCEPTED
                                            Achillea beibersteinii Achillea beibersteinii Afan. 7400456 EXACT DOUBTFUL
                                            Achillea biebersteinii Achillea biebersteinii C.Afan. 3120276 EXACT SYNONYM
                                             Ajuga austro-iranica Ajuga austroiranica Rech.f. 3888049FUZZY ACCEPTED 

                                            Some duplications and binomials which were misspelt could be identified using Table 4. The different classes of the 'status' were analyzed.
                                            The names that were shown to have FUZZY matchType were rectified according to the database entry. For example, the last entry is shown to be FUZZY; in this case, there was an additional "-" in the middle of the specific epithet.

                                            [Refer to Appendix B for the GBIF output file and Appendix C for the file containing normalized names]

                                            Retrieval of the Synonyms of the Plant Names

                                            Synonyms of 1202 (out of a total of 1838) plants were obtained after performing the steps as mentioned in Section 5.2.4. An example of the output obtained as the synonyms for Abies alba is shown in the Fig. 22:

                                            synonym_result.png
                                              Screenshot showing the synonyms of Abies alba

                                              [Refer to Appendix D for the complete file]

                                              Retrieval of Common Names of Plants

                                              After performing the steps mentioned in Section 5.2.5., common names of 378 plants were obtained out of a total of 1838 plants. Fig. 23 shows the common names of ten plants.

                                              common_names_result_1.png
                                                Screenshot showing common names of the first 50 plant names

                                                [Refer to Appendix D for the complete file]

                                                Extraction of Wiki id for the Plants

                                                After performing the steps mentioned in 5.2.6., the wiki id for 1710 plants were obtained out of a total of 1838 plants. Fig. 24 shows the first 50 plant names with their respective wiki id:

                                                wikkid_result.png
                                                  Screenshot of the screen showing the first 50 plant names with their respective wiki id

                                                  [Refer to Appendix D for the complete file]

                                                  Classification of Binomial Names Based on Their Status

                                                  A) A vector, representing whether the plant name is a synonym or not, was obtained. A sample of this is shown below:

                                                  syn(y:n)_1.png
                                                    Sample results showing whether the plant name is a synonym or not

                                                    B) A vector, representing whether the plant name is accepted or not, was obtained. A sample of this is shown below:

                                                    accepted(y:n).png
                                                      Sample results showing whether the plant name is accepted or not

                                                      CONCLUSIONS

                                                      Through this project, I was able to gain a complete understanding of the use of RStudio and R packages. This also helped me in understanding the management of a database. Some of the identified errors are duplications, introduction of special characters in binomial names, presence of hybrids in the data and typographical errors. After the removal of inadvertent special characters, 99 typographical errors were identified and rectified. There were about 51 duplications that were totally identified. The duplicate entries have to be merged and the profile-keys related to the duplicated entries of plants should be mapped to the same plant. Some of the issues that are yet to be resolved are:

                                                      • There were 7 hybrids with incomplete information.
                                                      • Some of the plants have aff. in their names which implies that that species has affinity towards a particular species (for example, Mentha aff. Rotundifoliahas has affinity towards Mentha rotundifolia).
                                                      • Some entries such as Cinnamomum fragrans and Serotinocarpum insignis, are not found in any database.
                                                      • Suffix spp. is added to some generic epithets. This makes the plant unspecific. (Xanthostemon spp.)

                                                      ACKNOWLEDGEMENTS

                                                      The success and final outcome of this project required a lot of guidance and assistance from many people. I am grateful to Indian Academy of
                                                      Sciences, Indian National Science Academy and The National Academy of Sciences, India for providing me this opportunity to carry out this project. I owe deep gratitude to my guide - Dr. Gitanjali Yadav who accepted me and guided me through the project. I would also like to thank Dr. Peter Murray-Rust, emeritus professor at Department of Chemistry, University of Cambridge, United Kingdom, for his continuous support in resolving the identified issues during the course of the project.

                                                      I would like to extend my gratitude to Mrs. Vineeta Lamba and Mr. Manish Kumar for introducing me to the project and for providing a strong
                                                      foundation to work on. I am grateful to all the faculty members, research scholars and other employees at NIPGR for their assistance during the course of this project.

                                                      I am also very grateful to Department of Biotechnology, Ramaiah Institute of Technology for the guidance and encouragement provided by them during the application process of The Academies' Summer Research Fellowship Programme 2019.

                                                      Last but not the least, I would like to thank my parents, other family members and friends who consistently supported me during the course of this project.

                                                      APPENDICES

                                                      Appendix A:

                                                      Link to initial data: https://github.com/gilienv/EssOilDB/blob/master/v1.0/essoildb.plantdata.csv

                                                      Appendix B:

                                                      Link to the file containing normalized names and observed errors: https://github.com/gilienv/EssOilDB/blob/master/tables/plant/normalized_names.csv

                                                      Appendix C:

                                                      Link to the file containing normalized names and observed errors: https://github.com/gilienv/EssOilDB/blob/master/tables/plant/normalized_names.csv

                                                      Appendix D:

                                                      Link to the file containing the details such as Wiki ID, common names and synonyms: https://github.com/gilienv/EssOilDB/blob/master/tables/plant/details.txt

                                                      References

                                                      • https://searchsqlserver.techtarget.com/definition/normalization

                                                      • Kumari S, Pundhir S, Priya P, Jeena G, Punetha A, Chawla K, Jafaree Z, Mondal S and Yadav G (2014). EssOilDB: A database of essential oils reflecting terpene composition and variability in the plant kingdom. Database (DOI: 10.1093/database/bau120)

                                                      • Allaire, J. (2012). RStudio: integrated development environment for R.Boston, MA,770.

                                                      • Luis Cayuela & Anke Stein (2017). https://CRAN.R-project.org/package=Taxonstand

                                                      • Kalwij, J. M. (2012). Review of ‘The Plant List, a working list of all plant species’.Journal of Vegetation Science,23(5), 998-1002.

                                                      • Scott Chamberlain and Ethan Welty (2018). wikitaxa: Taxonomic Information from 'Wikipedia'. R package version 0.3.0. https://CRAN.R-project.org/package=wikitaxa

                                                      • Vrandečić, D., & Krötzsch, M. (2014). Wikidata: a free collaborative knowledge base.

                                                      • ​Oliver Keyes, Serena Signorelli, Christian Graul and Mikhail Popov (2017). WikidataR: API Client Library for 'Wikidata'. R package version 1.4.0. https://CRAN.R-project.org/package=WikidataR

                                                      • ​https://www.gbif.org/en/what-is-gbif​

                                                      • Scott Chamberlain (2017). https://ropenscilabs.github.io/taxize-book/

                                                      Source

                                                      • Fig 1: nipgr.ac.in/Essoildb/
                                                      • Table 1: Chamberlain, S. A., & Szöcs, E. (2013). taxize: taxonomic search and retrieval in R. F1000Research, 2.
                                                      • Fig 3: http://www.theplantlist.org/
                                                      • Fig 5: https://www.wikidata.org/wiki/Wikidata:Main_Page
                                                      • Fig 6: https://www.gbif.org/
                                                      • Fig 7: Scott Chamberlain (2017). https://ropenscilabs.github.io/taxize-book/
                                                      • Fig 8: Luis Cayuela, Anke Stein and Jari Oksanen (2017). Taxonstand: Taxonomic Standardization of Plant Species Names. R package version
                                                      • Fig 9: Oliver Keyes, Serena Signorelli, Christian Graul and Mikhail Popov (2017). WikidataR: API Client Library for 'Wikidata'. R package version 1.4.0. https://CRAN.R-project.org/package=WikidataR
                                                      More
                                                      Written, reviewed, revised, proofed and published with