STUDY ON SIMILARITY OF MATERIAL PROPERTIES USING CHEMINFORMATICS APPROACH
Guided by:
Abstract
Cheminformatics can be described as the generation or retrieval of data from repositories to transform data into information and information into knowledge with the intended purpose of making better decisions faster in the promising area of compound identification and optimization. As the proliferation of highthroughput computing in materials science is increasing, the gap between accumulated information and derived knowledge widens.^{ }We address the issue of discovery in material science by introducing novel analytic approaches based on electronic materials fingerprints. The framework is employed to – (i) query large databases of materials using similarity concept, (ii) map the connectivity of materials space (i.e., as a materials cartogram) for rapidly identifying regions with unique organizations/properties. In this study, we have used only the “band structure symmetry dependent fingerprint (Bfingerprint)” and “density of states symmetry independent fingerprint (Dfingerprint)” to study the similarity of materials. These materials fingerprinting and materials cartography approaches contribute to the emerging field of materials informatics by enabling effective computational tools to analyse, visualize, model, and design new materials.
A large number of molecular representations exist, and there are several methods (similarity and distance metrics) to quantify the similarity of material representation. “Tanimoto Similarity index” is an appropriate choice for fingerprintbased similarity calculations. So, for our purpose, we have used only the “Tanimoto similarity index” to compare the similarity of a few materials and the diversity of material space. First, we have chosen the reference material as ‘Gallium Arsenide’ (GaAs) and compared the Bfingerprints and Dfingerprints of some elements and binary compounds with GaAs using the python script that we developed. We also searched the AFLOWLIB database for materials similar to ‘Ytterbium Selenide’ (YbSe).
Keywords: Cheminformatics, Materials Cartograms, Bfingerprint, Dfingerprint, Tanimoto Similarity Index, AFLOWLIB database.
Introduction
Quantifying the similarity of two materials is a key concept and routine task in cheminformatics. Design of materials with desired physical and chemical properties are vital challenges in the field of materials research [1] [2] [3]. Material properties directly depend on a large number of key variables, often making the property prediction complex. These variables include constitutive elements, crystal forms, and geometrical and electronic characteristics, among others. The rapid growth of materials research led to the accumulation of vast amounts of data. For example, the Inorganic Crystal Structure Database (ICSD) includes more than 1,70,000 entries [4]. Experimental data are also included in other databases, such as Matweb and Matbase. In addition, there are several large databases, such as AFLOWLIB, Materials Project, Nomad Repository, and Harvard Clean Energy Project that contain thousands of unique materials and their theoretically (using DFT) calculated properties [6]. These properties include electronic structure profiles estimated with quantum mechanical methods. The latter databases have great potential to serve as a source of novel functional materials. Promising candidates from these databases may in turn be selected for experimental confirmation using rational design approaches.
The rapidly growing compendium of experimental and theoretical materials data offers n unique opportunity for scientific discovery in materials databases. Specialized data mining and data visualization methods are being developed within the nascent field of materials informatics.
Similar approaches have been extensively used in cheminformatics with resounding success. For example, in many cases, these approaches have served to help identify and design small organic molecules with desired biological activity and acceptable environmental/humanhealth safety profiles. Application of cheminformatics approaches to material science would allow researchers to – (i) define, visualize, and navigate through the material space, (ii) analyze and model structural and electronic characteristics of materials with regard to a particular physical or chemical property, and (iii) employ predictive materials informatics models to forecast the experimental properties of untested materials. Thus, rational design approaches in materials science constitute a rapidly growing field.
Herein, we use a novel materials fingerprinting approach recently proposed in the literature [4]. We use fingerprints that encode information about the band structure and density of states (DOS), i.e., electronic structure of the materials. We show that known materials with similar properties turn out to have high similarity in their electronic fingerprints, thus suggesting that this method can be used to scout for materials with desired properties in existing materials databases.
Methods
Materials Fingerprints
It is well known that material properties depend on geometrical and electronic structure. In comparing the properties of materials, two important assumptions in the present approach are that (i) properties of materials are direct functions of ‘structures’, (ii) materials with similar ‘structures’ (as determined by constitutional, topological, spatial and electronic structures) are likely to have similar physical and chemical properties.
Thus, encoding material characteristics in the form of numerical arrays of descriptors, or fingerprints, enables the use of classical cheminformatics and machinelearning approaches to mine, visualize, and model any set of materials. We have encoded the electronic structure diagram for each material as two distinct types of arrays: a symmetry dependent fingerprint (band structure based Bfingerprint) and a symmetry independent fingerprint (density of state based Dfingerprint).
BFingerprint: At every special highsymmetry point of the Brillouin zone (BZ), the band energy scaled in the range 10 eV to 10 eV around band gap (or Fermi energy for metals) has been discretized into 32 bins to serve as our fingerprint array. The set of highsymmetry kpoints in a Brillouin Zone (BZ) depends on the crystal symmetry. For example, BZ of a simple cubic crystal has four high symmetry points (Γ, M, R, X) and will give a Bfingerprint array of length 128. The body centered orthorhombic lattice, on the other hand has 13 high symmetry kpoints (Γ, L, L_{1}, L_{2}, R, S, T, W, X, X_{1}, Y, Y_{1}, Z) and will lead to a Bfingerprint array of length 416. The special symmetry Γpoint is common to all lattice types, and does not depend on the symmetry of crystal structure. Therefore, to keep the analysis simple, we have calculated and compared Bfingerprints of materials only at the Γpoint as in [4]. The construction of Bfingerprint of a band structure (Figure 1) is shown by histogram plot (Figure 2).
DFingerprint: A similar idea can be implemented for the DOS of materials, which are sampled in 256 bins (from 10 eV to 10 eV). Each bin contains the average value of DOS, in the energy interval of the same bin. Due to the complexity and limitations of the symmetrydependent Bﬁngerprints, it is suggested to use the concept of symmetryindependent Dﬁngerprints. The length of these ﬁngerprints is adjustable depending on the objects, applications, and other factors. The domain space and length of these ﬁngerprints have been carefully designed to keep away the issues of enhancing boundary effects or discarding important features. The construction of Dfingerprint is shown by Figure 3.
Theory of Similarity and Distance Measures
There are various similarity and distance metrics. Similarities and distances can be interconverted using the following equation (1):
Similarity (S) = $\frac{\;1}{1+d}$; where d = distance ............................. (1)
i.e. every similarity metric corresponds to a distance metric and vice versa. Since distances are always nonnegative (R ∈ [0; + ∞]), similarity values calculated with this equation will always have a value between 0 and 1 (with 1 corresponding to identical objects, where the distance is 0).
Some of the similarity/distance metrics are: Manhattan distance, Euclidian distance, Cosine coefficient, Dice coefficient, Tanimoto index, Soergel distance [5].^{ }For our purpose, we will only discuss the “Tanimoto Similarity Index”.
Tanimoto Similarity Index: ‘Tanimoto Similarity Index’ measures the similarity between two finite sample sets and generally it is defined as the size of the intersection divided by the size of the union of the sample sets. Suppose ‘X’ and ‘Y’ are two sample sets each having same number of elements (say N). The sample sets are defined as:
X= {x_{1}, x_{2}, x_{3},……, x_{N }} and Y={ y_{1}, y_{2}, y_{3},……,y_{N}} with all real x_{j}, y_{j }≥ 0
We can consider the sample sets X and Y as two distinct N dimensional vectors with all positive real valued components. Then ‘Tanimoto Similarity Index’ can be represented as:
S (X,Y) = $\frac{\;\;\;\;X.Y}{\left[\leftX\right^{\;2}+\leftY\right^{\;2}X.Y\right]}$………............................................(2)
In terms of components of the vectors, the ‘Tanimoto Similarity Index’ takes the form :
S(X,Y) = $\frac{\;\;\;\;\;\;\;\left[{\displaystyle\sum_{j=1}^Nx_jy_j}\right]}{\left[{\displaystyle\sum_{j=1}^N\left(x_j\right)^2\;+\;\sum_{j=1}^N\left(y_j\right)^2\sum_{j=1}^Nx_jy_j}\right]}$…………........(3)
Where, ‘x_{j}’ is the value of jth component in X.
‘y_{j}’ is the value of jth component in Y.
Now, if the vectors X and Y are bitvectors ( Where value of each dimension is binary digit, i.e. either 0 or 1), then the ‘Tanimoto Similarity Index’ takes the simple form :
S (X,Y) = $\frac{\;\;c}{a+bc}$ ………………………………..................(4)
Where, ‘S’ denotes the similarity between two bitvectors X and Y.
‘a’ is the number of on bits (i.e. 1) in X.
‘b’ is the number of on bits (i.e. 1) in Y.
‘c’ is the number of bits that are on in both X and Y.
In our case, the threshold value of the ‘Tanimoto Similarity Index’ is S_{T} = 0.7, i.e., If the electronic fingerprint similarity value between any two chosen compounds is greater than or equal to 0.7, then the two compounds are considered to have a similar property. Otherwise they are considered to have dissimilar properties.
Similarity Search in the Material Space
Band structures and densities of states available in the AFLOWLIB consortium databases were converted into ﬁngerprints, or arrays of numbers.
We encoded the electronic structure diagram for each material as two distinct types of ﬁngerprints (Figure2 & Figure3): band structure symmetrydependent ﬁngerprints (Bﬁngerprints), and density of states symmetryindependent ﬁngerprints (Dﬁngerprints). The Bﬁngerprint is a string containing 32 integers, each characterizing the number of bands sampled at the highsymmetry reciprocal point (say Γ point) in one of the 32 bins dividing the [10,10] eV interval around valance band maximum. The Dﬁngerprint is a string containing 256 real numbers, each characterizing the strength of the DOS in one of the 256 bins dividing the [−10, 10] eV interval around the valance band maximum (which has to be taken as zero energy level).
This unique idea of materials representation enabled the use of cheminformatics approach, such as similarity searches, to retrieve materials with similar properties but different compositions from the AFLOWLIB repository. As an added beneﬁt, this similarity search can also quickly ﬁnd duplicate records. For example, we have identiﬁed several BaTiO_{3} records with identical ﬁngerprints (ICSD Nos. 15453, 27970, 6102, and 27965 in the AFLOWLIB database). Thus, ﬁngerprint representation afforded rapid identiﬁcation of duplicates, which is the standard ﬁrst step in our cheminformatics data curation workﬂow. There are severe limitations of standard DFT in the description of excited states and these should be substituted with more advanced approaches to characterize semiconductors and insulators [4]. However, there is a general tendency of DFT errors being comparable in similar classes of systems. These errors are considered to be “systematic”, and are immaterial when one seeks only similarities between materials. Our test case is Gallium Arsenide, GaAs (ICSD No. 41674), a very important semiconductor material for electronics in the AFLOWLIB database. GaAs is taken as the reference material, and some elements and binary materials from the AFLOWLIB database are taken as the virtual screening library. The pairwise similarity between GaAs and any of the materials represented by our Dﬁngerprints is computed using the Tanimoto similarity coeﬃcient(S). The top seven materials (GaP, InTe, Si, GaSb, SnP, InP, GeAs) retrieved show very high similarity (S > 0.75) to GaAs, and all seven are known to be semiconductor materials.
AFLOWLIB Material Repository and Data
AFLOWLIB is a material repository of density functional theory (DFT) calculations managed by the software package AFLOW. At the time of the study, the AFLOWLIB.org database contains nearly 1.8 million compounds, each characterized by about 100 different properties. Of the characterized systems, roughly half are metallic and half are insulating. AFLOW leverages the VASP Package to calculate the total energy of a given crystal structure with PAW pseudopotentials and PBE exchangecorrelation functional.
For our purpose, we need the OUTCAR file which contains the information about band energies of a particular compound and the DOSCAR file which contains the information about density of states of a particular compound, from the AFLOWLIB material repository.
Result and Discussion
Results of Similarity Determination
By Bfingerprint and Dfingerprint similarity searches, we found some duplicate records (i.e. both the value of Bfingerprint similarity and Dfingerprint similarity are 1.0).
For example, we identified several GaAs records with identical fingerprints (ICSD Nos. 41674, 610536, 610537, 610538, 610539, 610540, 610541, 610543 in the AFLOWLIB database).
Other duplicate records with identical fingerprints are: GaP (ICSD Nos. 635030, 635032, 635033, 635034, 635035, 635036), Si_{2 }(ICSD Nos. 60386, 60387, 60388, 60389), InTe (ICSD Nos. 169419, 169422, 169425, 169428), InP (ICSD Nos. 53105, 188691) etc.
We have done our first test case for pairwise similarity searches between GaAs and some of the semiconductor materials (GaP, GaSb, Si, SnP, GeAs, InTe, InP, InAs, Ge).
The Bfingerprint and Dfingerprint similarity values are tabulated below:
Reference Compound: GaAs (ICSD No. 41674) [FCC]
Serial No.  Compound Name  ICSD No.  Structure^{*}  B Fingerprint Similarity  DFingerprint Similarity 
1.  GaP  635030  FCC  0.833  0.894 
2.  GaSb  44979  FCC  0.390  0.790 
3.  Si  60388  FCC  0.263  0.799 
52460  BCT  0.057  0.750  
41392  ORCI  0.057  0.748  
52459  HEX  0.182  0.708  
4.  SnP  77786  FCC  0.000  0.774 
5.  GeAs  17033  BCT  0.022  0.760 
6.  InTe  169425  FCC  0.056  0.842 
640610  CUB  0.162  0.769  
7.  InP  53104  FCC  0.188  0.768 
8.  InAs  41444  FCC  0.655  0.703 
9.  Ge  53788  FCC  0.231  0.742 
181073  HEX  0.029  0.702 
* The structures are given in short form as mentioned in AFLOWLIB.org. The list of structures and corresponding short forms are given in APPENDIX.
Then we have taken the reference material as ‘GaAs’ (ICSD No. 41674) and some binary compounds having chemical formula A_{1}B_{1} as test materials.
Here, A= Alkali metals, Alkaline Earth metals, Transition metals (only 3d & 4d), Group 3A (excluding Boron); B= Group 5A, Group 6A (Chalcogens), Halogens. We found 157 different possible compounds of that type in the AFLOWLIB database. We operated our programming code for pairwise B & Dfingerprint similarity searches between GaAs and those A_{1}B_{1} type compounds. As a result, we found 16 different compounds having high similarity values with GaAs.
The compounds which have high similarity values with GaAs are tabulated below:
Serial No.  Compound Name  ICSD No.  Structure  B Fingerprint Similarity  DFingerprint Similarity 
1.  GaP  635030  FCC  0.833  0.894 
2.  TlP  184576  FCC  0.846  0.782 
3.  TlAs  184574  FCC  0.846  0.720 
4.  YP  185495  CUB  0.767  0.599 
5.  NaAs  182160  TET  0.032  0.711 
6.  NaBi  58816  TET  0.103  0.768 
7.  BeTe  290008  FCC  0.333  0.717 
8.  AlAs  67784  FCC  0.571  0.710 
9.  GaSb  44979  FCC  0.390  0.790 
10.  GaBi  167768  FCC  0.188  0.812 
11.  GaS  40824  RHL  0.081  0.701 
53590  HEX  0.082  0.775  
12.  InP  53104  FCC  0.188  0.768 
13.  InAs  41444  FCC  0.655  0.703 
14.  InS  409645  MCL  0.212  0.821 
660105  ORC  0.212  0.771  
15.  InTe  169425  FCC  0.056  0.842 
640610  CUB  0.162  0.769  
16.  TlBi  53967  CUB  0.132  0.743 
In a different exercise, we took ‘YbSe’ (ICSD No. 33675) as our reference compound and searched the AFLOWLIB database for A_{1}B_{1 }type materials similar to YbSe.
Here, A= Elements of Lanthanide series; B= Chalcogens (S, Se, Te)
We found 40 different compounds of that type. Among them, the two materials most similar to YbSe (based on Bfingerprint) with S > 0.7 were EuS (ICSD No. 631599) and YbS (ICSD No. 651441).
Discussion
Both the Bfingerprint and Dfingerprint similarity values are very high between GaAs and GaP (from TableI).
Again, pairwise similarity values (based on Dfingerprints) between GaAs and any of the semiconductor materials (GaP, GaSb, Si, SnP, GeAs, InTe, InP, InAs, Ge) are very high (S > 0.7).
So, for any need of material which has a similar property of semiconductor (e.g. band gap), we can initially choose our test material in such a way that the Dfingerprint of test material has a high similarity value (S > 0.7) with our known semiconductors (e.g. GaAs).
The Band structures, Bfingerprint histogram plots and Density of states vs. Energy plots for GaAs and GaP are shown by Figure 4, 5, 6 and 7.
From TableI, we see that, for different crystal structure of a particular compound, the variation of Bfingerprint similarity is high whereas the variation of Dfingerprint similarity is very less. So, it is clear that Bfingerprint is highly symmetry dependent but Dfingerprint is almost independent of the symmetry of crystal structure.
From TableI and TableII, there are 20 compounds very similar to GaAs. Out of these 20 materials, 17 materials are used as semiconductor. We did not find the experimental band gap values of TlP, TlAs and TlBi in our search of the literature.
The experimental band gap of some compounds are given below:
Serial No.  Compound Name  Structure  Band Gap at 300 K (eV) 
1.  BeTe  FCC  3.0 
2.  AlAs  FCC  2.16 
3.  GaP  FCC  2.26 
4.  GaSb  FCC  0.72 
5.  InP  FCC  1.35 
6.  InAs  FCC  0.36 
7.  InS  ORC  2.0 
8.  InTe  FCC  0.6 
9.  Si  FCC  1.12 
10.  GeAs  BCT  1.64 
11.  Ge  FCC  0.66 
BaTiO_{3} is widely used as a ferroelectric ceramic or piezoelectric [8]. Out of six materials most similar to BaTiO_{3}, five (BiOBr, SrZrO_{3}, BaZrO_{3}, KTaO_{3}, KNbO_{3}) are well known piezoelectric [9]. All these turn out to have Bfingerprint similarity, S > 0.8. The sixth material is cubic YbSe (ICSD No. 33675) [4]. So, the question arises whether YbSe is a ferroelectric or Piezoelectric compound. We have found a high Bfingerprint similarity (S > 0.7) for EuS (ICSD No. 631599) and YbS (ICSD No. 651441) with cubic YbSe. One can therefore formulate a testable hypothesis suggesting that these two materials may also be ferroelectric or piezoelectric.
Programming Algorithm
We developed the script in Python language to create the BFingerprint and Dfingerprint, and calculate the Bfingerprint similarity and DFingerprint similarity between any two chosen materials, using the AFLOWLIB data. The algorithm for the programming is described below:
STEP 1: Download the OUTCAR and DOSCAR files of two chosen materials from the AFLOWLIB.org database.
STEP 2: Read the OUTCAR files of both materials.
STEP 3: Store all the Band Energies at Γpoint (Kpoint 1) for both materials’ OUTCAR file.
STEP 4: Find the Maximum Valance Band Energy for each material and store it.
STEP 5: Set the Maximum Valance Band Energy of each material as zero energy level and shift all the Γ point (Kpoint 1) Band Energies of each material by subtracting the respective Maximum Valance Band Energy.
STEP 6: Choose the Band Energy range from 10.0 eV to 10.0 eV.
STEP 7: Divide the Band Energy range (10.0 eV, 10.0 eV) in 32 bins, so that the energy interval of each bin becomes 0.625 eV.
STEP 8: Find the number of band in each bin using the Band Energy data (after shifting in Step5) for each material and store it.
STEP 9: Convert the number of bands of each bin from decimal to 8bits binary number. So, for 32 bins of each material, we will get the 256 bits binary number, which is the BFingerprint of the material. Store the 256 bits binary number (BFingerprint) in an array. Do this step for both the materials.
STEP 10: Use the definition of Tanimoto Similarity Index (given by equation3) to find the similarity between two fingerprints. It will give the BFingerprint Similarity between two chosen materials.
STEP 11: Plot the Histogram of ‘Number of bands vs. Bin energy’ for both the materials.
STEP 12: Read the DOSCAR files of both materials.
STEP 13: Store all the energies and corresponding density of states from the DOSCAR file. Do this step for each of the material.
STEP 14: Shift the energies by subtracting the Maximum Valance Band Energy (obtained in Step4) for each material.
STEP 15: Choose the Energy range from 10.0 eV to 10.0 eV.
STEP 16: Divide the Energy range (10.0 eV, 10.0 eV) in 256 bins, so that the energy interval of each bin becomes 0.078125 eV.
STEP 17: Find the Density of States in each bin. Do this for each of the material and store it.
STEP 18: Convert the Density of States of each bin from decimal to 16bits binary number. So, for 256 bins of each material, we will get the 4096 bits binary number, which is the DFingerprint of the material. Store the 4096 bits binary number (DFingerprint) in an array. Do this step for both the materials.
STEP 19: Use the definition of Tanimoto Similarity Index (given by equation3) to find the similarity between two fingerprints. It will give the DFingerprint Similarity between two chosen materials.
STEP 20: Plot the ‘Energy vs. Density of states’ for each material in a single frame.
Acknowledgement
This project would not have been what it is today without the support and advice of others.
First of all I would like to express my deep gratitude to my project supervisor Prof. Prasenjit Sen for the opportunity he has given to me to work with him. His ideas and guidance have been invaluable. Without his constant motivation I believe the work could not take its proper shape.
I am also indebted towards the help, moral support and encouragement provided by Dr. Rudra Banerjee. I am also thankful to Arijit Dutta, Ph.D student at HRI, for his cooperation in code development throughout the project.
I would also like to thank AuthorCafe for providing me a great platform to present my project report.
APPENDIX
SERIAL NO.  BRAVAIS LATTICE  LATTICE SYMBOL 
1  Simple Cubic  CUB 
2  Body Centred (‘I’ Centred) Cubic  BCC 
3  Face Centred (‘F’ Centred) Cubic  FCC 
4  Simple Tetragonal  TET 
5  Body Centred (‘I’ Centred) Tetragonal  BCT 
6  Simple Orthorhombic  ORC 
7  Body Centred (‘I’ Centred) Orthorhombic  ORCI 
8  Base Centred (‘C’ Centred) Orthorhombic  ORCC 
9  Face Centred (‘F’ Centred) Orthorhombic  ORCF 
10  Simple Rhombohedral (or Trigonal)  RHL 
11  Simple Hexagonal  HEX 
12  Simple Monoclinic  MCL 
13  Base Centred (‘C’ Centred) Monoclinic  MCLC 
14  Simple Triclinic  TRI 
References

Rajan, K. Mater. Today 2005; 8: 38−45.

Curtarolo, S.; Hart, G. L. W.; Buongiorno Nardelli, M.; Mingo, N.; Sanvito, S.; Levy, O. Nat. Mater. 2013; 12: 191−201.

Potyrailo, R.; Rajan, K.; Takeuchi, I.; Chisholm, B.; Lam, H. ACS Comb. Sci. 2011; 13: 579−633.

O. Isayev, D. Fourches, E. N. Muratov, C. Oses, K. Rasch, A. Tropsha and S. Curtarolo. Materials Cartography: Representing and Mining Materials Space Using Structural and Electronic Fingerprints. Chem. Mater. 2015; 27: 735−743.

C. Oses, C. Toher and S. Curtarolo. Autonomous datadriven design of inorganic materials with AFLOW, submitted arXiv: 1803.05035v1 [condmat.mtr1sci], 2018.

D. Bajusz, A. Rácz and K. Héberger. Why is Tanimoto index an appropriate choice for fingerprintbased similarity calculations?. Journal of Cheminformatics. 2015; 7:20.

Bhalla, A. S.; Guo, R.; Roy, R. Mater. Res. Innovat. 2000; 4: 3−26.

Rabe, K. M.; Ahn, C. H.; Triscone, J. M. Physics of Ferroelectrics: A Modern Perspective; Speinger: New York, 2010.
Post your comments
Please try again.