# STUDY ON SIMILARITY OF MATERIAL PROPERTIES USING CHEMINFORMATICS APPROACH

Bireshewar Roy

M.Sc. (1st year), Physics, Indian Institute of Technology, Kanpur 208016

Guided by:

Prof. Prasenjit Sen

Harish-Chandra Research Institute, Allahabad – 211019

## Abstract

Cheminformatics can be described as the generation or retrieval of data from repositories to transform data into information and information into knowledge with the intended purpose of making better decisions faster in the promising area of compound identification and optimization. As the proliferation of high-throughput computing in materials science is increasing, the gap between accumulated information and derived knowledge widens. We address the issue of discovery in material science by introducing novel analytic approaches based on electronic materials fingerprints. The framework is employed to – (i) query large databases of materials using similarity concept, (ii) map the connectivity of materials space (i.e., as a materials cartogram) for rapidly identifying regions with unique organizations/properties. In this study, we have used only the “band structure symmetry dependent fingerprint (B-fingerprint)” and “density of states symmetry independent fingerprint (D-fingerprint)” to study the similarity of materials. These materials fingerprinting and materials cartography approaches contribute to the emerging field of materials informatics by enabling effective computational tools to analyse, visualize, model, and design new materials.

A large number of molecular representations exist, and there are several methods (similarity and distance metrics) to quantify the similarity of material representation. “Tanimoto Similarity index” is an appropriate choice for fingerprint-based similarity calculations. So, for our purpose, we have used only the “Tanimoto similarity index” to compare the similarity of a few materials and the diversity of material space. First, we have chosen the reference material as ‘Gallium Arsenide’ (GaAs) and compared the B-fingerprints and D-fingerprints of some elements and binary compounds with GaAs using the python script that we developed. We also searched the AFLOWLIB database for materials similar to ‘Ytterbium Selenide’ (YbSe).

Keywords: Cheminformatics, Materials Cartograms, B-fingerprint, D-fingerprint, Tanimoto Similarity Index, AFLOWLIB database.

## Introduction

Quantifying the similarity of two materials is a key concept and routine task in cheminformatics. Design of materials with desired physical and chemical properties are vital challenges in the field of materials research ​[1]​ ​[2]​ ​[3]​. Material properties directly depend on a large number of key variables, often making the property prediction complex. These variables include constitutive elements, crystal forms, and geometrical and electronic characteristics, among others. The rapid growth of materials research led to the accumulation of vast amounts of data. For example, the Inorganic Crystal Structure Database (ICSD) includes more than 1,70,000 entries [4]. Experimental data are also included in other databases, such as Matweb and Matbase. In addition, there are several large databases, such as AFLOWLIB, Materials Project, Nomad Repository, and Harvard Clean Energy Project that contain thousands of unique materials and their theoretically (using DFT) calculated properties [6]. These properties include electronic structure profiles estimated with quantum mechanical methods. The latter databases have great potential to serve as a source of novel functional materials. Promising candidates from these databases may in turn be selected for experimental confirmation using rational design approaches.

The rapidly growing compendium of experimental and theoretical materials data offers n unique opportunity for scientific discovery in materials databases. Specialized data mining and data visualization methods are being developed within the nascent field of materials informatics.

Similar approaches have been extensively used in cheminformatics with resounding success. For example, in many cases, these approaches have served to help identify and design small organic molecules with desired biological activity and acceptable environmental/human-health safety profiles. Application of cheminformatics approaches to material science would allow researchers to – (i) define, visualize, and navigate through the material space, (ii) analyze and model structural and electronic characteristics of materials with regard to a particular physical or chemical property, and (iii) employ predictive materials informatics models to forecast the experimental properties of untested materials. Thus, rational design approaches in materials science constitute a rapidly growing field.

Herein, we use a novel materials fingerprinting approach recently proposed in the literature ​[4]​. We use fingerprints that encode information about the band structure and density of states (DOS), i.e., electronic structure of the materials. We show that known materials with similar properties turn out to have high similarity in their electronic fingerprints, thus suggesting that this method can be used to scout for materials with desired properties in existing materials databases.

## Materials Fingerprints

It is well known that material properties depend on geometrical and electronic structure. In comparing the properties of materials, two important assumptions in the present approach are that- (i) properties of materials are direct functions of ‘structures’, (ii) materials with similar ‘structures’ (as determined by constitutional, topological, spatial and electronic structures) are likely to have similar physical and chemical properties.

Thus, encoding material characteristics in the form of numerical arrays of descriptors, or fingerprints, enables the use of classical cheminformatics and machine-learning approaches to mine, visualize, and model any set of materials. We have encoded the electronic structure diagram for each material as two distinct types of arrays: a symmetry dependent fingerprint (band structure based B-fingerprint) and a symmetry independent fingerprint (density of state based D-fingerprint).

B-Fingerprint: At every special high-symmetry point of the Brillouin zone (BZ), the band energy scaled in the range -10 eV to 10 eV around band gap (or Fermi energy for metals) has been discretized into 32 bins to serve as our fingerprint array. The set of high-symmetry k-points in a Brillouin Zone (BZ) depends on the crystal symmetry. For example, BZ of a simple cubic crystal has four high symmetry points (Γ, M, R, X) and will give a B-fingerprint array of length 128. The body centered orthorhombic lattice, on the other hand has 13 high symmetry k-points (Γ, L, L1, L2, R, S, T, W, X, X1, Y, Y1, Z) and will lead to a B-fingerprint array of length 416. The special symmetry Γ-point is common to all lattice types, and does not depend on the symmetry of crystal structure. Therefore, to keep the analysis simple, we have calculated and compared B-fingerprints of materials only at the Γ-point as in ​[4]​. The construction of B-fingerprint of a band structure (Figure 1) is shown by histogram plot (Figure 2).

D-Fingerprint: A similar idea can be implemented for the DOS of materials, which are sampled in 256 bins (from -10 eV to 10 eV). Each bin contains the average value of DOS, in the energy interval of the same bin. Due to the complexity and limitations of the symmetry-dependent B-ﬁngerprints, it is suggested to use the concept of symmetry-independent D-ﬁngerprints. The length of these ﬁngerprints is adjustable depending on the objects, applications, and other factors. The domain space and length of these ﬁngerprints have been carefully designed to keep away the issues of enhancing boundary effects or discarding important features. The construction of D-fingerprint is shown by Figure 3.

Band structure of  Sb2Te3 (ICSD No. 262171). (taken from www.aflow.org

Construction of B-fingerprint (at Γ point) from the band structure of Sb2Te3 (ICSD No. 262171). We illustrate the idea of B-fingerprint with 32 bins.
Construction of  D-fingerprint (shown by colour diagram) from the density of states of Bi1I1Te(ICSD No. 10500). We illustrate the
idea of D-fingerprint with 256 bins. (taken from ​[4]​)

## Theory of Similarity and Distance Measures

There are various similarity and distance metrics. Similarities and distances can be interconverted using the following equation (1):

Similarity (S) = $\frac{\;1}{1+d}$; where d = distance ............................. (1)

i.e. every similarity metric corresponds to a distance metric and vice versa. Since distances are always non-negative (R ∈ [0; + ∞]), similarity values calculated with this equation will always have a value between 0 and 1 (with 1 corresponding to identical objects, where the distance is 0).

Some of the similarity/distance metrics are: Manhattan distance, Euclidian distance, Cosine coefficient, Dice coefficient, Tanimoto index, Soergel distance ​[5]​. For our purpose, we will only discuss the “Tanimoto Similarity Index”.

Tanimoto Similarity Index: ‘Tanimoto Similarity Index’ measures the similarity between two finite sample sets and generally it is defined as the size of the intersection divided by the size of the union of the sample sets. Suppose ‘X’ and ‘Y’ are two sample sets each having same number of elements (say N). The sample sets are defined as:

X= {x1, x2, x3,……, xN } and Y={ y1, y2, y3,……,yN} with all real xj, yj ≥ 0

We can consider the sample sets X and Y as two distinct N dimensional vectors with all positive real valued components. Then ‘Tanimoto Similarity Index’ can be represented as:

S (X,Y) = $\frac{\;\;\;\;X.Y}{\left[\left|X\right|^{\;2}+\left|Y\right|^{\;2}-X.Y\right]}$………............................................(2)

In terms of components of the vectors, the ‘Tanimoto Similarity Index’ takes the form :

S(X,Y) = $\frac{\;\;\;\;\;\;\;\left[{\displaystyle\sum_{j=1}^Nx_jy_j}\right]}{\left[{\displaystyle\sum_{j=1}^N\left(x_j\right)^2\;+\;\sum_{j=1}^N\left(y_j\right)^2-\sum_{j=1}^Nx_jy_j}\right]}$…………........(3)

Where, ‘xj’ is the value of j-th component in X.

‘yj’ is the value of j-th component in Y.

Now, if the vectors X and Y are bit-vectors ( Where value of each dimension is binary digit, i.e. either 0 or 1), then the ‘Tanimoto Similarity Index’ takes the simple form :

S (X,Y) = $\frac{\;\;c}{a+b-c}$ ………………………………..................(4)

Where, ‘S’ denotes the similarity between two bit-vectors X and Y.

‘a’ is the number of on bits (i.e. 1) in X.

‘b’ is the number of on bits (i.e. 1) in Y.

‘c’ is the number of bits that are on in both X and Y.

In our case, the threshold value of the ‘Tanimoto Similarity Index’ is ST = 0.7, i.e., If the electronic fingerprint similarity value between any two chosen compounds is greater than or equal to 0.7, then the two compounds are considered to have a similar property. Otherwise they are considered to have dissimilar properties.

## Similarity Search in the Material Space

Band structures and densities of states available in the AFLOWLIB consortium databases were converted into ﬁngerprints, or arrays of numbers.

We encoded the electronic structure diagram for each material as two distinct types of ﬁngerprints (Figure-2 & Figure-3): band structure symmetry-dependent ﬁngerprints (B-ﬁngerprints), and density of states symmetry-independent ﬁngerprints (D-ﬁngerprints). The B-ﬁngerprint is a string containing 32 integers, each characterizing the number of bands sampled at the high-symmetry reciprocal point (say Γ point) in one of the 32 bins dividing the [-10,10] eV interval around valance band maximum. The D-ﬁngerprint is a string containing 256 real numbers, each characterizing the strength of the DOS in one of the 256 bins dividing the [−10, 10] eV interval around the valance band maximum (which has to be taken as zero energy level).

This unique idea of materials representation enabled the use of cheminformatics approach, such as similarity searches, to retrieve materials with similar properties but different compositions from the AFLOWLIB repository. As an added beneﬁt, this similarity search can also quickly ﬁnd duplicate records. For example, we have identiﬁed several BaTiO3 records with identical ﬁngerprints (ICSD Nos. 15453, 27970, 6102, and 27965 in the AFLOWLIB database). Thus, ﬁngerprint representation afforded rapid identiﬁcation of duplicates, which is the standard ﬁrst step in our cheminformatics data curation workﬂow. There are severe limitations of standard DFT in the description of excited states and these should be substituted with more advanced approaches to characterize semiconductors and insulators ​[4]​. However, there is a general tendency of DFT errors being comparable in similar classes of systems. These errors are considered to be “systematic”, and are immaterial when one seeks only similarities between materials. Our test case is Gallium Arsenide, GaAs (ICSD No. 41674), a very important semiconductor material for electronics in the AFLOWLIB database. GaAs is taken as the reference material, and some elements and binary materials from the AFLOWLIB database are taken as the virtual screening library. The pairwise similarity between GaAs and any of the materials represented by our D-ﬁngerprints is computed using the Tanimoto similarity coeﬃcient(S). The top seven materials (GaP, InTe, Si, GaSb, SnP, InP, GeAs) retrieved show very high similarity (S > 0.75) to GaAs, and all seven are known to be semiconductor materials.

## AFLOWLIB Material Repository and Data

AFLOWLIB is a material repository of density functional theory (DFT) calculations managed by the software package AFLOW. At the time of the study, the AFLOWLIB.org database contains nearly 1.8 million compounds, each characterized by about 100 different properties. Of the characterized systems, roughly half are metallic and half are insulating. AFLOW leverages the VASP Package to calculate the total energy of a given crystal structure with PAW pseudopotentials and PBE exchange-correlation functional.

For our purpose, we need the OUTCAR file which contains the information about band energies of a particular compound and the DOSCAR file which contains the information about density of states of a particular compound, from the AFLOWLIB material repository.

## Results of Similarity Determination

By B-fingerprint and D-fingerprint similarity searches, we found some duplicate records (i.e. both the value of B-fingerprint similarity and D-fingerprint similarity are 1.0).

For example, we identified several GaAs records with identical fingerprints (ICSD Nos. 41674, 610536, 610537, 610538, 610539, 610540, 610541, 610543 in the AFLOWLIB database).

Other duplicate records with identical fingerprints are: GaP (ICSD Nos. 635030, 635032, 635033, 635034, 635035, 635036), Si2 (ICSD Nos. 60386, 60387, 60388, 60389), InTe (ICSD Nos. 169419, 169422, 169425, 169428), InP (ICSD Nos. 53105, 188691) etc.

We have done our first test case for pairwise similarity searches between GaAs and some of the semiconductor materials (GaP, GaSb, Si, SnP, GeAs, InTe, InP, InAs, Ge).

The B-fingerprint and D-fingerprint similarity values are tabulated below:

Reference Compound: GaAs (ICSD No. 41674) [FCC]

TABLE-I
 Serial No. Compound Name ICSD No. Structure* B- Fingerprint Similarity D-Fingerprint Similarity 1. GaP 635030 FCC 0.833 0.894 2. GaSb 44979 FCC 0.390 0.790 3. Si 60388 FCC 0.263 0.799 52460 BCT 0.057 0.750 41392 ORCI 0.057 0.748 52459 HEX 0.182 0.708 4. SnP 77786 FCC 0.000 0.774 5. GeAs 17033 BCT 0.022 0.760 6. InTe 169425 FCC 0.056 0.842 640610 CUB 0.162 0.769 7. InP 53104 FCC 0.188 0.768 8. InAs 41444 FCC 0.655 0.703 9. Ge 53788 FCC 0.231 0.742 181073 HEX 0.029 0.702

* The structures are given in short form as mentioned in AFLOWLIB.org. The list of structures and corresponding short forms are given in APPENDIX.

Then we have taken the reference material as ‘GaAs’ (ICSD No. 41674) and some binary compounds having chemical formula A1B1 as test materials.

Here, A= Alkali metals, Alkaline Earth metals, Transition metals (only 3d & 4d), Group 3A (excluding Boron); B= Group 5A, Group 6A (Chalcogens), Halogens. We found 157 different possible compounds of that type in the AFLOWLIB database. We operated our programming code for pairwise B & D-fingerprint similarity searches between GaAs and those A1B1 type compounds. As a result, we found 16 different compounds having high similarity values with GaAs.

The compounds which have high similarity values with GaAs are tabulated below:

TABLE-II
 Serial No. Compound Name ICSD No. Structure B- FingerprintSimilarity D-FingerprintSimilarity 1. GaP 635030 FCC 0.833 0.894 2. TlP 184576 FCC 0.846 0.782 3. TlAs 184574 FCC 0.846 0.720 4. YP 185495 CUB 0.767 0.599 5. NaAs 182160 TET 0.032 0.711 6. NaBi 58816 TET 0.103 0.768 7. BeTe 290008 FCC 0.333 0.717 8. AlAs 67784 FCC 0.571 0.710 9. GaSb 44979 FCC 0.390 0.790 10. GaBi 167768 FCC 0.188 0.812 11. GaS 40824 RHL 0.081 0.701 53590 HEX 0.082 0.775 12. InP 53104 FCC 0.188 0.768 13. InAs 41444 FCC 0.655 0.703 14. InS 409645 MCL 0.212 0.821 660105 ORC 0.212 0.771 15. InTe 169425 FCC 0.056 0.842 640610 CUB 0.162 0.769 16. TlBi 53967 CUB 0.132 0.743

In a different exercise, we took ‘YbSe’ (ICSD No. 33675) as our reference compound and searched the AFLOWLIB database for A1B1 type materials similar to YbSe.

Here, A= Elements of Lanthanide series; B= Chalcogens (S, Se, Te)

We found 40 different compounds of that type. Among them, the two materials most similar to YbSe (based on B-fingerprint) with S > 0.7 were EuS (ICSD No. 631599) and YbS (ICSD No. 651441).

## Discussion

Both the B-fingerprint and D-fingerprint similarity values are very high between GaAs and GaP (from Table-I).

Again, pairwise similarity values (based on D-fingerprints) between GaAs and any of the semiconductor materials (GaP, GaSb, Si, SnP, GeAs, InTe, InP, InAs, Ge) are very high (S > 0.7).

So, for any need of material which has a similar property of semiconductor (e.g. band gap), we can initially choose our test material in such a way that the D-fingerprint of test material has a high similarity value (S > 0.7) with our known semiconductors (e.g. GaAs).

The Band structures, B-fingerprint histogram plots and Density of states vs. Energy plots for GaAs and GaP are shown by Figure 4, 5, 6 and 7.

Band structure of GaAs (ICSD No. 41674).  (taken from www.aflow.org)
Band structure of  GaP (ICSD No. 635030). (taken from www.aflow.org)
Construction of  B-fingerprints (at Γ point) from the band  structures of GaAs (Left histogram) and GaP (Right histogram). The B-fingerprint similarity between these two materials is 0.833.
The “DOS  vs. Energy” plot for GaAs (Red Curve) and GaP (Green Curve) using the AFLOWLIB data. The D-fingerprint similarity between these two materials  is  0.894.

From Table-I, we see that, for different crystal structure of a particular compound, the variation of B-fingerprint similarity is high whereas the variation of D-fingerprint similarity is very less. So, it is clear that B-fingerprint is highly symmetry dependent but D-fingerprint is almost independent of the symmetry of crystal structure.

From Table-I and Table-II, there are 20 compounds very similar to GaAs. Out of these 20 materials, 17 materials are used as semiconductor. We did not find the experimental band gap values of TlP, TlAs and TlBi in our search of the literature.

The experimental band gap of some compounds are given below:

TABLE-III
 Serial No. Compound Name Structure Band Gap at 300 K (eV) 1. BeTe FCC 3.0 2. AlAs FCC 2.16 3. GaP FCC 2.26 4. GaSb FCC 0.72 5. InP FCC 1.35 6. InAs FCC 0.36 7. InS ORC 2.0 8. InTe FCC 0.6 9. Si FCC 1.12 10. GeAs BCT 1.64 11. Ge FCC 0.66

BaTiO3 is widely used as a ferroelectric ceramic or piezoelectric ​[8]​. Out of six materials most similar to BaTiO3, five (BiOBr, SrZrO3, BaZrO3, KTaO3, KNbO3) are well known piezoelectric ​[9]​. All these turn out to have B-fingerprint similarity, S > 0.8. The sixth material is cubic YbSe (ICSD No. 33675) ​[4]​. So, the question arises whether YbSe is a ferroelectric or Piezoelectric compound. We have found a high B-fingerprint similarity (S > 0.7) for EuS (ICSD No. 631599) and YbS (ICSD No. 651441) with cubic YbSe. One can therefore formulate a testable hypothesis suggesting that these two materials may also be ferroelectric or piezoelectric.

## Programming Algorithm

We developed the script in Python language to create the B-Fingerprint and D-fingerprint, and calculate the B-fingerprint similarity and D-Fingerprint similarity between any two chosen materials, using the AFLOWLIB data. The algorithm for the programming is described below:

STEP 1: Download the OUTCAR and DOSCAR files of two chosen materials from the AFLOWLIB.org database.

STEP 2: Read the OUTCAR files of both materials.

STEP 3: Store all the Band Energies at Γ-point (K-point 1) for both materials’ OUTCAR file.

STEP 4: Find the Maximum Valance Band Energy for each material and store it.

STEP 5: Set the Maximum Valance Band Energy of each material as zero energy level and shift all the Γ point (K-point 1) Band Energies of each material by subtracting the respective Maximum Valance Band Energy.

STEP 6: Choose the Band Energy range from -10.0 eV to 10.0 eV.

STEP 7: Divide the Band Energy range (-10.0 eV, 10.0 eV) in 32 bins, so that the energy interval of each bin becomes 0.625 eV.

STEP 8: Find the number of band in each bin using the Band Energy data (after shifting in Step-5) for each material and store it.

STEP 9: Convert the number of bands of each bin from decimal to 8-bits binary number. So, for 32 bins of each material, we will get the 256 bits binary number, which is the B-Fingerprint of the material. Store the 256 bits binary number (B-Fingerprint) in an array. Do this step for both the materials.

STEP 10: Use the definition of Tanimoto Similarity Index (given by equation-3) to find the similarity between two fingerprints. It will give the B-Fingerprint Similarity between two chosen materials.

STEP 11: Plot the Histogram of ‘Number of bands vs. Bin energy’ for both the materials.

STEP 12: Read the DOSCAR files of both materials.

STEP 13: Store all the energies and corresponding density of states from the DOSCAR file. Do this step for each of the material.

STEP 14: Shift the energies by subtracting the Maximum Valance Band Energy (obtained in Step-4) for each material.

STEP 15: Choose the Energy range from -10.0 eV to 10.0 eV.

STEP 16: Divide the Energy range (-10.0 eV, 10.0 eV) in 256 bins, so that the energy interval of each bin becomes 0.078125 eV.

STEP 17: Find the Density of States in each bin. Do this for each of the material and store it.

STEP 18: Convert the Density of States of each bin from decimal to 16-bits binary number. So, for 256 bins of each material, we will get the 4096 bits binary number, which is the D-Fingerprint of the material. Store the 4096 bits binary number (D-Fingerprint) in an array. Do this step for both the materials.

STEP 19: Use the definition of Tanimoto Similarity Index (given by equation-3) to find the similarity between two fingerprints. It will give the D-Fingerprint Similarity between two chosen materials.

STEP 20: Plot the ‘Energy vs. Density of states’ for each material in a single frame.

## Acknowledgement

This project would not have been what it is today without the support and advice of others.

First of all I would like to express my deep gratitude to my project supervisor Prof. Prasenjit Sen for the opportunity he has given to me to work with him. His ideas and guidance have been invaluable. Without his constant motivation I believe the work could not take its proper shape.

I am also indebted towards the help, moral support and encouragement provided by Dr. Rudra Banerjee. I am also thankful to Arijit Dutta, Ph.D student at HRI, for his cooperation in code development throughout the project.

I would also like to thank AuthorCafe for providing me a great platform to present my project report.

## APPENDIX

Bravais Lattice name and  Symbol (as given in AFLOWLIB.org)
 SERIAL NO. BRAVAIS LATTICE LATTICE SYMBOL 1 Simple Cubic CUB 2 Body Centred (‘I’ Centred) Cubic BCC 3 Face Centred (‘F’ Centred) Cubic FCC 4 Simple Tetragonal TET 5 Body Centred (‘I’ Centred) Tetragonal BCT 6 Simple Orthorhombic ORC 7 Body Centred (‘I’ Centred) Orthorhombic ORCI 8 Base Centred (‘C’ Centred) Orthorhombic ORCC 9 Face Centred (‘F’ Centred) Orthorhombic ORCF 10 Simple Rhombohedral (or Trigonal) RHL 11 Simple Hexagonal HEX 12 Simple Monoclinic MCL 13 Base Centred (‘C’ Centred) Monoclinic MCLC 14 Simple Triclinic TRI

#### References

• Rajan, K. Mater. Today 2005; 8: 38−45.

• Curtarolo, S.; Hart, G. L. W.; Buongiorno Nardelli, M.; Mingo, N.; Sanvito, S.; Levy, O. Nat. Mater. 2013; 12: 191−201.

• Potyrailo, R.; Rajan, K.; Takeuchi, I.; Chisholm, B.; Lam, H. ACS Comb. Sci. 2011; 13: 579−633.

• O. Isayev, D. Fourches, E. N. Muratov, C. Oses, K. Rasch, A. Tropsha and S. Curtarolo. Materials Cartography: Representing and Mining Materials Space Using Structural and Electronic Fingerprints. Chem. Mater. 2015; 27: 735−743.

• C. Oses, C. Toher and S. Curtarolo. Autonomous data-driven design of inorganic materials with AFLOW, submitted arXiv: 1803.05035v1 [cond-mat.mtr1-sci], 2018.

• D. Bajusz, A. Rácz and K. Héberger. Why is Tanimoto index an appropriate choice for fingerprint-based similarity calculations?. Journal of Cheminformatics. 2015; 7:20.

• Bhalla, A. S.; Guo, R.; Roy, R. Mater. Res. Innovat. 2000; 4: 3−26.

• Rabe, K. M.; Ahn, C. H.; Triscone, J. M. Physics of Ferroelectrics: A Modern Perspective; Speinger: New York, 2010.

More
Written, reviewed, revised, proofed and published with