Loading...

Summer Research Fellowship Programme of India's Science Academies

Data storage and cleansing techniques for clinical research

K P Aarthy

College of Engineering, Guindy, Anna University, Chennai 600025

Dr Uma S. Ranjan

Associate Professor, Centre for Brain Research, Indian Institute of Science, Bangalore, Karnataka 560012

Professor Y Narahari

Department of Computer Science and Automation, Indian Institute of Science, Bengaluru, Karnataka 560012

Abstract

The project deals with creating an efficient digital data storage with functionalities like status tracking and flexible querying to assist the research on multiple types of dementia. Dementia is a condition that describes a group of symptoms associated with a decline in memory or other thinking skills that greatly reduces a person's ability to perform everyday activities. The data required for the study on dementia is obtained from various sources under categories namely psychometric, imaging, clinical, nursing, anthropometric and genomic data. Since the project deals with huge amounts of semi-structured data that cannot be confined in a fixed schema, the data is represented using a wide column store database, a type of NoSQL database. Wide column stores, also called extensible record stores, store data in records with an ability to hold very large numbers of dynamic columns. Column-oriented databases are typically useful to store denormalized data. They are a good choice in a query-heavy environment, thus making them very useful in the context of data analytics. With these considerations, Apache Cassandra, a column-store database was chosen for the data storage. Cassandra accommodates all possible data formats including structured, semi-structured, and unstructured. It can dynamically accommodate changes to the data structures as and when modifications are needed. The data about the study are availed in the form of CSV (Comma-separated values) and JSON files. In order to script the interface needed for data preprocessing and cleansing, python scripts are used. These scripts are used to make the necessary modifications to the data such as datatype conversions so as to meet the requirements of the Cassandra environment, managing erroneous values and anomalies prevailing in the data. A Progressive Web Application (PWA) is also developed as the presentation layer to enable efficient data entry and querying.

Keywords: NoSQL, progressive web application, anthropometric, genomic, denormalized data

Abbreviations

 Abbreviations
PWAProgressive Web Application
RDBMSRelational Database Management System
ACIDAtomicity, Consistency, Isolation and Durability properties
CAP Consistency, Availability and Partition Tolerance
DOMDocument Object Model
WSGI Web Server Gateway Interface 
CSSCascading Style Sheet
HTMLHypertext Markup Language
SANSCOGSrinivaspura Aging Neuro Senescence and COGnition study
HTTPSHypertext Transfer Protocol Secure

INTRODUCTION

Background

Dementia is a medical condition that involves a decline in mental ability severe enough to hurdle activities of daily life. Alzheimer's is the most common type of dementia. Alzheimer's and other forms of dementia are a major health concern among the elderly in developing countries like India. The aim of the project is to create a digital data storage with status tracking and flexible querying to assist the research on various types of dementia. Data collection is performed at various settings - by field workers in home visits, where internet connections could be flaky or non-existent, by trained clinicians in semi-rural office settings as well as imaging studies done at referral hospitals. All the data collected for an individual must be aggregated together for analysis without any mix-up. Further, no unauthorized person should be allowed to modify the data. For instance, a clinician should not be allowed to modify the registration data. This makes field data collection a complicated process prone to errors when collected manually. Most data collection endeavors rely on a unique identifier per subject to collate data corresponding to the subject. This unique identifier is generated centrally, which involves field workers reporting to a central location before leaving for their home-to-home data collection. This also introduces inefficiencies in the process. One of the issues in data collection is also that the field data workers do not sync up the data they collect regularly with the servers. This results in valuable data lying in field devices. If the device is lost or develops issues, the data may be lost for ever. Further, this gives a poor indication of the data collection status. The project deals with replacing traditional field data collection methods and providing an efficient means to collect field data and ease future analysis as well. This can reduce errors due to incorrect data entry and other manual errors. A front-end for data collectors to interact with the database is also necessary for ease of entering, and reducing manual processes to the extent possible.

Statement of the Problems

Data crowd sourcing is an integral part of data collection for clinical research. Primitive methods like handwritten forms or spreadsheets make way to humongous errors, whilst also making analysis difficult in the future. Also, the data collection process cannot be completely made online as a strong internet connection can not always be guaranteed in the sites of data collection. The structure of the collected data might not be rigidly structured, hence it can not be confined to a regular relational database.

However, the data does correspond to a loose structure, with most of the data types remaining constant for many years.

Hence, the problem consisted of

  • Designing a database structure for efficient representation of “mostly structured” data
  • Designing a responsive user interface for entering data
  • Enable viewing of stored data by authorized personnel on multiple devices, including mobile devices
  • Designing a system which can provide real-time insights into the data collection process.

Data Collection

The clinical research data are collected for the SANSCOG study, an initiative the Centre for Brain Research and other eminent institutions like NIMHANS, Sri Devaraj Urs Medical College, Kolar and other departments of IISc. The study is envisioned as a community based cohort study with long term follow-up over many years for comprehensive evaluation of risk and protective factors associated with cognitive changes due to normal ageing, Alzheimer's disease and other related disorders. The study cohort comprises nearly 10000 cognitive healthy individuals without dementia in the age group of 45 years and above. The cohort will be evaluated with detailed clinical, neurocognitive, lifestyle, anthropometric, biochemical, genetic and multi-nodal neuroimaging measures in the baseline and during follow up. The Srinivasapura taluk in Kolar district in Karnataka is the site of study.

DESIGN

A data warehouse is the central place of a business Intelligence system. When we deal with huge volumes of data, new approaches are required to make the analytical system more efficient. An IDC study examined emerging trends in DBMS technology and considers that in the next 5 years, most data warehouses will be stored in a columnar fashion. The size of data warehouses have constantly increased with years. The challenge is to maintain the performance of these repositories that are built as relational structures in row-oriented manner. A new approach called column-oriented database was as proposed as an alternative to row-oriented databases as size and complexity of data warehouses increased. In a column-oriented database, the data storage layer contains columns.

Usually the data collected by organisations or researchers alike, is mostly unstructured data, images and other files that need a greater storage space unlike structured data. Row-oriented databases are designed for transactional processing. All the attributes of an entity are stored in a single row. This is not very useful in an analytical system where a lot of read operations are executed in order to access a small number of attributes from a huge collection of data. This poses as a disadvantage in row oriented databases and makes it unsuitable for large scale processing. Hence, column-oriented databases are best suited for information retrieval from data warehouses by aggregating large volumes of data, and for faster read operations​.[1]

In traditional mobile application development, re-usability of code between native applications, web and mobile platforms had been inherently non-existent owing to their non-inter operable code bases. Hence, cross-platform development has become an alternative to reduce the resources spent on specialised applications for every target platform.[2]

Progressive Web applications combine the best of web and mobile applications giving a rich experience just like the native applications. The technology was designed and developed by Google. PWAs are always up-to-date as the service workers are updated often. They can be launched like native mobile applications without the problem of making them available in an application store. They can load with a comparatively lower latency, as once the service worker installs in the background, the website will be very fast to a web visitor.[3]

With the knowledge obtained from the works of various authors and researchers, column-oriented databases were chosen for data storage and manipulation. In order to facilitate efficient data entry and querying, a progressive web application was decided to be developed so that they can be used for field data collection in places with weak or no internet access.

Database Design

Column-oriented databases

Column-oriented database, otherwise known as Wide column stores organize the data into columns, and each columns are divided into any groups. It is a type of NoSQL database. NoSQL-based systems are very useful in large databases that are prone to performance problems primarily caused by the limitations of SQL and the relational model of databases. They do not use a fixed schema and SQL joins.

Advantages of column-oriented database over row-based databases

  • Structure and flexibility

It has a dynamic schema for unstructured data and semi-structured data. This provides flexibility. Hence, addition of fields at a later time to the database is feasible.

  • Scalability

SQL databases are vertically scalable (ie) the load on a server can be increased by increasing CPU, RAM or SSD. NoSQL databases, on the other hand are horizontally scalable (ie) By sharding or adding more servers, the traffic can be handled.

  • Speed

It is best suited for analytical queries that perform aggregate operations. As storage is optimised for block access, by storing the data beside each other, locality of reference can be exploited. This improves performance as it provides optimal performance for sequential access.

rowvscolumn.png
    Comparison of Row-oriented and column-oriented databases

    User Interface Design

    User interface on mobile devices is mostly in the form of native applications. However, native applications have the disadvantage that they must be developed independently for each platform. Moreover, change of interface involves re-distribution and update on all mobile devices. In order to avoid these, we considered the development of a progressive web application (PWA) for the user interface.

    Progressive web application

    PWAs are identified as mobile applications that are built using technologies like HTML, CSS and Javascript. They are delivered through the web. PWAs provide a range of functionalities that are provided by native applications on mobile devices, including working offline and push notifications. They need not be distributed via digital distribution systems as fundamentally, they are a type of web applications. Precisely, a PWA is a web application that delivers an application-like experience to users, using modern web-capabilities.

    Benefits of PWA

    • Cost of development and distribution:

    As PWAs target the web, it can work on all platforms. Hence, the cost of maintaining different teams for different platforms is minimised. Hence, PWAs and hybrid applications are increasingly being used by companies.

    • Poor conversion and friction:

    The effort of convincing a user to install an application is called friction. The native applications are distributed through centralized application stores. Each step in installing an application including finding it in the application store, installing, authenticating, accepting and allowing permissions, waiting for the download and finally opening it, increases the conversion by about 20% between a potential user and the application.

    PWAs have an advantage in terms of accessibility as they are easily installable. They can be easily accessed like a website and quickly and conveniently used like a native application.

    • Network resilience:

    Reliability is another factor that brings PWAs to popularity. Offline features are an additional advantage to an application. The web was not initially designed to deal with offline situations. On the other hand, Native applications are built using tools and practices that were specifically designed to deal with devices that are often exposed to poor or no network access.

    Service workers helped the web applications to have a way of controlling network access themselves. This allows the developers to cache the needed assets and how to deal with disconnected or poor network situations.

    • User engagement:

    Web applications have been very poor at engaging users. Native and hybrid applications have the benefit of being in the home screen and application list. This makes them easily accessible. These features can be now availed using progressive web applications, thus the web can inculcate these features as well.

    DETAILED SOLUTION DESIGN

    The user interactions take place through the progressive web application. The requests are handled by the service worker, which fetches the required from the cache or the server, whatever suits the need. In the absence of network, the data gets stored in the IndexDB in the web browser, or it sends the data straight to the server otherwise. Periodically, whenever network is available, the data from the IndexDB is sent to the server database.

    An overview of the application and its working is described below.

    pwa_workflow_1.jpeg
      Overview of the working of the application

      Database Design

      Data modeling is the first step of database design. Since, row-oriented databases do not support semi-structured data, column-oriented database, a type of NoSQL database was chosen.

      Apache Cassandra, an open-source, distributed, wide column store database was chosen to model the data.

      • Data is automatically replicated to multiple nodes for fault tolerance.
      • Every node in the cluster is identical. There is no single point of failure or network bottlenecks.
      • Cassandra is suitable for this application as it demands durability. It can't afford to lose data even if the entire data center goes down.

      The research data collected is modelled into the required model. The partition keys are chosen based on the requirement of the data and the application.

      The data obtained from the HTML forms or other means could differ from the requirements of the Cassandra environment. Minor changes like these were handled using python scripts that convert the data from incompatible types to compatible ones. For instance, python scripts were created that change the dates to the formats relevant to the Cassandra datatypes. Any other modifications to change the incoherent data to cleansed data using data cleansing scripts were written in python.

      The following code converts the irrelevant date formats in the CSV file to a pertinent format.

      1. import cassandra
      2. from cassandra.cluster import Cluster
      3. from cassandra.auth import PlainTextAuthProvider
      4. from flask import Flask
      5. from flask import Flask, render_template,request
      6. #defining the flask web application
      7. app = Flask(__name__)
      8. #routes to this application when the url ends with /insertion
      9. @app.route('/ insertion',methods=['GET', 'POST'])
      10. def entry()
      11. if (request.method == 'POST'):
      12. val1 = request.form['originalid']
      13. val2 = request.form['platelet']
      14. credential=PlainTextAuthProvider(username ='cassandra', password = 'cassandra')
      15. cluster = Cluster(auth_provider = credential )
      16. session = cluster.connect()
      17. #executing the CQL commands to insert the data into the database
      18. session.execute('USE sample')
      19. session.execute("""INSERT INTO blooddata(originalid,platelet) values (%s,%s)""",(val1,int(val2)))
      20. return "Success"
      21. #routes to the application when the url ends with /display
      22. @app.route('/ display',methods=['GET','POST'])
      23. def result():
      24. credential=PlainTextAuthProvider(username = 'cassandra', password = 'cassandra')
      25. cluster = Cluster(auth_provider = credential)
      26. session = cluster.connect()
      27. #sample is the keyspace to connect to
      28. session.execute('USE sample')
      29. rows = session.execute('SELECT originalid,platelet from blooddata')
      30. details=dict(rows)
      31. return render_template('display.html', result = details)
      32. if __name__ == '__main__':
      33. app.run(debug = True)
      Flask application codes to insert and display sample data to and from the database

      The PWA

      Components of a PWA

      The three important components of a PWA are

      A web manifest file:

      The Web manifest file contains the meta data about the PWA. This file describes the PWA to the browser like the icons that represent the application on the home screen and desktop. The web manifest is a JSON file that contains a list of properties that describes the meta data. The platform locates it with the reference made in the Head. The primary properties include name, short_name and icons. These properties compose the home screen icon and title. The remaining properties control the PWA launch process.

      Service Worker:

      Service workers intercept all network requests and provide the way to determine how the request is handled. It provides a persistence medium to keep network requests like other pages, images, scripts, CSS files etc,. in a controllable cache. When a network request is made it passed through the service worker where it is decided if the cached response is returned or make the network round trip. The cached response enables the website to be available offline.

      HTTPS:

      HTTPS encrypts network requests to protect the data and enables modern web platform features. PWA requires the application to be secured via HTTPS to prevent snooping and ensure content has not been tampered with.

      Working of the PWA

      The front-end of the application is the progressive web application. It serves as the presentation layer to the application and its functionalities. The data from the form is collected in an HTML file, the data is sent to the Database present in the server, using a flask application in the availability of network. In case of absence of network, the data gets stored into the IndexDB. IndexDB is a low level API for client-side storage. It stores significant amount of data including files and blobs.

      The service worker is made to cache the required files. In this application, the HTML, CSS, Javascipt and other static files that make up the application shell were cached. Hence, even when the application is offline, the data is fetched from the cached contents.

      When the application is installed in the device, the event handler in the service worker caches the static assets in the browser. Once the application is activated, the service worker sets itself as a controller for all the clients within its scope. The service workers fetches the data from the cache whenever there is an absence of internet. Else, it fetches from the server and updates the cache so that the cache always has fresh and updated data.

      In order to enable offline viewing, the static assets that are to be displayed even in the absence of network access are cached using the service worker. In this application, the home page, the javascript file that registers the service worker and the CSS file that describes how the HTML elements are to displayed are cached. Javascript can access and change all elements of an HTML document using DOM. The Javascript file that registers the service worker is also made to modify the HTML elements using DOM to insert dynamic cached data.

      fromt_end.jpeg
        Working of the front end of the application

        Back End

        The back end of the application is built using python. The data collected from the form written in HTML in the front-end, is interpreted using Flask, a micro web framework written in Python, and based on Werkzeug and Jinja 2. The Flask codes are used to obtain the data from the HTML files and to put the read data back to an HTML file.

        The Python file interacts with the Cassandra database using Python driver for Cassandra. The data read from the HTML form elements are written to the database by instantiating a Cluster and executing a session by connecting to the cluster. The retrieved data from the database is rendered to HTML in the user's browser, using Flask that uses the Jinja template library to render templates.

        The Python files are hosted on the server using WSGI. WSGI is a specification of a generic API for mapping between an underlying web server and a Python web application. It provides a common mechanism for hosting Python applications on servers supporting python programming language. An Apache HTTP Server module called mod_wsgi provides a WSGI compliant interface for hosting Python based web applications under Apache.

        A WSGI application has to be mounted against a specific URL. This is done by explicitly declaring in the main Apache configuration file. The directive called WSGIScriptAnalysis is used for mod_wsgi to host the Flask applications.

        1. import cassandra
        2. from cassandra.cluster import Cluster
        3. from cassandra.auth import PlainTextAuthProvider
        4. from flask import Flask
        5. from flask import Flask, render_template,request
        6. #defining the flask web application
        7. app = Flask(__name__)
        8. #routes to this application when the url ends with /insertion
        9. @app.route('/ insertion',methods=['GET', 'POST'])
        10. def entry()
        11. if (request.method == 'POST'):
        12. val1 = request.form['originalid']
        13. val2 = request.form['platelet']
        14. credential=PlainTextAuthProvider(username ='cassandra', password = 'cassandra')
        15. cluster = Cluster(auth_provider = credential )
        16. session = cluster.connect()
        17. #executing the CQL commands to insert the data into the database
        18. session.execute('USE sample')
        19. session.execute("""INSERT INTO blooddata(originalid,platelet) values (%s,%s)""",(val1,int(val2)))
        20. return "Success"
        21. #routes to the application when the url ends with /display
        22. @app.route('/ display',methods=['GET','POST'])
        23. def result():
        24. credential=PlainTextAuthProvider(username = 'cassandra', password = 'cassandra')
        25. cluster = Cluster(auth_provider = credential)
        26. session = cluster.connect()
        27. #sample is the keyspace to connect to
        28. session.execute('USE sample')
        29. rows = session.execute('SELECT originalid,platelet from blooddata')
        30. details=dict(rows)
        31. return render_template('display.html', result = details)
        32. if __name__ == '__main__':
        33. app.run(debug = True)
        Flask application codes to insert and display sample data to and from the database

        RESULTS AND FUTURE SCOPE

        Result

        The progressive web application was developed with an user friendly interface to ease data entry. The cached data storage enables the application to work offline so that the field data collection can be made feasible in rural areas with a little or no access to internet. The column-oriented database enables storing semi-structured data as the data collected might need modifications in the future. The nodes in the Cassandra data center are replicated based on the criticality of the data. Thus, the problem of losing data is minimised and durability is increased.

        Future Scope

        The data modelled in the database is well structured and can be used as a data warehouse for future data analysis. The data stored in the database is tuned according to the changes to meet the requirements of the Cassandra environment. The modifications include the changes in the format of date, time stamp. Since the research on dementia aims to analyse the likeliness of the cognitively healthy individuals developing dementia, an extended data analysis can help in identifying the potential chances of people developing dementia.

        ACKNOWLEDGEMENTS

        I would like to render my gratitude and thanks to my guide Professor Y Narahari, who guided me to a project of my interest at the Centre for Brain Research. It is a pleasure to have worked under Dr Uma Satya Ranjan, Associate Professor at the Centre for Brain Research, who guided and lead us, all through the internship period. I am grateful to Dr Uma, for having provided us with the required technical and moral support. I would like to thank the Indian Academy of Sciences for providing me with the opportunity of the fellowship and Indian Institute of Sciences, Bengaluru for a platform to work on my project.

        I thank my co-interns and my friends at the Indian Academy of Sciences Fellows Residency, for being a moral support during my internship period and making this period a memorable one.

        I thank my family for their unconditional love and belief in me, with which I was able to successfully complete my fellowship.

        APPENDIX

        1. //describes the name of the cache and assets to be cached
        2. const cacheName = 'cbr-cache-v1';
        3. const staticAssets = ['./', './index.html', './app.js', './styles.css'];
        4. //describes the actions under various events
        5. self.addEventListener('install', async event => {
        6. const cache = await caches.open(cacheName);
        7. await cache.addAll(staticAssets);
        8. });
        9. self.addEventListener('activate', event => {
        10. event.waitUntil(self.clients.claim());
        11. });
        12. //dynamic caching is handled here
        13. self.addEventListener('fetch', event => {
        14. const req = event.request;
        15. if (/.*(json)$/.test(req.url)) {
        16. event.respondWith(networkFirst(req));
        17. } else {
        18. event.respondWith(cacheFirst(req));
        19. }
        20. });
        21. async function cacheFirst(req) {
        22. const cache = await caches.open(cacheName);
        23. const cachedResponse = await cache.match(req);
        24. return cachedResponse || networkFirst(req);
        25. }
        26. async function networkFirst(req) {
        27. const cache = await caches.open(cacheName);
        28. try {
        29. const fresh = await fetch(req);
        30. cache.put(req, fresh.clone());
        31. return fresh;
        32. } catch (e) {
        33. const cachedResponse = await cache.match(req);
        34. return cachedResponse;
        35. }
        36. }
        The service worker

        References

        • Column-Oriented Databases, an Alternative for Analytical Environment Gheorghe MATEI Romanian Commercial Bank, Bucharest, ROMANIA

        • Progressive Web Apps: The Possible Web-native Unifier for Mobile Development Andreas Biørn-Hansen , Tim A. Majchrzak and Tor-Morten Grønli.

        • Progressive Web App for Educational System Amit Mhaske, Aditya Bhattad, Priyanka Khamkar, Radhika More

        More
        Written, reviewed, revised, proofed and published with