Prasanna

A Non-Intrusive Machine Learning Solution for Malware Detection and Data Theft Classification in Smartphones

Machine Learning Research

In our research endeavour, we introduced a non-intrusive machine learning pipeline designed for malware detection and the identification of stolen data categories. As the second author of this significant work, my contributions encompassed several key aspects. I played a pivotal role in formulating the problem statement, architecting the machine learning pipeline, conducting data preprocessing, and conducting additional experimentation, particularly in the realm of progressive learning. Moreover, I diligently analyzed and rigorously validated our research findings.

Our research presented a comprehensive evaluation of the proposed model's performance, employing the publicly available data collection framework Sherlock. This framework, renowned for being the largest open-source real-world mobile dataset at the time, houses a vast repository of data, totaling 4 terabytes. Notably, our machine learning architecture demonstrated exceptional accuracy, with an error rate of less than 9% in malware detection. Furthermore, our model showcased the ability to confidently classify the type of data being stolen, achieving an accuracy rate of 83%.

It is noteworthy that our research garnered recognition in the academic community and was published at ICCS'2021, a prestigious conference classified with an A ranking in the CORE classification system. This accomplishment underscores the significance and impact of our contributions to the fields of machine learning and cybersecurity.

Code link

SOMPS-Net : Attention based social graph framework for early detection of fake health news

Deep Learning Research (Graph Machine Learning)

In this research work, we introduced an innovative social-graph-based framework tailored for the early detection of fake news. Notably, our approach stands out for its unique characteristic of not relying on the content of the articles, making it versatile and applicable across diverse domains. As the project lead, I had the privilege of spearheading a diverse team of five researchers, ranging from junior to senior members. My integral contributions encompassed shaping the problem statement, crafting the architectural framework, conducting data preprocessing, and conducting an in-depth analysis of our research findings.

Our methodology revolves around the utilisation of Twitter engagements, including tweets, retweets, and replies, coupled with essential meta-information associated with the articles. This collective data is harnessed to classify articles as either genuine or counterfeit. Our innovative architecture incorporates cutting-edge technologies, specifically Graph Convolutional Neural Networks (GCNN) and Multi-Head Attention (MHA). We rigorously evaluated our model's performance using the FakeHealth dataset, which encompasses fraudulent health news across various topics, including cancer, Alzheimer's, and stroke. Impressively, our model outperformed existing state-of-the-art fake health news detection models by a substantial margin of 17.1% on similar training configurations. Notably, our model exhibits remarkable proficiency and is capable of identifying fake news with a high degree of certainty, achieving an accuracy rate of 79% within a mere 8 hours of the article's broadcast.

This research breakthrough underscores the potential of our approach to significantly enhance the detection of fake news across various domains while also offering a valuable contribution to the field of early fake news detection.

Code link

ECMAG - Ensemble of CNN and Multi-Head Attention with Bi-GRU for Sentiment Analysis in Code-Mixed Data

Deep Learning Research (Natural Language Processing)

In this work, we introduced an innovative ensemble framework tailored for the nuanced task of sentiment analysis in code-mixed text. This work held a significant position within the "Sentiment Analysis for Dravidian Languages in Code-Mixed Text" task at FIRE 2021. I had the privilege of leading our dedicated team under the expert guidance of Dr. D. Thenmozhi. My contributions to this project encompassed designing the architectural framework and the crucial step of data preprocessing.

The proposed model was evaluated on the code-mixed YouTube comments dataset. Our architectural approach drew strength from the utilisation of XLM-R sub-word embeddings, which served as the foundation for our methodology. Notably, our model featured two pivotal components: the Convolutional Neural Network for Texts (CNNT) and the Multi-Head Attention Pipelined to Bi-GRU (MHGRU). These components adeptly harnessed the power of XLM-R sub-word embeddings and generated insightful outputs. The fusion of these outputs formed the basis of our final sentiment predictions.

Code link

Department Graph Database System

Backend Development, Database Management

I had the opportunity to develop the department's crucial backend database system, a project with the clear goal of simplifying faculty access to complex database records. In collaboration with our team, we conceived and designed a dynamic graph-based database system tailored for storing and querying college data. This proof-of-concept innovative system was meticulously constructed using Neo4j and Node.js.

My contributions to this project encompassed multiple facets. Firstly, I played a pivotal role in designing the schema, a foundational blueprint that proved to be adaptable across various use cases. This schema formed the backbone of our system, ensuring its versatility and scalability. Additionally, I led the design and development of two essential sub-databases within the system. One of these sub-databases was dedicated to handling Create, Read, Update, and Delete (CRUD) operations, while the other was designed to manage permissions for diverse entities or nodes within the graph database. These entities ranged from faculty members to event nodes, and the meticulous management of permissions added a layer of security and control to our system.

Furthermore, I handled the development of the system's connector, a critical component that interfaced with the Neo4j API. This connector efficiently executed database queries and delivered restructured data to the frontend, ensuring a seamless and responsive user experience. This project proved to be an invaluable learning experience for me, especially in the realms of backend development and database management. I gained insights into writing clean, efficient, and bug-free code, an essential skill set for any developer. Overall, this project not only contributed to the department's data accessibility but also enriched my expertise in backend systems and database management.

Languages and technologies used

Node.js
Javascript
Neo4j
Cypher
Google Auth

HEalthiMAN

Web Development

I initiated this project as a sincere gesture to give back to the community, particularly during the trying times of the COVID-19 pandemic, which left many people grappling with both mental and physical challenges.

I engineered a dynamic Flask web application with a noble purpose: to enhance the overall well-being of its users. This application serves as a valuable resource by generating personalised meal plans based on the user's dietary preferences, whether they're keto, vegetarian, paleo, or any other. Furthermore, it offers a dedicated personal blog section for users to express themselves and share their experiences, as well as a news feed section that exclusively provides the latest health-related updates.

It's worth mentioning that the application is fully operational and accessible to users, as it's deployed on the Heroku web server. This project represents our commitment to contributing positively to the well-being of individuals, especially during challenging times.

Languages and technologies used

Python
Flask
SQL
Heroku
HTML
CSS

TARS: Workplace Automation Bot for Solarillion Foundation

Development

During my tenure with the Server Maintenance and Development Team at the Solarillion Foundation, I undertook a significant project aimed at streamlining the scheduling of meetings. This innovative module automates the entire meeting booking process, from notifying participants to updating their Google calendars with the Hangouts Meet link. The impact of this project has been substantial, evident in the time it has saved and the fact that it was utilised by over 30 individuals.

In the course of this project, I gained valuable experience working with APIs and successfully bridging the gap between backend systems and cloud databases. This newfound expertise not only contributed to the project's success but also expanded my skill set as a developer.

Languages and technologies used

Flask
Firebase
Google App Script
Heroku
Slack APIs
Javascript

Flight Delay Prediction using Machine Learning

Machine Learning

During my orientation phase at the Solarillion Foundation, I had the opportunity to contribute to a project focused on machine learning methods for flight delay classification and arrival delay prediction. This project tackled the critical task of determining whether a flight would be delayed and, if so, predicting the extent of the delay in minutes. To achieve these objectives, we leveraged historical weather records and flight information, harnessing the power of machine learning.

Throughout my involvement in this project, I delved into the intricacies of handling large datasets, mastering the art of data transformation and modelling, and conducting in-depth result analysis. This hands-on experience not only broadened my skill set but also provided valuable insights into the complexities of real-world data analysis.

My time spent on this project was an enriching learning experience, offering me a glimpse into the world of data-driven decision-making and the immense potential of machine learning in addressing practical challenges.

Languages and technologies used

Python
Numpy
Pandas
Matplotlib
Scikit
Imblearn

Detecting Hate Speech in Tweets using Machine Learning

Machine Learning

In this project, we dedicated our efforts to combating the issue of online harassment and the dissemination of fake news by Russian hackers on platforms like Twitter. Our primary objective was to develop an effective solution to identify and address hateful speech trolls engaging in harmful online activities.

To achieve this, we adopted Convolutional Neural Networks (CNNs), a potent tool known for its ability to capture spatial information effectively and extract valuable textual features. Our approach involved training a straightforward CNN model to detect various forms of hate speech based on word patterns. We curated our training data from Kaggle, ensuring a diverse and representative dataset.

Subsequently, we applied our trained CNN model to analyze tweets associated with the bots operated by the Internet Research Agency (IRA) of Russia. This endeavour aimed to characterize the behavior of these bots and ascertain whether their tweets exhibited patterns of hate speech. The ultimate goal was to shed light on the potential correlation between hate speech and the activities of these bots, offering valuable insights into their online behavior. This project was part of the Social Networks course at SSNCE.

Languages and technologies used

Python
Numpy
Pandas
Matplotlib
Scikit
Imblearn

TwiCatch

Graph Machine Learning

In this project, I worked on a graph-based machine learning solution geared towards the detection of mature content on online streaming platforms. The objective was to create a robust system capable of identifying and mitigating the presence of inappropriate content, fostering a more wholesome online environment.

To achieve this, I employed a multifaceted approach, including the extraction and preprocessing of text messages from user reactions and comments across various languages, including English, Spanish, French, and Russian.

The core of our solution revolved around the construction of graph networks, specifically focusing on the Twitch platform by establishing graph networks that mapped the connections and relationships between Twitch followers of different streams. These networks were represented through adjacency matrices, providing a structured foundation for our analysis.

The work encompassed the utilization of various graph neural networks, including Graph Convolutional Networks (GCNs), Graph Attention Networks (GATs), FastRGCNConv, and Gated Graph Convolutional Networks (GGCNs). These advanced neural network architectures were employed at the node level to scrutinise streamers promoting mature content. The aim was to identify and flag such content creators efficiently.

Languages and technologies used

PyTorch
Python
Numpy
Pandas

Detecting Brain Abnormalities using Self-Supervised Contrastive Learning

Deep Learning Research (Self-Supervised Learning - Medical Domain)

In my undergraduate thesis project, I led a team to develop a self-supervised contrastive learning-based solution for the detection and classification of brain abnormalities, with a primary focus on identifying various types of haemorrhages within the brain.

Our project entailed the exploration of multiple datasets, including a meticulously curated dataset containing over 40 scans sourced from scan centres located across Chennai. We owe special thanks to Dr. U.S. Srinivasan, Senior Consultant Neurosurgeon at Sri Balaji Hospital, Guindy, Chennai, for providing invaluable data and annotations for this collection.

Our approach revolved around leveraging robust Convolutional Neural Network (CNN) architectures to extract essential image features from unlabeled images as part of a pretext task. This initial phase aimed to impart knowledge to our system, which was subsequently applied to the core task of haemorrhage detection and classification.

What sets our work apart is its efficiency in harnessing valuable features during the pretext task, ultimately leading to superior performance in comparison to models based on supervised learning, all while using the same amount of testing data. This achievement underscored the effectiveness of our proposed architecture. It is currently in the process of being submitted for publication at a prestigious conference and has already been submitted as an abstract at NSICON 2022, the 70th annual conference of the Neurological Society of India, highlighting its potential to make a substantial impact in the field of medical imaging and diagnostics.

Code link

Understanding the interplay between rating and category for restaurants in Google Local Review

Recommender Systems

In today's fast-paced digital landscape, the demand for rapid and accurate query responses from users has never been more critical. The internet, which powers much of our modern world, impacts people's daily lives in various ways, from online searches to personalised product recommendations on e-commerce platforms. As a result, online review forums and blogs play a pivotal role in influencing user decisions.

In recent times, consumers often rely on Google reviews to assess businesses, making it essential to provide personalised recommendations and determine business ratings and quality based on user engagement. In this project, we embark on an experiment to explore how information gathered from various points of interest, including restaurant details, GPS data, and user sentiment in reviews, can contribute to the prediction of business ratings and offer tailored cuisine suggestions.

Our experiments are conducted using the Google local reviews dataset, and our approach is designed to minimise prediction inaccuracies, achieving less than 11% error. This outcome underscores the effectiveness of our feature extraction strategy and the potential for leveraging diverse data sources to enhance the quality of predictions and recommendations in the context of user reviews and business ratings.

Languages and technologies used

Python
Numpy
Pandas
Matplotlib
Scikit

Generating scene graphs using Transformers with Knowledge infusion and Residual Connections

Deep Learning Research (Scene-Graphs)

Scene graph generation is a dynamic field of research that revolves around representing visual scenes using nodes and edges. When presented with an image, the primary objective is to identify the entities or objects within the image and establish relationships between them. In this context, the nodes in a scene graph represent potential objects, while the edges signify the connections or relationships between these nodes. For example, when analysing an image containing both a car and a person, the model must discern if there is an action or relationship connecting the car and the person.

This project builds upon prior research conducted in the realm of Relation Transformers (RelTR). Our innovative approach involves introducing residual connections between modules and infusing pre-existing knowledge about objects into the system. The underlying hypothesis is that these enhancements will empower the model to identify objects and their relationships with greater speed and accuracy.

To evaluate our customised architecture's performance, we conducted a comprehensive comparison against the baseline RelTR model. Our assessments were carried out on the visual genome dataset, utilising both 5,000 and 7,500 training samples. The results of this comparative analysis provide a nuanced understanding of the strengths and weaknesses of our proposed model, shedding light on its potential benefits and limitations.

This project represents an important step forward in the pursuit of more efficient and accurate scene graph generation, with practical implications in various fields, including computer vision and image analysis.

Languages and technologies used

Python
PyTorch

Analysis-of-Explainability-Techniques-on-BERT-for-Medical-Domain

Model Interpretability and Explainability

In this project, our objective was to explore a range of interpretability approaches applicable to deep learning models employed for medical classification tasks. We recognised the potential significance of such endeavours in bolstering AI for the healthcare industry and instilling trust in the deployment of these models in high-stakes domains.

Our primary focus was on the post-hoc analysis of these models, and we extend our gratitude to the survey paper titled "Post-hoc Interpretability for Neural NLP: A Survey" for providing valuable insights into this complex problem domain.

Interpretability approaches can be broadly categorised into two main branches: intrinsic and extrinsic. Intrinsic methods involve the generation of explanations by the model's own architecture, while extrinsic or post-hoc methods analyse the model's outputs to generate explanations. Our work exclusively delved into post-hoc approaches within the context of medical natural language processing (NLP). These approaches are often preferred due to their model-agnostic nature, treating the underlying model as a black box and relying on its outputs to generate explanations. It's worth noting that post-hoc methods, while practical, can sometimes be critiqued for potentially offering misleading explanations for models that are inherently complex and difficult to explain.

Throughout this study, we leveraged a fine-tuned pre-trained BERT-based model and implemented a diverse array of post-hoc methods. Our aim was to assess the strengths and weaknesses of each approach, shedding light on their applicability and reliability within the context of medical NLP. This project served as our submission for the final assessment in the course CSE 256: Statistical NLP at the University of California, San Diego.

Languages and technologies used

Python
Numpy
Pandas
SHAP
LIME
Integrated Gradients
Bertology

Digital Behavioral Activity Monitor System

Full Stack Machine Learning

The Digital Footprint Monitor is a personalized web application meticulously designed to cater to the user's digital activity, with a primary focus on monitoring their Chrome browser usage. It operates by continuously tracking and securely storing data related to Chrome downloads, bookmarks, and browsing history on a remote database. This data serves as the foundation for in-depth behavioural analysis, allowing users to gain insights into their online activities and the content they engage with.

In addition to monitoring Chrome activity, the application also offers a feature for fetching and storing Reddit upvoted data in the cloud. This functionality enhances the user's ability to manage and review their Reddit interactions conveniently.

What sets this project apart is its integration with OpenAI's GPT model, which enables the retrieval of relevant Reddit posts based on user queries using vector databases. This intelligent feature empowers users to access valuable information tailored to their interests and inquiries.

Ultimately, the overarching goal of this project is to empower users to navigate the digital landscape, particularly within the Chrome browser, with a heightened sense of awareness. By monitoring their activities and providing insights, it encourages users to make informed decisions about their online behaviour, fostering a more responsible and mindful digital presence.

Languages and technologies used

Django
Python
Pandas
MongoDB
Atlas Vector Search
Chrome, Reddit APIs
Vega-Altair
GPT

We tried to color images, and you will not believe what happened next!

Deep Learning Research (Self-Supervised Learning - Medical Domain)

Image restoration encompasses a category of problems focused on rectifying abnormalities and enhancing the quality of images. Among these challenges, image denoising and colorization are particularly noteworthy. The literature on this subject reveals an abundance of research, employing both Computer Vision and Deep Learning techniques.

Deep Learning algorithms tailored for these tasks typically adopt an encoder-decoder architecture. The encoding phase reduces the input image to a lower-dimensional representation, while the decoder phase amplifies this representation to reconstruct the output image. Therefore, the quality and robustness of the latent representations are crucial, as they must contain sufficient information to faithfully reconstruct the output.

In this project, we introduce innovative Self-Supervised Contrastive Learning approaches to acquire robust representations of the dataset. These representations serve as the foundation for training a decoder module responsible for executing image restoration tasks, such as denoising and colorization. Furthermore, we conduct a comprehensive performance evaluation, comparing and contrasting the effectiveness of our two-stage training module with traditional end-to-end training methods like ResNet50, EfficientNet, and InceptionNet, as backbone networks. This work was part of the CSE 251B - Neural Networks & Pattern Recognition course's final project.

Languages and technologies used

Python
Pandas
PyTorch

Hi. I am Prasanna
CS Grad @ UCSD | B.E. CSE @ SSNCE | Computer Science Enthusiast

Welcome aboard!.

About Me

Computer Languages covered

Education

Chinmaya Vidyalaya

Chennai, India

SSN College of Engineering

Chennai, India

University of California San Diego

San Diego, California, United States

Experiences

Projects that I have worked on

Hi. I am Prasanna CS Grad @ UCSD | B.E. CSE @ SSNCE | Computer Science Enthusiast

Welcome aboard!.

About Me

Computer Languages covered

Education

Chinmaya Vidyalaya

Chennai, India

SSN College of Engineering

Chennai, India

University of California San Diego

San Diego, California, United States

Experiences

Projects that I have worked on

Hi. I am Prasanna
CS Grad @ UCSD | B.E. CSE @ SSNCE | Computer Science Enthusiast