Clustering Graph Theory Algorithms and Applications to Computational Vision and Big Data

Mentor: Dr. Sos Agaian (Computer Science Department - CSI)
Description: Many real-life problems deal with collections of high-dimensional data, such as images consisting of billions of pixels, text, and videos composed of millions of frames. To process this data, it is essential to consider the computational time and memory requirements of the algorithms. The primary objective of segmentation is to divide an image into distinct segments with shared features or attributes. One potential approach is to depict an image or dataset as a smaller-sized, edge-weighted graph, wherein vertices denote individual pixels, edges signify neighboring relationships, and edge weights convey the likeness between pixel characteristics. In this project, students will implement similarity-based clustering algorithms using graph theoretical techniques such as random walks, and spectral graph theory. Big graph clustering techniques will also be covered, and algorithms implemented by the students. Clustering and segmentation of images or videos are fundamental to computer vision, big data, and pattern recognition.
Research objectives: Students will learn both computer vision and graph-theoretical clustering techniques. They will evaluate the learned methodologies to solve different real-life clustering problems, with emphasis put on cancer imaging. Moreover, time permitting, parallel, and distributed approaches for big data analysis will be covered (e.g., parallel probabilistic latent semantic analysis and parallel spectral clustering). Furthermore, algorithms designed to address large-scale graph clustering problems will be implemented within a parallel environment, based on MPI libraries.
What the students will learn: Students will acquire an understanding of graph theoretical concepts (e.g., Adjacency and Laplacian matrices representation of mages, Markov algorithms, etc.) and review pertinent graph theoretical algorithms and algebraic graph analysis (e.g., spectral graph theory, similarity matrix and nearest neighbor graph, random walks).
Student’s required background: The applicant must show proficiency in C++, Java, and/or other high-level languages. A successfully completed formal course in Data Structures/Algorithms and another course in Linear Algebra are preferable.


Mixture of Experts for Deep Neural Networks of Molecular Structures

Mentor: Dr. Lei Xie – Co-PI (Computer Science Department – Hunter College)
Description: Machine learning, especially deep learning, has made profound impact on chemistry [CoPa19]. Deep learning for chemistry has a broad range of applications in drug discovery, material science, environmental toxicity, and many others. Specifically, this project will develop machine learning models to predict molecular properties of small molecule organic compounds from their structures. Given a chemical structure, it can be represented as a SMILES string (1D), a graph where each atom is a node and each bond is an edge (2D), or a point (atom) cloud in a 3D space. A plethora of deep learning models have proposed based on these representations. However, each model introduces different inductive biases. An ensemble model that combines collective wisdom of all models may provide the best practical value. In this project, students will implement multiple deep neural network architectures including transformer, convolutional neural network (CNN), graph neural network (GNN) for the representation of 1D, 2D, and 3D chemical structures. Moreover, students will implement a paralleled Mixture of Experts for Deep Neural Networks (MeDNN) method that trains an ensemble of multiple diverse neural networks in a parallel computing environment.
Research objectives: Students will get familiar with multiple state-of-the-art deep learning architectures (CNN, transformer, GNN, etc.). They will be able to evaluate the learning methodologies to solve real-life deep learning problems, with emphasis put on chemoinformatics. They will be comfortable running and implementing machine learning models in a parallel computing environment (GPUs, CUDA.)
What the students will learn: Students will learn concepts of deep neural networks (e.g., active function, loss function, optimization etc.), review state-of-the-art neural network architectures (e.g., transformer CNN, GNN, etc.), and deep learning platform (e.g., PyTorch). Moreover, students will learn how to train parallel deep learning models.
Student’s required background: The applicant must show proficiency in C++, Java, and/or other high-level languages. A successfully completed formal course in Data Structures/Algorithms and preferably a high school AP-level biology or chemistry knowledge (genomics).

DNAGPT - a scalable DNA language model

Mentor: Dr. Lei Xie – Co-PI (Computer Science Department – Hunter College)
Description: Advances in the next-generating sequencing and genome-wide association studies (GWAS) have generated vast amounts of DNA sequence data. However, it is challenging to identify causal variants underlying the observed genotype-phenotype associations. This project will take advantage of recent remarkable progresses in natural language processing (NLP) for modeling the DNA sequence. The project consists of three major steps. Firstly, students will implement a distributed data processing pipe using Spark to process terabyte DNA sequencing data and GWAS data. Secondly, students will test several existing language models (e.g., BigBird, Perceiver IO, etc.) to perform unsupervised DNA sequence pre-training. Finally, students will apply the pre-trained DNA sequence embeddings to perform supervised machine learning tasks.
Research objectives: Students will get familiar with Spark for big data processing and obtain basic knowledge of genomics. They will be able to apply state-of-the-art NLP techniques to solve real-world problems, with emphasis put on bioinformatics. They will be comfortable of running and implementing machine learning models in a distributed computing environment.
What the students will learn: Students will learn basics of NLP and genomics, review state-of-the-art NLP algorithms (e.g., ChatGPT, etc.), and deep learning platform (e.g., PyTorch). Moreover, students will learn how to process big data in a distributed computing environment using Spark (
Student’s required background: The applicant must show proficiency in C++, Java, and/or other high-level languages. A successfully completed formal course in Data Structures/Algorithms and preferably a high school AP-level biology or chemistry knowledge (genomics). 

Graph Theory (Computational Biology - RNAs Classification and Partitioning

Mentor: Dr. Louis Petingi - PI (Computer Science Department - CSI)
Description: In this project, our objective is to build upon the existing research by further investigating the functionality of RNA (Ribonucleic acid) secondary structures in the form of undirected dual graphs. RNA secondary structures consist of a foundational sequence of successive pairs of bases (i.e., adenine, guanine, cytosine, and uracil). These primary base pairs are connected by secondary base pairs, forming stems that link non-consecutive bases. Pseudoknots are complex RNA structures that hold significant biological implications, as they involve the intertwining of two or more non-consecutive base pairs.
Previous endeavors have yielded the implementation of a partitioning algorithm [PS1, P2], a vital tool that divides RNA structures into distinct fragments or subgraphs. The primary purpose of this algorithm is to discern between regions with pseudoknots and those that are pseudoknot-free. Through the application of this technique, a systematic analysis and classification of RNAs can be accomplished, contributing to a deeper understanding of their properties.
The mentor is currently collaborating with researchers from both the Department of Chemistry and the Courant Institute at New York University. This research team is at the forefront of advancing the analysis of RNA structure and function. The project in question represents a seamless extension of an existing research trajectory, which originated with the publication [KPS] followed by subsequent papers such as [PS1] and [P2]. The employed technique led to recent publications in journals like "Genes" [SCPS] and "Methods" [PS19]. Many RNA viruses use Pseudoknots in the control of viral RNA translation, replication, and the switch between the two processes. One of the techniques used to destroy viruses is to inhibit the pseudoknotted region of an RNA. For both COVID-19 and SARS, the precise structures of the pseudoknots within their RNAs have not been empirically observed as of now. Instead, these structures have been deduced through predictive algorithms.
Research objectives: During the 2018 summer extension of the REU Site, a participant and the mentor successfully expanded the algorithm's capabilities to differentiate between various types of pseudoknots, specifically recursive and non-recursive pseudoknots. This accomplishment culminated in the publication of their findings. We have recently introduced a novel graph-theoretic representation of RNAs based upon directed dual graphs [Pet4], which yields a more accurate and systematic analysis of RNA structure and prediction. Students will be able to analyze, based on these topologies, for example, and using OpenSHMEM (i.e., Open Shared Memory) libraries, several families of viral mRNAs (messenger RNAs). Analyzing their distinctions and similarities can provide a clearer comprehension of viral replication.
What the students will learn: Students will learn basic graph-theoretic algorithms (e.g., bi-connectivity, etc.), and concepts (e.g., connectivity, and cut-sets), alongside fundamental biological concepts, and parallel processing techniques applied on distributed architectures.
Student’s required background: The applicant must show proficiency in C++, Java, and/or other high-level languages. A successfully completed formal course in Data Structures/Algorithms and preferably a high school AP-level biology or chemistry knowledge (genomics).


Real-time Robust Two-dimensional Phase Unwrapping with Deep Learning and GPU Acceleration

Mentor: Dr. Shuqun Zhang (Computer Science Department - CSI)
Description: Phase unwrapping is a critical procedure required by many imaging techniques that are based on the interferometric principle, such as magnetic resonance imaging (MRI), synthetic aperture radar (SAR), 3D imaging and optical and microwave interferometry. The task involves reconstructing the absolute phase from the measured "wrapped" phase. It is a time-intensive procedure that entails numerous calculations. While numerous phase unwrapping algorithms have been proposed over the years, generating accurate real-time true phase information remains a challenging frontier in research. In the presence of strong noise, aliasing and inconsistencies, the traditional spatial and temporal phase unwrapping methods are limited in performance and are computationally intensive leading to high execution times, regardless of if GPU and FPGA-based hardware can be used to accelerate phase unwrapping process. Inspired by the great successes of deep learning techniques in computer vision and image processing, recently deep learning methods are proposed as an alternative method for phase unwrapping, enhance speed and accuracy. This potential lies within the deep learning approach because a network can be trained to understand the phase unwrapping process. This enables the inference of true phase, which is often rapid, especially when assisted by hardware acceleration. This project will examine the problems in existing neural networks for phase unwrapping and develop a new end-to-end network by integrating a noise reduction module with real-time constraint. The operation of phase unwrapping will be casted into an image segmentation problem and a new deep learning network for segmentation will be developed for real-time phase unwrapping. The developed network architecture will be computationally efficient with fewer parameters that need to be tuned. One of the critical challenges in deep learning is the availability of a large set of labeled training data. This issue will be circumvented by creating a program in MATLAB to simulate various cases in phase unwrapping and generate the corresponding labeled data. Performance evaluation and benchmarking results will be studied. All the network training and inference will be performed on an HPC machine.
During the 2015 to 2017 REU student projects, which covered relevant techniques applicable to this new project, such as GPU acceleration, image segmentation using GPU, and convolutional neural network, were successfully completed by the participants. These projects also led to publications and presentations.
Research objectives: This project aims at exploring the use of machine learning approach in phase unwrapping to address the problems of low performance against noise/aliasing and low speed in phase retrieval. The goal of this project is to develop a deep learning network for phase unwrapping, which is robust against noise and phase residues, and achieve real-time phase reconstruction.
What the students will learn: Students will learn basic image processing operations, phase unwrapping algorithms, convolutional neural networks and software frameworks, and parameter tuning skills. Students will write/modify code to generate and process image training data and build and test deep neural networks for phase unwrapping based on the PyTorch framework in a HPC machine. They will gain understanding of the benefits of parallel processing and GPU acceleration in training and inference of neural networks.
Student’s required background: The applicant must show proficiency in C++, Java, and/or other high-level languages. A successfully completed formal course in Data Structures/Algorithms is preferable.

Multidimensional-Flexibility-Based Electric Vehicle Charging and Discharging Scheduling

Mentor: Dr. Yu-Wen Chen (Computer Systems Technology Department - NYCCT)
Description: The US power grid is undergoing nationwide modernization to become environmental-friendlier and achieve a higher level of reliability and energy efficiencies with the recent advances in smart grid technologies, such as distributed energy resources and electric vehicles (EVs). However, the fluctuating electricity requests and productions from EVs and distributed energy resources will cause significant damage to the grid's stability and reliability without proper management and scheduling. Therefore, to properly manage the fluctuated high penetration from EVs and distributed energy resources, many EVs charging scheduling and demand response programs propose the optimal operation scheduling based on the heterogeneous data from various components. Most of the existing demand response and EV charging scheduling programs mainly utilize the flexibility in time dimension, shifting the deferrable electricity demand to another time frame with less overall electricity demand and lower electricity pricing. However, most of the programs still face a lot of challenges due to the limited flexibility and customer participation in the management and scheduling schemes.
Moreover, the exponentially growing number of EVs will significantly impact the grid with its mobility and vehicle-to-grid (V2G) capability. Each EV is a movable resource with three different roles: consumer (requesting electricity from the grid), producer (exporting power back to the grid with V2G), and storage unit. It is essential to have the management and scheduling schemes to efficiently coordinate and facilitate all the EVs with the consideration of their mobility and V2G capability, which could potentially elicit flexibility in other dimensions. This research will investigate the potential flexibility in other dimensions from the EVs’ mobility and V2G capability. We will design an EV charging and discharging scheduling that enables multidimensional flexibility to elicit EV owners' participation and enhance the grid's stability and reliability.
Research objectives: 1) We will first investigate and identify the factors of multidimensional flexibility that can be used for the scheduling algorithm. 2) Students will then design the scheduling schemes in the centralized approach and seek the possible distributed approaches to handle large-scale data efficiently. Finally, 3) Students will implement and simulate the proposed algorithm using the MPI parallel programming library on the CUNY HPCC high-performance computing machines to validate and compare the results.
What the students will learn: This research provides students with opportunities to focus on the smart grid and the needs for EV charging and discharging operations. Students will explore the existing demand response and EV charging scheduling programs. Students will investigate the factors that can enable the multidimensional flexibility for the EV charging and discharging scheduling and learn how to model the problem in both centralized and distributed approaches. With the CUNY HPCC, students will have the opportunity to implement and test the proposed scheduling algorithm and learn how to validate and compare the performance.


Steganography-as-a-service: privacy technologies from practical steganography

Mentor: Tushar Jois, Assistant Professor, Electrical Engineering, City College of New York
Background. The deployment of encrypted communication technologies, such as TLS (for protecting web traffic) and Signal (for protecting messages) have been a boon to improving the privacy of everyday users on the Internet. Encrypted communications, however, are easily identifiable, so authoritarian regimes who wish to limit free communication can simply block any encrypted data it cannot decrypt. This scenario is not hypothetical; there is evidence, for example, that nation-states are starting to block TLS 1.3 connections due to their strong cryptographic protections.
To overcome extreme censorship, there is a need for steganography: the hiding of sensitive messages in mundane ones. With steganography, an innocuous message (like a cake recipe) could contain a secret one (like protest information). The censor, seeing only the innocuous message, would allow this communication, inadvertently allowing free communication past its filters.
Dr. Jois has previously developed Meteor, the first practical, provably secure steganography scheme. Meteor hides messages into the output of generative text models, such as (Chat)GPT. A censor cannot distinguish between the regular text output of these models, and text that contains a Meteor-encoded secret message. Building on this, Dr. Jois is in the process of developing the first such scheme for image synthesis, Pulsar, which would similarly allow hiding of messages in generated images.
Research objectives. Meteor and Pulsar have been shown to be secure and practical systems. To deploy them widely, we must take these techniques and build them into a part of services that end users can employ in their daily life. In this project, we will investigate how practical steganography can be used to provide censorship-resistant online services, such as messaging, social media, web browsing, and/or file storage. This will require novel architectures to properly integrate steganographic technologies into applications, as well as efficient implementation to evaluate real-world performance. This research will help ensure that we maintain a free and open Internet in the face of pervasive attempts at censorship.
Learning objectives. Students will learn the theory behind cryptography and steganography and apply this knowledge to develop steganographic implementations of privacy-enhancing technologies. Students will also get exposure to programming using machine learning frameworks, like PyTorch and CoreML. The result of this project is a research publication; Dr. Jois’s previous work with undergraduate students and machine learning was published in a major security conference.
Parallel resources required. We will use CUNY HPCC GPU resources to accelerate the inference of the machine learning models that underlie practical steganography.

Flood Modeling and Simulation with Flood Sensor Data and Machine Learning for New York City.

Mentor: Dr. Zhanyang Zhang (Computer Science Department - CSI)
Description: New York City is increasingly vulnerable to floods from extreme weather and rising sea levels. Many coastal neighborhoods suffer from high tide flooding monthly that is destroying infrastructure and property value.  In 2021, hurricanes Henri and Ida struck NYC within weeks of each other with unprecedented levels of rainfall that caused deaths and damages. As part of a citywide study, we focus on developing a local compound flooding (rains & storm surge) model and simulation with machine leaning using multi-model weather and flood sensors. As part of NYC FloodNet initiative, we have deployed a handful flood sensors and gateways on Staten Island.  Our project integrates simulation model and machine learning to take advantage of huge amount data from multi-model data sources to improve the model’s accordance.
To model floods we need to consider many physical inputs and parameters, such as rainfalls, meteorological conditions, storms, elevations, land uses, ground infiltration and sewers, just to name a few.  A model is like a system.  It takes input data, then it computes the output results based on certain mathematics and physics theories using the values assigned to the system parameters.  Due to measurement errors or, in case of difficult to measure parameters, we have to estimate the values for the parameters; the model often produces results that do not match reality accurately. 
Research objectives: Our project objective is to improve flood model accordance. It involves, (1) develop and deploy a networked flood sensor to collect flood data from many key locations over time, (2) use machine learning algorithms to train the model with large quantities of flood data.  Eventually, the model will adjust the parameters to improve the outcomes. Our major data sources include NYC FloodNet, NYC Open Data Project and NYS MesoNet.  CSI has an unique advantage that it has the MesoNet weather station and FloodNet Gateway as well as FloodNet Sensors deployed on campus.  Conduct simulation study in numeric solutions with large volume sensor data is computation intensive.  CUNY’s HPCC computing resources are critical to the project.
What the students will learn: Student will learn how to collect FloodNet data from flood sensor node, gateway to Internet cloud storage and data representations. Flood data are inherently scares and noise. Student will learn how to process and clearing the data with Kalman filters and other statistical methods. Finally, students will conduct experiments with machine learning algorithms to train the flood model for better performance.  We will use an existing flood model, SWMM, developed by US EPA and widely used in the field.
Student’s required background: The applicants must show proficiency in programming using C++, Python, and/or MATLab. A successfully completed formal course in Data Structures/Algorithms and Probability and Statistics is preferable.