The exploration and analysis of high-dimensional data is a central problem in data science, which requires the construction of low-dimensional and interpretable representations of data through dimensionality reduction (DR). DR is, for example, particularly useful in genomics for visualizing and interpreting gene interactions.
Many questions remain open in this field: what guarantees (statistical, approximation) exist for such low-dimensional representations? Which algorithms can efficiently compute projections in small dimension? How can the data structure, e.g. graphs or sequences, be incorporated into these representations?
During this day, we propose to address the latest advancements in this field from both theoretical and practical perspectives.
The workshop will take place at ENS Lyon (Condorcet room on Monod site). ENS is accessible from the Debourg subway station on line B, as well as the Debourg station on tramway line T1.
Camille Marchet: Sketching in sequence bioinformatics: methods and applications
DNA and RNA sequences can be computationally processed as finite texts in a four-letter alphabet (A, C, G, T), representing each nucleotide. Such sequences can be obtained with high throughput at a low cost, thanks to current sequencing technologies, and data tends to accumulate faster than our capacity to analyze it. For example, the amount of available raw sequenced datasets accumulates to around 50 petanucleotides on the SRA, one of the main public servers.
Whether it is to analyze small components, such as genes or variants, or larger ones, such as complete genomes, a core task is to be able to compare a pair of sequences to spot differences and similarities. Pairwise sequence comparison can be performed using quadratic algorithms such as Smith and Waterman or Needleman and Wunsch to obtain exact solutions. However, given the orders of magnitudes of data to be dealt with, these solutions are usually reserved for very small problems. To scale, heuristics based on sketching have been developed to perform a broad spectrum of tasks.
The common principle in all sketching techniques for DNA/RNA sequences is the consistent selection of representative subsequences from longer sequences, for indexing these sequences in data structures or algorithms. Sketching schemes have been designed to achieve different goals, such as maximizing the density of the scheme (roughly, the fraction of the original sequence covered by the scheme), their cache-locality, or their construction space/time efficiency. In this talk, we will review these techniques and present applications to current bioinformatics problems, such as assembly, sequence alignment, genome compositional analysis, and indexing.
Eddie Aamari: Minimax Boundary Estimation and Estimation with Boundary
In this talk, we study the non-asymptotic minimax rates for the Hausdorff estimation of 𝑑-dimensional manifolds 𝑀 with (possibly) non-empty boundary 𝜕𝑀. The class of target sets that we consider reunites and extends the most prevalent C²-type models: manifolds without boundary, and full-dimensional domains. We will consider both the estimation of the manifold 𝑀 itself and that of its boundary 𝜕𝑀 if non-empty. In the process, we will present a Voronoi-based procedure that allows to identify enough points close to 𝜕𝑀 for reconstructing it. Explicit constant derivations are given, showing that these rates do not depend on the ambient dimension. If time permits, we will talk about possible extensions of the estimation procedure to smoother manifolds with corners. Joint work with Catherine Aaron & Clément Levrard
Julien Tierny: Principal Geodesic Analysis of Merge Trees (and Persistence Diagrams)
This work presents a computational framework for the Principal Geodesic Analysis of merge trees (MT-PGA), a novel adaptation of the celebrated Principal Component Analysis (PCA) framework to the Wasserstein metric space of merge trees. We formulate MT-PGA computation as a constrained optimization problem, aiming at adjusting a basis of orthogonal geodesic axes, while minimizing a fitting energy. We introduce an efficient, iterative algorithm which exploits shared-memory parallelism, as well as an analytic expression of the fitting energy gradient, to ensure fast iterations. Our approach also trivially extends to extremum persistence diagrams. Extensive experiments on public ensembles demonstrate the efficiency of our approach - with MT-PGA computations in the orders of minutes for the largest examples. We show the utility of our contributions by extending to merge trees two typical PCA applications. First, we apply MT-PGA to data reduction and reliably compress merge trees by concisely representing them by their first coordinates in the MT-PGA basis. Second, we present a dimensionality reduction framework exploiting the first two directions of the MT-PGA basis to generate two-dimensional layouts of the ensemble. We augment these layouts with persistence correlation views, enabling global and local visual inspections of the feature variability in the ensemble. In both applications, quantitative experiments assess the relevance of our framework.
Giulia Marchello, Unsupervised statistical learning with latent block models on dynamic discrete data
In our interconnected world, we constantly generate vast and diverse data, necessitating automated methods for detection, synthesis, and understanding. Over the years, statistical learning has evolved to address these data challenges, with unsupervised learning being pivotal in uncovering patterns without predefined labels. In particular, clustering is a valuable technique for summarizing high-dimensional data by grouping similar observations based on shared characteristics. Extending this approach to simultaneously group observations and features, known as co-clustering, reveals intricate relationships among data. Within this framework, we delve into the exploration of two dynamic co-clustering models specifically designed for count data. These models integrate systems of ordinary differential equations, enabling them to adeptly capture sudden shifts in group membership and data sparsity as time progresses, with the second model boasting the capability to function in real-time on data streams. These innovations hold practical implications in the field of pharmacovigilance, a critical domain dedicated to the continuous monitoring and assessment of the safety of medical products.
The workshop will start at 9 h and end at 17 h. Lunch will be provided on site.
9 h: Participants welcome
9 h 15: Julien Tierny, Principal Geodesic Analysis of Merge Trees (and Persistence Diagrams)
10 h 00: Elodie Maignant, Barycentric subspace analysis of sets of graphs
10 h 20: coffee break
10 h 35: Giulia Marchello, Unsupervised statistical learning with latent block models on dynamic discrete data
11 h 20: Tom Szwagier, Rethinking dimension reduction with flag manifolds
11 h 45: Lunch break (across the Condorcet room)
13 h 30: Camille Marchet, Sketching in sequence bioinformatics: methods and applications
14 h 20: Igor Martayan, A compressed, dynamic and locality-preserving representation of k-mer sets for genomic analysis
14 h 40: Hugues Van Assel, SNEkhorn: Dimension Reduction with Symmetric Entropic Affinities
15 h: coffee break
15 h 30: Eddie Aamari, Minimax Boundary Estimation and Estimation with Boundary
16 h 20: Etienne Lasalle, Compressive Recovery of Sparse Precision Matrices
16 h 40: Maxime Guillaud, Channel Charting: Dimensionality Reduction for Radio Signals
17 h: concluding remarks
Abstracts of short talks
Barycentric subspace analysis of sets of graphs, Elodie Maignant
Unlabeled graphs are naturally modeled as equivalence classes under the action of permutations on nodes or equivalently under the action by conjugation of permutations matrices on adjacency matrices. Such model is however computationally limited, especially because it relies on the combinatorial problem of graph matching. As a relaxation of such action, we introduce in this talk a new framework where graphs are modeled in the quotient space resulting from the action of rotations on the set of adjacency matrices. Now, beyond the idea of relaxation, our approach takes on a natural interpretation from the point of view of spectral graph theory, that is the study of graphs through their spectrum, a descriptor that has proved to encode numerous properties. Indeed, the action of rotations by conjugation preserves the eigenvalues of a given graph – represented by its adjacency matrix – and the resulting equivalence classes are precisely the sets of cospectral graphs, that is graphs sharing the same set of eigenvalues. In this framework, we explore non-Euclidean dimensionality reduction and in particular a recent method named Barycentric Subspace Analysis (BSA). We demonstrate that our framework carry a very nice geometric structure, greatly simplifying the method. Through several examples, we are then able to illustrate that BSA is a powerful dimensionality reduction tool with great interpretability and particularly suited to the analysis of sets of graphs.
Rethinking dimension reduction with flag manifolds, Tom Szwagier
Principal component analysis (PCA) can be rethought as the search for a sequence of nested linear subspaces of increasing dimension (i.e., a flag) that best and best fit the data. Recent advances in the geometry and optimization on flag manifolds raise interesting multiscale perspectives on the classical dimension reduction methods, which usually work at a fixed dimension. In this talk, I will present a first step in that direction, whose goal is to lay some statistical foundations to these multiscale subspace methods. Stratified PCA (SPCA) is a generative covariance model with repeated eigenvalues which generalizes Probabilistic PCA (PPCA). A geometric study of SPCA models shows that they are parameterized with flag manifolds and that they stratify the space of covariance matrices according to the sequence of eigenvalue multiplicities. This sheds light on PPCA. We notably show that considering the top eigenvalues and eigenvectors in PPCA as identifiable (as assumed by the model) is rarely justified in practice, and that covariance models with repeated eigenvalues and multidimensional eigenspaces are more likely.
A compressed, dynamic and locality-preserving representation of k-mer sets for genomic analysis, Igor Martayan
With the unprecedented amount of genomic data available, the development of space-efficient data structures for indexing DNA sequences has never been more important. Since it is often too expensive to analyze long sequences all at once, these sequences are usually broken down into words of fixed size k, called k-mers. In particular, sets of k-mers play a central role in bioinformatics, whether for searching a sequence in a dataset, assembling a genome from smaller sequences or comparing different genomes. One way to obtain a space-efficient representation of k-mer sets is to consider them as a set of integers and perform compression on this set. We can then augment this representation to support fast membership queries and dynamic insertion/deletion with little additional space. On top of this representation, we can use an alternative encoding of k-mers, based on Lyndon words, which improves locality so that overlapping k-mers are mapped close to each other in memory. By improving locality, we achieve a better compression and speed up the queries by avoiding cache misses for consecutive k-mers.
SNEkhorn: Dimension Reduction with Symmetric Entropic Affinities, Hugues Van Assel
Many approaches in machine learning rely on a weighted graph to encode the similarities between samples in a dataset. Entropic affinities (EAs), which are notably used in the popular Dimensionality Reduction (DR) algorithm t-SNE, are particular instances of such graphs. To ensure robustness to heterogeneous sampling densities, EAs assign a kernel bandwidth parameter to every sample in such a way that the entropy of each row in the affinity matrix is kept constant at a specific value, whose exponential is known as perplexity. EAs are inherently asymmetric and row-wise stochastic, but they are used in DR approaches after undergoing heuristic symmetrization methods that violate both the row-wise constant entropy and stochasticity properties. In this work, we uncover a novel characterization of EA as an optimal transport problem, allowing a natural symmetrization that can be computed efficiently using dual ascent. The corresponding novel affinity matrix derives advantages from symmetric doubly stochastic normalization in terms of clustering performance, while also effectively controlling the entropy of each row thus making it particularly robust to varying noise levels. Following, we present a new DR algorithm, SNEkhorn, that leverages this new affinity matrix. We show its clear superiority to state-of-the-art approaches with several indicators on both synthetic and real-world datasets.
Compressive Recovery of Sparse Precision Matrices, Etienne Lasalle
We consider the problem of learning a graph modeling the statistical relations of the d variables of a dataset with n samples. Standard approaches amount to searching for a precision matrix representative of a Gaussian graphical model that adequately explains the data. However, most maximum likelihood-based estimators usually require storing the d^2 values of the empirical covariance matrix, which can become prohibitive in a high-dimensional setting. In this talk, we adopt a “compressive” viewpoint and aim to estimate a sparse precision matrix from a sketch of the data, i.e., a low-dimensional vector of size m « d^2 carefully designed from the data using nonlinear random features (e.g., rank-one projections). Under certain spectral assumptions, we show that it is possible to recover the precision matrix from a sketch of size m=O( (d+2k) ln(d) ) where k is the maximal number of edges of the underlying graph. These information-theoretic guarantees are inspired by the compressed sensing theory. We investigate the possibility of achieving practical recovery with an iterative algorithm based on the graphical lasso, viewed as a specific denoiser. We compare our approach and the graphical lasso on synthetic datasets, demonstrating its favorable performance even when the dataset is compressed. Joint work with : Titouan Vayer, Rémi Gribonval and Paulo Gonçalves.
Channel Charting: Dimensionality Reduction for Radio Signals, Maxime Guillaud
Channel charting is a recently proposed framework that applies dimensionality reduction to electromagnetic propagation data in wireless communications systems (such as 5G, 6G etc). It allows to associate in real time a pseudo-position to each mobile user in a low-dimensional space, enabling a range of applications that are tied to user location. I will introduce the theoretical underpinnings of channel charting and present an overview of algorithmic developments and experimental results obtained in the field. I will discuss the role of channel charting in next-generation wireless networks, and provide a perspective on future developments and challenges. This presentation will be based on a recent overview article: https://ieeexplore.ieee.org/document/10155724
Registration is closed.
We have limited funding to sponsor PhD students who would like to attend (and in particular, to present their work) but have no funding available. Please let us know in the submission form if you are in such a situation.