Novel algorithm facilitates insights into diverse types of problems, from healthcare to finance | Usher Institute

A new feature-selection algorithm can be applied to any dataset, yielding faster data analysis.

Although large datasets are crucial in data analytics, having too many variables in a dataset is often detrimental to the resulting performance, as they prove complicated and time-consuming to process and interpret. Dr Thanasis Tsanas, Senior Lecturer in Data Science at the Usher Institute has led the development of the Relevance, Redundancy and Complementarity Trade-Off (RRCT) algorithm – a universal feature selection tool that can be used by any data analyst to identify a robust small set of key features in a dataset amongst a potential very large number of features, many of which may be considered noise or redundant to the study.

A simple tool for any data-driven field

By robustly reducing the number of features in a dataset, the RRCT algorithm facilitates the derivation of key new insights into understanding the underlying characteristics of a problem and thus making a system more interpretable and usable by experts.

In medicine, this could mean facilitating insights into e.g. the key genes contributing towards rare genetic disorders, faster diagnosis of illness, or better informed decision-making in industries such as finance, software, navigation, even weather forecasting.

What makes Dr Tsanas’ algorithm unique is that it can intrinsically account for different types of variables in a dataset, accounts for combinations of variables towards estimating the outcome of interest, and it can be readily deployed across settings where the outcome of interest is discrete or continuous.

Making complex datasets workable

The RRCT algorithm works by ranking features in descending order of importance towards estimating the outcome of interest, by taking into account their joint characteristics. Intrinsically, it accounts for all key properties of robust feature selection, namely relevance, redundancy, and complementarity, achieving a convenient trade-off using a principled information-theoretic basis.

In a paper published in the journal Patterns, RRCT was compared against 19 known feature-selection algorithms across 12 synthetic and real-world datasets. RRCT outperformed its competitors in the vast majority of cases in terms of correctly identifying the key features contributing to estimating the outcome of interest, proving its worth as a robust and powerful feature selection tool that can be used in virtually any field.

As a ‘generalisable’ algorithm, it can be used off-the-shelf for any data-related application where analysts are looking to identify key data variables. RRCT has been made open-source, and can be accessed via GitHub for use in MATLAB or Python. While large datasets are vital to improving research, creating versatile algorithms like ours will help us to more easily digest and use the colossal amounts of data that we collect on a daily basis, helping to supercharge the AI systems of the future.

The RRCT algorithm is a mathematically robust way to identify the key underlying characteristics of any dataset and can be applied to any combination of variable types across any field or domain.
Dr Thanasis Tsanas

Read the blog post about RRCT at the Alan Turing Institute

Access the paper published in Patterns

Watch the HDR UK webinar on RRCT

Download the RRCT algorithm from GitHub

This article was published on Thursday 21 July 2022