Artificial Intelligence and Life Sciences (Why Big Data is not enough to capture biological systems?)

by HansNauj9 min read15th Jan 20203 comments


BiologyMachine LearningAI


Artificial intelligence algorithms, and in particular deep learning, are becoming increasingly relevant to biology research. With these algorithms it is possible to analyze large amounts of data that are impossible to analyze with other techniques, helping to detect characteristics that might otherwise be impossible to recognize. This automatic feature detection allows you to classify cellular images, make genomic connections, advance drug discovery, and even find links between different types of data, from genomics and images to electronic medical records.[1]

However, as S. Webb pointed in a technology feature published in Nature, like any computational-biology technique, the results that arise from artificial intelligence are only as good as the data that go in. Overfitting a model to its training data is also a concern. In addition, for deep learning, the criteria for data quantity and quality are often more rigorous than some experimental biologists might expect [2]. Regarding this aspect, if organisms behave more like observers who make decisions depending a constant changing environment, then this strategy is not useful at all because individual decisions introduce constant irregularities in the collected data that cannot be grasped in a simple model or principle[3].

This perspective is not only a challenge to research in biology but seeks for an alternative approach to our understanding of intelligence, since the goal is not only to understand and simply classify individual complex structures and organisms, but is also about how these individual intelligences (heterogeneously) distribute into spaces and (eco) systems, and generate coherence[4] from apparent incoherent world. This perspective also has implications about how any research and derived technologies about intelligence cannot be seen as an isolated phenomenon from the rest of the nature.

Big data as golden pathway in biology?

We currently live in a golden age of the biology in part due to the idea that biological systems consist of a relationship between the parts (e.g. molecules) and the whole (e.g. organisms or ecosystems). Thanks to advances in technology (for instance in microscopy), mathematics and informatics it is possible today to measure and model complex systems by interconnecting different components at different scales in the space and depending on time[5]. To this end, large amounts of data are gathered to identify patterns [6]that allow the identification of dynamic models.

This -apparently successful- working method is defined as the paradigm of "microarray"[7]. This name originates from the microarray technique in biology to visualize relevant intracellular biochemical reactions by labeling proteins with phosphorescent markers. An eye-catching and important example of such a trend can be seen in biosciences, where drug effectivity or disease detection are, in practice, addressed through the study of groups, similarities and other structured characteristics of huge sets of chemical compounds (microarrays) derived from genetic, protein or even tissue samples. Traditional study of biological systems requires reductive methods in which amounts of data are collected by category. To do this, computers are useful to analyze and model this data[8], in part by using machine learning tools such as deep learning[9], to create accurate, real-time models of the response of a system to environmental and internal stimuli, for example in the development of therapies for cancer treatment[10].

However, a significant challenge for the advancement of this systemic approach is that the data needed for modeling are incomplete, contain biases and imbalances, and are continually evolving. Generally, usually the fault lies with the modeler and its methods and technologies used to design and build the model, which has no choice but to increase the amount and diversity of the data, perform various model optimizations and model training, and even make use of sloppy parameterization in its models[11].

Assuming that organisms are like robots that automatically operate through a set of well-defined internal mechanisms, then bias in the data can only be originated from factors outside the organism.

But even when data problems stem from the measurement methodology, we should not forget that biological systems and organisms are also observers who have complex organized interior states that allow observation of the environment. As organisms respond to their inner states, they are also obliged to move within their environment[12]. This implies a constant adaptation that continuously disregards the use of historical data to train models. Thus, flexible responses of the organisms to the environment, imply that bias can also originate from the organism itself, and that organisms can also adapt or restrict their adaption to experimental conditions. Therefore, in biology is also applies that the experimenter can also influence the experimental outputs.

This implies that living entities do not react blind to their environment, but also subjectively decide what that response should be. For instance, Denis Noble argue that “life is a symphony that connect organisms and environment across several scales and is much more than a blind mechanism” [13]. This constant sensing across several scales implies that mechanisms and pathways[14] are also adapting and changing as a constant flow[15], and that they are implicitly incomplete, challenging the explicit or implicit accounting of causal explanations based on mechanisms and pathways.

Therefore, cognition is not the characteristic of single organisms, but implies the participation of all the organisms subject to any analysis. This type of natural distributed cognition poses a challenge to the way we understand the world and process that information, either to build mechanistic models or to use entropy reduction methods in machine learning to build black box models. Recognizing that "biological systems" are a set of observers who make decisions we are continually considering systems with "ears and eyes" that are also able to model their environment and actions, including the experimental conditions to which they are subject. Because organisms, such as cells, are continually interacting and deciding on their environment[16], they can also develop intrinsic biases, for example in the responsiveness of growth factors[17] [18].

Consequently, instead to accumulate ever more data to try to have a better approach of a biological system it is preferable to better acknowledge how an organism senses and “computes”[19] its environment[20]. To this end we argue that the use of concepts coming from cognitive computation will be relevant to better analyze small amounts of data. As has been referred in an overview about this topic, cognitively-inspired mechanisms should be investigated in order to make the algorithms more intelligent, powerful, and effective in extracting insightful knowledge, from huge amounts of heterogeneous Big data [21], with topics that span cognitive inspired neural networks to semantic knowledge understanding. Our focus in this work orients in part to such approaches, while recognize the possibility of a natural distributed cognition in biological systems.

Finally, we want to point out that acknowledging this kind of distributed cognition in biological systems is not only relevant from a philosophical, but also from a practical point of view, since it points to minimize the amount of data required to train models. For instance, Ghanem et al. employed two metaheuristics, the artificial bee colony algorithm and the dragonfly algorithm to define a metaheuristic optimizer to train a multilayer perceptron to reach a set of weights and biases that can yield high performance compared to traditional learning algorithms[22]. Therefore, a clever combination of natural bias with a recognition of heterogeneities in the data used to train models in biology contribute to the development of efficient modelling methodologies using few datasets.


Currently, deep learning algorithms used for biology research have required extremely large and well-annotated datasets to learn how to distinguish features and categorize patterns. Larger, clearly labeled datasets, with millions of data points representing different experimental and physiological conditions, give researchers the most flexibility to train an algorithm. But for optimal training, algorithms require many well-annotated examples, which can be exceptionally difficult to obtain[23].

This implies that a safe use of this technology for modelling of biological systems requires of looking at the model through the lens of the training data [24]. Our central point is that the safe use of methods for modelling in biology and medicine must recognize that organisms are also active and intelligent observers[25], and that modelling technologies can be used in a “safe” way only when organisms or agents behave more or less in a “mechanistic” way.

On the other hand, we argue that the use of small data amounts, for instance based on a clever use of cognitive bias[26], can be a promising way to better understand and eventually model biological systems in a fairer (and safer) way.

These arguments also imply that (from a biological perspective) no form of intelligence should be investigated as an isolated phenomenon, and that any form of intelligence (in any organism, including humans) should be referred to the environment and interaction with other organisms, since nature is a loop where each organism is modeling other organisms as well as the common environment.

For example, constant reciprocal modeling between organisms and the environment is part of the central theme of the film Jurassic Park: according to this story, thanks to biotechnological advances it was possible to reproduce and control a population of dinosaurs produced de novo from DNA fragments blood found the stomachs of fossilized insects preserved in amber; the film portrays how from a human perspective this population of dinosaurs was under control, but no one in the park was aware that these dinosaurs were at the same time adapting and changing their behavior, and they were also modeling humans, so at the end the theme park ends in a catastrophe where humans lose its control.

This was a sci-fi film that perhaps was far from reality but helps illustrate several examples in biology where living beings are much more than a piece of hardware and biotech software.

In the nature there are several examples that illustrates this fact, i.e. how life is much more than a set of modellable biotechnical systems, like for instance the observed ability of octopuses to develop empathy and friendships with humans[27], or horses that model human behavior and even demonstrate high cognitive abilities (e.g. intelligent Hans effect[28]).

Thus, to see the world, and model it, we must go beyond the data, since each part in the nature is continuously relearning. Following Maurice Merleau-Ponty[29], to see the world (and in this case biological systems) we must break with familiar acceptance of it.




[4] Coherence has in this context a physical meaning


[6] Such as space-based forms of organization or fluctuations in time





[11] This notion also applies to other models, such as climate and economics (Freedman, 2011)



[14] In what follows we distinguish mechanisms, as biological phenomenon that can be understood from fundamental and invariant physicochemical principles, from pathways, which are a sequence of causal steps that string together an upstream cause to a set of causal intermediates to some downstream outcome.




[18] Intrinsic biases in biology are similar but not equal to the concept of internal bias of agents in psychology and economy, which is a concept applicable only to the human cognition.

[19] “Computing” means in this context sensing more than defining algorithms to perform a given operation












3 comments, sorted by Highlighting new comments since Today at 12:25 AM
New Comment
Interestingly, this notion is the central topic of the film Jurasic Park:

Which one?

As with many film franchises, the first Jurassic Park movie is actually titled "Jurassic Park."

Thanks for the feed back. I implemented small changes in the text in response to these comments