Monday, July 19, 2010

Data mining talks

As a molecular biophysicist I often hear talks (and see posters) given by bioinformaticists.* I am struck by how these are almost uniformly abysmal. I'm not necessarily referring to the data, but rather the presentation as a whole. This has reached the point where I don't think I can bring myself to sit through another bioinformatics talk (or poster presentation) for at least the next three months.

Why has the quality of the now dozens of such talks I've suffered through been so low?

In the majority of cases I posit it's a combination, in varying degrees, of a lack of imagination and a disconnection from the underlying biology. Too many of these presenters regale their audiences with interminable laundry lists of how property X is over-represented in sequences of class A, and under-represented in sequences of class B. Ummmm... So what? Why should I care? Often such presenters either don't know or are too lazy to spend the time connecting their data with known biology. As an example, I recently sat through a talk where the speaker made a big deal about the prevalence of glutamine-rich sequences in proteins involved in transcription. Not once did he refer to the fairly substantial body of experimental data on these very same sequences. In fact, when asked, he couldn't offer up any explanation for this observation.** Major fail.

I can't explain why this happens. Obviously it shouldn't. Perhaps it's a function of the relatively immature nature of bioinformatics as a field. It's still at a stage where method development trumps method application. Application of the intelligent kind.

I remember when macromolecular crystallography talks suffered from similar issues. They would often be these long detailed descriptions of the structure(s) just solved by the crystallographer. No connection to the biology, just the details of the structure. Listen, I don't give a rat's arse that there's a type VIIb turn between helices 7 and 8. What I want to know is what the structure tells us about the biology. Nowadays most crystallographers do make the connections. One can't get a grant for simply solving structures any more.***

I've heard through the grapevine that getting a grant to do bioinformatics has become increasingly difficult. More so than would be expected from the downturn in science funding. Perhaps we'll see the field forced to mature more rapidly and presentations improve.

* By "bioinformatics" I mean the data-mining thing. A colleague once defined it thusly: "Bioinformatics is the mining of biological databases for profit (not necessarily of the monetary kind)." This is distinct from computational biology which, at least at the molecular level, tends to employ an energy function of sorts.

** Glutamine-rich regions can be involved in DNA binding - the glutamine side chain is quite good at making hydrogen bonds with nucleic acids.

*** Not when I'm reviewing the grant. :-)


Comrade PhysioProf said...

Preach on, brother!

Prof-like Substance said...

My impression is that there are very few individuals who have been "brought up" at the intersection of biology and computer science / math in the way that allows for the really meaningful integration of them. That is changing somewhat as bioinformatics becomes more recognized as a field, but typically you have biologists struggling with the tools or computer scientists struggling with the context. Having one foot firmly planted on both sides of the divide is rare and very difficult to get to as a trainee.

Odyssey said...

Agreed. However, people who become so conversant with mining databases shouldn't find the mining of PubMed, WoS and/or Google Scholar all that hard. Understanding what they find is certainly more difficult, but that's when you march yourself down the corridor/to the next building/across campus to find yourself a biologist willing to help you try to make sense of what you have. Granted one can't always make sense of these things, but the onus is on the researcher to at least try.

As the field of bioinformatics grows and evolves things will improve, but it's so freaking frustrating right now to sit through a bioinformatics talk devoid of biology.

Prof-like Substance said...

I agree. I'm not excusing the behavior, just reporting how it looks from my seat. It's lazy to just report the findings and not try and place them in a broader context, but the shit still gets published.

Odyssey said...

Alas the literature is full of shit...

ucsf egan said...

This choir member agrees - the same sentiment has fueled our development of EGAN. Too many machine learning algorithms produce black-box classifiers that, while highly accurate with respect to their training data, fail to provide any insight into the underlying biology.

I worry that computational biology is becoming an exclusive scientific caste with no motivation to collaborate with outsiders.

Odyssey said...

You are absolutely right about machine learning algorithms.

Re a computational biology caste, I was careful to distinguish bioinformatics from the rest of computational biology. The molecular simulation field appears to have matured to the point where many within the field are trying to tackle biologically relevant problems in collaboration with experimentalists. Of course there are also those that don't, and some of those that do, do so very badly.

tideliar said...

I know exactly what you mean. I beat my staff and students constantly to force them to learn some bloody biology. They are just so focused on the tool, the miss completely the damned *point*!

I was fortunate to get into this area after already having been a bench biologist for some years. Everything I look at it seen in terms of functions, not necessarily form.

rork said...

Bioinformatics: anything remotely mathematical that the other people don't want to do or don't know how to do.

I think part of the problem when working on real problems is that the biologists aren't able to articulate what they really want, partly because they are not familiar with what can be done. Also, the number of possible domains of bioinformatics is so great, that it's very hard to be familiar, much less good, at all of them. Sure, nerds like me often will not understand the biology well enough. But it goes both ways - the biologists do not nearly understand the math well enough. So perhaps I want to plead (with Prof-like) that we will need future docs and biologists to learn more math-related stuff - far more. In biology in particular, I think many people self-selected into the (squishy) field, cause they did not want to study the sciences where equations dominate.

For example, if you didn't want a dark machine-learning classifier (I don't), did you ask your nerd to try a diagonal linear discriminant function instead? Know what a linear discriminant function is about? Hint: invented by one of the greatest geneticists of the 20th century, R.A. Fisher. I hear he was a fairly good mathematician too. And that was when, 1936?

I admit there is some problem with it being a somewhat "hot field", but one must permit turning aside to research and improve methods without focusing so much on immediate goals sometimes, in order to better achieve our goals better in the long run (I am nearly quoting de Tocqueville).

Odyssey said...

I'm a biophysicist. I get the math.

And I agree with both PlS and you that there is a dire shortage of people trained at the interface. But ultimately none of that is relevant to my gripe. What it all boils down to is this: know your audience. If you're a bioinformatician (or any other kind of scientist) presenting your work to a more general audience you should be working hard to ensure your presentation is relevant to that audience. Maybe biologists should be more conversant in math. But right now many are not, and giving an unintelligible, math-laden presentation isn't likely to inspire them to go out and learn some. Presenting some whizz-bang new software/algorithm/tool, no matter how transparent the technology, won't impress anyone if you can't show how it's useful to the audience. In language the audience understands.

As alluded to above, it goes both ways. A biologist/biochemist/microbiologist/physiologist/biophysicist/neurobiologist etc. giving a talk to bioinformaticists should work to make the biology more accessible.

It's all about communication.

rork said...

I can't argue with any of that.

Anonymous said...

The problem with training more people at the interface is that it's hard for them to find employment as independent researchers afterwards. Math/CS departments ask where the theorems/algorithms/methods are, and biology departments want someone trained to teach biology and run a lab. Easy to find yet another postdoc, but there are limited research career paths for people who actually take the time to understand both.