面向数据科学家的实用统计学
重点(Top highlight)
This post is written for those readers with little or no biology, but having data science skills and who are interested in working on biological domain problems.
这篇文章是为那些很少或没有生物学知识,但具有数据科学技能并且对研究生物学领域问题感兴趣的读者而写的。
My goal is for you to understand how you can tackle this problem without reading a textbook or spending 20 hours studying on Khan Academy.
我的目标是让您了解如何解决此问题,而无需阅读教科书或花20个小时在可汗学院学习。
Keep in mind that while many problems are neatly pre-digested into training and testing data sets, this is not always the case. This article is very much about the ideas that help you before you get to that point.
请记住,虽然许多问题已被很好地预先消化成训练和测试数据集,但情况并非总是如此。 本文非常关注在达到这一点之前可以为您提供帮助的想法。
清单: (The Checklist:)
- Pick your task. Understand it 选择您的任务。 明白它
-
Understanding the data that solves your problem (via the Data Types of Modern Biology)
了解解决您问题的数据(通过现代生物学的数据类型)
- Contextualize your project to understand your goals. 将您的项目上下文化以了解您的目标。
- Lastly, how and why you should collaborate 最后,如何以及为什么应该合作
选择您的任务。 明白它。 (Pick your Task. Understand it.)
Biology is the study of living organisms. It’s huge. Live with it. Lucky for us though, your problem is much more specific.
生物学是对生物的研究。 很大。 忍受它。 对我们来说幸运的是,您的问题更加具体。
It’s an interesting question whether this is a function of the human preference for categorisation but if you go to the Wikipedia list of unsolved problems in biology you will see every problem neatly laid out into different categories, like “biochemistry”, “neurophysiology” and “ecology”.
这是否是人类对分类的偏爱的函数,这是一个有趣的问题,但是如果您进入Wikipedia生物学中未解决问题的列表,您会发现每个问题都被整齐地布置在不同的类别中,例如“生物化学”,“神经生理学”和“生态”。
First things first. Don’t think you have to study everything up front, this is often quite unnecessary. Learn more as you need it. Your time is valuable.
首先是第一件事。 不必认为您必须事先学习所有内容,这通常是不必要的。 根据需要了解更多信息。 您的时间很宝贵。
The rest of this article is about how you break down that literature into useful insights that will help with your solution.
本文的其余部分是关于如何将这些文献分解为有用的见解,这些见解将对您的解决方案有所帮助。
If you don’t already have a problem in mind, you could go here and pick a problem.
如果您还没有考虑到问题,可以去这里选一个问题。
Learning general skills is best done in the context of an immediate and real problem for which specific answers can be found.
学习通用技能最好是在存在直接问题和实际问题的情况下进行,并且可以找到具体答案。
现代生物学的数据类型(The Data Types of Modern Biology)
It is often tempting to spend most of our time training and evaluating models. On any data science project however, there’s likely to be payoffs to thinking about your data sources and data generating processes.
花很多时间来训练和评估模型通常很诱人。 但是,在任何数据科学项目中,考虑数据源和数据生成过程都可能会有所收获。
In Biology, there are huge returns to understanding your data.
在生物学中,了解您的数据有巨大的回报。
Comparing Raw, Annotated and Research Data
比较原始,注释和研究数据
Raw data is new. It hasn’t been processed.
原始数据是新的。 尚未处理。
Take DNA sequences for example. These represent information containing molecules which provide the blue-print for all life. We represent them with ordered combinations of four letters “A”,”T”,”C” and “G”.
以DNA序列为例。 这些代表包含信息的信息,这些信息为整个生命提供了蓝图。 我们用四个字母“ A”,“ T”,“ C”和“ G”的有序组合表示它们。
These letters are not raw data though. In reality, we deduce them via experimental techniques that involve bouncing off DNA molecules in solution and that process has some error. Thus the true raw data may in fact be the wave-like spectra created in that experiment.
这些字母不是原始数据。 实际上,我们是通过实验技术推断出它们的,这些技术涉及反弹溶液中的DNA分子,并且该过程存在一些错误。 因此,真正的原始数据实际上可能是该实验中创建的波状光谱。
So in reality, the “rawness” of data is very subjective but it can be valuable to ask about in many contexts.
因此,现实中,数据的“原始性”是非常主观的,但在许多情况下进行询问可能很有价值。
We can contrast raw data with annotated data. This data has had some level of editing performed on it.
我们可以将原始数据与带注释的数据进行对比。 该数据已对其进行了某种程度的编辑。
Elaborating on the previous example. We can translate DNA sequences to Protein sequences (made up of 20 different letters instead) using a dictionary-like key.
详细说明前面的示例。 我们可以使用类似字典的键将DNA序列翻译为蛋白质序列(改为由20个不同的字母组成)。
If you saw a protein sequence derived from a DNA sequence, it would be a form of annotation on the DNA sequence and not the same data as protein sequenced directly (which does result from other analytical processes).
如果您看到了一个源自DNA序列的蛋白质序列,那么它将是DNA序列上的一种注释形式,而不是与直接测序的蛋白质相同的数据(这确实是其他分析过程产生的)。
Annotations, as in other areas, can become more detailed over time.
与其他区域一样,注释会随着时间的推移变得更加详细。
If we define research as the process of using data to evaluate hypotheses that might be supported, then research itself can be thought of as an extreme form of annotation.
如果我们将研究定义为使用数据评估可能支持的假设的过程,那么研究本身可以被视为注释的一种极端形式。
This year, a call was made to annotate research papers that might be relevant to covid-19 using Natural Language Processing (NLP), to enable us to mine existing literature for clues to aid in the global coronavirus response.
今年,人们号召使用自然语言处理(NLP)注释与covid-19相关的研究论文,以使我们能够挖掘现有文献以寻找有助于全球冠状病毒应对的线索。
The raw data for these projects then becomes the text inside journal articles and which itself describes other types of data. An example paper is here.
这些项目的原始数据随后成为日记文章中的文本,它本身描述了其他类型的数据。 示例纸在这里。
The use of data types as a framework and language for describing biological data can be incredibly powerful, and these details can make or break the interpretation and validity of your final model.
使用数据类型作为描述生物数据的框架和语言可能会非常强大,而这些细节可能会影响或破坏最终模型的解释和有效性。
Observational vs Experimental
观察性与实验性
Considering where your data comes from will determine the type of analysis and the reach of your insights. In science, we do experiments. But so often, we also don’t.
考虑数据的来源将决定分析的类型和见解的范围。 在科学中,我们进行实验。 但是很多时候,我们也没有。
How would Darwin have proven evolution without observational data (going out and observing similarities between animals)?
没有观察数据(走出去并观察动物之间的相似性),达尔文将如何证明其进化?
Conversely, are we to ignore natural experiments when they occur? (when nature randomly treats otherwise identical entities differently).
相反,当自然实验发生时,我们是否会忽略它们? (当自然随机对待否则相同的实体时会有所不同)。
Experimental data originates from intentionally changing one or more variables and observing how the world changes. This data is often created with a particular analysis method in mind and you will need to understand the logic behind this if you are to analyze this type of data.
实验数据来自有意改变一个或多个变量并观察世界如何变化。 通常在创建数据时会考虑特定的分析方法,并且如果要分析此类数据,则需要了解其背后的逻辑。
Experimental data is awesome because it can often facilitate causal inference. It often lends itself well to supervised learning tasks.
实验数据很棒,因为它通常可以促进因果推理。 它通常很适合监督学习任务。
Observational data is much more common. This includes any data where we haven’t systematically varied any variables of interest.
观测数据更为普遍。 这包括我们没有系统地改变任何感兴趣变量的任何数据。
Causal inference is much harder, if not impossible, with observational data. Darwin may have succeeded because of natural experiments (see hyperlink above).
如果不是不可能的话,利用观测数据很难进行因果推论。 达尔文可能由于自然实验而获得成功(请参见上面的超链接)。
Supervised learning still works for observational data, but the language surrounding model interpretation becomes treacherous territory. This causes much confusion in the public domain about whether X or Y is good for your health or not.
监督学习仍然适用于观测数据,但是围绕模型解释的语言变成了危险的领域。 这在公共领域引起了很多关于X或Y是否对您的健康有益的混淆。
Global vs Local
全球与本地
This distinction compares data accessible publicly, to data collected by a specific group like a lab or a company. I suspect thinking about this distinction is useful beyond just biological applications.
这种区别将可公开访问的数据与特定组(例如实验室或公司)收集的数据进行比较。 我怀疑思考这种区别不仅限于生物学应用。
Global data comprises a vast array of online databases about molecules, species, medical conditions and more that have been collected and shared publicly. These datasets are critical repositories for public research and can be awesome ways to enrich local datasets through linking.
全球数据包括有关分子,物种,医学状况以及大量已公开收集和共享的更多在线数据库。 这些数据集是公共研究的重要存储库,并且可以是通过链接丰富本地数据集的绝佳方法。
Learning how to leverage global data resources may well be one of the most exciting opportunities for data scientists to work with large non-commercial datasets currently.
学习如何利用全球数据资源可能是数据科学家当前使用大型非商业数据集的最令人兴奋的机会之一。
Local data, however, is data that only you or your organization has. It may or may not be structured in similar ways to other datasets or be produced using the same methods.
但是,本地数据是只有您或您的组织拥有的数据。 它可能会或可能不会以与其他数据集类似的方式进行结构化或使用相同的方法生成。
This data can be very valuable if you are trying to compete with others academically or in business but might be less valuable if you are trying to create a general solution that needs to be applicable in many contexts.
如果您想在学术上或商业上与他人竞争,那么这些数据可能非常有价值,但是如果您试图创建一个需要在许多情况下适用的通用解决方案,那么这些数据可能就没有那么有价值。
The danger with building models on local data is that your model might be very hard for other people to use appropriately unless they can reconstruct your data collection process.
在本地数据上构建模型的危险在于,除非其他人可以重构您的数据收集过程,否则其他人可能很难正确使用您的模型。
A further note: Local data doesn’t become public data just because it’s available online, although the details probably matter here. Global data repositories usually create relationships between data points that mean that there is a process required to submit additions to these repositories.
进一步说明:本地数据不会仅因为可以在线获取而成为公共数据,尽管此处的细节可能很重要。 全局数据存储库通常会在数据点之间创建关系,这意味着需要一个过程才能向这些存储库提交附加内容。
上下文是关键 (Context is Key)
The criteria by which a problem is solved changes for every application of data science, so too in Biological Sciences.
数据科学的每一项应用都改变了解决问题的标准,在生物科学中也是如此。
Evaluation Metrics are important considerations for any project, but the important metrics for biological projects will often be domain specific (hence why you should learn about the domain!).
评估指标 是任何项目的重要考虑因素,但是生物项目的重要指标通常是特定于领域的(因此,为什么您应该了解领域!)。
For example, in many biomedical applications of data science, the false discovery rate will compose part of your evaluation criteria. If your algorithm says that a molecule is present that in fact isn’t, and a researcher believes you, they might build their next study around that result!
例如,在数据科学的许多生物医学应用中,错误发现率将构成您评估标准的一部分。 如果您的算法说存在一个实际上并不存在的分子,而研究人员相信您,那么他们可能会根据该结果进行下一次研究!
Model Interpretation is not trivial in biological problems (is it ever?). Depending on how you construct your model, you may or may not be able to explain why it makes the decisions that it makes. Sometimes, this is ok, and sometimes it isn’t.
模型解释 在生物学问题上并非微不足道(是吗?)。 根据您构建模型的方式,您可能无法解释其做出决定的原因。 有时,这没关系,有时却不行。
Model interpretation can be as simple as keeping track of feature importances (in tree based models), coefficients in regression models. Often these values can be directly compared to existing theories or other models, thereby placing your results in context.
模型解释就像跟踪特征重要性(在基于树的模型中),回归模型中的系数一样简单。 通常,可以将这些值直接与现有理论或其他模型进行比较,从而将结果放在上下文中。
In biological domains, understanding and communicating the limitations of a model is very important.
在生物学领域,了解和传达模型的局限性非常重要。
Data science is fundamentally constrained by data, so I’d start there when considering what you can or can’t say about your model’s conclusions.
数据科学从根本上受到数据的限制,因此,在考虑您对模型结论可以说或不能说什么时,我将从这里开始。
Was your data local? Or did you analyze a larger global dataset? The latter will allow greater generalizability. Was the data experimental? If so, causal interpretations may be on the table.
您的数据在本地吗? 还是您分析了更大的全局数据集? 后者将允许更大的通用性。 数据是实验性的吗? 如果是这样,可能会在表上显示因果关系。
How annotated was the data you used to train your model? Could that bias your model?
您用来训练模型的数据有多注释? 这会使您的模型产生偏差吗?
Understanding the domain and your data is not accessory to your data science, it’s fundamental to knowing what your model can and can’t do.
了解域和数据并不是数据科学的附属内容,这是了解模型可以做什么和不能做什么的基础。
您必须合作(You must Collaborate)
This goes without saying.
这不用说。
If you want to improve and be challenged, the best way to do that is work with lots of different people. Moreover, this becomes especially salient if those people have subject matter expertise in problems that you are interested in.
如果您想改进并受到挑战,最好的方法就是与许多不同的人一起工作。 此外,如果这些人在您感兴趣的问题上具有主题专业知识,这将变得尤为突出。
I don’t think it’s just wishful thinking when I say that lots of people out there are willing to share their subject matter expertise.
当我说很多人愿意分享他们在主题方面的专业知识时,我认为这并不是一厢情愿的想法。
Bear in mind, a person who could help you doesn’t need a PhD, only more experience, knowledge or education than you currently have.
请记住,一个可以帮助您的人不需要博士学位,只需要比您现在更多的经验,知识或教育即可。
Take a risk, ask for help, share a problem. It is hardly an indictment on your character to ask for help so you really have nothing to lose.
冒险,寻求帮助,分享问题。 寻求帮助几乎不是对您性格的起诉,因此您真的没有损失。
下一步: (Next steps:)
Hopefully, when you have put the effort into understanding these aspects of your problem, you will be in a much better place to solve it.
希望您在努力理解问题的这些方面时,将可以在一个更好的地方解决它。
In my personal experience, at Mass Dynamics, thoroughly pinning down both the problem and solution criteria is immensely useful. Whether it’s for your collaborators, customers or the general public, making an effort to understand your problem domain will create real dividends and drive success for all involved.
以我个人的经验,在Mass Dynamics,彻底确定问题和解决方案的标准非常有用。 无论是为您的合作伙伴,客户还是普通大众,努力了解您的问题领域都将带来实实在在的红利,并为所有相关人员带来成功。
So do what you’d usually do, crunch the data, feature engineer, feature select, find the latent space, create embeddings, train your models, evaluate and tune. Do all this, knowing what your goal is, the true solution criteria and with a helping hand to guide you through.
因此,您可以执行通常的工作,处理数据,特征工程师,特征选择,找到潜在空间,创建嵌入,训练模型,评估和调整。 知道您的目标是什么,确定真正的解决方案标准,并借助帮助您进行所有操作。
**Special thanks to Ben Harper and others for his help editing this article.
**特别感谢Ben Harper和其他人帮助编辑本文。
翻译自: https://towardsdatascience.com/data-scientists-think-like-biologists-b681a9795627
面向数据科学家的实用统计学