Data Science, Generating Value From Data
1. What modern data science is?
数据科学将数据转化为见解甚至行动,是实证研究的基础;是动态的,需要进行多次的分析评估过程;数据科学团队需要成为该领域的专家保证见解的合理性
2. Why data science is the key to getting value out of data and where the growing interest for it comes from?
1.随着数据来源的激增,获取数据的能力增强
2.数据的流入激发了处理大数据能力的需求
3.List a recommended set of skills for a data scientist.
Why Python for Data Science
1.Some of the traits(特征) of modern data scientists
2. Why Python?
3. Four major Python modules useful for data analysis
1.Jupyter Notebooks,followed by NumPy and Pandas to ingest and analyze data efficiently.
2.Matplotlib as visualization libraries.
3.machine learning libraries such as Scikit-Learn to create models.
4.BeautifulSoup to easily read an XML and HTML-type data.
Case Study, Soccer Data Analysis
1.Big Picture of data science through a soccer case study
2.Generate statistics about a soccer data set
eg. xx.describe().transpose()
3.Summarize how data cleaning and correlations were applied to an existing data set
The cleaning can also build on the statistical analysis like removing outliers, missing values, or in general, weeding out unwanted stuff from your data.
In those situations you can impute those fields with known aggregate values such as mean of the columns, et cetera.
Python offers data cleaning functions to help with general data cleaning tasks like finding and removing null values.
4.Explain how clustering similar groups and plotting these clusters help the case study
Feature selection is about selecting attributes that have the greatest impact towards the problem you are solving. It requires some domain knowledge to narrow down the number of features.
Narrowing the features has several benefits. You get models that are easier to interpret, models get trained much faster, and you’re likely to generalize well to newer scenarios.
5.Recall what was used to draw conclusions based on this data analysis
数据分析五步走:
The process involved an acquiring results, data preparation, analysis and reporting of results, can then be used for data-driven actions.