数据仓库建模理论

We have reached a point in the field of data that keeping up with the different technologies and the different steps of using and processing the data has become like a job itself; applying them to practice even more so. There is the collection of the data, the storage, the cleaning, the analyzing, use of the data: each of these steps has multiple tools and programming languages that one can use, each of these steps has many different ways to handle your data, on cloud or corporate servers, aiming for the same goal. Recently I had the chance to get my hands dirty with data warehouses.

在数据领域, 我们已经达到了一个点,即跟上不同的技术,并且使用和处理数据的不同步骤变得像工作本身一样。 让他们练习更多。 这里有数据的收集,存储,清理,分析,数据的使用:每个步骤都有一个可以使用的多种工具和编程语言,每个步骤都有许多不同的方式来处理您的数据,在云或公司服务器上,目标是相同的。 最近,我有机会接触到数据仓库。

假人简介 (Introduction for dummies)

One simple definition of a data warehouse (DWH) is as a database dedicated to data analysis and reporting, maintaining the history of data. DWH applications are often called OLAP (Online Analytical Processing), they are not however optimized for transaction processing, that is the domain of OLTP systems. The data is usually loaded through an ETL (Extract Transform Load) process of the different sources like OLTP applications, external data providers or mainframe applications. The users can read the data but they cannot write; often-wise they perform time-related data analysis.

数据仓库(DWH)的一个简单定义是作为专用于数据分析和报告,维护数据历史的数据库。 DWH应用程序通常被称为OLAP(在线分析处理),但是它们并未针对事务处理进行优化,这是OLTP系统的领域。 通常通过不同来源(如OLTP应用程序,外部数据提供程序或大型机应用程序)的ETL(提取转换加载)过程来加载数据。 用户可以读取数据,但不能写入。 通常,他们执行与时间相关的数据分析。

A data warehouse is typically described as a collection of different data sources, transformed in order to be easily queried and maintained, adapted to the needs of the business for doing analysis; the foundation for business intelligence environment.

数据仓库通常被描述为不同数据源的集合,经过转换后可以轻松查询和维护,以适应企业进行分析的需求; 商业智能环境的基础。

The advantages of building a DWH are many but some of the most important ones are:

建立DWH的优点很多,但其中一些最重要的优点是:

  • consolidation of different data sources

    合并不同的数据源
  • easy querying/access of the data

    轻松查询/访问数据
  • history through the staging step

    暂存步骤的历史
  • profiling the data for different business units

    分析不同业务部门的数据

一般架构 (The general schema)

As with any other processes, there is a source and there is a destination. In the case of data warehouses the data sources can be just about anything, the idea is that we aim to put together many different sources, being that flat excel files or operational systems, feed them to the warehouse and thus preparing them to be analyzed by the end users. However there are some intermediate steps in the whole process of building the warehouse.

像任何其他过程一样,有一个来源和一个目的地。 在数据仓库的情况下,数据源几乎可以是任何东西,我们的目标是将许多不同的源放在一起,即平面excel文件或操作系统,将它们输入到仓库中,从而准备将它们分析。最终用户。 但是,在构建仓库的整个过程中有一些中间步骤。

数据仓库建模理论_数据仓库理论
Image by Author
图片作者

Between the data source and the warehouse we have the staging area. The purpose of the staging area is that of being a bridge between the sources, collecting the data from the different sources, duplicating them and storing them with some system data. They are often loaded daily but can be loaded even more frequently. The staging data bring also some perks:

在数据源和仓库之间,我们有临时区域。 暂存区域的目的是成为源之间的桥梁,从不同源收集数据,复制它们并将它们与一些系统数据一起存储。 它们通常每天加载一次,但可以更频繁地加载。 暂存数据还带来一些好处:

  • data history: data is stored during time, capturing their history

    数据历史记录 :数据在一段时间内存储,捕获其历史记录

  • isolation: data warehouse is separated from the data sources

    隔离 :将数据仓库与数据源分开

Right after comes the actual warehouse. The warehouse consists of a dimensional model. A DWH has a broad vision as it has data of all the different units within a company being thus the central point of the structured data. Therefore there exists another possible step before reaching the end user: data mart.

紧接着是实际的仓库。 仓库由一个尺寸模型组成。 DWH具有广阔的视野,因为它拥有公司内所有不同部门的数据,因此成为结构化数据的中心点。 因此,在到达最终用户之前还有另一个可能的步骤:数据集市。

A data mart is a subset of the DWH that has pre-processed data and is aimed at a business unit for analysis.

数据集市是DWH的子集,它具有经过预处理的数据,并且针对要进行分析的业务部门。

The utility of a data mart is that of separating the data for the different units, reducing the amount of data being considered, making it easier to access the data as well as isolate them from the rest of the units. The presence of data marts is however optional.

数据集市的用途是将不同单元的数据分离,减少要考虑的数据量,从而更易于访问数据以及将它们与其余单元隔离。 但是,数据集市的存在是可选的。

With or without the data marts the processed and well organized data is used for its original purpose: analyzing the data in attempts to find useful facts and insights, building reports on top of these data to make visual the information extracted or do even more complicated stuff like data mining.

有或没有数据集市,经过处理和组织良好的数据均用于其原始用途:分析数据以尝试查找有用的事实和见解,在这些数据之上构建报告以使提取的信息可视化或做更复杂的事情像数据挖掘。

尺寸模型 (The dimensional model)

The data model of a data warehouse is called a dimensional model. The reason of the name comes from the fact that the model consists of dimensions and facts.

数据仓库的数据模型称为维模型。 该名称的原因来自于该模型由维度和事实组成的事实。

Fact tables are those tables that measure the performance of the business and how it changes over time.

事实表是用于衡量业务绩效以及其随时间变化的那些表。

These tables contain two type of columns: facts and foreign/surrogate keys to dimension tables. It can contain additive, semi-additive and non-additive measure types and usually take up to 90% of the DWH. They can be of three different types: transactional, periodic snapshots and accumulating snapshots. Examples of fact tables are all those tables that traces periodic (daily, monthly, quarterly etc) activities within the company likes employee’s daily activities, sales, ticket signing and so on.

这些表包含两种类型的列:事实和维表的外键/代理键。 它可以包含加性半加性非加性度量类型,通常占DWH的90%。 它们可以具有三种不同的类型: 事务性 快照定期快照累积快照 。 事实表的示例是所有这些表,这些表可跟踪公司内的定期(每天,每月,每季度等)活动,例如员工的日常活动,销售,签单等。

Dimension tables are tables that provide basics for fact tables, have descriptive attributes for filtering records, grouping records and labeling reports.

维度表是提供事实表基础的表,具有用于过滤记录,分组记录和标记报告的描述性属性。

Dimension tables instead provide well structured information to the fact tables which also means that their main purpose is that of providing filters, grouping and labeling of the data. Examples of dimensions are people, hierarchical organization of people, products, places etc.

相反,维表向事实表提供了结构良好的信息,这也意味着维表的主要目的是提供数据的过滤器,分组和标记。 维度的示例是人员,人员,产品,地点等的层次结构。

The first step in designing a dimensional model is that of identifying the two types of tables, their grain and the relationships that exists among them. Each of these two tables are advised to have surrogate keys and to keep a clean naming convention. One important aspect to keep in consideration when designing these tables are Slowly Changing Dimensions (SCD). These kind of tables are those that keep track of the attributes that change over time. E.g. in a table where the positions of the employees are kept it is important to keep track of the changes over time of the positions so that to know the start and the end date of each one of them. There are 6 different types of these tables and the choice on which one to adapt for your model depends on the needs of the business on the analysis that will later be done on these data.

设计维度模型的第一步是识别两种类型的表,即它们的粒度和它们之间存在的关系。 建议这两个表中的每个表都具有代理键,并保持简洁的命名约定。 设计这些表时要考虑的一个重要方面是“缓慢更改尺寸(SCD)”。 这些类型的表用于跟踪随时间变化的属性。 例如,在保留员工职位的表格中,重要的是要跟踪职位随时间的变化,以便知道每个员工的开始和结束日期。 这些表有6种不同的类型,一种适合您的模型的选择取决于企业的需求,这些需求将在以后根据这些数据进行分析。

数据仓库建模理论_数据仓库理论
Dummy star schema example — Image by Author
假星模式示例—照片作者Author

Star schema is the most basic and effective schema for a data model where the fact tables are in the center connected to the dimensions.

星型模式是事实表位于与维度连接的中心的数据模型的最基本,最有效的模式。

There exists yet another schema for the model called snowflake schema.

该模型还有另一个模式,称为雪花模式。

数据仓库建模理论_数据仓库理论
Dummy snowflake example — Image by Author
虚拟雪花的例子—照片作者Author

Snowflake schema is quite similar to the star schema with the difference that dimensions are also connected among them.

雪花模式与星型模式非常相似,区别在于维度之间也相互关联。

The star schema is a special case of the snowflake schema. The snowflake schema affects only the dimensions, the fact tables do not change with respect to the star schema: dimensions are kept in a normalized form to reduce redundancy which makes them easy to maintain and reduce storage space but it also leads to an increased number of joins needed when querying the data.

星型模式是雪花模式的特例。 雪花模式仅影响维度,事实表相对于星型模式不发生变化:维度以规范化形式保存以减少冗余,这使其易于维护并减少了存储空间,但同时也导致数量增加查询数据时需要连接。

数据仓库建模理论_数据仓库理论
Dummy bus matrix example — Image by Author
虚拟巴士矩阵示例—照片作者Author

When all is said and done, as per Kimball’s advice and as almost any kind of documentation requires a bus matrix can be produced where we once again put in display the relationships between the tables, in this case putting the emphasis on the distinction between facts and dimensions.

说完所有内容后,根据Kimball的建议以及几乎所有类型的文档都需要生成总线矩阵 ,在该矩阵中,我们再次显示表之间的关系,在这种情况下,重点在于事实与事实之间的区别。尺寸。

要记住 (To keep in mind)

As with anything else in the data world there does not exist a golden rule that works in all cases therefore very specific guidelines cannot be given and it depends on the data at hand as much as it does on the business need. So what to take into account when considering to build a data warehouse is that a great part of the time before building the model is spent understanding the available data and what you/the business units aim to obtain from them. By doing so you can be able then to think of a model and how to structure each single table and how to connect them. There are however two crucial aspects to keep in mind:

与数据世界中的任何其他事物一样,不存在在所有情况下都适用的黄金法则,因此无法给出非常具体的准则,它取决于手头的数据以及其对业务需求的依赖。 因此,在考虑构建数据仓库时要考虑的是,在构建模型之前,大部分时间都花在了了解可用数据以及您/业务部门打算从中获取什么方面。 这样,您便可以考虑一个模型,以及如何构造每个表以及如何连接它们。 但是,要记住两个关键方面:

  • Adaption to change: when building a DWH there can be specific requirements from the business side but as an data engineer you have to consider possible the evolution and different future scenarios and this has an impact on the decisions being taken when writing the model.

    适应变化 :构建DWH时,业务方面可能有特定的要求,但是作为数据工程师,您必须考虑可能的发展和不同的未来方案,这会影响编写模型时所做出的决策。

  • Depends on the business approval: the end users of the DWH or of the data marts are also the ones that provide the requirements. The model is considered validated only when it meets the needs of the business units involved in it.

    取决于业务批准 :DWH或数据集市的最终用户也是提供要求的用户。 仅当模型满足所涉及的业务部门的需求时,该模型才被视为经过验证。

Sources:

资料来源

  • Kimball, Ross: The Data Warehouse Toolkit

    Ross,Kimball:数据仓库工具包
  • O’Riley Learning: Agile Data Warehouse design

    O'Riley学习:敏捷数据仓库设计

翻译自: https://towardsdatascience.com/data-warehouses-the-theory-2f0481eb5af8

数据仓库建模理论

相关文章: