dim_marketing_channel
和 dim_marketing_source
维度表时,可能存在多个记录表示相同的营销渠道信息。DISTINCT
关键字去除重复记录。例如,在 SQL 中:CREATE TABLE dim_marketing_info AS
SELECT DISTINCT
mc.channel_id AS id,
mc.channel_name,
ms.source_name
FROM
dim_marketing_channel mc
JOIN
dim_marketing_source ms ON mc.channel_id = ms.channel_id;
CREATE TABLE dim_marketing_info (
id INT PRIMARY KEY,
channel_name VARCHAR(100),
source_name VARCHAR(100)
);
这样在插入数据时,如果出现重复的 id
,数据库会抛出异常,可根据异常处理机制解决。
CREATE TABLE dim_marketing_info AS
SELECT
mc.channel_id AS id,
UPPER(mc.channel_name) AS channel_name,
UPPER(ms.source_name) AS source_name
FROM
dim_marketing_channel mc
JOIN
dim_marketing_source ms ON mc.channel_id = ms.channel_id;
dim_marketing_channel
和 dim_marketing_source
中对同一渠道的描述不同,可根据业务规则选择其中一个或组合信息。CREATE TABLE dim_marketing_info AS
SELECT
mc.channel_id AS id,
CASE
WHEN mc.channel_name IS NOT NULL THEN mc.channel_name
ELSE ms.channel_name
END AS channel_name,
CASE
WHEN ms.source_name IS NOT NULL THEN ms.source_name
ELSE mc.source_name
END AS source_name
FROM
dim_marketing_channel mc
JOIN
dim_marketing_source ms ON mc.channel_id = ms.channel_id;
id
字段,但代表不同含义。CREATE TABLE dim_marketing_info AS
SELECT
mc.channel_id AS channel_id,
mc.channel_name,
ms.id AS source_id,
ms.source_name
FROM
dim_marketing_channel mc
JOIN
dim_marketing_source ms ON mc.channel_id = ms.channel_id;
dim_marketing_channel
的 id
字段,而不使用 dim_marketing_source
的 id
字段:CREATE TABLE dim_marketing_info AS
SELECT
mc.channel_id AS id,
mc.channel_name,
ms.source_name
FROM
dim_marketing_channel mc
JOIN
dim_marketing_source ms ON mc.channel_id = ms.channel_id;
MAX
函数选择最新时间戳:CREATE TABLE dim_marketing_info AS
SELECT
mc.channel_id AS id,
mc.channel_name,
ms.source_name,
MAX(mc.update_timestamp) AS update_timestamp
FROM
dim_marketing_channel mc
JOIN
dim_marketing_source ms ON mc.channel_id = ms.channel_id
GROUP BY
mc.channel_id, mc.channel_name, ms.source_name;
-- 创建历史表存储旧数据
CREATE TABLE dim_marketing_info_historical (
id INT,
channel_name VARCHAR(100),
source_name VARCHAR(100),
valid_from TIMESTAMP,
valid_to TIMESTAMP,
is_current BOOLEAN
);
-- 新数据存储在主表
CREATE TABLE dim_marketing_info (
id INT,
channel_name VARCHAR(100),
source_name VARCHAR(100),
valid_from TIMESTAMP,
is_current BOOLEAN
);
COALESCE
或 IFNULL
函数补全缺失数据。例如:CREATE TABLE dim_marketing_info AS
SELECT
mc.channel_id AS id,
mc.channel_name,
COALESCE(ms.source_name, 'Unknown') AS source_name
FROM
dim_marketing_channel mc
JOIN
dim_marketing_source ms ON mc.channel_id = ms.channel_id;
ods_marketing_data
表补充信息:CREATE TABLE dim_marketing_info AS
SELECT
mc.channel_id AS id,
mc.channel_name,
COALESCE(ms.source_name, od.source_name) AS source_name
FROM
dim_marketing_channel mc
JOIN
dim_marketing_source ms ON mc.channel_id = ms.channel_id
LEFT JOIN
ods_marketing_data od ON mc.channel_id = od.channel_id;
通过以上方法,可以有效地处理维度表合并时可能出现的数据冲突,确保合并后的数据完整性、一致性和准确性,满足业务需求和数据仓库的性能要求。在合并过程中,需要根据实际情况选择合适的处理方法,并进行充分的测试和验证。
如果您觉得本文还不错,对你有帮助,那么不妨可以关注一下我的数字化建设实践之路专栏,这里的内容会更精彩。。
(1)一次收费持续更新。
(2)实战中总结的SQL技巧,帮助SQLBOY 在SQL语言上有质的飞越,无论你应对业务难题及面试都会游刃有余【全网唯一讲SQL实战技巧,方法独特】
(3)实战中技巧总结,让你认识不一样的数仓。【数据建模+业务建模,不一样的认知体系】(如果只懂数据建模而不懂业务建模,数仓体系认知是不全面的)
(4)数字化建设当中遇到难题解决思路及问题思考。
我的专栏具体链接如下: