数据库结果行转列实践

March 06, 2016

在使用数据库进行数据分析的过程中, 我们时常会遇到将数据做行转列的操作. 假设我们有以下数据:

SELECT * FROM foo;

+----+-------+------------+-------+
| id | name  | date       | score |
+----+-------+------------+-------+
| 11 | jacky | 2016-03-06 |    60 |
| 12 | jacky | 2016-03-07 |    43 |
| 13 | jacky | 2016-03-08 |    66 |
| 14 | jacky | 2016-03-09 |    34 |
| 15 | jacky | 2016-03-10 |    20 |
| 16 | mary  | 2016-03-07 |    50 |
| 17 | mary  | 2016-03-08 |    52 |
| 18 | mary  | 2016-03-09 |    54 |
| 19 | mary  | 2016-03-10 |    56 |
| 20 | mary  | 2016-03-06 |    54 |
| 21 | tom   | 2016-03-10 |    99 |
| 22 | tom   | 2016-03-07 |   100 |
+----+-------+------------+-------+

我们需要将其转换为以日期为列的结构, 我们应该如何实现呢?

使用MySQL的GROUP_CONCAT函数

mysql > SELECT name, GROUP_CONCAT(score ORDER BY date SEPARATOR ' ') AS score FROM foo GROUP BY name;

+-------+----------------+
| name  | score          |
+-------+----------------+
| jacky | 60 43 66 34 20 |
| mary  | 54 50 52 54 56 |
| tom   | 100 99         |
+-------+----------------+

GROUP_CONCAT具体用法参考 官方手册

这是一种简单粗暴的方式, 同时存在一些缺陷:

  • 平铺的结果放在一个字段中, 可能仍然需要进行文本加工 (通常使用文本编辑器+Excel加工).
  • 要求原始数据分组后的列数相等, 这样查询的结果可以使用排序 (保证每一列的值正确) 并且结果是列对齐的状态 (name=tom的数据就是错误的状态).

使用MySQL的WHEN CASE语句

1.在SELECT的字段中将作为列的结果的字段作为条件, 并将作为行结果的字段作为值查询出临时结果.

mysql > SELECT foo.*,
        CASE WHEN `date` = '2016-03-06' THEN `score` END AS '2016-03-06',
        CASE WHEN `date` = '2016-03-07' THEN `score` END AS '2016-03-07',
        CASE WHEN `date` = '2016-03-08' THEN `score` END AS '2016-03-08',
        CASE WHEN `date` = '2016-03-09' THEN `score` END AS '2016-03-09',
        CASE WHEN `date` = '2016-03-10' THEN `score` END AS '2016-03-10'
    FROM foo;

+----+-------+------------+-------+------------+------------+------------+------------+------------+
| id | name  | date       | score | 2016-03-06 | 2016-03-07 | 2016-03-08 | 2016-03-09 | 2016-03-10 |
+----+-------+------------+-------+------------+------------+------------+------------+------------+
| 11 | jacky | 2016-03-06 |    60 |         60 |       NULL |       NULL |       NULL |       NULL |
| 12 | jacky | 2016-03-07 |    43 |       NULL |         43 |       NULL |       NULL |       NULL |
| 13 | jacky | 2016-03-08 |    66 |       NULL |       NULL |         66 |       NULL |       NULL |
| 14 | jacky | 2016-03-09 |    34 |       NULL |       NULL |       NULL |         34 |       NULL |
| 15 | jacky | 2016-03-10 |    20 |       NULL |       NULL |       NULL |       NULL |         20 |
| 16 | mary  | 2016-03-07 |    50 |       NULL |         50 |       NULL |       NULL |       NULL |
| 17 | mary  | 2016-03-08 |    52 |       NULL |       NULL |         52 |       NULL |       NULL |
| 18 | mary  | 2016-03-09 |    54 |       NULL |       NULL |       NULL |         54 |       NULL |
| 19 | mary  | 2016-03-10 |    56 |       NULL |       NULL |       NULL |       NULL |         56 |
| 20 | mary  | 2016-03-06 |    54 |         54 |       NULL |       NULL |       NULL |       NULL |
| 21 | tom   | 2016-03-10 |    99 |       NULL |       NULL |       NULL |       NULL |         99 |
| 22 | tom   | 2016-03-07 |   100 |       NULL |        100 |       NULL |       NULL |       NULL |
+----+-------+------------+-------+------------+------------+------------+------------+------------

2.使用上一步的结果按照需要使用的聚合字段聚合, 并且将没有值的字段填充默认值0.

mysql > SELECT
    name,
    COALESCE(SUM(`2016-03-06`), 0) AS '2016-03-06',
    COALESCE(SUM(`2016-03-07`), 0) AS '2016-03-07',
    COALESCE(SUM(`2016-03-08`), 0) AS '2016-03-08',
    COALESCE(SUM(`2016-03-09`), 0) AS '2016-03-09',
    COALESCE(SUM(`2016-03-10`), 0) AS '2016-03-10'
FROM
    (SELECT foo.*,
        CASE WHEN `date` = '2016-03-06' THEN `score` END AS '2016-03-06',
        CASE WHEN `date` = '2016-03-07' THEN `score` END AS '2016-03-07',
        CASE WHEN `date` = '2016-03-08' THEN `score` END AS '2016-03-08',
        CASE WHEN `date` = '2016-03-09' THEN `score` END AS '2016-03-09',
        CASE WHEN `date` = '2016-03-10' THEN `score` END AS '2016-03-10'
    FROM foo) foo_extend
GROUP BY name;

+-------+------------+------------+------------+------------+------------+
| name  | 2016-03-06 | 2016-03-07 | 2016-03-08 | 2016-03-09 | 2016-03-10 |
+-------+------------+------------+------------+------------+------------+
| jacky |         60 |         43 |         66 |         34 |         20 |
| mary  |         54 |         50 |         52 |         54 |         56 |
| tom   |          0 |        100 |          0 |          0 |         99 |
+-------+------------+------------+------------+------------+------------+

这样的做法也存在一些不足之处:

  • 统计的列需要事先指定并且状态值是有限的.

总结

以上是我通过使用数据库直接将结果的行转列的一些方法, 但方法终究只是工具, 重要的是否结果能否满足数据分析的需求, 不要在追求方法的过程中忘记了最终的目标. 很多时候, 将数据导入Excel分析依然是最有效的方法 (笑).

留下评论