在数据库中经常会碰到一些表的列是稀疏列,只有很少的值,例如性别字段,一般就只有2种不同的值。
但是当我们求这些稀疏列的唯一值时,如果表的数据量很大,速度还是会很慢。
例如:
创建测试表
bill=# create table t_sex (sex char(1), otherinfo text); create table bill=# insert into t_sex select 'm', generate_series(1,10000000)||'this is test'; insert 0 10000000 bill=# insert into t_sex select 'w', generate_series(1,10000000)||'this is test'; insert 0 10000000
查询:
可以看到下面的查询速度很慢。
bill=# select count(distinct sex) from t_sex; count ------- 2 (1 row) time: 8803.505 ms (00:08.804) bill=# select sex from t_sex t group by sex; sex ----- m w (2 rows) time: 1026.464 ms (00:01.026)
那么我们对该字段加上索引又是什么情况呢?
速度依然没有明显
bill=# create index idx_sex_1 on t_sex(sex); create index bill=# select count(distinct sex) from t_sex; count ------- 2 (1 row) time: 8502.460 ms (00:08.502) bill=# select sex from t_sex t group by sex; sex ----- m w (2 rows) time: 572.353 ms
的变化,可以看到执行计划已经使用index only scan了。
bill=# explain select count(distinct sex) from t_sex;
query plan
----------------------------------------------------------------------------------------------
aggregate (cost=371996.44..371996.45 rows=1 width=8)
-> index only scan using idx_sex_1 on t_sex (cost=0.44..321996.44 rows=20000000 width=2)
(2 rows)
同样的sql我们看看在oracle中性能如何?
创建测试表:
sql> create table t_sex (sex char(1), otherinfo varchar2(100)); table created. sql> insert into t_sex select 'm', rownum||'this is test' from dual connect by level <=10000000; 10000000 rows created. sql> commit; commit complete. sql> insert into t_sex select 'w', rownum||'this is test' from dual connect by level <=10000000; 10000000 rows created. sql> commit; commit complete.
性能测试:
sql> set lines 1000 pages 2000
sql> set autotrace on
sql> set timing on
sql> select count(distinct sex) from t_sex;
count(distinctsex)
------------------
2
elapsed: 00:00:01.58
execution plan
----------------------------------------------------------
plan hash value: 3915432945
----------------------------------------------------------------------------
| id | operation | name | rows | bytes | cost (%cpu)| time |
----------------------------------------------------------------------------
| 0 | select statement | | 1 | 3 | 20132 (1)| 00:00:01 |
| 1 | sort group by | | 1 | 3 | | |
| 2 | table access full| t_sex | 14m| 42m| 20132 (1)| 00:00:01 |
----------------------------------------------------------------------------
note
-----
- dynamic statistics used: dynamic sampling (level=2)
statistics
----------------------------------------------------------
0 recursive calls
0 db block gets
74074 consistent gets
0 physical reads
0 redo size
552 bytes sent via sql*net to client
608 bytes received via sql*net from client
2 sql*net roundtrips to/from client
1 sorts (memory)
0 sorts (disk)
1 rows processed
sql> select sex from t_sex t group by sex;
se
--
m
w
elapsed: 00:00:01.08
execution plan
----------------------------------------------------------
plan hash value: 3915432945
----------------------------------------------------------------------------
| id | operation | name | rows | bytes | cost (%cpu)| time |
----------------------------------------------------------------------------
| 0 | select statement | | 14m| 42m| 20558 (3)| 00:00:01 |
| 1 | sort group by | | 14m| 42m| 20558 (3)| 00:00:01 |
| 2 | table access full| t_sex | 14m| 42m| 20132 (1)| 00:00:01 |
----------------------------------------------------------------------------
note
-----
- dynamic statistics used: dynamic sampling (level=2)
statistics
----------------------------------------------------------
0 recursive calls
0 db block gets
74074 consistent gets
0 physical reads
0 redo size
589 bytes sent via sql*net to client
608 bytes received via sql*net from client
2 sql*net roundtrips to/from client
1 sorts (memory)
0 sorts (disk)
2 rows processed
可以看到oracle的性能即使不加索引也明显比postgresql中要好。
那么我们在postgresql中是不是没办法继续优化了呢?这种情况我们利用pg中的递归语句结合索引可以大幅提升性能。
sql改写:
bill=# with recursive tmp as ( bill(# ( bill(# select min(t.sex) as sex from t_sex t where t.sex is not null bill(# ) bill(# union all bill(# ( bill(# select (select min(t.sex) from t_sex t where t.sex > s.sex and t.sex is not null) bill(# from tmp s where s.sex is not null bill(# ) bill(# ) bill-# select count(distinct sex) from tmp; count ------- 2 (1 row) time: 2.711 ms
查看执行计划:
bill=# explain with recursive tmp as (
bill(# (
bill(# select min(t.sex) as sex from t_sex t where t.sex is not null
bill(# )
bill(# union all
bill(# (
bill(# select (select min(t.sex) from t_sex t where t.sex > s.sex and t.sex is not null)
bill(# from tmp s where s.sex is not null
bill(# )
bill(# )
bill-# select count(distinct sex) from tmp;
query plan
----------------------------------------------------------------------------------------------------------------------
aggregate (cost=53.62..53.63 rows=1 width=8)
cte tmp
-> recursive union (cost=0.46..51.35 rows=101 width=32)
-> result (cost=0.46..0.47 rows=1 width=32)
initplan 3 (returns $1)
-> limit (cost=0.44..0.46 rows=1 width=2)
-> index only scan using idx_sex_1 on t_sex t (cost=0.44..371996.44 rows=20000000 width=2)
index cond: (sex is not null)
-> worktable scan on tmp s (cost=0.00..4.89 rows=10 width=32)
filter: (sex is not null)
-> cte scan on tmp (cost=0.00..2.02 rows=101 width=32)
(11 rows)
time: 1.371 ms
可以看到执行时间从原先的8000ms降低到了2ms,提升了几千倍!
甚至对比oracle,性能也是提升了很多。
但是需要注意的是:这种写法仅仅是针对稀疏列,换成数据分布广泛的字段,显然性能是下降的, 所以使用递归sql不适合数据分布广泛的字段的group by或者count(distinct)操作。
到此这篇关于postgresql利用递归优化求稀疏列唯一值的文章就介绍到这了,更多相关postgresql递归优化内容请搜索www.887551.com以前的文章或继续浏览下面的相关文章希望大家以后多多支持www.887551.com!