character sort

oracle通常有两种排序方法,binary sort和linguistic sort。binary根据字符集的二进制编码去排序,比如英文字母根据ASCII的编码标准去排序;多字节字符会根据其字符的内部编码去排序,像汉字根据汉字内码来排序;而linguistic排序分两种情况,a) 单字符环境,也就是排序字段只包含一种字符集的数据,这个时候会根据字符集的默认排序标准排列数据,比如汉字默认采用拼音码来排序;b)多字符环境, 也就是排序字段包含了两种或以上字符集的数据,这个时候oracle会采用一种叫做分级的排序方法,什么意思呢?就是Oracle会根据第一层的方法去排序,如果第一层方法排序之后的顺序相同会采用第二种方法,一次类推,最多只能到第三层。各语言都提供了自己的排序方法 。

各字符集排序方法http://download.oracle.com/docs/cd/B19306_01/server.102/b14225/applocaledata.htm#i637232

binary sort、linguistic sort http://download.oracle.com/docs/cd/B19306_01/server.102/b14225/ch5lingsort.htm#i1008187

现在我们只单独考虑汉字排序问题,影响汉字排序有以下两个因素:


  • 字符集

  • NLS_SORT

如果数据库字符集采用了中文字符集,比如(ZHS16GBK,ZHS16CGB231201等),默认情况下中文排序会按各汉字的拼音字母顺序排序,但同时还受NLS_SORT参数的限制。NLS_SORT可以设置排序的类别和具体的排序方法,默认为BINARY。我们常用的简体中文如果根据linguistic来排序,可用的方法有(9i后新增):

SCHINESE_RADICAL_M
Simplified Chinese sort based on radical as primary order and number of strokes order as secondary order
SCHINESE_STROKE_M
Simplified Chinese sort uses number of strokes as primary order and radical as secondary order
SCHINESE_PINYIN_M
Simplified Chinese PinYin sorting order
TCHINESE_RADICAL_M
Traditional Chinese sort based on radical as primary order and number of strokes order as secondary order
TCHINESE_STROKE_M
Traditional Chinese sort uses number of strokes as primary order and radical as secondary order. It supports supplementary characters

有以下测试数据。

create table chars(id number,region varchar2(20));
insert into chars values(1,'陕西');
insert into chars values(2,'北京');
insert into chars values(3,'北京西');
insert into chars values(4,'上海');
commit;

我们希望region列按拼音码的顺序来排序 :

SQL> select * from chars order by region;

        ID REGION
---------- ----------------------------------------
         4 上海
         2 北京
         3 北京西
         1 陕西

这个排序结果是不正确的。当前session的nls参数:

SQL> col parameter format a30
SQL> col value format a20
SQL> select * from nls_session_parameters;

PARAMETER                      VALUE
------------------------------ --------------------
NLS_LANGUAGE                   AMERICAN
NLS_TERRITORY                  AMERICA
NLS_CURRENCY                   $
NLS_ISO_CURRENCY               AMERICA
NLS_NUMERIC_CHARACTERS         .,
NLS_CALENDAR                   GREGORIAN
NLS_DATE_FORMAT                yyyy-mm-dd hh24:mi:s
                               s

NLS_DATE_LANGUAGE              AMERICAN
NLS_SORT                       BINARY

PARAMETER                      VALUE
------------------------------ --------------------
NLS_TIME_FORMAT                HH.MI.SSXFF AM
NLS_TIMESTAMP_FORMAT           DD-MON-RR HH.MI.SSXF
                               F AM

NLS_TIME_TZ_FORMAT             HH.MI.SSXFF AM TZR
NLS_TIMESTAMP_TZ_FORMAT        DD-MON-RR HH.MI.SSXF
                               F AM TZR

NLS_DUAL_CURRENCY              $
NLS_COMP                       BINARY
NLS_LENGTH_SEMANTICS           BYTE

PARAMETER                      VALUE
------------------------------ --------------------
NLS_NCHAR_CONV_EXCP            FALSE

17 rows selected.

这里默认的排序方法为BINARY。LANGUAGE为AMERICAN,数据库字符集为AL32UTF8:

SQL> select * from nls_database_parameters;

PARAMETER                      VALUE
------------------------------ --------------------
NLS_LANGUAGE                   AMERICAN
NLS_TERRITORY                  AMERICA
NLS_CURRENCY                   $
NLS_ISO_CURRENCY               AMERICA
NLS_NUMERIC_CHARACTERS         .,
NLS_CHARACTERSET               AL32UTF8

修改当前session的NLS_SORT参数为拼音排序发:

SQL> alter session set nls_sort=SCHINESE_PINYIN_M;

Session altered.

SQL> select * from chars order by region;

        ID REGION
---------- ----------------------------------------
         2 北京
         3 北京西
         1 陕西
         4 上海

SQL>
SQL> insert into chars values(5,'陕北');     

1 row created.

SQL> commit;

Commit complete.

SQL> select * from chars order by region;

        ID REGION
---------- ----------------------------------------
         2 北京
         3 北京西
         5 陕北
         1 陕西
         4 上海

alter session修改了整个session的排序方法,这样可能会影响其他排序,通过NLSSORT函数指定只对当前SQL排序的方法 :

SQL> select * from chars order by nlssort(region,'NLS_SORT=SCHINESE_PINYIN_M');

        ID REGION
---------- ----------------------------------------
         2 北京
         3 北京西
         5 陕北
         1 陕西
         4 上海

在接着看下面的的排序

SQL> insert into chars values(6,'How Are You');

1 row created.

SQL> insert into chars values(7,null);

1 row created.

SQL> insert into chars values(8,'Are you ok');

1 row created.

SQL> commit;

Commit complete.

SQL> select * from chars order by region;

        ID REGION
---------- ----------------------------------------
         8 Are you ok
         6 How Are You
         2 北京
         3 北京西
         5 陕北
         1 陕西
         4 上海
         7

8 rows selected.

混合数据中,首先按单字节的字符排序;然后排序多字节字符,多字节字符根据设定的排序方法来排序。

随机日志

Tags:

Leave a Reply