#310 open
pluskid

regexp should consider $KCODE

Reported by pluskid | February 7th, 2008 @ 03:46 PM | in 1.0 preview

The second optional argument of Regexp.new can be used to indicate the language/encoding.

The default behavior:

re = Regexp.new(".")
str = "中文" # or "\344\270\255\346\226\207" in utf-8 encoding
re.match(str)[0] # => "\344"

When specifying the language/encoding:

re = Regexp.new(".", nil, 'u')
str = "中文" # or "\344\270\255\346\226\207" in utf-8 encoding
re.match(str)[0] # => "\344\270\255"

However, there's a global variable $KCODE that indicate the current language/encoding. When set, the regexp should behavior according to this, thus:

$KCODE = 'u'
re = Regexp.new(".")
str = "中文" # or "\344\270\255\346\226\207" in utf-8 encoding
re.match(str)[0] # => "中" or  "\344\270\255" in utf-8 encoding

Those are Ruby 1.8 behavior. Since Ruby 1.9 gains full Unicode support, the global variable $KCODE is no longer used. I think Rubinius is currently making capability mainly to Ruby 1.8, so this should be considered.

One way to fix this, I think, is to change the default value for the second optional argument (lang) of Regexp.new from "nil" to "$KCODE".

I don't know whether Rubinius and Ruby1.8 use the same regexp engine. But it seems that even though I set $KCODE to 'u' in Ruby1.8. The code

Regexp.new(".").inspect

will return "/./" but

Regexp.new(".", nil, "u").inspect

returns "/./u" . However, the "/./" can successfully match a multi-byte character when setting $KCODE to 'u', but fails in Rubinius. So I think maybe some better way is to patch the regexp engine to take care of the global variable instead of patch the Regexp.new method.

Comments and changes to this ticket

Please Login or create a free account to add a new comment.

You can update this ticket by sending an email to from your email client. (help)

Create your profile

Help contribute to this project by taking a few moments to create your personal profile. Create your profile »

People watching this ticket

Tags