Unicode Bugs (Chinese)

Discussion Forums discussion Unicode Bugs (Chinese)

This topic contains 0 voices and has 7 replies.

Viewing 8 posts - 1 through 8 (of 8 total)
Author Posts
Author Posts
October 10, 2010 at 8:15 am #5106

guipeng
Member

First, thank you for providing PN, a so powerful text editor!

When I use the previous major edition, I was puzzled by it’s unicode support in display and editing, especially in Chinese. Althought Simon said that PN2 would be much better in unicode supporting, I found the Chinese support are still not that perfect.

For example, you can save the following text as a txt file:”9日,正影响新疆的冷空气将自西北向东南影响中国中东部地区。受较强冷空气影响,未来三天,中国将出现大范围雨雪和大风降温天气,部分降温可达12℃。新疆、西藏、青海、内蒙古、东北等地部分地区将飘起雪花,广东、海南、云南、西藏、四川、贵州、湖南、广西、江西、福建、内蒙古等地部分地区有大雨或暴雨。” When opening this txt file with PN2, you will find that the program regards a Chinese character as two non-unicode characters. And you can “insert into” a Chinese character, delete “half” of it, and found the modified text looks like alien language. Also you can find that when you choose some words in the txt file, some unrelated and dissimilar text zone will be highlighted in light green. But the intersting phenomenon can not be found in Chinese html or xml file.

I don’t know whether the similar problem can be observed in files in others asin languages or not, maybe any volunteers can have a try.

Thanks Simon for such a gorgeous tool all the same. Simon, thanks for your dedication to PN and your concern to this post!

October 14, 2010 at 9:06 pm #17385

simon
Key Master

Hi,

Thanks for this, can you please tell me which version you using – is it 2.1.5?

Also please send me sample files that show these problems or attach them to a bug, I can then use them to reproduce the problems.

Thanks.

October 15, 2010 at 6:09 am #17386

guipeng
Member

Hi, Simon. Thank you for your attention.

I’m using PN 2.1.5.2222-devel. You can download the txt sample file here. http://sn.im/1b5m2o or http://cid-9544c68861be512d.office.live.com/self.aspx/.Public/cn^_sample.txt

October 15, 2010 at 3:17 pm #17387

horus
Member

I could confirm this bug, however, to be fair, it’s not related to Unicode. Your sample file is in ANSI encoding, to be precise GB2312 encoding. You don’t have this problem when the text file is in real Unicode (UTF8 or UTF16) encoding.

Back to the bug:

In order to reproduce the problem, first, your Windows has to be in “Chinese (PRC)” locale.

Secondly, put the cursor between two characters (eg 9æ—¥ at the very beginning of the file), press the right arrow key once and enter a letter (eg a). In most situations, this inserts the letter in “half-character” position, demonstrating guipeng’s problem. However, in few situations (between some other characters), this won’t break the Chinese characters. Since this is a bug, I don’t think it really matters to know between what characters exactly this would break them :)

IMO, PN2 (or scintilla) should treat double-byte chars as one single entity. Or with a much better approach (notepad’s approach): when a file is open, everything is converted to Unicode internally so PN2 manipulates them natively using Unicode API. Then at the moment of saving, strings are converted back to ANSI if needed. So users don’t really need to know if their files are in ANSI or Unicode encoding.

October 15, 2010 at 3:54 pm #17388

simon
Key Master

Thanks horus.

guipeng, in Options you might like to set Character Set to “GB 2312″ and then try again. This tells Programmer’s Notepad that when you open ANSI files it should support GB 2312 for multi-byte characters. This will then work correctly, treating double-byte chars as a single item.

Even if PN converted everything to Unicode for editing, it would still need to be told which encoding an ANSI file was in as there’s no easy way to guess.

October 15, 2010 at 4:18 pm #17389

Tang
Member

Dear Simon,

I changed the “Character Set”, with no luck!

Nothing changed but the displaying font.

However, once I change the “Code Page” setting, everything got right!

Is it possible to add some option like “System Default” to make PN adapt the Windows Locale setting automatically?

I think that will be a preferable behavior for DBCS users.

Besides, the “Code Page” options for “Simple Chinese GBK” and “Simple Chinese Big5″ should corrected as “Simplified Chinese GBK” and “Traditional Chinese Big5″ accordingly.

Anyway, thank your providing such a wonderful program!

October 15, 2010 at 4:48 pm #17390

simon
Key Master

Thank you for the update, I’ve cleared these options up somewhat for the next release. I’ll also try to make the corrections you’ve detailed.

October 16, 2010 at 8:56 am #17391

guipeng
Member

@horus Thank you. It works correctly now.

@simon After setting the “File-Encoding” from ANSI to UTF-8, everything is ok. Thanks. I didn’t notice this before.

@tang The same to you. “Code Page” is much more useful than “Character Set” option. It will be better if PN can detect the encoding format all by itself.

Viewing 8 posts - 1 through 8 (of 8 total)

You must be logged in to reply to this topic.