Where to contribute Unicode General Category encoding/decoding

Pander Musubi · Dec 13, 2012

Hi all,

I have created some handy code to encode and decode Unicode General Categories. To which Python Package should I contribute this?

Regards,

Pander

Bruno Dupuis · Dec 13, 2012

Hi all,

I have created some handy code to encode and decode Unicode General Categories. To which Python Package should I contribute this?

Hi,

As said in a recent thread (a graph data structure IIRC), talking about
new features is far better if we see the code, so anyone can figure what
the code really does.

Can you provide a public repository uri or something?

Standard lib inclusions are not trivial, it most likely happens for well-known,
mature, PyPI packages, or battle-tested code patterns. Therefore, it's
often better to make a package on PyPI, or, if the code is too short, to submit
your handy chunks on ActiveState. If it deserves a general approbation, it
may be included in Python stdlib.

Cheers

Pander Musubi · Dec 13, 2012

Hi,

As said in a recent thread (a graph data structure IIRC), talking about

new features is far better if we see the code, so anyone can figure what

the code really does.

Can you provide a public repository uri or something?

Standard lib inclusions are not trivial, it most likely happens for well-known,

mature, PyPI packages, or battle-tested code patterns. Therefore, it's

often better to make a package on PyPI, or, if the code is too short, to submit

your handy chunks on ActiveState. If it deserves a general approbation, it

may be included in Python stdlib.

I was expecting PyPI. Here is the code, please advise on where to submit it:
http://pastebin.com/dbzeasyq

Pander Musubi · Dec 13, 2012

Hi,

As said in a recent thread (a graph data structure IIRC), talking about

new features is far better if we see the code, so anyone can figure what

the code really does.

Can you provide a public repository uri or something?

Standard lib inclusions are not trivial, it most likely happens for well-known,

mature, PyPI packages, or battle-tested code patterns. Therefore, it's

often better to make a package on PyPI, or, if the code is too short, to submit

your handy chunks on ActiveState. If it deserves a general approbation, it

may be included in Python stdlib.

I was expecting PyPI. Here is the code, please advise on where to submit it:
http://pastebin.com/dbzeasyq

Steven D'Aprano · Dec 14, 2012

I was expecting PyPI. Here is the code, please advise on where to submit
it:
http://pastebin.com/dbzeasyq

If anywhere, either a third-party module, or the unicodedata standard
library module.

Some unanswered questions:

- when would somebody need this function?

- why is is called "decodeUnicodeGeneralCategory" when it
doesn't seem to have anything to do with decoding?

- why is the parameter "sortable" called sortable, when it
doesn't seem to have anything to do with sorting?

If this is useful at all, it would be more useful to just expose the data
as a dict, and forget about an unnecessary wrapper function:

from collections import namedtuple
r = namedtuple("record", "other name desc") # better field names needed!

GC = {
'C' : r('Other', 'Other', 'Cc | Cf | Cn | Co | Cs'),
'Cc': r('Control', 'Control',
'a C0 or C1 control code'), # a.k.a. cntrl
'Cf': r('Format', 'Format', 'a format control character'),
'Cn': r('Unassigned', 'Unassigned',
'a reserved unassigned code point or a noncharacter'),
'Co': r('Private Use', 'Private_Use', 'a private-use character'),
'Cs': r('Surrogate', 'Surrogate', 'a surrogate code point'),
'L' : r('Letter', 'Letter', 'Ll | Lm | Lo | Lt | Lu'),
'LC': r('Letter, Cased', 'Cased_Letter', 'Ll | Lt | Lu'),
'Ll': r('Letter, Lowercase', 'Lowercase_Letter',
'a lowercase letter'),
'Lm': r('Letter, Modifier', 'Modifier_Letter', 'a modifier letter'),
'Lo': r('Letter, Other', 'Other_Letter',
'other letters, including syllables and ideographs'),
'Lt': r('Letter, Titlecase', 'Titlecase_Letter',
'a digraphic character, with first part uppercase'),
'Lu': r('Letter, Uppercase', 'Uppercase_Letter',
'an uppercase letter'),
'M' : r('Mark', 'Mark', 'Mc | Me | Mn '), # a.k.a. Combining_Mark
'Mc': r('Mark, Spacing', 'Spacing_Mark',
'a spacing combining mark (positive advance width)'),
'Me': r('Mark, Enclosing', 'Enclosing_Mark',
'an enclosing combining mark'),
'Mn': r('Mark, Nonspacing', 'Nonspacing_Mark',
'a nonspacing combining mark (zero advance width)'),
'N' : r('Number', 'Number', 'Nd | Nl | No'),
'Nd': r('Number, Decimal', 'Decimal_Number',
'a decimal digit'), # a.k.a. digit
'Nl': r('Number, Letter', 'Letter_Number',
'a letterlike numeric character'),
'No': r('Number, Other', 'Other_Number',
'a numeric character of other type'),
'P' : r('Punctuation', 'Punctuation',
'Pc | Pd | Pe | Pf | Pi | Po | Ps'), # a.k.a. punct
'Pc': r('Punctuation, Connector', 'Connector_Punctuation',
'a connecting punctuation mark, like a tie'),
'Pd': r('Punctuation, Dash', 'Dash_Punctuation',
'a dash or hyphen punctuation mark'),
'Pe': r('Punctuation, Close', 'Close_Punctuation',
'a closing punctuation mark (of a pair)'),
'Pf': r('Punctuation, Final', 'Final_Punctuation',
'a final quotation mark'),
'Pi': r('Punctuation, Initial', 'Initial_Punctuation',
'an initial quotation mark'),
'Po': r('Punctuation, Other', 'Other_Punctuation',
'a punctuation mark of other type'),
'Ps': r('Punctuation, Open', 'Open_Punctuation',
'an opening punctuation mark (of a pair)'),
'S' : r('Symbol', 'Symbol', 'Sc | Sk | Sm | So'),
'Sc': r('Symbol, Currency', 'Currency_Symbol', 'a currency sign'),
'Sk': r('Symbol, Modifier', 'Modifier_Symbol',
'a non-letterlike modifier symbol'),
'Sm': r('Symbol, Math', 'Math_Symbol',
'a symbol of mathematical use'),
'So': r('Symbol, Other', 'Other_Symbol', 'a symbol of other type'),
'Z' : r('Separator', 'Separator', 'Zl | Zp | Zs'),
'Zl': r('Separator, Line', 'Line_Separator',
'U+2028 LINE SEPARATOR only'),
'Zp': r('Separator, Paragraph', 'Paragraph_Separator',
'U+2029 PARAGRAPH SEPARATOR only'),
'Zs': r('Separator, Space', 'Space_Separator',
'a space character (of various non-zero widths)'),
}

del r

Usage is then trivially the same as normal dict and attribute access:

py> GC['Ps'].desc
'an opening punctuation mark (of a pair)'

Pander Musubi · Dec 14, 2012

If anywhere, either a third-party module, or the unicodedata standard

library module.

Some unanswered questions:

- when would somebody need this function?

When working with Unicode metedata, see below.

- why is is called "decodeUnicodeGeneralCategory" when it

doesn't seem to have anything to do with decoding?

It is actually a simple LUT. I like your improvements below.

- why is the parameter "sortable" called sortable, when it

doesn't seem to have anything to do with sorting?

The values return are alphabetically sortable.

If this is useful at all, it would be more useful to just expose the data

as a dict, and forget about an unnecessary wrapper function:

from collections import namedtuple

r = namedtuple("record", "other name desc") # better field names needed!

GC = {

'C' : r('Other', 'Other', 'Cc | Cf | Cn | Co | Cs'),

'Cc': r('Control', 'Control',

'a C0 or C1 control code'), # a.k.a. cntrl

'Cf': r('Format', 'Format', 'a format control character'),

'Cn': r('Unassigned', 'Unassigned',

'a reserved unassigned code point or a noncharacter'),

'Co': r('Private Use', 'Private_Use', 'a private-use character'),

'Cs': r('Surrogate', 'Surrogate', 'a surrogate code point'),

'L' : r('Letter', 'Letter', 'Ll | Lm | Lo | Lt | Lu'),

'LC': r('Letter, Cased', 'Cased_Letter', 'Ll | Lt | Lu'),

'Ll': r('Letter, Lowercase', 'Lowercase_Letter',

'a lowercase letter'),

'Lm': r('Letter, Modifier', 'Modifier_Letter', 'a modifier letter'),

'Lo': r('Letter, Other', 'Other_Letter',

'other letters, including syllables and ideographs'),

'Lt': r('Letter, Titlecase', 'Titlecase_Letter',

'a digraphic character, with first part uppercase'),

'Lu': r('Letter, Uppercase', 'Uppercase_Letter',

'an uppercase letter'),

'M' : r('Mark', 'Mark', 'Mc | Me | Mn '), # a.k.a. Combining_Mark

'Mc': r('Mark, Spacing', 'Spacing_Mark',

'a spacing combining mark (positive advance width)'),

'Me': r('Mark, Enclosing', 'Enclosing_Mark',

'an enclosing combining mark'),

'Mn': r('Mark, Nonspacing', 'Nonspacing_Mark',

'a nonspacing combining mark (zero advance width)'),

'N' : r('Number', 'Number', 'Nd | Nl | No'),

'Nd': r('Number, Decimal', 'Decimal_Number',

'a decimal digit'), # a.k.a. digit

'Nl': r('Number, Letter', 'Letter_Number',

'a letterlike numeric character'),

'No': r('Number, Other', 'Other_Number',

'a numeric character of other type'),

'P' : r('Punctuation', 'Punctuation',

'Pc | Pd | Pe | Pf | Pi | Po | Ps'), # a.k.a. punct

'Pc': r('Punctuation, Connector', 'Connector_Punctuation',

'a connecting punctuation mark, like a tie'),

'Pd': r('Punctuation, Dash', 'Dash_Punctuation',

'a dash or hyphen punctuation mark'),

'Pe': r('Punctuation, Close', 'Close_Punctuation',

'a closing punctuation mark (of a pair)'),

'Pf': r('Punctuation, Final', 'Final_Punctuation',

'a final quotation mark'),

'Pi': r('Punctuation, Initial', 'Initial_Punctuation',

'an initial quotation mark'),

'Po': r('Punctuation, Other', 'Other_Punctuation',

'a punctuation mark of other type'),

'Ps': r('Punctuation, Open', 'Open_Punctuation',

'an opening punctuation mark (of a pair)'),

'S' : r('Symbol', 'Symbol', 'Sc | Sk | Sm | So'),

'Sc': r('Symbol, Currency', 'Currency_Symbol', 'a currency sign'),

'Sk': r('Symbol, Modifier', 'Modifier_Symbol',

'a non-letterlike modifier symbol'),

'Sm': r('Symbol, Math', 'Math_Symbol',

'a symbol of mathematical use'),

'So': r('Symbol, Other', 'Other_Symbol', 'a symbol of other type'),

'Z' : r('Separator', 'Separator', 'Zl | Zp | Zs'),

'Zl': r('Separator, Line', 'Line_Separator',

'U+2028 LINE SEPARATOR only'),

'Zp': r('Separator, Paragraph', 'Paragraph_Separator',

'U+2029 PARAGRAPH SEPARATOR only'),

'Zs': r('Separator, Space', 'Space_Separator',

'a space character (of various non-zero widths)'),

}

del r

Usage is then trivially the same as normal dict and attribute access:

py> GC['Ps'].desc

'an opening punctuation mark (of a pair)'

Thank you for the improvements. I have some more extra dicts in this way such as:
http://www.unicode.org/Public/UNIDATA/UnicodeData.txt
where this general category is begin used. This information is useful when handling Unicode metadata.

I think I will approach both
http://pypi.python.org/pypi/unicodeblocks/
and
http://pypi.python.org/pypi/unicodescript/
to see who will adopt this.

Perhaps it might be in their mutual interest to join their packages to e.g. unicodemetadata or something similar. Extra ideas on this are still welcome.

Thanks for all your help,

Pander

Pander Musubi · Dec 14, 2012

If anywhere, either a third-party module, or the unicodedata standard

library module.

Some unanswered questions:

- when would somebody need this function?

Click to expand...

When working with Unicode metedata, see below.

- why is is called "decodeUnicodeGeneralCategory" when it

doesn't seem to have anything to do with decoding?

Click to expand...

It is actually a simple LUT. I like your improvements below.

- why is the parameter "sortable" called sortable, when it

doesn't seem to have anything to do with sorting?

Click to expand...

The values return are alphabetically sortable.

If this is useful at all, it would be more useful to just expose the data
as a dict, and forget about an unnecessary wrapper function:
from collections import namedtuple
r = namedtuple("record", "other name desc") # better field names needed!
GC = {
'C' : r('Other', 'Other', 'Cc | Cf | Cn | Co | Cs'),
'Cc': r('Control', 'Control',
'a C0 or C1 control code'), # a.k.a. cntrl
'Cf': r('Format', 'Format', 'a format control character'),
'Cn': r('Unassigned', 'Unassigned',
'a reserved unassigned code point or a noncharacter'),
'Co': r('Private Use', 'Private_Use', 'a private-use character'),
'Cs': r('Surrogate', 'Surrogate', 'a surrogate code point'),
'L' : r('Letter', 'Letter', 'Ll | Lm | Lo | Lt | Lu'),
'LC': r('Letter, Cased', 'Cased_Letter', 'Ll | Lt | Lu'),
'Ll': r('Letter, Lowercase', 'Lowercase_Letter',
'a lowercase letter'),
'Lm': r('Letter, Modifier', 'Modifier_Letter', 'a modifier letter'),
'Lo': r('Letter, Other', 'Other_Letter',
'other letters, including syllables and ideographs'),
'Lt': r('Letter, Titlecase', 'Titlecase_Letter',
'a digraphic character, with first part uppercase'),
'Lu': r('Letter, Uppercase', 'Uppercase_Letter',
'an uppercase letter'),
'M' : r('Mark', 'Mark', 'Mc | Me | Mn '), # a.k.a. Combining_Mark
'Mc': r('Mark, Spacing', 'Spacing_Mark',
'a spacing combining mark (positive advance width)'),
'Me': r('Mark, Enclosing', 'Enclosing_Mark',
'an enclosing combining mark'),
'Mn': r('Mark, Nonspacing', 'Nonspacing_Mark',
'a nonspacing combining mark (zero advance width)'),
'N' : r('Number', 'Number', 'Nd | Nl | No'),
'Nd': r('Number, Decimal', 'Decimal_Number',
'a decimal digit'), # a.k.a. digit
'Nl': r('Number, Letter', 'Letter_Number',
'a letterlike numeric character'),
'No': r('Number, Other', 'Other_Number',
'a numeric character of other type'),
'P' : r('Punctuation', 'Punctuation',
'Pc | Pd | Pe | Pf | Pi | Po | Ps'), # a.k.a. punct
'Pc': r('Punctuation, Connector', 'Connector_Punctuation',
'a connecting punctuation mark, like a tie'),
'Pd': r('Punctuation, Dash', 'Dash_Punctuation',
'a dash or hyphen punctuation mark'),
'Pe': r('Punctuation, Close', 'Close_Punctuation',
'a closing punctuation mark (of a pair)'),
'Pf': r('Punctuation, Final', 'Final_Punctuation',
'a final quotation mark'),
'Pi': r('Punctuation, Initial', 'Initial_Punctuation',
'an initial quotation mark'),
'Po': r('Punctuation, Other', 'Other_Punctuation',
'a punctuation mark of other type'),
'Ps': r('Punctuation, Open', 'Open_Punctuation',
'an opening punctuation mark (of a pair)'),
'S' : r('Symbol', 'Symbol', 'Sc | Sk | Sm | So'),
'Sc': r('Symbol, Currency', 'Currency_Symbol', 'a currency sign'),
'Sk': r('Symbol, Modifier', 'Modifier_Symbol',
'a non-letterlike modifier symbol'),
'Sm': r('Symbol, Math', 'Math_Symbol',
'a symbol of mathematical use'),
'So': r('Symbol, Other', 'Other_Symbol', 'a symbol of other type'),
'Z' : r('Separator', 'Separator', 'Zl | Zp | Zs'),
'Zl': r('Separator, Line', 'Line_Separator',
'U+2028 LINE SEPARATOR only'),
'Zp': r('Separator, Paragraph', 'Paragraph_Separator',
'U+2029 PARAGRAPH SEPARATOR only'),
'Zs': r('Separator, Space', 'Space_Separator',
'a space character (of various non-zero widths)'),

}

Click to expand...

del r
Usage is then trivially the same as normal dict and attribute access:
py> GC['Ps'].desc
'an opening punctuation mark (of a pair)'

Click to expand...

Thank you for the improvements. I have some more extra dicts in this way such as:

http://www.unicode.org/Public/UNIDATA/UnicodeData.txt

where this general category is begin used. This information is useful when handling Unicode metadata.

I think I will approach both

http://pypi.python.org/pypi/unicodeblocks/

and

http://pypi.python.org/pypi/unicodescript/

to see who will adopt this.

Perhaps it might be in their mutual interest to join their packages to e.g. unicodemetadata or something similar. Extra ideas on this are still welcome.

Thanks for all your help,

Pander

Steven

Click to expand...

Ah, it will become a feature request for http://docs.python.org/3/library/unicodedata.html

Pander Musubi · Dec 14, 2012

I was expecting PyPI. Here is the code, please advise on where to submit

Click to expand...

If anywhere, either a third-party module, or the unicodedata standard

Click to expand...

library module.

Click to expand...

Some unanswered questions:

Click to expand...

- when would somebody need this function?

Click to expand...

When working with Unicode metedata, see below.

- why is is called "decodeUnicodeGeneralCategory" when it

Click to expand...

doesn't seem to have anything to do with decoding?

Click to expand...

It is actually a simple LUT. I like your improvements below.

- why is the parameter "sortable" called sortable, when it

Click to expand...

doesn't seem to have anything to do with sorting?

Click to expand...

The values return are alphabetically sortable.

If this is useful at all, it would be more useful to just expose the data

Click to expand...

as a dict, and forget about an unnecessary wrapper function:

Click to expand...

from collections import namedtuple

Click to expand...

r = namedtuple("record", "other name desc") # better field names needed!

Click to expand...

GC = {

Click to expand...

'C' : r('Other', 'Other', 'Cc | Cf | Cn | Co | Cs'),

Click to expand...

'Cc': r('Control', 'Control',

Click to expand...

'a C0 or C1 control code'), # a.k.a. cntrl

Click to expand...

'Cf': r('Format', 'Format', 'a format control character'),

Click to expand...

'Cn': r('Unassigned', 'Unassigned',

Click to expand...

'a reserved unassigned code point or a noncharacter'),

Click to expand...

'Co': r('Private Use', 'Private_Use', 'a private-use character'),

Click to expand...

'Cs': r('Surrogate', 'Surrogate', 'a surrogate code point'),

Click to expand...

'L' : r('Letter', 'Letter', 'Ll | Lm | Lo | Lt | Lu'),

Click to expand...

'LC': r('Letter, Cased', 'Cased_Letter', 'Ll | Lt | Lu'),

Click to expand...

'Ll': r('Letter, Lowercase', 'Lowercase_Letter',

Click to expand...

'a lowercase letter'),

Click to expand...

'Lm': r('Letter, Modifier', 'Modifier_Letter', 'a modifier letter'),

Click to expand...

'Lo': r('Letter, Other', 'Other_Letter',

Click to expand...

'other letters, including syllables and ideographs'),

Click to expand...

'Lt': r('Letter, Titlecase', 'Titlecase_Letter',

Click to expand...

'a digraphic character, with first part uppercase'),

Click to expand...

'Lu': r('Letter, Uppercase', 'Uppercase_Letter',

Click to expand...

'an uppercase letter'),

Click to expand...

'M' : r('Mark', 'Mark', 'Mc | Me | Mn '), # a.k.a. Combining_Mark

Click to expand...

'Mc': r('Mark, Spacing', 'Spacing_Mark',

Click to expand...

'a spacing combining mark (positive advance width)'),

Click to expand...

'Me': r('Mark, Enclosing', 'Enclosing_Mark',

Click to expand...

'an enclosing combining mark'),

Click to expand...

'Mn': r('Mark, Nonspacing', 'Nonspacing_Mark',

Click to expand...

'a nonspacing combining mark (zero advance width)'),

Click to expand...

'N' : r('Number', 'Number', 'Nd | Nl | No'),

Click to expand...

'Nd': r('Number, Decimal', 'Decimal_Number',

Click to expand...

'a decimal digit'), # a.k.a. digit

Click to expand...

'Nl': r('Number, Letter', 'Letter_Number',

Click to expand...

'a letterlike numeric character'),

Click to expand...

'No': r('Number, Other', 'Other_Number',

Click to expand...

'a numeric character of other type'),

Click to expand...

'P' : r('Punctuation', 'Punctuation',

Click to expand...

'Pc | Pd | Pe | Pf | Pi | Po | Ps'), # a.k.a. punct

Click to expand...

'Pc': r('Punctuation, Connector', 'Connector_Punctuation',

Click to expand...

'a connecting punctuation mark, like a tie'),

Click to expand...

'Pd': r('Punctuation, Dash', 'Dash_Punctuation',

Click to expand...

'a dash or hyphen punctuation mark'),

Click to expand...

'Pe': r('Punctuation, Close', 'Close_Punctuation',

Click to expand...

'a closing punctuation mark (of a pair)'),

Click to expand...

'Pf': r('Punctuation, Final', 'Final_Punctuation',

Click to expand...

'a final quotation mark'),

Click to expand...

'Pi': r('Punctuation, Initial', 'Initial_Punctuation',

Click to expand...

'an initial quotation mark'),

Click to expand...

'Po': r('Punctuation, Other', 'Other_Punctuation',

Click to expand...

'a punctuation mark of other type'),

Click to expand...

'Ps': r('Punctuation, Open', 'Open_Punctuation',

Click to expand...

'an opening punctuation mark (of a pair)'),

Click to expand...

'S' : r('Symbol', 'Symbol', 'Sc | Sk | Sm | So'),

Click to expand...

'Sc': r('Symbol, Currency', 'Currency_Symbol', 'a currency sign'),

Click to expand...

'Sk': r('Symbol, Modifier', 'Modifier_Symbol',

Click to expand...

'a non-letterlike modifier symbol'),

Click to expand...

'Sm': r('Symbol, Math', 'Math_Symbol',

Click to expand...

'a symbol of mathematical use'),

Click to expand...

'So': r('Symbol, Other', 'Other_Symbol', 'a symbol of other type'),

Click to expand...

'Z' : r('Separator', 'Separator', 'Zl | Zp | Zs'),

Click to expand...

'Zl': r('Separator, Line', 'Line_Separator',

Click to expand...

'U+2028 LINE SEPARATOR only'),

Click to expand...

'Zp': r('Separator, Paragraph', 'Paragraph_Separator',

Click to expand...

'U+2029 PARAGRAPH SEPARATOR only'),

Click to expand...

'Zs': r('Separator, Space', 'Space_Separator',

Click to expand...

'a space character (of various non-zero widths)'),

Click to expand...

del r

Click to expand...

Usage is then trivially the same as normal dict and attribute access:

Click to expand...

py> GC['Ps'].desc

Click to expand...

'an opening punctuation mark (of a pair)'

Click to expand...

Thank you for the improvements. I have some more extra dicts in this way such as:

http://www.unicode.org/Public/UNIDATA/UnicodeData.txt

Click to expand...

where this general category is begin used. This information is useful when handling Unicode metadata.
I think I will approach both

http://pypi.python.org/pypi/unicodeblocks/

Click to expand...

and

Click to expand...

http://pypi.python.org/pypi/unicodescript/

Click to expand...

to see who will adopt this.
Perhaps it might be in their mutual interest to join their packages to e.g. unicodemetadata or something similar. Extra ideas on this are still welcome.
Thanks for all your help,

Pander

Click to expand...

Steven

Click to expand...

Click to expand...

Ah, it will become a feature request for http://docs.python.org/3/library/unicodedata.html

Please see:
http://bugs.python.org/issue16684

Unicode Chars in Windows Path	12	Apr 3, 2014
Decoding a process output	0	Mar 4, 2014
How to convert CSV to parquet file without RLE_DICTIONARY encoding?	0	Sep 2, 2022
python simplejson decoding	3	Mar 2, 2011
string to unicode	0	Aug 15, 2011
How to make a div select work?	5	Jan 13, 2022
MeCab UTF-8 Decoding Problem	6	Jun 29, 2013
files.py (encoding error)	0	Jun 10, 2013

Where to contribute Unicode General Category encoding/decoding

Pander Musubi

Bruno Dupuis

Pander Musubi

Pander Musubi

Steven D'Aprano

Pander Musubi

Pander Musubi

Pander Musubi

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads