Where to contribute Unicode General Category encoding/decoding

P

Pander Musubi

Hi all,

I have created some handy code to encode and decode Unicode General Categories. To which Python Package should I contribute this?

Regards,

Pander
 
B

Bruno Dupuis

Hi all,

I have created some handy code to encode and decode Unicode General Categories. To which Python Package should I contribute this?

Hi,

As said in a recent thread (a graph data structure IIRC), talking about
new features is far better if we see the code, so anyone can figure what
the code really does.

Can you provide a public repository uri or something?

Standard lib inclusions are not trivial, it most likely happens for well-known,
mature, PyPI packages, or battle-tested code patterns. Therefore, it's
often better to make a package on PyPI, or, if the code is too short, to submit
your handy chunks on ActiveState. If it deserves a general approbation, it
may be included in Python stdlib.

Cheers
 
P

Pander Musubi

Hi,



As said in a recent thread (a graph data structure IIRC), talking about

new features is far better if we see the code, so anyone can figure what

the code really does.



Can you provide a public repository uri or something?



Standard lib inclusions are not trivial, it most likely happens for well-known,

mature, PyPI packages, or battle-tested code patterns. Therefore, it's

often better to make a package on PyPI, or, if the code is too short, to submit

your handy chunks on ActiveState. If it deserves a general approbation, it

may be included in Python stdlib.

I was expecting PyPI. Here is the code, please advise on where to submit it:
http://pastebin.com/dbzeasyq
 
P

Pander Musubi

Hi,



As said in a recent thread (a graph data structure IIRC), talking about

new features is far better if we see the code, so anyone can figure what

the code really does.



Can you provide a public repository uri or something?



Standard lib inclusions are not trivial, it most likely happens for well-known,

mature, PyPI packages, or battle-tested code patterns. Therefore, it's

often better to make a package on PyPI, or, if the code is too short, to submit

your handy chunks on ActiveState. If it deserves a general approbation, it

may be included in Python stdlib.

I was expecting PyPI. Here is the code, please advise on where to submit it:
http://pastebin.com/dbzeasyq
 
S

Steven D'Aprano

I was expecting PyPI. Here is the code, please advise on where to submit
it:
http://pastebin.com/dbzeasyq

If anywhere, either a third-party module, or the unicodedata standard
library module.


Some unanswered questions:

- when would somebody need this function?

- why is is called "decodeUnicodeGeneralCategory" when it
doesn't seem to have anything to do with decoding?

- why is the parameter "sortable" called sortable, when it
doesn't seem to have anything to do with sorting?


If this is useful at all, it would be more useful to just expose the data
as a dict, and forget about an unnecessary wrapper function:


from collections import namedtuple
r = namedtuple("record", "other name desc") # better field names needed!

GC = {
'C' : r('Other', 'Other', 'Cc | Cf | Cn | Co | Cs'),
'Cc': r('Control', 'Control',
'a C0 or C1 control code'), # a.k.a. cntrl
'Cf': r('Format', 'Format', 'a format control character'),
'Cn': r('Unassigned', 'Unassigned',
'a reserved unassigned code point or a noncharacter'),
'Co': r('Private Use', 'Private_Use', 'a private-use character'),
'Cs': r('Surrogate', 'Surrogate', 'a surrogate code point'),
'L' : r('Letter', 'Letter', 'Ll | Lm | Lo | Lt | Lu'),
'LC': r('Letter, Cased', 'Cased_Letter', 'Ll | Lt | Lu'),
'Ll': r('Letter, Lowercase', 'Lowercase_Letter',
'a lowercase letter'),
'Lm': r('Letter, Modifier', 'Modifier_Letter', 'a modifier letter'),
'Lo': r('Letter, Other', 'Other_Letter',
'other letters, including syllables and ideographs'),
'Lt': r('Letter, Titlecase', 'Titlecase_Letter',
'a digraphic character, with first part uppercase'),
'Lu': r('Letter, Uppercase', 'Uppercase_Letter',
'an uppercase letter'),
'M' : r('Mark', 'Mark', 'Mc | Me | Mn '), # a.k.a. Combining_Mark
'Mc': r('Mark, Spacing', 'Spacing_Mark',
'a spacing combining mark (positive advance width)'),
'Me': r('Mark, Enclosing', 'Enclosing_Mark',
'an enclosing combining mark'),
'Mn': r('Mark, Nonspacing', 'Nonspacing_Mark',
'a nonspacing combining mark (zero advance width)'),
'N' : r('Number', 'Number', 'Nd | Nl | No'),
'Nd': r('Number, Decimal', 'Decimal_Number',
'a decimal digit'), # a.k.a. digit
'Nl': r('Number, Letter', 'Letter_Number',
'a letterlike numeric character'),
'No': r('Number, Other', 'Other_Number',
'a numeric character of other type'),
'P' : r('Punctuation', 'Punctuation',
'Pc | Pd | Pe | Pf | Pi | Po | Ps'), # a.k.a. punct
'Pc': r('Punctuation, Connector', 'Connector_Punctuation',
'a connecting punctuation mark, like a tie'),
'Pd': r('Punctuation, Dash', 'Dash_Punctuation',
'a dash or hyphen punctuation mark'),
'Pe': r('Punctuation, Close', 'Close_Punctuation',
'a closing punctuation mark (of a pair)'),
'Pf': r('Punctuation, Final', 'Final_Punctuation',
'a final quotation mark'),
'Pi': r('Punctuation, Initial', 'Initial_Punctuation',
'an initial quotation mark'),
'Po': r('Punctuation, Other', 'Other_Punctuation',
'a punctuation mark of other type'),
'Ps': r('Punctuation, Open', 'Open_Punctuation',
'an opening punctuation mark (of a pair)'),
'S' : r('Symbol', 'Symbol', 'Sc | Sk | Sm | So'),
'Sc': r('Symbol, Currency', 'Currency_Symbol', 'a currency sign'),
'Sk': r('Symbol, Modifier', 'Modifier_Symbol',
'a non-letterlike modifier symbol'),
'Sm': r('Symbol, Math', 'Math_Symbol',
'a symbol of mathematical use'),
'So': r('Symbol, Other', 'Other_Symbol', 'a symbol of other type'),
'Z' : r('Separator', 'Separator', 'Zl | Zp | Zs'),
'Zl': r('Separator, Line', 'Line_Separator',
'U+2028 LINE SEPARATOR only'),
'Zp': r('Separator, Paragraph', 'Paragraph_Separator',
'U+2029 PARAGRAPH SEPARATOR only'),
'Zs': r('Separator, Space', 'Space_Separator',
'a space character (of various non-zero widths)'),
}

del r


Usage is then trivially the same as normal dict and attribute access:

py> GC['Ps'].desc
'an opening punctuation mark (of a pair)'
 
P

Pander Musubi

If anywhere, either a third-party module, or the unicodedata standard

library module.





Some unanswered questions:



- when would somebody need this function?

When working with Unicode metedata, see below.
- why is is called "decodeUnicodeGeneralCategory" when it

doesn't seem to have anything to do with decoding?

It is actually a simple LUT. I like your improvements below.
- why is the parameter "sortable" called sortable, when it

doesn't seem to have anything to do with sorting?

The values return are alphabetically sortable.
If this is useful at all, it would be more useful to just expose the data

as a dict, and forget about an unnecessary wrapper function:





from collections import namedtuple

r = namedtuple("record", "other name desc") # better field names needed!



GC = {

'C' : r('Other', 'Other', 'Cc | Cf | Cn | Co | Cs'),

'Cc': r('Control', 'Control',

'a C0 or C1 control code'), # a.k.a. cntrl

'Cf': r('Format', 'Format', 'a format control character'),

'Cn': r('Unassigned', 'Unassigned',

'a reserved unassigned code point or a noncharacter'),

'Co': r('Private Use', 'Private_Use', 'a private-use character'),

'Cs': r('Surrogate', 'Surrogate', 'a surrogate code point'),

'L' : r('Letter', 'Letter', 'Ll | Lm | Lo | Lt | Lu'),

'LC': r('Letter, Cased', 'Cased_Letter', 'Ll | Lt | Lu'),

'Ll': r('Letter, Lowercase', 'Lowercase_Letter',

'a lowercase letter'),

'Lm': r('Letter, Modifier', 'Modifier_Letter', 'a modifier letter'),

'Lo': r('Letter, Other', 'Other_Letter',

'other letters, including syllables and ideographs'),

'Lt': r('Letter, Titlecase', 'Titlecase_Letter',

'a digraphic character, with first part uppercase'),

'Lu': r('Letter, Uppercase', 'Uppercase_Letter',

'an uppercase letter'),

'M' : r('Mark', 'Mark', 'Mc | Me | Mn '), # a.k.a. Combining_Mark

'Mc': r('Mark, Spacing', 'Spacing_Mark',

'a spacing combining mark (positive advance width)'),

'Me': r('Mark, Enclosing', 'Enclosing_Mark',

'an enclosing combining mark'),

'Mn': r('Mark, Nonspacing', 'Nonspacing_Mark',

'a nonspacing combining mark (zero advance width)'),

'N' : r('Number', 'Number', 'Nd | Nl | No'),

'Nd': r('Number, Decimal', 'Decimal_Number',

'a decimal digit'), # a.k.a. digit

'Nl': r('Number, Letter', 'Letter_Number',

'a letterlike numeric character'),

'No': r('Number, Other', 'Other_Number',

'a numeric character of other type'),

'P' : r('Punctuation', 'Punctuation',

'Pc | Pd | Pe | Pf | Pi | Po | Ps'), # a.k.a. punct

'Pc': r('Punctuation, Connector', 'Connector_Punctuation',

'a connecting punctuation mark, like a tie'),

'Pd': r('Punctuation, Dash', 'Dash_Punctuation',

'a dash or hyphen punctuation mark'),

'Pe': r('Punctuation, Close', 'Close_Punctuation',

'a closing punctuation mark (of a pair)'),

'Pf': r('Punctuation, Final', 'Final_Punctuation',

'a final quotation mark'),

'Pi': r('Punctuation, Initial', 'Initial_Punctuation',

'an initial quotation mark'),

'Po': r('Punctuation, Other', 'Other_Punctuation',

'a punctuation mark of other type'),

'Ps': r('Punctuation, Open', 'Open_Punctuation',

'an opening punctuation mark (of a pair)'),

'S' : r('Symbol', 'Symbol', 'Sc | Sk | Sm | So'),

'Sc': r('Symbol, Currency', 'Currency_Symbol', 'a currency sign'),

'Sk': r('Symbol, Modifier', 'Modifier_Symbol',

'a non-letterlike modifier symbol'),

'Sm': r('Symbol, Math', 'Math_Symbol',

'a symbol of mathematical use'),

'So': r('Symbol, Other', 'Other_Symbol', 'a symbol of other type'),

'Z' : r('Separator', 'Separator', 'Zl | Zp | Zs'),

'Zl': r('Separator, Line', 'Line_Separator',

'U+2028 LINE SEPARATOR only'),

'Zp': r('Separator, Paragraph', 'Paragraph_Separator',

'U+2029 PARAGRAPH SEPARATOR only'),

'Zs': r('Separator, Space', 'Space_Separator',

'a space character (of various non-zero widths)'),

}



del r





Usage is then trivially the same as normal dict and attribute access:



py> GC['Ps'].desc

'an opening punctuation mark (of a pair)'

Thank you for the improvements. I have some more extra dicts in this way such as:
http://www.unicode.org/Public/UNIDATA/UnicodeData.txt
where this general category is begin used. This information is useful when handling Unicode metadata.

I think I will approach both
http://pypi.python.org/pypi/unicodeblocks/
and
http://pypi.python.org/pypi/unicodescript/
to see who will adopt this.

Perhaps it might be in their mutual interest to join their packages to e.g. unicodemetadata or something similar. Extra ideas on this are still welcome.

Thanks for all your help,

Pander
 
P

Pander Musubi

If anywhere, either a third-party module, or the unicodedata standard

library module.





Some unanswered questions:



- when would somebody need this function?



When working with Unicode metedata, see below.


- why is is called "decodeUnicodeGeneralCategory" when it

doesn't seem to have anything to do with decoding?



It is actually a simple LUT. I like your improvements below.


- why is the parameter "sortable" called sortable, when it

doesn't seem to have anything to do with sorting?



The values return are alphabetically sortable.


If this is useful at all, it would be more useful to just expose the data
as a dict, and forget about an unnecessary wrapper function:
from collections import namedtuple
r = namedtuple("record", "other name desc") # better field names needed!
GC = {
'C' : r('Other', 'Other', 'Cc | Cf | Cn | Co | Cs'),
'Cc': r('Control', 'Control',
'a C0 or C1 control code'), # a.k.a. cntrl
'Cf': r('Format', 'Format', 'a format control character'),
'Cn': r('Unassigned', 'Unassigned',
'a reserved unassigned code point or a noncharacter'),
'Co': r('Private Use', 'Private_Use', 'a private-use character'),
'Cs': r('Surrogate', 'Surrogate', 'a surrogate code point'),
'L' : r('Letter', 'Letter', 'Ll | Lm | Lo | Lt | Lu'),
'LC': r('Letter, Cased', 'Cased_Letter', 'Ll | Lt | Lu'),
'Ll': r('Letter, Lowercase', 'Lowercase_Letter',
'a lowercase letter'),
'Lm': r('Letter, Modifier', 'Modifier_Letter', 'a modifier letter'),
'Lo': r('Letter, Other', 'Other_Letter',
'other letters, including syllables and ideographs'),
'Lt': r('Letter, Titlecase', 'Titlecase_Letter',
'a digraphic character, with first part uppercase'),
'Lu': r('Letter, Uppercase', 'Uppercase_Letter',
'an uppercase letter'),
'M' : r('Mark', 'Mark', 'Mc | Me | Mn '), # a.k.a. Combining_Mark
'Mc': r('Mark, Spacing', 'Spacing_Mark',
'a spacing combining mark (positive advance width)'),
'Me': r('Mark, Enclosing', 'Enclosing_Mark',
'an enclosing combining mark'),
'Mn': r('Mark, Nonspacing', 'Nonspacing_Mark',
'a nonspacing combining mark (zero advance width)'),
'N' : r('Number', 'Number', 'Nd | Nl | No'),
'Nd': r('Number, Decimal', 'Decimal_Number',
'a decimal digit'), # a.k.a. digit
'Nl': r('Number, Letter', 'Letter_Number',
'a letterlike numeric character'),
'No': r('Number, Other', 'Other_Number',
'a numeric character of other type'),
'P' : r('Punctuation', 'Punctuation',
'Pc | Pd | Pe | Pf | Pi | Po | Ps'), # a.k.a. punct
'Pc': r('Punctuation, Connector', 'Connector_Punctuation',
'a connecting punctuation mark, like a tie'),
'Pd': r('Punctuation, Dash', 'Dash_Punctuation',
'a dash or hyphen punctuation mark'),
'Pe': r('Punctuation, Close', 'Close_Punctuation',
'a closing punctuation mark (of a pair)'),
'Pf': r('Punctuation, Final', 'Final_Punctuation',
'a final quotation mark'),
'Pi': r('Punctuation, Initial', 'Initial_Punctuation',
'an initial quotation mark'),
'Po': r('Punctuation, Other', 'Other_Punctuation',
'a punctuation mark of other type'),
'Ps': r('Punctuation, Open', 'Open_Punctuation',
'an opening punctuation mark (of a pair)'),
'S' : r('Symbol', 'Symbol', 'Sc | Sk | Sm | So'),
'Sc': r('Symbol, Currency', 'Currency_Symbol', 'a currency sign'),
'Sk': r('Symbol, Modifier', 'Modifier_Symbol',
'a non-letterlike modifier symbol'),
'Sm': r('Symbol, Math', 'Math_Symbol',
'a symbol of mathematical use'),
'So': r('Symbol, Other', 'Other_Symbol', 'a symbol of other type'),
'Z' : r('Separator', 'Separator', 'Zl | Zp | Zs'),
'Zl': r('Separator, Line', 'Line_Separator',
'U+2028 LINE SEPARATOR only'),
'Zp': r('Separator, Paragraph', 'Paragraph_Separator',
'U+2029 PARAGRAPH SEPARATOR only'),
'Zs': r('Separator, Space', 'Space_Separator',
'a space character (of various non-zero widths)'),

del r
Usage is then trivially the same as normal dict and attribute access:
py> GC['Ps'].desc
'an opening punctuation mark (of a pair)'



Thank you for the improvements. I have some more extra dicts in this way such as:

http://www.unicode.org/Public/UNIDATA/UnicodeData.txt

where this general category is begin used. This information is useful when handling Unicode metadata.



I think I will approach both

http://pypi.python.org/pypi/unicodeblocks/

and

http://pypi.python.org/pypi/unicodescript/

to see who will adopt this.



Perhaps it might be in their mutual interest to join their packages to e.g. unicodemetadata or something similar. Extra ideas on this are still welcome.



Thanks for all your help,



Pander



Ah, it will become a feature request for http://docs.python.org/3/library/unicodedata.html
 
P

Pander Musubi

I was expecting PyPI. Here is the code, please advise on where to submit
If anywhere, either a third-party module, or the unicodedata standard
library module.
Some unanswered questions:
- when would somebody need this function?
When working with Unicode metedata, see below.
- why is is called "decodeUnicodeGeneralCategory" when it
doesn't seem to have anything to do with decoding?
It is actually a simple LUT. I like your improvements below.
- why is the parameter "sortable" called sortable, when it
doesn't seem to have anything to do with sorting?
The values return are alphabetically sortable.
If this is useful at all, it would be more useful to just expose the data
as a dict, and forget about an unnecessary wrapper function:
from collections import namedtuple
r = namedtuple("record", "other name desc") # better field names needed!
'C' : r('Other', 'Other', 'Cc | Cf | Cn | Co | Cs'),
'Cc': r('Control', 'Control',
'a C0 or C1 control code'), # a.k.a. cntrl
'Cf': r('Format', 'Format', 'a format control character'),
'Cn': r('Unassigned', 'Unassigned',
'a reserved unassigned code point or a noncharacter'),
'Co': r('Private Use', 'Private_Use', 'a private-use character'),
'Cs': r('Surrogate', 'Surrogate', 'a surrogate code point'),
'L' : r('Letter', 'Letter', 'Ll | Lm | Lo | Lt | Lu'),
'LC': r('Letter, Cased', 'Cased_Letter', 'Ll | Lt | Lu'),
'Ll': r('Letter, Lowercase', 'Lowercase_Letter',
'a lowercase letter'),
'Lm': r('Letter, Modifier', 'Modifier_Letter', 'a modifier letter'),
'Lo': r('Letter, Other', 'Other_Letter',
'other letters, including syllables and ideographs'),
'Lt': r('Letter, Titlecase', 'Titlecase_Letter',
'a digraphic character, with first part uppercase'),
'Lu': r('Letter, Uppercase', 'Uppercase_Letter',
'an uppercase letter'),
'M' : r('Mark', 'Mark', 'Mc | Me | Mn '), # a.k.a. Combining_Mark
'Mc': r('Mark, Spacing', 'Spacing_Mark',
'a spacing combining mark (positive advance width)'),
'Me': r('Mark, Enclosing', 'Enclosing_Mark',
'an enclosing combining mark'),
'Mn': r('Mark, Nonspacing', 'Nonspacing_Mark',
'a nonspacing combining mark (zero advance width)'),
'N' : r('Number', 'Number', 'Nd | Nl | No'),
'Nd': r('Number, Decimal', 'Decimal_Number',
'a decimal digit'), # a.k.a. digit
'Nl': r('Number, Letter', 'Letter_Number',
'a letterlike numeric character'),
'No': r('Number, Other', 'Other_Number',
'a numeric character of other type'),
'P' : r('Punctuation', 'Punctuation',
'Pc | Pd | Pe | Pf | Pi | Po | Ps'), # a.k.a. punct
'Pc': r('Punctuation, Connector', 'Connector_Punctuation',
'a connecting punctuation mark, like a tie'),
'Pd': r('Punctuation, Dash', 'Dash_Punctuation',
'a dash or hyphen punctuation mark'),
'Pe': r('Punctuation, Close', 'Close_Punctuation',
'a closing punctuation mark (of a pair)'),
'Pf': r('Punctuation, Final', 'Final_Punctuation',
'a final quotation mark'),
'Pi': r('Punctuation, Initial', 'Initial_Punctuation',
'an initial quotation mark'),
'Po': r('Punctuation, Other', 'Other_Punctuation',
'a punctuation mark of other type'),
'Ps': r('Punctuation, Open', 'Open_Punctuation',
'an opening punctuation mark (of a pair)'),
'S' : r('Symbol', 'Symbol', 'Sc | Sk | Sm | So'),
'Sc': r('Symbol, Currency', 'Currency_Symbol', 'a currency sign'),
'Sk': r('Symbol, Modifier', 'Modifier_Symbol',
'a non-letterlike modifier symbol'),
'Sm': r('Symbol, Math', 'Math_Symbol',
'a symbol of mathematical use'),
'So': r('Symbol, Other', 'Other_Symbol', 'a symbol of other type'),
'Z' : r('Separator', 'Separator', 'Zl | Zp | Zs'),
'Zl': r('Separator, Line', 'Line_Separator',
'U+2028 LINE SEPARATOR only'),
'Zp': r('Separator, Paragraph', 'Paragraph_Separator',
'U+2029 PARAGRAPH SEPARATOR only'),
'Zs': r('Separator, Space', 'Space_Separator',
'a space character (of various non-zero widths)'),
Usage is then trivially the same as normal dict and attribute access:
py> GC['Ps'].desc
'an opening punctuation mark (of a pair)'
Thank you for the improvements. I have some more extra dicts in this way such as:

where this general category is begin used. This information is useful when handling Unicode metadata.
I think I will approach both

to see who will adopt this.
Perhaps it might be in their mutual interest to join their packages to e.g. unicodemetadata or something similar. Extra ideas on this are still welcome.
Thanks for all your help,



Ah, it will become a feature request for http://docs.python.org/3/library/unicodedata.html

Please see:
http://bugs.python.org/issue16684
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,995
Messages
2,570,228
Members
46,817
Latest member
AdalbertoT

Latest Threads

Top