regex - Perl: How to match FULLWIDTH LATIN SMALL -


i using listadmin manage many mailman-based mailing lists. have long list of subjects , addresses set block spam. recently, received smarter spam in sense uses nice-looking unicode characters, eg:

subject: Al l the ad ult mov ies you' ve see n r e nothing c ompari- ng t o our exx xci t ng compilation of 13' 000 mov ies in hd t hat are v ailable y ou now!

or

subject: hd qua lit y vi d eos d pho graph s o f ho t c hic ks
here u

now want use smart perl regex block that. piping these subjects hexdump revealed many characters fullwidth latin small letter. however, \p{fullwidth latin small letter} doesn't work: can't find unicode property definition "fullwidth latin small letter"

so question is: there \p{something} match fullwidth characters? alternatively: there way match characters?

the page perlunicode documents available unicode character classes. found reference in perlrebackslash, documents special character classes , backslash sequences \p{...} in regexes.

the summary common property classes require property type , property value, separated : or =. however, there not seem mention of fullwidth characters predefined property.

but there block/blk property, can have halfwidth , fullwidth forms (u+ff00u+ffef) value:

/\p{block=halfwidth , fullwidth forms}/ 

this match on input (tested on v16.3).


a useful tool uniprops.

$ uniprops u+ff41 u+ff41 ‹a› \n{fullwidth latin small letter a}     \w \pl \p{lc} \p{l_} \p{l&} \p{ll}     alnum alpha alphabetic assigned inhalfwidthandfullwidthforms     cased cased_letter lc changes_when_casemapped cwcm     changes_when_nfkc_casefolded cwkcf changes_when_titlecased cwt     changes_when_uppercased cwu ll l gr_base grapheme_base graph grbase     halfwidth_and_fullwidth_forms hex xdigit hex_digit id_continue idc     id_start ids letter l_ latin latn lowercase_letter lower lowercase     print word xid_continue xidc xid_start xids x_posix_alnum     x_posix_alpha x_posix_graph x_posix_lower x_posix_print x_posix_word     x_posix_xdigit 

as can see, \p{block=halfwidth , fullwidth forms} can written \p{in halfwidth , fullwidth forms}.


Comments

Popular posts from this blog

linux - xterm copying to CLIPBOARD using copy-selection causes automatic updating of CLIPBOARD upon mouse selection -

c++ - qgraphicsview horizontal scrolling always has a vertical delta -