regex - Perl: How to match FULLWIDTH LATIN SMALL -
i using listadmin manage many mailman-based mailing lists. have long list of subjects , addresses set block spam. recently, received smarter spam in sense uses nice-looking unicode characters, eg:
subject: Al l the ad ult mov ies you' ve see n r e nothing c ompari- ng t o our exx xci t ng compilation of 13' 000 mov ies in hd t hat are v ailable y ou now!
or
subject: hd qua lit y vi d eos d pho graph s o f ho t c hic ks
here u
now want use smart perl regex block that. piping these subjects hexdump revealed many characters fullwidth latin small letter. however, \p{fullwidth latin small letter}
doesn't work: can't find unicode property definition "fullwidth latin small letter"
so question is: there \p{something}
match fullwidth characters? alternatively: there way match characters?
the page perlunicode
documents available unicode character classes. found reference in perlrebackslash, documents special character classes , backslash sequences \p{...}
in regexes.
the summary common property classes require property type , property value, separated :
or =
. however, there not seem mention of fullwidth characters predefined property.
but there block
/blk
property, can have halfwidth , fullwidth forms
(u+ff00
–u+ffef
) value:
/\p{block=halfwidth , fullwidth forms}/
this match on input (tested on v16.3).
a useful tool uniprops
.
$ uniprops u+ff41 u+ff41 ‹a› \n{fullwidth latin small letter a} \w \pl \p{lc} \p{l_} \p{l&} \p{ll} alnum alpha alphabetic assigned inhalfwidthandfullwidthforms cased cased_letter lc changes_when_casemapped cwcm changes_when_nfkc_casefolded cwkcf changes_when_titlecased cwt changes_when_uppercased cwu ll l gr_base grapheme_base graph grbase halfwidth_and_fullwidth_forms hex xdigit hex_digit id_continue idc id_start ids letter l_ latin latn lowercase_letter lower lowercase print word xid_continue xidc xid_start xids x_posix_alnum x_posix_alpha x_posix_graph x_posix_lower x_posix_print x_posix_word x_posix_xdigit
as can see, \p{block=halfwidth , fullwidth forms}
can written \p{in halfwidth , fullwidth forms}
.
Comments
Post a Comment