Main Content

characterCategories

Unicode character categories

Since R2021a

    Description

    example

    ucats = characterCategories(str32) returns the major Unicode character categories for the characters in the UTF32 object str.

    example

    ucats = characterCategories(str32,'Granularity',granularity) also specifies the granularity of the returned categories. For example, characterCategories(str32,'Granularity','detailed') returns detailed Unicode character categories.

    Examples

    collapse all

    Convert the string "Hello! 😀" to its Unicode UTF-32 string representation using the textanalytics.unicode.UTF32 function.

    str = "Hello! 😀";
    str32 = textanalytics.unicode.UTF32(str)
    str32 = 
      UTF32 with properties:
    
        Data: [72 101 108 108 111 33 32 128512]
    
    

    Get the Unicode character categories of str32 using the characterCategories function.

    ucats = characterCategories(str32)
    ucats = 1x1 cell array
        {[L    L    L    L    L    P    Z    S]}
    
    

    The Unicode character categories "L", "P", "Z", and "S" correspond to "letter", "punctuation", "separator", and "symbol", respectively.

    Convert the string "Hello! 😀" to its Unicode UTF-32 string representation using the textanalytics.unicode.UTF32 function.

    str = "Hello! 😀";
    str32 = textanalytics.unicode.UTF32(str)
    str32 = 
      UTF32 with properties:
    
        Data: [72 101 108 108 111 33 32 128512]
    
    

    Get the Unicode character categories of str32 using the characterCategories function. To return detailed Unicode character categories, set the 'Granularity' option to 'detailed'.

    ucats = characterCategories(str32,'Granularity','detailed')
    ucats = 1x1 cell array
        {[Lu    Ll    Ll    Ll    Ll    Po    Zs    So]}
    
    

    The Unicode character categories "Lu", "Ll", "Po", "Zs", and "So" correspond to "uppercase letter", "lowercase letter", "other punctuation", "space separator", and "other symbol", respectively.

    Input Arguments

    collapse all

    UTF-32 string representation, specified as a UTF32 array.

    Granularity of returned Unicode character categories, specified as one of the following:

    • 'major' – Return the major Unicode character category. This includes the first character of the Unicode character category only.

    • 'detailed' – Return detailed Unicode character codes. This includes all characters of the Unicode character category.

    Output Arguments

    collapse all

    Unicode character categories, returned as a cell array of categorical vectors.

    This table shows the major and detailed Unicode character categories. To specify which granularity of Unicode character categories to return, use the Granularity option.

    Major Character CategoryMajor Character Category DescriptionDetailed Character CategoryDetailed Character Category Description
    LLetterLuUppercase letter
    LlLowercase letter
    LtTitlecase letter
    LmModifier letter
    LoOther letter
    MMarkMnNonspacing mark
    McSpacing mark
    MeEnclosing mark
    NNumberNdDecimal number
    NlLetter number
    NoOther number
    PPunctuationPcConnector punctuation
    PdDash punctuation
    PsOpen punctuation
    PeClose punctuation
    PiInitial punctuation
    PfFinal punctuation
    PoOther punctuation
    SSymbolSmMath symbol
    ScCurrency symbol
    SkModifier symbol
    SoOther symbol
    ZSeparatorZsSpace separator
    ZlLine separator
    ZpParagraph separator
    COtherCcControl
    CfFormat
    CsSurrogate
    CoPrivate use
    CnUnassigned

    References

    [1] Unicode® Standard Annex #44 Unicode Character Database https://www.unicode.org/reports/tr44/

    Version History

    Introduced in R2021a