Skip to content

RTIInternational/rota

Repository files navigation

language widget
en
text
theft 3
text
forgery
text
unlawful possession short-barreled shotgun
text
criminal trespass 2nd degree
text
eluding a police vehicle
text
upcs synthetic narcotic

ROTA

Rapid Offense Text Autocoder

HuggingFace Models GitHub Model Release DOI

Criminal justice research often requires conversion of free-text offense descriptions into overall charge categories to aid analysis. For example, the free-text offense of "eluding a police vehicle" would be coded to a charge category of "Obstruction - Law Enforcement". Since free-text offense descriptions aren't standardized and often need to be categorized in large volumes, this can result in a manual and time intensive process for researchers. ROTA is a machine learning model for converting offense text into offense codes.

Currently ROTA predicts the Charge Category of a given offense text. A charge category is one of the headings for offense codes in the 2009 NCRP Codebook: Appendix F.

The model was trained on publicly available data from a crosswalk containing offenses from all 50 states combined with three additional hand-labeled offense text datasets.

Charge Category Example

Data Preprocessing

The input text is standardized through a series of preprocessing steps. The text is first passed through a sequence of 500+ case-insensitive regular expressions that identify common misspellings and abbreviations and expand the text to a more full, correct English text. Some data-specific prefixes and suffixes are then removed from the text -- e.g. some states included a statute as a part of the text. Finally, punctuation (excluding dollar signs) are removed from the input, multiple spaces between words are removed, and the text is lowercased.

Cross-Validation Performance

This model was evaluated using 3-fold cross validation. Except where noted, numbers presented below are the mean value across the 3 folds.

The model in this repository is trained on all available data. Because of this, you can typically expect production performance to be (unknowably) better than the numbers presented below.

Overall Metrics

Metric Value
Accuracy 0.934
MCC 0.931
Metric precision recall f1-score
macro avg 0.811 0.786 0.794

Note: These are the average of the values per fold, so macro avg is the average of the macro average of all categories per fold.

Per-Category Metrics

Category precision recall f1-score support
AGGRAVATED ASSAULT 0.954 0.954 0.954 4085
ARMED ROBBERY 0.961 0.955 0.958 1021
ARSON 0.946 0.954 0.95 344
ASSAULTING PUBLIC OFFICER 0.914 0.905 0.909 588
AUTO THEFT 0.962 0.962 0.962 1660
BLACKMAIL/EXTORTION/INTIMIDATION 0.872 0.871 0.872 627
BRIBERY AND CONFLICT OF INTEREST 0.784 0.796 0.79 216
BURGLARY 0.979 0.981 0.98 2214
CHILD ABUSE 0.805 0.78 0.792 139
COCAINE OR CRACK VIOLATION OFFENSE UNSPECIFIED 0.827 0.815 0.821 47
COMMERCIALIZED VICE 0.818 0.788 0.802 666
CONTEMPT OF COURT 0.982 0.987 0.984 2952
CONTRIBUTING TO DELINQUENCY OF A MINOR 0.544 0.333 0.392 50
CONTROLLED SUBSTANCE - OFFENSE UNSPECIFIED 0.864 0.791 0.826 280
COUNTERFEITING (FEDERAL ONLY) 0 0 0 2
DESTRUCTION OF PROPERTY 0.97 0.968 0.969 2560
DRIVING UNDER INFLUENCE - DRUGS 0.567 0.603 0.581 34
DRIVING UNDER THE INFLUENCE 0.951 0.946 0.949 2195
DRIVING WHILE INTOXICATED 0.986 0.981 0.984 2391
DRUG OFFENSES - VIOLATION/DRUG UNSPECIFIED 0.903 0.911 0.907 3100
DRUNKENNESS/VAGRANCY/DISORDERLY CONDUCT 0.856 0.861 0.858 380
EMBEZZLEMENT 0.865 0.759 0.809 100
EMBEZZLEMENT (FEDERAL ONLY) 0 0 0 1
ESCAPE FROM CUSTODY 0.988 0.991 0.989 4035
FAMILY RELATED OFFENSES 0.739 0.773 0.755 442
FELONY - UNSPECIFIED 0.692 0.735 0.712 122
FLIGHT TO AVOID PROSECUTION 0.46 0.407 0.425 38
FORCIBLE SODOMY 0.82 0.8 0.809 76
FORGERY (FEDERAL ONLY) 0 0 0 2
FORGERY/FRAUD 0.911 0.928 0.919 4687
FRAUD (FEDERAL ONLY) 0 0 0 2
GRAND LARCENY - THEFT OVER $200 0.957 0.973 0.965 2412
HABITUAL OFFENDER 0.742 0.627 0.679 53
HEROIN VIOLATION - OFFENSE UNSPECIFIED 0.879 0.811 0.843 24
HIT AND RUN DRIVING 0.922 0.94 0.931 303
HIT/RUN DRIVING - PROPERTY DAMAGE 0.929 0.918 0.923 362
IMMIGRATION VIOLATIONS 0.84 0.609 0.697 19
INVASION OF PRIVACY 0.927 0.923 0.925 1235
JUVENILE OFFENSES 0.928 0.866 0.895 144
KIDNAPPING 0.937 0.93 0.933 553
LARCENY/THEFT - VALUE UNKNOWN 0.955 0.945 0.95 3175
LEWD ACT WITH CHILDREN 0.775 0.85 0.811 596
LIQUOR LAW VIOLATIONS 0.741 0.768 0.755 214
MANSLAUGHTER - NON-VEHICULAR 0.626 0.802 0.701 139
MANSLAUGHTER - VEHICULAR 0.79 0.853 0.819 117
MARIJUANA/HASHISH VIOLATION - OFFENSE UNSPECIFIED 0.741 0.662 0.699 62
MISDEMEANOR UNSPECIFIED 0.63 0.243 0.347 57
MORALS/DECENCY - OFFENSE 0.774 0.764 0.769 412
MURDER 0.965 0.915 0.939 621
OBSTRUCTION - LAW ENFORCEMENT 0.939 0.947 0.943 4220
OFFENSES AGAINST COURTS, LEGISLATURES, AND COMMISSIONS 0.881 0.895 0.888 1965
PAROLE VIOLATION 0.97 0.953 0.962 946
PETTY LARCENY - THEFT UNDER $200 0.965 0.761 0.85 139
POSSESSION/USE - COCAINE OR CRACK 0.893 0.928 0.908 68
POSSESSION/USE - DRUG UNSPECIFIED 0.624 0.535 0.572 189
POSSESSION/USE - HEROIN 0.884 0.852 0.866 25
POSSESSION/USE - MARIJUANA/HASHISH 0.977 0.97 0.973 556
POSSESSION/USE - OTHER CONTROLLED SUBSTANCES 0.975 0.965 0.97 3271
PROBATION VIOLATION 0.963 0.953 0.958 1158
PROPERTY OFFENSES - OTHER 0.901 0.87 0.885 446
PUBLIC ORDER OFFENSES - OTHER 0.7 0.721 0.71 1871
RACKETEERING/EXTORTION (FEDERAL ONLY) 0 0 0 2
RAPE - FORCE 0.842 0.873 0.857 641
RAPE - STATUTORY - NO FORCE 0.707 0.55 0.611 140
REGULATORY OFFENSES (FEDERAL ONLY) 0.847 0.567 0.674 70
RIOTING 0.784 0.605 0.68 119
SEXUAL ASSAULT - OTHER 0.836 0.836 0.836 971
SIMPLE ASSAULT 0.976 0.967 0.972 4577
STOLEN PROPERTY - RECEIVING 0.959 0.957 0.958 1193
STOLEN PROPERTY - TRAFFICKING 0.902 0.888 0.895 491
TAX LAW (FEDERAL ONLY) 0.373 0.233 0.286 30
TRAFFIC OFFENSES - MINOR 0.974 0.977 0.976 8699
TRAFFICKING - COCAINE OR CRACK 0.896 0.951 0.922 185
TRAFFICKING - DRUG UNSPECIFIED 0.709 0.795 0.749 516
TRAFFICKING - HEROIN 0.871 0.92 0.894 54
TRAFFICKING - OTHER CONTROLLED SUBSTANCES 0.963 0.954 0.959 2832
TRAFFICKING MARIJUANA/HASHISH 0.921 0.943 0.932 255
TRESPASSING 0.974 0.98 0.977 1916
UNARMED ROBBERY 0.941 0.939 0.94 377
UNAUTHORIZED USE OF VEHICLE 0.94 0.908 0.924 304
UNSPECIFIED HOMICIDE 0.61 0.554 0.577 60
VIOLENT OFFENSES - OTHER 0.827 0.817 0.822 606
VOLUNTARY/NONNEGLIGENT MANSLAUGHTER 0.619 0.513 0.542 54
WEAPON OFFENSE 0.943 0.949 0.946 2466

Note: support is the average number of observations predicted on per fold, so the total number of observations per class is roughly 3x support.

Using Confidence Scores

If we interpret the classification probability as a confidence score, we can use it to filter out predictions that the model isn't as confident about. We applied this process in 3-fold cross validation. The numbers presented below indicate how much of the prediction data is retained given a confidence score cutoff of p. We present the overall accuracy and MCC metrics as if the model was only evaluated on this subset of confident predictions.

cutoff percent retained mcc acc
0 0.85 0.952 0.96 0.961
1 0.9 0.943 0.964 0.965
2 0.95 0.928 0.97 0.971
3 0.975 0.912 0.975 0.976
4 0.99 0.886 0.982 0.983
5 0.999 0.733 0.995 0.996