Skip to main content

Foundational/Online models

Transaction categorization benchmark table

Dataset: categorization_golden_v1
ModelQuantisationjson_modebatch_sizeeasymediumhardedge_caseaccuracyavg_latency_mscost_per_sample
gpt-5.2fullstrict25100.0% accuracy (33/33)97.06% accuracy (33/34)84.21% accuracy (16/19)100.0% accuracy (14/14)96.090900.0
gpt-5fullstrict25100.0% accuracy (33/33)100.0% accuracy (34/34)100.0% accuracy (19/19)100.0% accuracy (14/14)100.0503390.0
gpt-4ofullstrict25100.0% accuracy (33/33)100.0% accuracy (34/34)94.74% accuracy (18/19)100.0% accuracy (14/14)99.083020.0

Open-weight models

Transaction categorization benchmark table

Dataset: categorization_golden_v1
ModelQuantisationjson_modebatch_sizeeasymediumhardedge_caseaccuracyavg_latency_mscost_per_samplenotes
qwen/qwen3.5-35b-a3bfullstrict25100.0% accuracy (33/33)100.0% accuracy (34/34)100.0% accuracy (19/19)100.0% accuracy (14/14)100.0546980.0
qwen3-30b-a3b-instruct-2507Q3_K_XLstrict25100.0% accuracy (33/33)100.0% accuracy (34/34)94.74% accuracy (18/19)100.0% accuracy (14/14)99.0253570.0
gpt-oss-20bQ3_K_XLnone, strict25100.0% accuracy (33/33)100.0% accuracy (34/34)52.63% accuracy (10/19)100.0% accuracy (14/14)91.0432140.0Needs at least 8k context window due to not supporting strict JSON mode.
qwen3-14bQ5_K_XLstrict2593.94% accuracy (31/33)88.24% accuracy (30/34)84.21% accuracy (16/19)100.0% accuracy (14/14)91.0674550.0
google/gemma-3-12b4bitstrict25100.0% accuracy (33/33)91.18% accuracy (31/34)89.47% accuracy (17/19)100.0% accuracy (14/14)95.0584540.0