This is a document of my experimental journey building an Italian NLP system to identify which entities matter most in documents. The core challenge: entity recognition and coreference resolution, testing multiple approaches and understanding why each fell short.

The central question

The problem sounds straightforward: determine which entities a document fundamentally discusses, rather than simply counting frequency. This requires solving two distinct challenges simultaneously.

NER attempts

spaCy's limitations

I initially chose spaCy for named entity recognition but found it inadequate. spaCy is an English-first system extended to other languages as an afterthought. Custom entity types and fine-tuning proved cumbersome, and the system lacked genuine language-agnostic architecture by design.

Transformer-based models

XLM-RoBERTa and mBERT offered genuine multilingual support from the ground up. However, they came with pre-trained label sets (PER, ORG, LOC, MISC) incompatible with my custom taxonomy for domain-specific entities like ingredients and legal instruments.

Local language models via Ollama

Testing llama3.2:3b, qwen2.5:7b, qwen2.5:14b, and mistral:7b, I found LLMs understand context well but suffer from scaling issues. Processing time becomes prohibitive at scale, output lacks determinism, and models may invent unsolicited entity types or hallucinate entirely.

Coreference resolution challenges

coreferee's obsolescence

Built on spaCy, coreferee faced maintenance issues. The last release tested Python 3.11 and spaCy 3.5, with the project's last meaningful update in 2022. By November 2025, compatibility problems rendered it unreliable for production use.

AllenNLP's language constraint

SpanBERT showed promise but only supported English, Chinese, and Arabic natively. Italian would require retraining from scratch with unavailable labelled data.

fastcoref's performance issues

This model demonstrated modern architecture and better multilingual accuracy but revealed Italian-specific problems. On CPU, processing 1,000 documents took 8 seconds each — prohibitively slow. More critically, it struggled with Italian's linguistic features: pro-drop structures, clitics, and null subjects that carry coreference information in authentic Italian text.

Conclusion

General-purpose LLMs cannot serve as NER solutions — they lack schema control, determinism, and scalability. The needed direction: a purpose-built transformer enabling span detection with label-in-context, allowing custom entity types at inference time rather than baking them into model weights.