3 min readfrom Machine Learning

20M+ Indian legal documents with citation graphs and vector embeddings – potential uses for legal NLP? [D]

been working on structuring India's legal corpus for the past 2 years and wanted to share what I've built and hear from people working on legal NLP or low-resource Indian language models.

dataset is 20M+ Indian court cases from the Supreme Court, all 25 High Courts, and 14 Tribunals. each case has structured metadata (court, bench, date, parties, judges, sections cited, acts referenced, case type). there's a citation graph across the full corpus where I've classified relationships as followed, distinguished, overruled, or mentioned.

every case is embedded with Voyage AI (1024d dense) plus BM25 sparse vectors. I have also cross-referenced 23,122 Acts and Statutes with the cases that interpret them.

Some things that might be interesting to this community:

citation network thing across 20M+ cases is, as far as I know, the first machine-readable one for Indian law.

could be useful for graph neural network research, legal outcome prediction, or influence analysis on which judgments are most cited and which are being overruled.

most Indian language NLP corpora are conversational or news text. Legal text is a completely different register. formal, precise, domain-specific. the bilingual pairs from the translation service could be useful for fine-tuning Indian language models on formal and legal domains.

the metadata extraction pipeline identifies judges, advocates, parties, sections, acts, and dates from unstructured judgment text. built with a mix of regex, heuristics, and LLM-based extraction. the structured outputs could serve as training data for legal NER models.

Indian court judgments are long. Median around 3,000 words, some exceed 50,000 words.

if anyone is benchmarking retrieval-augmented generation on legal domains, this corpus plus the citation graph could work as an evaluation bed. Ground truth exists in the citation relationships: if Case A cites Case B, a good retriever should show B when asked about the legal question in A.

data is available via API and bulk export in JSON and Parquet. Indian court judgments are public domain under Indian law so no copyright issues for research use.

being upfront about limitations: coverage is primarily English text (except Supreme court one, they have 3-4 translated language copies ) since Indian HCs issue orders in English, the regional language data comes from our translation service not from original regional language judgments.

metadata extraction accuracy varies by court, SC and major HCs are cleaner while smaller tribunals have messier inputs. The citation graph is extracted heuristically plus LLM-assisted, I estimate around 90-95% precision on citation extraction and lower on treatment classification. Not all 20M cases have complete metadata, coverage is best for post-2007 judgments.

would love to hear from anyone working on legal NLP, Indian language models, or graph-based legal analysis. What would be most useful to you from a dataset like this?

deets at vaquill

submitted by /u/zriyansh
[link] [comments]

Want to read more?

Check out the full article on the original site

View original article

Tagged with

#natural language processing for spreadsheets
#generative AI for data analysis
#Excel alternatives for data analysis
#natural language processing
#conversational data analysis
#data analysis tools
#financial modeling with spreadsheets
#big data management in spreadsheets
#real-time data collaboration
#intelligent data visualization
#data visualization tools
#enterprise data management
#big data performance
#data cleaning solutions
#self-service analytics tools
#large dataset processing
#cloud-based spreadsheet applications
#self-service analytics
#rows.com
#AI formula generation techniques
20M+ Indian legal documents with citation graphs and vector embeddings – potential uses for legal NLP? [D]