2 min readfrom Machine Learning

I built an open-source Knowledge Graph pipeline with hybrid retrieval to improve LLM multi-hop reasoning [P]

Hey everyone,

I built an open-source full-stack pipeline (Django + React) that constructs a Knowledge Graph from raw text, detects thematic communities, and uses hybrid search to solve the "lost in the middle" problem in standard vector retrieval.

The Pipeline:

  1. Ingestion & Chunking: Raw text is cleaned, parsed, and split into overlapping chunks to preserve local context.
  2. Graph Construction: spaCy extracts named entities from each chunk. A weighted co-occurrence graph is built using NetworkX, mapping which entities appear together and linking them to their source chunks.
  3. Community Detection: The graph is partitioned into thematic clusters using greedy_modularity_communities. For each cluster, random text chunks are sampled and sent to an LLM to generate a high-level summary (preventing "hub node" bias).
  4. Indexing: All chunks are embedded into a dense vector store, and a sparse BM25 index is built over the same corpus.
  5. Hybrid Retrieval: On query, the system performs a dual search (Dense Vector + BM25). Simultaneously, it extracts entities from the prompt, traverses the graph for 1st-degree neighbors, and retrieves their associated chunks.
  6. Fusion & Reranking: Local and Global (community summary) results are merged, deduplicated, and scored using Reciprocal Rank Fusion (RRF). The top-K candidates are then re-scored by a Cross-Encoder for maximum precision.
  7. LLM Synthesis: The final curated context is passed to the LLM with strict prompting to generate a concise, well-structured, and cited answer.

Why it works:

Standard vector search fails at multi-hop queries like:

Who ordered the execution of Sansa's father, and how did that person eventually die?

By traversing the graph (Sansa -> Ned -> Joffrey -> Poisoning), the system bridges the gap between disconnected text chunks and synthesizes the correct answer.

GitHub: https://github.com/mohammad-majoony/graphrag-studio

Would love feedback! Thanks.

submitted by /u/Future_Caregiver_643
[link] [comments]

Want to read more?

Check out the full article on the original site

View original article

Tagged with

#natural language processing for spreadsheets
#generative AI for data analysis
#rows.com
#Excel alternatives for data analysis
#no-code spreadsheet solutions
#financial modeling with spreadsheets
#enterprise-level spreadsheet solutions
#automated anomaly detection
#Knowledge Graph
#LLM
#Hybrid Retrieval
#Multi-hop Reasoning
#Vector Retrieval
#Lost in the Middle
#spaCy
#NetworkX
#BM25
#Dense Vector Store
#Community Detection
#greedy_modularity_communities