AI and Data science updates

22 aug 2025

Building Together: Arindam's Journey Advancing Open-Source Conversation Intelligence

After seven months of dedicated work at Dembrane, Arindam is moving on to pursue his own agentic graph project. His contributions have helped us tackle some of the toughest challenges in making conversation intelligence accessible and practical for communities everywhere. As he transitions to his next adventure, we wanted to share the technical journey we've been on together to document the open-source contributions that might help others facing similar challenges.

Production ready knowledge graphs

When we first encountered Microsoft's GraphRAG, we were excited by its promise: understanding connections between ideas in ways that traditional RAG couldn't. But we quickly hit a wall that many startups face—the academic implementation wasn't ready for the messy reality of continuous community conversations.

The problem was simple but devastating: every time you wanted to add new conversations to the graph, you had to rebuild the graph from scratch.

From a friend of the company, we got put in the direction of LightRAG from Hong Kong University. Now, instead of rebuilding graphs daily at prohibitive costs, we could update them incrementally as new conversations flow in.

It was more efficient but it still wasn't quite right for our needs. Working with the open-source community, Arindam helped add metadata support that preserves conversation context without compromising the graph structure, allowing us to use LightRAG in production. We've contributed these improvements back to the LightRAG project.

Listening at scale with BERTopic

Our core challenge is helping communities understand themselves. When you have hundreds of hours of conversation, how do you surface the themes that matter without losing the nuance? Simply adding all the context to an AI prompt might work well for small focus groups but catastrophically fails when you scale to city-wide consultations.

Implementing RAG helps, but it has a critical flaw: when you ask a traditional RAG system to “summarise the content” of a large dataset - It looks for content similar to the query “summarise the content.” Because there is no content that looks similar to the query “summarise the content”, the system returns a bunch of seemingly random context that makes a sub-par output.

Arindam had the brainwave to implement a traditional machine learning technique (BERTopic) to overcome this limitation.

Instead of simply looking for content similar to the query, BERTopic creates hierarchical topic trees where themes nest within broader concepts. This allows an LLM to navigate from general ideas to specific discussions, like zooming in and out on a map.

The magic happens when someone asks, "What are people saying about the new bike lanes?" The system doesn't search through millions of conversation chunks to find the most relevant chunks (which might be chunks where people literally ask about bike lanes). Instead, it traverses the topic tree, identifying relevant branches (transportation concerns, safety feedback, environmental benefits, bike lanes), then provides all the relevant context to the LLM answering the query.

This is specifically useful when answering the kinds of questions that come up when trying to understand broad concerns in large sets conversations that touch many specific topics. Working together with our CTO Sameer, we brought this method into production and we’re happy to share this new technique with the world!

Multilingual finetuning

Working with European municipalities taught us that privacy isn't optional. Voice recordings contain biometric data, making them personally identifiable under GDPR. We couldn't use cloud transcription services that might store or process this data outside the EU. We encountered many issues getting this to work, as all the work seems to be going into optimising english transcription on easy to use but privacy infringing API’s.

To compensate, we experimented with building our own system on top of the open source transcription model whisper, which we host ourselves. Whisper is trained on 90+ languages but often defaults to English, which is the main language in its training set. Arindam built pipelines to make it relatively easy for us to finetune custom Whisper models on different languages. The system fine-tunes the weights of the model to new languages on demand, learning local terminology and accents without compromising its broad capabilities. We also incorporated many of the optimisations coming out of research into this stack. Where standard Whisper might take 40 minutes to process an hour of audio, our implementation does it in about a minute (depending on GPU availability)—making real-time community sessions feasible.

For now, we have started finetuning whisper for Dutch, but look forward to improving secure, accurate and multilingual transcription in production for all European languages, and then the rest of the world.

Speaker Diarisation

The speaker diarisation project also deserves special mention. We needed to track who said what within a session (for coherent transcripts), but we only have one audio track per conversation, unlike video calling solutions which can split speakers at the audio source. Arindam implemented a prototype solution that allows us to split the multiple transcripts into distinct speakers coherently, without making speakers explicitly identifiable, which is crucial for Chatham House guidelines and GDPR.

Bringing It All Together

The real innovation isn't any single component—it's how they work together. When someone asks, "How do engineering concerns relate to customer feedback?", the system understands the topical landscape through BERTopic, navigates entity relationships through LightRAG, separates speakers without breaching privacy, in any language.

This synthesis enables us to take steps towards our dream: making large-scale community dialogue a commodity. City councils can understand thousands of citizen voices. Companies can genuinely listen to all their employees. Communities can find consensus even in complexity.

Source Available: Building the Commons Together

We believe that tools for democratic participation should be common resources that lift all communities. That’s why all our improvements to the open source libraries are also open source. When we solve a problem—like maintaining graph consistency during updates or handling code-switching in multilingual communities—everyone benefits. And when others improve these tools, our platform gets better too. It's a virtuous cycle that moves faster than any company could alone.

Thank You, Arindam

As Arindam embarks on his next adventure with agentic graphs, we're grateful for the foundation he's helped build. His work exemplifies what we value at Dembrane: technical excellence in service of human connection, open collaboration over proprietary advantage, and the patience to solve hard problems properly.

The challenges he tackled are the unglamorous but essential work of making AI serve communities rather than the other way around.

We look forward to seeing where Arindam's journey leads next. The door remains open for future collaboration, especially as we work to bring more of his powerful innovations to production.

At Dembrane, we believe people know how. Our tools simply help them share that knowledge. Learn more about our mission at dembrane.com.


Isn't that something!

Ready to bring your next event to the next level with dembrane?