“I believe that there are many ethical uses for AI, it just requires care and a clear understanding of the problem you are trying to solve.” – Interview with EM Lewis-Jong
EM Lewis-Jong is the Founder of Mozilla Data Collective. They were previously the Director at Common Voice, the world’s largest open crowdsourced speech corpus, spanning 300+ languages and 750,000 open community contributors. EM has led language data research and community governance programs backed by the US National Science Foundation, NVIDIA, Gates Foundation, and GIZ. Before Mozilla, they were a founding executive in CivTech, working on sharing information, data and knowledge across borders. In this interview, we explored the Mozilla Data Collective, the challenges of data sharing, the importance of building infrastructure for less-resourced languages, their use of AI, and their doctoral research.
The name Mozilla instantly brings to mind the Firefox web browser. But you’re the Vice President of the Mozilla Data Collective. What exactly is that?
The Mozilla Data Collective is a brand-new organisation being incubated within the Mozilla Foundation. Its beginnings can be traced back to 2017, when Common Voice was started – a very different era for technology, when people were solving quite different problems with data. Recently, as various forms of AI evolved and with the rise of large language models, many language communities began pushing for greater control and sovereignty over their data sharing. Some people started to feel that there were extractive practices emerging in the AI industry and began questioning whether releasing data under CC0 without restrictions truly served their interests anymore.
While there was still a strong commitment to open data sharing, some communities wanted more transparency, accountability, and assurance that their contributions would help build a fairer world and a more equitable tech ecosystem. To address these concerns, we spent nearly 18 months conducting listening exercises to understand what the right solutions might look like. From technological and legal frameworks to community infrastructure that would give people more control over their datasets.
We tried to find a platform that could serve the needs of all our communities – and it didn’t exist! Ultimately, we decided to build it ourselves. Mozilla Data Collective is a platform for data agency, and fair value exchange. It is about having the choice to share in line with your values, and being able to set what benefit means to you and your community. It enables customisable data sharing, collective governance and helps users strike a balance between openness and sovereignty. It is designed for communities, organisations, and individuals to share their data under a range of different terms and conditions. What makes it truly special is that communities retain 100% ownership and control over their datasets.
What does it actually mean for people to have full ownership and control over their datasets? And how is that principle reflected in the platform’s design?
It means that people can choose whether they want to share their data completely openly, with full transparency over the downstream supply chain as we ensure logs are kept of who downloads each dataset. We also allow people to customise their sharing preferences: they can decide whether to share data locally, regionally, and under specific jurisdictions.
They might choose to share their data only for certain use cases. For instance, for accessibility or education, but not for commercial purposes. Others might prefer to share it solely for research, excluding commercialisation, or they may opt into commercial use but specify how they wish to be compensated. For example, organisations accessing their data would pay, and the resulting funds could be channelled into community projects or local initiatives. The current version enables this mostly through legal and governance frameworks, but we’re very excited about the coming release which gives more technical ‘teeth’ to the restrictions.
That is the central premise of the Mozilla Data Collective: to build a system in which people are empowered to share their data and to be the primary beneficiaries of that sharing.
Can you share some examples of the kinds of projects currently being hosted on the platform?
It all started with Common Voice, which is the first user of the Mozilla Data Collective. Common Voice enables people to collect sentences, record audio, and transcribe it in their language. We then package this data into a usable training set for machine learning purposes, most often for speech recognition, but it can also be applied to other areas. It provides a way for communities to create datasets that do not yet exist in the world, with the core focus on improving diversity in language data. The project is entirely community-led: communities decide whether they want to add their language to the platform, select their sentences, and create audio clips. Our role is simply to provide the infrastructure and community support.
Once we started talking to people about what we were doing for Common Voice – everyone wanted to get involved! It was clear we won’t be the only ones with this need. Hundreds of incredible, exclusive third party datasets are being uploaded as we speak! German TTS from the Open Home Foundation, an Urdu Corpus from Kaleem Magazine, Sindhi News, Punjabi Literature by Chishti Sons publishing agency, French-Ewondo translation corpus – the list goes on! It’s been humbling to be joined by so many fellow travellers.
Mozilla Common Voice has grown into an incredible global community. How did you manage to build such a vibrant group of people?
It is really a combination of several factors. First, the project addresses a genuine pain point as many products simply don’t support minority languages, and that resonates deeply with people across the world. Importantly, our work isn’t just about data, it’s also about recognition, respect, and honouring people and their communities and cultures. Second, we brought together a diverse mix of people like language activists, computational linguists, developers, and engineers all united by the belief that technology should be accessible to everyone, in every language. And third, these individuals were deeply rooted in their local communities, which became a powerful driver of the project’s growth.
We asked this because many researchers at our European Centre of Excellence in AI for Digital Humanities are working on an open-source LLM for Slovene, a language spoken by around two million people. As part of this, we launched a national data collection campaign, but it faced some backlash particularly from authors whose work is copyrighted, such as translators and book writers. How can we respond to and address this kind of criticism?
This is all very valid criticism, and we should be grateful to have members in our communities who are willing to engage in these kinds of conversations, because there are no easy answers. A radically open approach to data curation and sharing makes it easier for more actors to create tools in these languages – there is little doubt about that.
The challenge, however, is that “open” licenses mean open to everyone. It’s not a bug, it’s a feature. But this feature can make people very uncomfortable, and it isn’t right for everyone. Why should a large technology company be able to use your data for free when the community has invested its precious time in creating it? That concern is a core motivation behind the Mozilla Data Collective. We want our communities to have the power to decide what is right for them. We do not accept people uploading copyrighted content for which they do not hold the rights.
In my experience, communities tend to respond to data sharing in two main ways. Some communities feel that releasing data under CC0 is the right approach for them. Small (language) communities often recognise that large tech companies won’t support their language unless it is made very easy for them. In these cases, the community decides that the best way to ensure their language remains supported is to make the data freely available.
Other communities take a very different stance. They ask, “Why should a large corporation have free access to our data? Who benefits when that happens?” They may be willing to share it for free with certain, close-to-context organisations, but if a major tech company wants to use it, they expect proper compensation. For example, contributions to a community fund or a programme that revitalises their culture. This is really about fair value exchange.
These are two distinct approaches, but the key principle is the same: the communities themselves should decide. Not us. Our role is simply to provide the infrastructure that makes exercising that choice easier.
In recent months, we have also seen significant public backlash against AI. From very real fears of job losses to concerns about poor-quality AI outputs (AI slop). In your view, how can AI be used for good?
AI is not useful for everything. It is effective for specific tasks and should not be deployed indiscriminately in any context or for any problem. AI comes at a cost both environmental and labour-related.
At the same time, AI is an incredibly powerful tool. Some use cases from Common Voice are particularly inspiring to me. For example, a Swedish health tech company, Mabel AI, aims to prevent miscommunication between doctors and patients. It is a problem that can sometimes be a matter of life and death. When doctors and patients do not share a language, misunderstandings can occur, which is especially critical for migrants. Mabel AI addresses this by building secure, local translation services for doctor–patient interactions, powered by Common Voice.
Another project we worked on involved building a chatbot in Swahili to provide information about land rights. This was especially useful for women whose husbands had died, allowing them to understand their rights regarding land ownership and use. Accessing information about rights, legislation, and policy is crucial but can often be difficult. By enabling oral, natural question-and-answer interactions through voice, this technology opened new ways for people to understand their civic context and the AI behind it made this possible.
I believe that there are many ethical uses for AI, but it requires care and a clear understanding of the problem you are trying to solve.

How do you employ AI in your day-to-day work?
We have a running joke in my team that you should treat AI the way you might engage with an intern who is very young and very new. Only use it for work that you’re capable of checking and evaluating yourself!
I sometimes use AI for research tasks, such as indexing exercises that would take a very long time to do manually. Of course, we still have to check the results, because AI is not accurate in every context. The further you get from American English, web-based content, the more inaccurate it becomes, which is a particularly significant issue for projects like Mozilla Data Collective, where much of our work involves other languages and cultures.
We also use it for creating templates that we repeatedly use across the team. Since we are a remote team with flexible working hours, it is rare that everyone is online at the same time. Standard templates make it easy for team members to self-serve information.
The things that LLMs are better at, like writing, are the things that I do not particularly want to outsource, because these are things that I like to do myself! Sitting down for deep reading and writing are the two things I enjoy most in life and I am not giving them up to AI.
Since you mentioned reading and writing, you are writing a doctoral dissertation at the University of Sussex. What is your research focused on?
My research explores how possible it is to configure AI agents. By that, I mean optimising them for specific use cases. For example, if you have an iPhone, how easy is it to customise Siri for your context using the settings available today? My work asks to what extent we are truly in control of the AIs we use, and what the available settings and configurations actually allow us to do, and what they don’t.
I exclusively focus on young people because they are fascinating and creative users. Most of my work is with children aged 12 to 14, and explores new modalities for exerting control, beyond graphical interfaces.
What kinds of methods do you use?
I generally prefer mixed methods, as they offer an interesting way to work with children on complex topic. On one hand, you want to be able to draw comparisons between different groups, quantitative methods are helpful for this. But because the samples are not large, you need to supplement this with qualitative methods. Simple metrics, like task completion rates, can obscure more nuanced, behavioural insights.