AI applications surged in 2025 as businesses rapidly integrated generative technology into their platforms, aiming to boost productivity and, ultimately, revenue. A new study examined exactly how much data these AI apps collect, and the results are revealing: a substantial portion of user input ends up in the hands of the apps themselves. The rapid pace of AI advancement has outstripped regulatory guidance, creating a complex privacy landscape where data harvesting practices can be surprising to many users. The findings underscore the growing tension between innovation and privacy, highlighting the need for clearer transparency and safeguards as AI becomes more embedded in everyday tools.
The Landscape of AI Data Collection in 2025
The Ecommerce Platforms analysis delves into how AI-powered apps handle user data, extending beyond what is collected to how that data is used and who has access to it. The study spans a wide range of consumer-facing AI apps, including voice assistants, language learning tools, writing and design apps, photography and image editing tools, and multi-purpose platforms that incorporate generative capabilities. The central finding across these diverse categories is striking: the majority of the top-rated AI apps collect more than half of the data users input into the service. This high rate of data capture reflects the underlying business models of many AI products, where data serves as a critical asset for training, refinement, and monetization.
To illustrate the scope, the study enumerates a list of AI apps and the percentage of input data that is collected. At the very top of the spectrum sits Amazon Alexa, which collects a staggering 93% of input data. Close behind is Google Assistant, which gathers 86% of user input data. These two long-standing, widely used assistants have set a high benchmark for data collection within the AI ecosystem, underscoring how pervasive data capture is across core interfaces many users interact with daily. The third position is taken by Duolingo, which collects 79% of the data input into its services. This is notable given Duolingo’s role as a popular language learning platform where personal progress and interaction data are central to the user experience and the platform’s adaptive features.
Following these leaders, Canva, an app primarily known for design and content creation, collects 64% of input data. This demonstrates that even tools focused on productivity and creativity are substantial data sources in the current AI milieu. Otter, another high-usage tool that transcribes and analyzes conversations for note-taking and meeting summaries, collects 57% of user input. Poe, a text-based AI chat interface, also collects 57% of input data, illustrating how conversational AI interfaces at times place a high premium on the data users provide during interactions.
Facetune, a mobile photo-editing app, collects 50% of input data, and the same percentage is reported for both Bing and DeepSeek, highlighting that traditional search and image analysis tools are also significant data collectors. Mem, a note-taking and knowledge-management app, collects 43% of input data, followed by ELSA Speak (43%), and PhotoRoom (43%), each showing substantial data collection across different use cases—from language learning augmentation to photo editing.
The list continues with 13. Trint (43%), reinforcing how transcription and video editing platforms capture substantial user data. ChatGPT itself is reported at 36%, alongside Perplexity AI (36%), Lensa (36%), StarryAI (36%), Wombo (36%), Youper (36%), FaceApp (36%), Luma AI (36%), all of which demonstrate a broad spectrum of AI-enabled tools collecting similar levels of user input. Speechify registers 29%, Pixai sits at 21%, and Clipchamp, a video editing platform, records the lowest among the group at 14%.
This spread across the dataset shows a clear pattern: the evolution of generative AI platforms has not straightforwardly translated into uniformly worse privacy practices by default, as reflected by the continued leadership of mature voice assistants like Google Assistant and Amazon Alexa in data collection. Yet, the more consequential question concerns not only how much data is collected but what happens to that data after it is gathered. The study highlights that the differences in data handling—what is collected, how it is used, and who it is shared with—are central to understanding overall privacy implications. The raw percentages reveal a baseline of extensive data capture, but the downstream processing choices determine the real privacy impact for users.
In-depth analysis of the data collection landscape reveals that the sheer volume of data required to train and improve AI models is a major driver behind these practices. The reliance on large-scale data sets means that many apps justify broad collection by pointing to performance enhancements, personalization, and improved predictive capabilities. However, the broader implication is that user privacy is increasingly tethered to the success of these models, and many users may not be fully aware of the breadth of data being captured or the purposes for which it is used.
Top AI Apps by Data Collection Rate
Placed in order of the percentage of input data collected, the following list captures the extent of data capture across widely used AI apps in 2025. The numbers reflect the proportion of user-provided data that the app collects and processes as part of its regular operation, including training, improvement, analytics, and feature enhancement. This comprehensive ranking reveals a spectrum: from near-total data capture on some voice assistants to relatively smaller captures on certain video and image editing tools.
- Amazon Alexa – 93%
- Google Assistant – 86%
- Duolingo – 79%
- Canva – 64%
- Otter – 57%
- Poe – 57%
- Facetune – 50%
- Bing – 50%
- DeepSeek – 50%
- Mem – 43%
- ELSA Speak – 43%
- PhotoRoom – 43%
- Trint – 43%
- ChatGPT – 36%
- Perplexity AI – 36%
- Lensa – 36%
- StarryAI – 36%
- Wombo – 36%
- Youper – 36%
- FaceApp – 36%
- Luma AI – 36%
- Speechify – 29%
- Pixai – 21%
- Clipchamp – 14%
The distribution across these apps shows a consistent trend: major, widely used AI platforms tend to collect significant portions of input data, with voice assistants leading the way. The data highlights how essential voice and conversational interfaces have become in the AI economy, where conversational data often holds unique value for model training and service refinement. At the same time, the presence of design, photo-editing, transcription, and image-generation tools on the list demonstrates that data collection permeates a broad swath of AI-enabled services used for daily tasks, creativity, communication, and productivity.
A deeper takeaway from this ranking is the persistent emphasis on data as a fundamental asset driving AI performance. While the most aggressive data collection appears with ubiquitous, always-on assistants, many other popular tools also rely on substantial input data to deliver personalized experiences, accurate results, and rapid improvements. This reality underscores the importance of user awareness and consent, because even seemingly benign tools can accumulate large data footprints that influence not only the immediate service but broader model capabilities.
Moreover, the presence of both long-standing platforms (like Google Assistant) and newer or specialized apps (like StarryAI or DeepSeek) indicates that the data collection phenomenon is not confined to a particular category of AI product. Instead, it reflects a shared industry practice: to extract meaningful signals from user inputs to drive engagement, retention, and monetization, which often includes data aggregation, cross-service sharing, and model refinement. With this context, users should be aware that the value proposition of AI tools frequently includes a trade-off between enhanced functionality and privacy considerations that may evolve as services update their data practices.
What Is User Data Being Used For?
Beyond simply identifying what data is collected, the Ecommerce Platforms analysis also examines how that data is utilized by ranked AI apps. The study reveals meaningful differences in data usage patterns, including how much information is shared with third-party advertisers versus how much is retained for the company’s own benefit. These usage patterns have direct implications for user privacy, as they determine the potential exposure of personal information to external entities and the likelihood of data-driven advertising or other monetization strategies.
For example, the study highlights Canva as a notable case not because of the magnitude of data collection but because of the purposes for which it uses data. Specifically, Canva reportedly shares 36% of the collected data with third-party advertisers, while an additional 43% is used for the company’s own benefit. This combination means that a majority of Canva’s collected data is channeled into internal value creation or external marketing, potentially increasing exposure to third-party partners and advertisers. The dual use case reveals how a single app can drive a combination of self-enhancement and external monetization through data sharing, with privacy trade-offs that users may not anticipate when they sign up for a seemingly straightforward design tool.
In contrast, Google presents a different data-use profile even though it captures a large share of user input. The study indicates that Google collects 86% of data on its platform, but only 21% is shared with third-party advertisers, and 36% is used for the company’s own benefit. This gap between data collection and data sharing highlights a privacy dynamic where internal use is more prominent than external sharing in Google’s case, even as the data is still widely captured. While neither approach is ideal from a privacy standpoint, the difference suggests that the way data is used—internal enhancement vs. external monetization—can significantly alter user risk exposure. These contrasts emphasize that the raw quantity of data collected is only part of the privacy equation; the downstream flow and purpose of that data are equally critical.
The broader takeaway is that data collection and data use are multifaceted issues within AI ecosystems. Some services accumulate data heavily but restrict third-party sharing, while others monetize data more aggressively through advertising networks and external partnerships. Users should consider both the volume of data collected and the potential end use of that data when evaluating the privacy implications of AI apps. This dual focus helps clarify why a platform with high data collection rates does not automatically equate to proportionally higher external data exposure; conversely, even lower collection rates can entail significant privacy risks if a large portion is used for targeted advertising or cross-service profiling.
AI Terms and Conditions
In addition to examining data collection and usage, the Ecommerce Platforms study scrutinized the terms and conditions (T&Cs) that govern these AI platforms to understand readability, accessibility, and the time investment required to parse the agreements. The results reveal a stark reality: the average consumer’s experience with T&Cs for AI services is long, complex, and difficult to navigate. The length and complexity of these documents can influence users’ ability to understand what they are agreeing to, including what data is collected, how it is used, and with whom it may be shared.
Among the apps analyzed, Clipchamp stood out as the worst offender in terms of reading time and complexity. Reading the full terms from start to finish reportedly took three hours and 16 minutes, largely due to jargon-laden language that made comprehension challenging. This extended duration implies that many users would likely skip the detailed terms, relying instead on summaries or implicit consent, which raises concerns about informed consent and awareness of data practices. The next-longest reading time belonged to Bing, with an official reading duration of two hours and 20 minutes, a substantial but still shorter commitment compared to Clipchamp. In contrast, the terms and conditions for Google Assistant can be completed in a more manageable 56 minutes, a significantly shorter but still lengthy document for most users.
These findings underscore a broader issue: even as AI advances deliver powerful capabilities, the accompanying legal texts remain dense and opaque to the average user. The gap between the sophistication of the technology and the accessibility of its governing terms presents a barrier to true informed consent. Consumers may sign up for access to AI services with only a partial understanding of the data practices that will govern their interactions. For regulators and consumer advocates, these readability obstacles highlight the need for plain-language summaries, standardized disclosures, and clearer signaling of user rights and data-use practices within AI platforms. The overall implication is clear: technology is advancing faster than the governance around it, emphasizing the ongoing importance of transparency and user empowerment in data handling.
Suffice to say, the study suggests that AI has not significantly improved the baseline privacy posture of big-tech data collection. If anything, the sheer volume of data required to train and refine sophisticated models means that data collection is likely to intensify in the near term. The combination of high data capture rates, complex terms, and varied data-use patterns points to a privacy landscape that will increasingly demand attention from policymakers, corporate transparency initiatives, and informed consumers who seek to understand how their personal information is being leveraged by AI systems.
Implications for Consumers and Regulators
The findings carry important implications for both consumers and regulators. For users, the practical takeaway is that many everyday AI apps collect substantial portions of the data they input, and this data may be used for purposes beyond the immediate functionality of the service. The distinct patterns across apps—ranging from high data collection in voice assistants to substantial usage for the company’s own benefit in design and productivity tools—mean that travelers, students, professionals, and casual users alike may be encountering different privacy risk profiles depending on the app they choose. In a world where personalization and efficiency are increasingly valued, users must balance the appeal of AI-powered features with the understanding that their data can be recycled into training data, analytics, targeted advertising, or other monetization strategies.
From a regulatory perspective, the study underscores a persistent gap between user expectations and the realities of data handling in AI services. The broad variance in data-use practices, including the proportion shared with third-party advertisers and the extent to which data serves internal corporate objectives, points to a need for more consistent and transparent disclosure standards. Regulators may consider enhanced reporting requirements around how data is used, the specific purposes for which data is shared with third parties, and the safeguards in place to protect user privacy. Furthermore, the readability concerns highlighted by the T&Cs analysis suggest a push toward standardized, user-friendly disclosures that enable informed consent and empower users to opt out of non-essential data processing where feasible.
The broader arc of AI development implies that data collection will become even more central to model performance and product differentiation in the coming years. As models grow more capable and training data needs escalate, developers are incentivized to collect richer datasets, often spanning across multiple apps and services. This reality raises questions about data portability, cross-service privacy, and the potential for cumulative privacy impact when multiple apps are used in tandem. Policymakers and industry stakeholders must grapple with how to preserve innovation and user benefits while ensuring that data practices remain aligned with evolving privacy expectations and legal frameworks.
Conclusion
The 2025 study by Ecommerce Platforms spotlights a privacy-adjacent truth about the AI era: the more capable these tools become, the more data they tend to require, and the more complex their governance becomes. The data reveals a clear hierarchy of data collection, with leading voice assistants capturing the highest shares of user input, while a broad array of apps across design, photo editing, transcription, and AI-generated content follow closely behind. This landscape underscores that the AI revolution is as much a data story as it is a technology story. The practical reality for users is that daily interactions with AI apps can involve substantial data capture, often without straightforward visibility into how that data is used, shared, or monetized.
When considering what users should do in light of these findings, several actionable steps emerge. First, users should actively review privacy settings and data-sharing options within each app, seeking out granular controls where available to limit third-party sharing and model training inputs. Second, there is value in seeking and supporting platforms that provide clear, plain-language explanations of data practices, along with accessible summaries that distill key points about data collection, usage, and sharing. Third, policymakers and industry groups should advocate for standardized disclosures that make it easier to compare how different AI apps collect and use data, including explicit percentages where possible and practical examples of common data-use scenarios. Fourth, developers should consider adopting privacy-by-design principles, implementing data minimization when feasible, and offering opt-out mechanisms for non-essential data collection without sacrificing core functionality or user experience. Finally, researchers and privacy advocates may push for ongoing audits, transparency reports, and independent assessments that evaluate real-world data practices against stated policies, ensuring accountability in a rapidly evolving AI ecosystem.
In sum, as AI technologies become increasingly integrated into everyday tools and experiences, the imperative for transparent, user-centric data practices grows ever stronger. The study’s results serve as a clear reminder that data collection remains a foundational concern in AI adoption, warranting continued attention from users, developers, and regulators as the digital landscape evolves. A balanced approach—one that prioritizes innovation while upholding robust privacy safeguards—will be essential to sustaining trust and enabling responsible, long-term growth in the AI economy.