Apple’s AI efforts don’t have to be hampered by its commitment to user privacy. A blog post published Monday explains how the company can generate the data needed to train its large language models without the privacy violations caused by Apple itself reading people’s emails or messages.
It’s an indirect, opt-in system that takes advantage of the small AIs the Apple builds into millions of users’ devices.
AI done wrong can be a privacy nightmare
A large language model (LLM) is trained using a process in which a neural network learns to predict the next word in a sentence by analyzing text data. The process requires vast amounts of data to train the LLM. OpenAI trains ChatGPT by scraping billions of words from the internet without paying anyone for access to their work, for instance.
Apple must go through a similar process to train the LLMs needed for Apple Intelligence. If it were an unethical company, it would feed the emails sent and received by iPhones and Macs into its training data, but it will not. As the company says over and over, “At Apple, we believe privacy is a fundamental human right.”
Apple Intelligence sticks with company’s privacy commitment
Instead, Apple will train its LLMs with what it calls “synthetic data” which has been “created to mimic the format and important properties of user data, but do not contain any actual user generated content.” The problem with this method should be obvious: how can Apple be sure the synthetic data conforms to the way real people actually write?
The method to beat this problem described in a blog post from Apple’s Machine Learning Research starts with the company making many variations on a possible message. The example it gives is, “Would you like to play tennis tomorrow at 11:30AM?”
It then sends these message variations to a selection of Macs and iPhones with Apple Intelligence installed and asks if any of them are similar to messages that are already on that device. The device then chooses which of the variants it’s been given is closest to an email message or text it has access to, and returns that data to Apple.
“As a result of these protections, Apple can construct synthetic data that is reflective of aggregate trends, without ever collecting or reading any user email content,” the Mac-maker points out.
More privacy protections
While Apple promises “The contents of the sampled emails never leave the device and are never shared with Apple,” some might be uncomfortable with even this indirect method of using their emails to test Apple’s data. The company points out that the process will take place only on devices whose users have opted in to sending Device Analytics, so no one is forced to participate.
Also, Apple promises it will only give itself access to aggregate data. It’ll learn which of the message variants it generated are most like real ones on the most number of devices, not the results from specific devices. So, for example, Apple might learn from this system that 937 iPhones have messages very similar to “Would you like to play tennis tomorrow at 11:30AM?” but researchers won’t know which 937 those are out of the billion or iPhones so in use.
The blog post from Apple’s Machine Learning Research didn’t reveal when the iPhone maker intends to start using this system, but Bloomberg reported Monday that “the company will roll out the new system in an upcoming beta version of iOS and iPadOS 18.5 and macOS 15.5.”