I'm trying to implement data anonymization in MySQL and PHP.
At the moment I'm separating the data by encrypting the foreign key/ID using the user password and save it in the 'user' account table. But I quickly realized that when a user is initially created, and I insert the first data inside the other tables, I can match them together by row count.
What I thought of doing is to randomly swap the user account details each time a new account is created - but this feels very inefficient.
I cannot find anything related online like a basic explanation of how one would properly achieve user data separation so that it is completely anonymized. Can anyone explain here what goes into achieving data anonymization in an RDBMS architecture?
Thanks a lot in advance!
EDIT:
To be more clear, let's imagine I have two tables: One holding the user email & encrypted unique foreign key (account-table). The other holding user preferences/info (this table will always hold 1 row per user).
Now let's say I added a new user in account-table, and data in user-preferences/info table. In reality, I can still know from counting the table rows if this info is owned by that user.
I can't encrypt all this data, because some of it might be public anonymously. And even so, making the rows unrelated to each other continues making it harder on anyone getting hold of this encrypted data from matching it to any user.
I'm looking for complete anonymity and privacy not just by encryption, but by separation of user-data. I want data to be completely private to the user - possibly without duplicating any of it in multiple places.
Would the random swap be the best scenario in this case? (copy a randomly picked user, and swap/overwrite new the data in their original row)
You need to look at differential privacy. The idea here is to preserve the original data in one record, but add carefully randomised data that looks very similar to it.
For example imagine you were storing user year of birth. If you add a single user record and an unrelated separate single birth year record, it's very likely (as you say) that you will be able to reverse the relationship and reassociate the two. However, you could add multiple records with randomised values clustered around the real value (but not exactly centred as that's statistically reversible too), so you could haveuser1
born in 1970, and add records for 1968, 1969, 1970, and 1971,user2
born in 1980 could have values of 1979, 1980, 1981, 1982. You then can't tell exactly which record is exactly correct, but on average the values are reasonably correct. Note that this even works for a single record.
But there is a further concern here ??� exactly how anonymous do you want records to be? The degree of anonymity you need may depend on the nature of the data you're processing. This simple example only looks at a single field - one that may indeed not allow reidentification when used alone, but might provide sufficient information when combined with other fields, even if they use a similar approach.
As you may gather, this is a difficult and subtle thing to design effectively ??� the algorithm for figuring out how much noise you need to add is something that has won mathematics medals!
Another approach is to keep the real data without knowing what it is using homomorphic encryption, allowing you to still do things like searching but without actually being able to see the underlying data.
Since you're in PHP, you might find CipherSweet provides a useful toolkit.
Our community is visited by hundreds of web development professionals every day. Ask your question and get a quick answer for free.
Find the answer in similar questions on our website.
Do you know the answer to this question? Write a quick response to it. With your help, we will make our community stronger.
PHP (from the English Hypertext Preprocessor - hypertext preprocessor) is a scripting programming language for developing web applications. Supported by most hosting providers, it is one of the most popular tools for creating dynamic websites.
The PHP scripting language has gained wide popularity due to its processing speed, simplicity, cross-platform, functionality and distribution of source codes under its own license.
https://www.php.net/
DBMS is a database management system. It is designed to change, search, add and delete information in the database. There are many DBMSs designed for similar purposes with different features. One of the most popular is MySQL.
It is a software tool designed to work with relational SQL databases. It is easy to learn even for site owners who are not professional programmers or administrators. MySQL DBMS also allows you to export and import data, which is convenient when moving large amounts of information.
https://www.mysql.com/
Welcome to the Q&A site for web developers. Here you can ask a question about the problem you are facing and get answers from other experts. We have created a user-friendly interface so that you can quickly and free of charge ask a question about a web programming problem. We also invite other experts to join our community and help other members who ask questions. In addition, you can use our search for questions with a solution.
Ask about the real problem you are facing. Describe in detail what you are doing and what you want to achieve.
Our goal is to create a strong community in which everyone will support each other. If you find a question and know the answer to it, help others with your knowledge.