php - How do I properly separate user data when implementing data anonymization in an RDBMS?

I'm trying to implement data anonymization in MySQL and PHP.

At the moment I'm separating the data by encrypting the foreign key/ID using the user password and save it in the 'user' account table. But I quickly realized that when a user is initially created, and I insert the first data inside the other tables, I can match them together by row count.

What I thought of doing is to randomly swap the user account details each time a new account is created - but this feels very inefficient.

I cannot find anything related online like a basic explanation of how one would properly achieve user data separation so that it is completely anonymized. Can anyone explain here what goes into achieving data anonymization in an RDBMS architecture?

Thanks a lot in advance!

EDIT:

To be more clear, let's imagine I have two tables: One holding the user email & encrypted unique foreign key (account-table). The other holding user preferences/info (this table will always hold 1 row per user).

Now let's say I added a new user in account-table, and data in user-preferences/info table. In reality, I can still know from counting the table rows if this info is owned by that user.

I can't encrypt all this data, because some of it might be public anonymously. And even so, making the rows unrelated to each other continues making it harder on anyone getting hold of this encrypted data from matching it to any user.

I'm looking for complete anonymity and privacy not just by encryption, but by separation of user-data. I want data to be completely private to the user - possibly without duplicating any of it in multiple places.

Would the random swap be the best scenario in this case? (copy a randomly picked user, and swap/overwrite new the data in their original row)

Answer

Solution:

You need to look at differential privacy. The idea here is to preserve the original data in one record, but add carefully randomised data that looks very similar to it.

For example imagine you were storing user year of birth. If you add a single user record and an unrelated separate single birth year record, it's very likely (as you say) that you will be able to reverse the relationship and reassociate the two. However, you could add multiple records with randomised values clustered around the real value (but not exactly centred as that's statistically reversible too), so you could have user1 born in 1970, and add records for 1968, 1969, 1970, and 1971, user2 born in 1980 could have values of 1979, 1980, 1981, 1982. You then can't tell exactly which record is exactly correct, but on average the values are reasonably correct. Note that this even works for a single record.

But there is a further concern here ??� exactly how anonymous do you want records to be? The degree of anonymity you need may depend on the nature of the data you're processing. This simple example only looks at a single field - one that may indeed not allow reidentification when used alone, but might provide sufficient information when combined with other fields, even if they use a similar approach.

As you may gather, this is a difficult and subtle thing to design effectively ??� the algorithm for figuring out how much noise you need to add is something that has won mathematics medals!

Another approach is to keep the real data without knowing what it is using homomorphic encryption, allowing you to still do things like searching but without actually being able to see the underlying data.

Since you're in PHP, you might find CipherSweet provides a useful toolkit.

Source