Privacy-preserving technology

Privacy-preserving technology, 해시게임 Anonymous technology.

Traditional methods such as randomization, data shuffling, and data transformation have been able to protect privacy to a certain extent, but the risk of data disclosure still exists.

Anonymity technology is a privacy-preserving technology that overcomes the limitations of traditional methods.

Anonymous technology actually makes a person “eliminate the crowd”. So, how many people are “people”?

This is the idea behind the k-anonymization technique, which makes k records look similar in the dataset,

That is, each person’s private data is hidden in k similar records.

If a person’s information is indistinguishable from k-1 individuals whose other information also appears in the data,

Then the published data has k-anonymity.

k Anonymization technology mitigates the risk of link attacks.

The transformation of identifiers can be achieved by techniques such as generalization and suppression.

For suppression, part or all of the property’s value can be replaced with *,

For generalization, a single value of an attribute is replaced by a value representing a broader range or category,

For example, many web applications use “*” to replace the middle 4 digits in the number when displaying the user’s mobile phone number.

Higher generality allows more records to be mapped, enabling higher levels of privacy,

Although this can significantly impact data utility.

Also, generalizing all records with a single strategy for attributes may not be the best strategy.

This privacy-preserving data transformation is called recoding.

In global encoding, a specific detail value must map to the same common value in all records.

Local encoding allows the same detailed value to be mapped to a different generic value within each anonymous group.

Although k-anonymization of data prevents chaining attacks,

and attackers cannot link to other databases with high certainty,

But it could still reveal sensitive information.

This is called a homogeneity attack, where all k individuals have the same sensitivity value.

Similarly, if the attacker has additional information about a person,

It is then possible to re-identify the record with a high probability, leading to a background knowledge attack.

Therefore, k-anonymity does not provide any scientific guarantee against such an attack.

Can optimal k-anonymity be achieved by modifying a minimal amount of data?

For multidimensional data, achieving optimal k-anonymity is an NP-hard problem.

Furthermore, choosing k as an acceptable level of k-anonymity presents another challenge.

To achieve k-anonymity, information is lost during the generalization or suppression of records,

The higher the generalization, the lower the utility.

To overcome these shortcomings, different k-anonymity techniques have been proposed.

L-diversity is a variant in which any sensitive attribute should have l distinct values ​​in each population.

This ensures that sensitive properties are well represented, but it also involves suppressing or adding that might alter the distribution of the data.

This suppression or addition raises concerns about the validity of statistical conclusions drawn from the dataset,

The distribution of sensitive attributes in any k subsets is not only l diverse,

And it is close to the distribution of attributes in the entire dataset.

Furthermore, the distance between these two distributions is measured by a threshold t.

The dimensionality of the data is still a challenge, for high-dimensional data like time series,

It is quite difficult to provide the same privacy protection as low-dimensional data.

Anonymity technology has been implemented in many scenarios of sensitive data release in privacy,

Applications have expanded from relational databases to anonymous composite structures such as graphs.

This section discusses the choice of k-anonymity, some practical issues of publishing anonymous data,

Quasi-identifiers, the ideal amount of generalization to achieve desired anonymity, and how to effectively k-anonymize.

4.1.1 Correct choice of K

In the United States, the Health Insurance Convenience and Accountability Act sets the standard for protecting sensitive patient data,

Define 20,000 as the standard value of k for k-anonymity.

The Family Educational Rights and Privacy Act sets standards for protecting the personal information of students and their families.

A value of 5 or 10 is recommended for k to prevent disclosure.

This shows the difference when choosing k.

The choice of K is predefined for the application based on these administrative authorizations.

However, for applications without regulatory requirements, choosing k to provide the right privacy level versus utility tradeoff is a challenge.

One way to choose k is to vary the value of k within a certain range,

And determine the change in the generalized information loss (a measure of utility) of the dataset.

Therefore, the value of k corresponding to an acceptable generalized information loss is an appropriate choice.

Nevertheless, finding the optimal value of k is still an open problem,

Current research includes probabilistic models and multi-objective optimization models.

Approximation algorithms can achieve k-anonymity, but do not scale.

On the other hand, the probabilistic method k-anonymity technique provides a time-optimal k-anonymity algorithm using dynamic programming.

Heuristics can also produce valid results.

The current focus is on AI-driven analytics,

However, the definition of privacy and data protection has changed significantly,

This shows the need to provide stronger guarantees and provide a wider range for different applications.

4.1.2 Identification of Quasi-Identifiers

The identification of quasi-identifiers is a major issue as it directly affects the effectiveness of k-anonymity techniques.

If the number of records of the variable attribute set can be identified,

These attribute sets may be potential quasi-identifiers.

With the addition of information, a large number of records may become identifiable.

As the dimensionality of the data increases, the selection of quasi-identifiers becomes more complex.

The question has also become more challenging because of the uncertainty surrounding additional data published by others.

In this case, some published properties must be treated as quasi-identifiers.

4.1.3 The ideal amount of generalization to achieve the desired anonymization

The ideal amount of generalization depends on publicly available information.

Some organizations publish information in the public domain for greater transparency,

and make it easier for people to obtain their data.

These organizations may have inadvertently released information that should not have been provided.

This opens up opportunities for private aggregates to misuse such information.

Therefore, organizations publishing personal data must take an extremely generalized approach,

to prevent re-identification through link attacks.

Chaining attacks show that simply removing identifiers does not protect privacy.

Therefore, k-anonymity has become a prominent privacy-preserving technology.

Here, generalization is performed on real information, which makes it more acceptable than other strategies.

Furthermore, k-anonymity and its variants can limit linking, homogeneity, and background attacks.

From an industry perspective, k-anonymity has gained wider popularity.

Anonymity techniques do have some drawbacks, such as loss of information.

Furthermore, generalization requires building a classification tree for each quasi-identifier in the dataset,

This requires the intervention of domain experts, even if the classification is generated automatically.

Also, the generalization level for each attribute may vary depending on the use case.

With increasing computing power and the availability of digital datasets,

The risk of personal data being re-identified remains.