Standards
for Privacy of Individually Identifiable Health Information
G. Section 164.514--Other
Requirements Relating to Uses and Disclosures of Protected
Health Information
1. De-Identification of Protected Health Information
December 2000 Privacy Rule
At Sec. 164.514(a)-(c), the Privacy Rule permits a covered entity
to de-identify protected health information so that such information
may be used and disclosed freely, without being subject to the Privacy
Rule's protections. Health information is de-identified, or not
individually identifiable, under the Privacy Rule, if it does not
identify an individual and if the covered entity has no reasonable
basis to believe that the information can be used to identify an
individual. In order to meet this standard, the Privacy Rule provides
two alternative methods for covered entities to de-identify protected
health information.
First, a covered entity may demonstrate that it has met the standard
if a person with appropriate knowledge and experience applying generally
acceptable statistical and scientific principles and methods for
rendering information not individually identifiable makes and documents
a determination that there is a very small risk that the information
could be used by others to identify a subject of the information.
The preamble to the Privacy Rule refers to two government reports
that provide guidance for applying these principles and methods,
including describing types of techniques intended to reduce the
risk of disclosure that should be considered by a professional when
de-identifying health information. These techniques include removing
all direct identifiers, reducing the number of variables on which
a match might be made, and limiting the distribution of records
through a "data use agreement" or "restricted access
agreement" in which the recipient agrees to limits on who can
use or receive the data.
Alternatively, covered entities may choose to use the Privacy Rule's
safe harbor method for de-identification. Under the safe harbor
method, covered entities must remove all of a list of 18 enumerated
identifiers and have no actual knowledge that the information remaining
could be used, alone or in combination, to identify a subject of
the information. The identifiers that must be removed include direct
identifiers, such as name, street address, social security number,
as well as other identifiers, such as birth date, admission and
discharge dates, and five-digit zip code. The safe harbor requires
removal of geographic subdivisions smaller than a State, except
for the initial three digits of a zip code if the geographic unit
formed by combining all zip codes with the same initial three digits
contains more than 20,000 people. In addition, age, if less than
90, gender, ethnicity, and other demographic information not listed
may remain in the information. The safe harbor is intended to provide
covered entities with a simple, definitive method that does not
require much judgment by the covered entity to determine if the
information is adequately de- identified.
The Privacy Rule also allows for the covered entity to assign a
code or other means of record identification to allow de-identified
information to be re-identified by the covered entity, if the code
is not derived from, or related to, information about the subject
of the information. For example, the code cannot be a derivation
of the individual's social security number, nor can it be otherwise
capable of being translated so as to identify the individual. The
covered entity also may not use or disclose the code for any other
purpose, and may not disclose the mechanism (e.g., algorithm or
other tool) for re- identification.
The Department is cognizant of the increasing capabilities and
sophistication of electronic data matching used to link data elements
from various sources and from which, therefore, individuals may
be identified. Given this increasing risk to individuals' privacy,
the Department included in the Privacy Rule the above stringent
standards for determining when information may flow unprotected.
The Department also wanted the standards to be flexible enough so
the Privacy Rule would not be a disincentive for covered entities
to use or disclose de- identified information wherever possible.
The Privacy Rule, therefore, strives to balance the need to protect
individuals' identities with the need to allow de-identified databases
to be useful.
March 2002 NPRM
The Department heard a number of concerns regarding the de-identification
standard in the Privacy Rule. These concerns generally were raised
in the context of using and disclosing information for research,
public health purposes, or for certain health care operations. In
particular, concerns were expressed that the safe harbor method
for de-identifying protected health information was so stringent
that it required removal of many of the data elements that were
essential to analyses for research and these other purposes. The
comments, however, demonstrated little consensus as to which data
elements were needed for such analyses and were largely silent regarding
the feasibility of using the Privacy Rule's alternative statistical
method to de-identify information.
Based on the comments received, the Department was not convinced
of the need to modify the safe harbor standard for de-identified
information. However, the Department was aware that a number of
entities were confused by potentially conflicting provisions within
the de-identification standard. These entities argued that, on the
one hand, the Privacy Rule treats information as de-identified if
all listed identifiers on the information are stripped, including
any unique, identifying number, characteristic, or code. Yet, the
Privacy Rule permits a covered entity to assign a code or other
record identification to the information so that it may be re-identified
by the covered entity at some later date.
The Department did not intend such a re-identification code to
be considered one of the unique, identifying numbers or codes that
prevented the information from being de-identified. Therefore, the
Department proposed a technical modification to the safe harbor
provisions explicitly to except the re-identification code or other
means of record identification permitted by Sec. 164.514(c) from
the listed identifiers (Sec. 164.514(b)(2)(i)(R)).
Overview of Public Comments
The following provides an overview of the public comment received
on this proposal. Additional comments received on this issue are
discussed below in the section entitled, "Response to Other
Public Comments."
All commenters on our clarification of the safe harbor re- identification
code not being an enumerated identifier supported our proposed regulatory
clarification.
Final Modifications
Based on the Department's intent that the re- identification code
not be considered one of the enumerated identifiers that must be
excluded under the safe harbor for de-identification, and the public
comment supporting this clarification, the Department adopts the
provision as proposed. The re-identification code or other means
of record identification permitted by Sec. 164.514(c) is expressly
excepted from the listed safe harbor identifiers at Sec. 164.514(b)(2)(i)(R).
Response to Other Public Comments
Comment: One commenter asked if data can be linked inside
the covered entity and a dummy identifier substituted for the actual
identifier when the data is disclosed to the external researcher,
with control of the dummy identifier remaining with the covered
entity.
Response: The Privacy Rule does not restrict linkage of
protected health information inside a covered entity. The model
that the commenter describes for the dummy identifier is consistent
with the re- identification code allowed under the Rule's safe harbor
so long as the covered entity does not generate the dummy identifier
using any individually identifiable information. For example, the
dummy identifier cannot be derived from the individual's social
security number, birth date, or hospital record number.
Comment: Several commenters who supported the creation of
de- identified data for research based on removal of facial identifiers
asked if a keyed-hash message authentication code (HMAC) can be
used as a re-identification code even though it is derived from
patient information, because it is not intended to re-identify the
patient and it is not possible to identify the patient from the
code. The commenters stated that use of the keyed-hash message authentication
code would be valuable for research, public health and bio-terrorism
detection purposes where there is a need to link clinical events
on the same person occurring in different health care settings (e.g.
to avoid double counting of cases or to observe long-term outcomes).
These commenters referenced Federal Information Processing Standard
(FIPS) 198: "The Keyed-Hash Message Authentication Code."
This standard describes a keyed-hash message authentication code
(HMAC) as a mechanism for message authentication using cryptographic
hash functions. The HMAC can be used with any iterative approved
cryptographic hash function, in combination with a shared secret
key. A hash function is an approved mathematical function that maps
a string of arbitrary length (up to a pre-determined maximum size)
to a fixed length string. It may be used to produce a checksum,
called a hash value or message digest, for a potentially long string
or message.
According to the commenters, the HMAC can only be breached when
the key and the identifier from which the HMAC is derived and the
de- identified information attached to this code are known to the
public. It is common practice that the key is limited in time and
scope (e.g. only for the purpose of a single research query) and
that data not be accumulated with such codes (with the code needed
for joining records being discarded after the de-identified data
has been joined).
Response: The HMAC does not meet the conditions for use
as a re- identification code for de-identified information. It is
derived from individually identified information and it appears
the key is shared with or provided by the recipient of the data
in order for that recipient to be able to link information about
the individual from multiple entities or over time. Since the HMAC
allows identification of individuals by the recipient, disclosure
of the HMAC violates the Rule. It is not solely the public's access
to the key that matters for these purposes; the covered entity may
not share the key to the re- identification code with anyone, including
the recipient of the data, regardless of whether the intent is to
facilitate re-identification or not.
The HMAC methodology, however, may be used in the context of the
limited data set, discussed below. The limited data set contains
individually identifiable health information and is not a de-identified
data set. Creation of a limited data set for research with a data
use agreement, as specified in Sec. 164.514(e), would not preclude
inclusion of the keyed-hash message authentication code in the limited
data set. The Department encourages inclusion of the additional
safeguards mentioned by the commenters as part of the data use agreement
whenever the HMAC is used.
Comment: One commenter requested that HHS update the safe
harbor de-identification standard with prohibited 3-digit zip codes
based on 2000 Census data.
Response: The Department stated in the preamble to the December
2000 Privacy Rule that it would monitor such data and the associated
re-identification risks and adjust the safe harbor as necessary.
Accordingly, the Department provides such updated information in
response to the above comment. The Department notes that these three-
digit zip codes are based on the five-digit zip Code Tabulation
Areas created by the Census Bureau for the 2000 Census. This new
methodology also is briefly described below, as it will likely be
of interest to all users of data tabulated by zip code.
The Census Bureau will not be producing data files containing U.S.
Postal Service zip codes either as part of the Census 2000 product
series or as a post Census 2000 product. However, due to the public's
interest in having statistics tabulated by zip code, the Census
Bureau has created a new statistical area called the Zip Code Tabulation
Area (ZCTA) for Census 2000. The ZCTAs were designed to overcome
the operational difficulties of creating a well-defined zip code
area by using Census blocks (and the addresses found in them) as
the basis for the ZCTAs. In the past, there has been no correlation
between zip codes and Census Bureau geography. Zip codes can cross
State, place, county, census tract, block group and census block
boundaries. The geographic entities the Census Bureau uses to tabulate
data are relatively stable over time. For instance, census tracts
are only defined every ten years. In contrast, zip codes can change
more frequently. Because of the ill-defined nature of zip code boundaries,
the Census Bureau has no file (crosswalk) showing the relationship
between US Census Bureau geography and US Postal Service zip codes.
ZCTAs are generalized area representations of U.S. Postal Service
(USPS) zip code service areas. Simply put, each one is built by
aggregating the Census 2000 blocks, whose addresses use a given
zip code, into a ZCTA which gets that zip code assigned as its ZCTA
code. They represent the majority USPS five-digit zip code found
in a given area. For those areas where it is difficult to determine
the prevailing five-digit zip code, the higher-level three-digit
zip code is used for the ZCTA code. For further information, go
to: http://frwebgate.access.gpo.gov/cgi-bin/leaving.cgi?from=leavingFR.html&log=linklog&to=http://www.census.gov/geo/www/gazetteer/places2k.html.
Utilizing 2000 Census data, the following three-digit ZCTAs have
a population of 20,000 or fewer persons. To produce
a de-identified data set utilizing the safe harbor method,
all records with three-digit zip codes corresponding
to these three-digit ZCTAs must have the zip code changed
to 000. The 17 restricted zip codes are: 036, 059, 063,
102, 203, 556, 692, 790, 821, 823, 830, 831, 878, 879,
884, 890, and 893.
|