Here's an overview of the flow of data through the Integrated Data Infrastructure (IDI) and a brief explanation of how we link different datasets together.
The IDI is a large SQL database holding linked longitudinal microdata about people and households. It is a constantly growing source of information – we update the data every quarter as well as integrating new datasets. Figure 1 shows how data flows through the IDI. Each step is explained in more detail below.
1. Data collected from all sources
Data is supplied securely to the IDI from a range of government agencies, Statistics NZ surveys, and non-government organisations. Access to data containing personal identifying information such as names is restricted to essential IDI staff preparing the data.
The IDI is updated (or ‘refreshed’) quarterly to include new updates. The majority of the data is updated quarterly, however some data is updated less frequently.
See Data in the IDI.
2. Process and link the data
We link records by using the variables they have in common. IDI linking software performs millions of comparisons to identify which records are likely to belong to the same identity (see figure 2).
In this example, record 1 in the birth dataset is linked with record 52 in the tax dataset because they contain identity data that is the same or very similar. This means the two records, from two different sources, are likely to belong to the same person.
When we receive data from source agencies, we ensure it complies with data supply agreements (ie we have everything we should have received and nothing extra), and prepare the data for linking. We ensure that the linking variables are presented in a standardised format and apply text treatments to remove invalid characters.
We also compile and standardise address information at this step in preparation for geocoding. Geocoding allows us to make location information available to researchers as small geographic units called meshblocks – this keeps specific addresses anonymous. See meshblocks definition.
We link together information about the same identity across multiple data sources in different ways, depending on the variables available within each dataset. Both ‘probabilistic’ and ‘deterministic’ linking methods are used in the IDI.
Records with personal identifiers in common , such as National Health Index (NHI) numbers or passport codes, allow for exact ‘deterministic’ links to be made. ‘Probabilistic’ linking uses demographic variables such as name, date of birth, and sex, to infer two records belong to the same person.
Our robust linking methodology ensures two records have a high probability of belonging to the same person if linked together. However, some records can be linked incorrectly or the link could be missed. To minimise these errors, we perform link quality checks with every update.
See Resources for IDI users for more information on our linking methodology.
3. De-identified data available for research
We remove personal identifying information such as names and addresses, and encrypt (ie replace with another number) identifiers such as IRD number and NHI numbers. The IDI is not about locating individuals. It’s about understanding groups of people with similar characteristics. See How we keep IDI data safe.
Three SQL databases are made available for research – the clean database, the metadata database, and the sandpit database. Figure 3 shows how the IDI looks in Microsoft SQL Server Management Studio.
- The clean database contains all of the data tables, usually organised by the name of the agency that supplied the data (eg IRD, MSD, MBIE). We also provide a number of derived tables that combine information from various sources. Researchers do not get access to all of the datasets – only those needed for their research. See Data in the IDI.
- The metadata database contains lookup codes and classifications about specific collections. Access to other collaboration spaces containing metadata and coding resources are also available in the Data Lab, including the IDI Wiki.
- The sandpit database is a space where researchers can put tables, datasets, and programming code to be shared with members of their project team.
See Resources for IDI users for more information about using the IDI.
To gain access to the IDI, applicants must successfully prove their projects are for bona fide research purposes that are in the public interest. Researchers will only be given access to data that is essential for their research project, and this data can only be accessed in secure Data Lab locations.
See Using the Data Lab for information on applying to use the IDI.
Confidentiality rules must be applied by researchers to all output before data can be taken out of the secure Data Lab environment. This ensures that information is grouped in a way that makes it impossible to identify individual people.
History of the IDI
Statistics NZ has been undertaking data integration for over 10 years – a 1997 Cabinet directive stated Statistics NZ was the right organisation to complete cross-agency data integration. The early integration projects were for specific purposes, often linking two or three datasets, and each project was kept in a separate environment. Complex linking work and research investment could not be easily replicated.
In 2011, Cabinet agreed to a proposal to integrate Department of Labour migration data with the integrated datasets managed by Statistics NZ. As a result, Statistics NZ created the IDI prototype – consolidating the previously separate integration projects with the migration data. At this point Statistics NZ moved from one-off data integration to providing a data integration service.
In 2013, Cabinet agreed that the delivery of better public services would be possible from improved capability across government to share data using existing datasets – for this a cross-agency data-sharing solution was required. Cabinet agreed that the IDI be expanded to facilitate this work.
Updated 21 July 2016