On 21 October 2018, the Big Data United Nations Global Working Group) held an Open Day in Dubai UAE. The purpose was to demonstrate the UN Global Platform on Big Data for Official Statistics with presentations and demos. The Big Data UN Global Platform is a collaborative research and development environment for trusted data, services and applications. The platform is already in use for several projects between technology companies, data providers and academia.

The event hosted at Burj Khalifa featured an introduction to UN Global Platform on Big Data and after that four sessions about its uses in practice. First, satellite data for agricultural and land cover statistics were discussed. These data include flight data, ship location and position data and satellite imagery. The latter can be used to analyse for instance the state of crops, water or deforestation. A lot of these data are freely available, but due to their volume (flight data often constitutes 3 million records per second), computation can be expensive.

The next session focused on the measurement of human mobility using mobile phone data. These data are currently being used to analyse domestic and cross-border tourism in areas with open borders much more efficiently than can often be done by traditional means. A practical study of mobility in Indonesia was introduced and it was easy to understand how mobile data have been useful in providing more accurate information than has otherwise been available. However, it was also noted that an even more accurate picture of human mobility can be obtained, if data from social media and advertisment platforms are used in addition to mobile data.

Next, scanner data were introduced. Scanner data are data about sales of consumer goods obtained by scanning the bar codes of individual products in shops. These data can be used to analyse price fluctuations of certain products in a country. A demonstrator was presented, which computed the FEWS index on scanner data. The FEWS index is a quality-adjusted price index that can be used on big data when there is no infromation on product characteristics for explicit quality adjustment. Instead, the existing longitudinal information is used to implicitly adjust the quality of indices.

Finally, in the privacy-preserving techniques session, Cybernetica's team introduced secure multi-party computation and its uses for privacy-preserving data analysis. Our team presented prototypes showcasing privacy-preserving mobility studies (our partnership with Positium) and a privacy-preserving version of the FEWS index.

How does the FEWS index work?

As mentioned, the FEWS (fixed-effects window-splice) index outputs quality-adjusted price indices for scanner data, when these data include longitudinal price information at a detailed product-specification level. As data are often not complete and some products are only seasonal and do not have all necessary data points in the longitudinal information, steps have to be taken to account for these missing values.

Usually, hedonic regression is used to control the effect of changes in product quality. Hedonic theory assigns the item a set of constituent characteristics, and, using regression, computes the importance of each characteristic. The simplest way to construct a hedonic price index using only info from the hedonic regression is the time dummy variable method. This method assumes the coefficients (prices) to be constant over adjacent time periods.

The FEWS index makes its quality adjustment in a way that is equivalent to a time dummy hedonic index. However, window-splicing takes into account the price movement across the entire estimation window, rather than the most recent period. This helps the method to account for the price movements associated with the introduction of new products also in the period after their introduction. The FEWS index is more challenging to use in areas where product characteristics or consumer preferences change rapidly, e.g., with seasonal variation or technological development. You can read more about the FEWS index in the paper.

Privacy-preserving scanner data analysis using the FEWS index

Scanner data are sensitive data as large companies are reluctant to share their pricing strategies and price fluctuations with their competitors. Therefore, we were interested in whether this problem can be feasibly solved using secure multi-party computation. Our team implemented the FEWS index using the SecreC programming language and deployed the result in the cloud.

We used three different cloud providers as hosts for the three computing parties: Amazon Web Services, Google Cloud Platform and IBM Cloud. All of the cloud servers were in the Frankfurt region with a two millisecond latency for communication between the nodes. The test data were secret shared and uploaded to the computing parties.

As with all secure multi-party algorithms, the FEWS algorithm also benefits from parallelisation in terms of performance. Therefore, we ran the algorithm on two parallel Sharemind instances, i.e., each server hosted two instances of Sharemind, the linear regressions were run in parallel and the result was combined afterwards. We first ran the experiment for 230 products for 26 months (3843 rows of data), and then we doubled the data to run the experiment on 460 products for 26 months (7654 rows of data). The corresponding benchmark results can be seen in Table 1.

Data profile No. of rows No. of Sharemind instances Time spent Cloud cost (bandwidth)
230 products for 26 months 3843 2 (11 parallel regressions) 8m 57s $31.50
230 products for 26 months 3843 4 (~5 parallel regressions) 7m 32s $31.50
460 products for 26 months 7654 4 (~5 parallel regressions) 22m 24s $165

 

The original algorithm code and test data we recieved from Stats NZ. The hosting costs of the Amazon Web Services and Google Cloud Platform were graciously covered by the Big Data United Nations Global Working Group. The privacy-preserving FEWS index will be available as part of the UN Big Data Global Platform.