Unique ID: 2015046

Division: Methodology Department
Issue Date: February 13th 2019
Last modified: February 22nd 2019
Collaborative

Web scraping data from retailers' websites for the CPI calculation

Using web scraping data for compilation of price indices

The Hungarian Central Statistical Office (HCSC) uses the web scraping technique to get price data from a general retailer's website. The project is in exploration status but in the future the HCSO plans to integrate this data source into the price statistics business processes. Other price information than the general retailer's data is already considered for use for official statistics (e.g. information on flight prices). Search for the suitable IT tools is currently an ongoing task. The HCSO is also currently investigating the use of road sensor data (from the National Toll Payment Services Plc) and scanner data (from the National Tax and Customs Administration of Hungary) but as the HCSO receives monthly aggregated datasets for the current methodological studies, these are not considered as 'traditional' Big Data uses.

Project Objective:

Exploration, Scientific / research

Project Outcomes:

Successful pilot with promising outcomes on the integrated use of web scraping techniques in the acquisition of prices used for the production of the CPI.

Statistical Area

Price

Project Sources
Project Sources
Type Of Institution: National statistical office
Big Data Source: Web scraping data
Region: Europe & Central Asia
Country Area: Hungary
Id Country Regional: country
Partnerships
Partnerships
Other Partners: Other
Partnership Comments: Partnerships are not considered at the moment as the main focus of the project is on the actual acquisition of the information. No partnerships are foreseen for statistical production as the new Big Data sources are planned to be integrated into already existing production environment (after successful. pilot). The use of cloud server might be considered in the future.
Accessing Data
Accessing Data
Data Access Rights: Broader access rights
Intermediary Comments: The HCSO is web scarping the data directly from the designated websites. No pre-treatment of the datasets is necessary at this stage of the work.
Data Access Comments: Website information is generally open for use without limiting the potential purposes.
Data Coverage
Data Coverage
Data Coverage: Only a portion of all data
Coverage Geo Pop: Part of country / high % of market
Cost Implication: Free
Cost Comments: Access to website information is free.
Coverage Geo Comments: For the time being the research is focusing on a few actors on the market. Potentially, the use should not be limited to only a few actors.
Coverage Period: It depends on the need of the concerned subject matter department and the research (daily, weekly data).
Project Details
Project Details
Frequency Comments: The HCSO is currently web scraping only a portion of the available information the statistical purposes.
Data Quality
Data Quality
Quality Framework: Quality of processing/throughput
Quality Aspects Evaluated: Completeness, Usability, Time Factors, Accessibility, Relevance, Coherence, including linkability to other sources
Quality Framework Comments: No specific frameworks are applied to the source itself. Quality frameworks apply to the business process in which data from this Big Source is to be used in the future.
Data Quality Concerns Comments: No quality concerns at the current state of the research.
Methodology
Methodology
Methods Used: Traditional statistical methods
Technologies
Technologies
Technologies: Spreadsheet, Other
Technologies Comments: The HCSO is using Excel macro and SAS solutions for the research.
Other
Other
Income Level: High-income
Iso: HU
Timeframe To Produce Indicator: NA
Frequency Comments: The HCSO is currently web scraping only a portion of the available information the statistical purposes.
Write Your Own Review
You're reviewing:Web scraping data from retailers' websites for the CPI calculation
Your Rating