Unique ID: 2015046
Division: | Methodology Department |
---|---|
Issue Date: | February 13th 2019 |
Last modified: | February 22nd 2019 |
Web scraping data from retailers' websites for the CPI calculation
Using web scraping data for compilation of price indices
The Hungarian Central Statistical Office (HCSC) uses the web scraping technique to get price data from a general retailer's website. The project is in exploration status but in the future the HCSO plans to integrate this data source into the price statistics business processes. Other price information than the general retailer's data is already considered for use for official statistics (e.g. information on flight prices). Search for the suitable IT tools is currently an ongoing task. The HCSO is also currently investigating the use of road sensor data (from the National Toll Payment Services Plc) and scanner data (from the National Tax and Customs Administration of Hungary) but as the HCSO receives monthly aggregated datasets for the current methodological studies, these are not considered as 'traditional' Big Data uses.
Project Objective:
Exploration, Scientific / research
Project Outcomes:
Successful pilot with promising outcomes on the integrated use of web scraping techniques in the acquisition of prices used for the production of the CPI.
Statistical Area
Project Sources
Type Of Institution: | National statistical office |
---|---|
Big Data Source: | Web scraping data |
Region: | Europe & Central Asia |
Country Area: | Hungary |
Id Country Regional: | country |
Partnerships
Other Partners: | Other |
---|---|
Partnership Comments: | Partnerships are not considered at the moment as the main focus of the project is on the actual acquisition of the information. No partnerships are foreseen for statistical production as the new Big Data sources are planned to be integrated into already existing production environment (after successful. pilot). The use of cloud server might be considered in the future. |
Accessing Data
Data Access Rights: | Broader access rights |
---|---|
Intermediary Comments: | The HCSO is web scarping the data directly from the designated websites. No pre-treatment of the datasets is necessary at this stage of the work. |
Data Access Comments: | Website information is generally open for use without limiting the potential purposes. |
Data Coverage
Data Coverage: | Only a portion of all data |
---|---|
Coverage Geo Pop: | Part of country / high % of market |
Cost Implication: | Free |
Cost Comments: | Access to website information is free. |
Coverage Geo Comments: | For the time being the research is focusing on a few actors on the market. Potentially, the use should not be limited to only a few actors. |
Coverage Period: | It depends on the need of the concerned subject matter department and the research (daily, weekly data). |
Project Details
Frequency Comments: | The HCSO is currently web scraping only a portion of the available information the statistical purposes. |
---|
Data Quality
Quality Framework: | Quality of processing/throughput |
---|---|
Quality Aspects Evaluated: | Completeness, Usability, Time Factors, Accessibility, Relevance, Coherence, including linkability to other sources |
Quality Framework Comments: | No specific frameworks are applied to the source itself. Quality frameworks apply to the business process in which data from this Big Source is to be used in the future. |
Data Quality Concerns Comments: | No quality concerns at the current state of the research. |
Methodology
Methods Used: | Traditional statistical methods |
---|
Technologies
Technologies: | Spreadsheet, Other |
---|---|
Technologies Comments: | The HCSO is using Excel macro and SAS solutions for the research. |
Other
Income Level: | High-income |
---|---|
Iso: | HU |
Timeframe To Produce Indicator: | NA |
Frequency Comments: | The HCSO is currently web scraping only a portion of the available information the statistical purposes. |