Unique ID: 2015056
Division: | DIQR Department |
---|---|
Issue Date: | February 13th 2019 |
Last modified: | February 22nd 2019 |
Internet as a data source on ICT in enterprises
Using web scraping as data source on use of internet in businesses
Explore the possibility to use web scraping techniques in the estimation phase (apply to text and data mining algorithms), with the aim of replacing traditional instruments of data collection and estimation, or to combine them in an integrated approach. Aim: produce information in particular on the use of Internet and other networks for various purposes (e-commerce, e-skills, e-business, social media, e-government, etc.)
Project Objective:
Pilot intended to go to production to supplement existing data, Pilot intended to go to production to replace existing data
Project Outcomes:
Replace part of traditional survey. Identify new statistics. Collect auxiliary information to be used in the traditional survey. Reduce response burden.
Publications Comments:
A paper describing the process has been published in journal and proceedings
Statistical Area
Project Sources
Type Of Institution: | National statistical office |
---|---|
Big Data Source: | Web scraping data |
Region: | Europe & Central Asia |
Country Area: | Italy |
Id Country Regional: | country |
Partnerships
Other Partners: | Technology partner, Research or academic institute |
---|
Accessing Data
Data Access Rights: | Broader access rights |
---|
Data Coverage
Data Coverage: | Only a portion of all data |
---|---|
Coverage Geo Pop: | Part of country / low % of market |
Cost Implication: | Free |
Coverage Geo Comments: | Due to few available URLs |
Project Details
Frequency Comments: | The URLs are known only for a part of the of the enterprise population . |
---|
Data Quality
Quality Framework: | Quality of source/input |
---|---|
Quality Aspects Evaluated: | Completeness, Usability, Time Factors, Accessibility, Relevance, Institutional/Business Environment, Validity, Accuracy, including selectivity, Coherence, including linkability to other sources |
Validation Comments: | Comparison with the official estimates coming from the sample survey "ICT Usage by Enterprises:. |
Quality Framework Comments: | Quantitative. After web scraping a supervised classification is performed. A confusion matrix is implemented to evaluate the classification task. Quality of output evaluated comparing the estimates with the official estimates |
Data Quality Concerns Comments: | Based on the confusion matrix, the scraping and text mining phases must be improved. |
Methodology
Methods Used: | Supervised learning, Decision Trees, Data visualization methods, Machine learning (Random forest, etc.) |
---|
Technologies
Technologies: | Data mining tools, Hadoop Clusters |
---|
Other
Income Level: | High-income |
---|---|
Iso: | IT |
Timeframe To Produce Indicator: | NA |
Frequency Comments: | The URLs are known only for a part of the of the enterprise population . |