Unique ID: 2015056

Division: DIQR Department
Issue Date: February 13th 2019
Last modified: February 22nd 2019
Collaborative

Internet as a data source on ICT in enterprises

Using web scraping as data source on use of internet in businesses

Explore the possibility to use web scraping techniques in the estimation phase (apply to text and data mining algorithms), with the aim of replacing traditional instruments of data collection and estimation, or to combine them in an integrated approach. Aim: produce information in particular on the use of Internet and other networks for various purposes (e-commerce, e-skills, e-business, social media, e-government, etc.)

Project Objective:

Pilot intended to go to production to supplement existing data, Pilot intended to go to production to replace existing data

Project Outcomes:

Replace part of traditional survey. Identify new statistics. Collect auxiliary information to be used in the traditional survey. Reduce response burden.

Publications Comments:

A paper describing the process has been published in journal and proceedings

Statistical Area

Business

Project Sources
Project Sources
Type Of Institution: National statistical office
Big Data Source: Web scraping data
Region: Europe & Central Asia
Country Area: Italy
Id Country Regional: country
Partnerships
Partnerships
Other Partners: Technology partner, Research or academic institute
Accessing Data
Accessing Data
Data Access Rights: Broader access rights
Data Coverage
Data Coverage
Data Coverage: Only a portion of all data
Coverage Geo Pop: Part of country / low % of market
Cost Implication: Free
Coverage Geo Comments: Due to few available URLs
Project Details
Project Details
Frequency Comments: The URLs are known only for a part of the of the enterprise population .
Data Quality
Data Quality
Quality Framework: Quality of source/input
Quality Aspects Evaluated: Completeness, Usability, Time Factors, Accessibility, Relevance, Institutional/Business Environment, Validity, Accuracy, including selectivity, Coherence, including linkability to other sources
Validation Comments: Comparison with the official estimates coming from the sample survey "ICT Usage by Enterprises:.
Quality Framework Comments: Quantitative. After web scraping a supervised classification is performed. A confusion matrix is implemented to evaluate the classification task. Quality of output evaluated comparing the estimates with the official estimates
Data Quality Concerns Comments: Based on the confusion matrix, the scraping and text mining phases must be improved.
Methodology
Methodology
Methods Used: Supervised learning, Decision Trees, Data visualization methods, Machine learning (Random forest, etc.)
Technologies
Technologies
Technologies: Data mining tools, Hadoop Clusters
Other
Other
Income Level: High-income
Iso: IT
Timeframe To Produce Indicator: NA
Frequency Comments: The URLs are known only for a part of the of the enterprise population .
Write Your Own Review
You're reviewing:Internet as a data source on ICT in enterprises
Your Rating