How to improve your data acquisition process?

It is rather obvious to state that data acquisition is the critical step in data science projects considering that data is the “gasoline” of any data science engine. Nonetheless, in our experience data acquisition is where most problems arise. After raising hypotheses and defining what information will be needed to test them, one needs to seek high performance in data collection in order to reach the desired sample size and quality.

Inefficiencies in the data collection process will affect the entire outcome of the experiment, whether in academic or business settings, including the end-to-end production of the solution designed. Therefore, it is important to establish best practices for data collection and data quality controls.

The steps below can significantly improve the quality of the data collection pipeline:

1. Understand the format of the data

When collecting data, it is important to use as many resources as possible to prevent corrupted data from entering the database. When the data type is known it should be possible to define validation rules. An interesting way to define such rules and avoid errors in digital collections is to use regular expressions (Regex). This method is able to identify data even if it does not follow a well-defined pattern. A simple example is the validation of e-mail addresses, for which the expression ^ [\ w - \.] + @ ([\ w -] + \.) + [\ w -] {2,4} $ can be used.

Rules like these can be created for any type of data. Creating regular expressions in forms may be of great help to avoid the entry of invalid data and certainly help the implementation of automated data quality controls


2. Understand the audience and channels

It is important to adapt the communication channel to the target audience. For example, according to, healthcare companies that send surveys to the elderly via email tend to have lower response rates compared to channels like SMS and WhatsApp. Therefore, the target audience has to be analyzed in terms of its behavior before establishing the collection channel. One way of doing that is to directly ask the audience what their preferred channels are.


3. Analyze response behavior

In addition to "how to collect", "when to collect" is also extremely important. Research from Hubspot revealed that the time of the day when most people use their smartphones is before bed and as soon as they wake up. These moments can be a great opportunity to send promotions and surveys.


4. Standardization of data

There are infinite ways of writing the same variable, which, if left untreated, will generate major biases for analysis. If possible, apply autocomplete or checkbox functions during the collection of text data to facilitate interaction with the respondent and use data standardization platforms after their tabulation. If that is not possible or if the data was already acquired with poor standardization, it is possible to use fuzzy matching techniques to detect string similarity and make sure that the data is properly parsed.


5. Data validation

In most cases, just analyzing the data format may not be enough, and the solution to this problem is to use deeper data validation rules, a way to be more assertive during the collection and analysis of information.

In order to improve collection performance, resources are usually used to validate email and telephone addresses. The goal is to eliminate as much as possible invalid data to ensure that surveys are sent to legitimate contact data without affecting return and reputation metrics. For instance, APIs can be used to check the validity of addresses, phone numbers, IDs and tax identification numbers.