Blog Content

Home – Blog Content

Data Collection – Best Practices

Introduction

Today, we will focus on the topic of data collection. This is an important aspect of the work of a data analyst or machine learning engineer. The models we can train and, most importantly, the correctness and usefulness of the entire solution depend on the type and amount of data we collect. Therefore, it is worth devoting a lot of attention to this topic.

How to Approach It?

The appropriate approach to data collection depends on the stage we are involved in this process. We can distinguish three most common situations here:

1. You already have ready-made data collected by someone else.

2. Data are (or are to be) collected by someone else, but you have the opportunity to provide certain requirements or suggestions that will be taken into account.

3. You are responsible for collecting the data yourself.

I will discuss the first situation below, and in the next section – the second and third.

Receiving Collected Data

It may seem that this article should not address this case since I have indicated that we will focus on the data collection stage. However, for me, the data collection stage ends when you have data ready for analysis or further processing. When you receive data (in the context of real-world applications – I am not referring here to training sets, competitions, or hackathons), there is usually one more step before you proceed to process them – understanding the data. And that’s what we will focus on now.

How Were the Data Collected?

First of all, you need to consider how these data were obtained. It will be crucial to talk to the people who collected and provided the data. Were they generated automatically by systems, or were they entered manually? If the latter, it is worth investigating who exactly was responsible for entering this data and whether there are any rules or standards regarding this process. Another important aspect is determining the time period covered by the collected data. This will allow you to understand any changes or trends in the data over time. It is also worth asking about any changes in the data collection methods over time. Was the information gathering process consistent, or were there significant changes in the collection methods?

In many systems, the data flow order is crucial. If our algorithm is supposed to work in real-time, we need to pay close attention to the problem of data leakage, which occurs when the model for inference uses data not available at the time of prediction. We can only protect against this problem when each piece of data also includes information about when it appeared.

What Is in the Data?

Next, we need to understand what is in this data – for example, in tabular data, do we know the meaning of each column? If there are multiple tables, we need to determine if there are keys by which these tables are joined and whether the columns contain any defined criteria or constraints.

Another issue is assessing the quality of the data. Were there any gaps in the dataset that were later filled? If so, how were they filled? Were these manual actions, or were any automatic data filling techniques used?

Finally, it is worth asking whether data collection is still ongoing. Are there plans to continue the data collection process? This will help understand whether there is a chance that, for example, after a few months, a lot more data will appear during the project. Existing data is one thing, but it is also worth finding out what the potential is for collecting additional data that we are not currently collecting. What other data could be collected in the future, and how much time will be needed for this task?

So, you can see that even when you already have some data, you can still do a lot to help yourself in the process of processing them or to improve this process in the future.

Independent Data Collection or Guideline Creation

Here you are a step back from the situation described in the previous point. The data have not yet been collected, so you can ensure their quality early enough. Here are the top 10 best practices in this area:

1. Collect as much data as possible: Above all, focus on collecting as much data as possible, both in terms of the number of records (rows in the case of tabular data) and features (columns). The more diverse the data, the greater the chance of discovering valuable patterns and information. Even if at this point you think that certain data will not be useful, you may come up with new ideas later and regret if you lack crucial data.

2. Use different sources: Similar information is often collected at different levels, by different systems, etc. It is worth ensuring access to these different sources. Firstly, you will potentially have more information, and secondly – you will verify correctness by comparing different sources with the same information.

3. Track the history of changes: If data evolve over time, ensure that you track the history of changes. This will allow you to reconstruct the data state at any time, which is crucial if you plan to experiment with the model on historical data.

4. Add data arrival time: Adding a timestamp to the data allows you to monitor when specific information appeared, which is necessary to avoid the data leakage problem described earlier.

5. Establish a clear format: You must ensure data format consistency (especially within columns). This applies, for example, to features such as dates or the way missing values are encoded (NULL in the table, empty string, etc.).

6. Establish clear naming conventions: Consistent and clear naming of categories, variables, or data objects facilitates, for example, merging data from different sources and immediately provides greater clarity. For example, I encountered production data where the same machines were named differently depending on the department. It caused significant problems when creating a general system using data from multiple departments at once.

7. Implement a data integrity system: Proper tools and automation systems that assist in data collection can reduce the risk of human errors. Avoid collecting data manually, especially in complicated projects.

8. Plan and document the data collection process: Prepare a data collection plan that includes goals, sources, methods, and a schedule. Documentation of the data collection process facilitates subsequent analysis and enables other people to understand the data. Always imagine that you will not be the person responsible for the later use of this data – this will give you the right perspective.

9. Secure sensitive data: If you collect personal or sensitive information, make sure you apply appropriate security measures and comply with personal data protection regulations. I wrote about the risks of improper data use in my post.

10. Regularly check data quality: Monitor data quality during the collection and processing process. Establish quality metrics and indicators that will help identify errors and data problems.

How Much Data Is Needed?

I wrote earlier that we should collect as much data as possible. Well, but in practice, firstly, we have various financial constraints (collecting data requires the involvement of people, maintaining various systems, etc.), and secondly, we do not want to wait indefinitely but rather quickly create a dataset from which we will draw some important conclusions or which will be used to train a model.

Unfortunately, it is difficult to answer the question posed in this section simply. However, I will provide a few criteria that will allow you to better estimate the amount of data needed:

Recognizing how much data will be needed in a Data Science project is an important step that will affect the quality and reliability of the analysis and model results. Here are a few criteria worth considering:

1. Model complexity level: The more complex the model, the generally more data you need for training. Simpler models, such as linear regression, can work effectively with smaller datasets, even dozens of cases. Neural networks typically require thousands of records for training. However, on the other hand, if you use a Transfer Learning approach (typical in natural language processing or image processing), you already use pre-trained large models, and then tens of labeled data (or even a few – in the case of few-shot learning) may be enough to fine-tune the model for your application.

2. Number of features: If you have many variables or features in the analyzed dataset, you usually need more data to train the model to avoid overfitting. In the case of tabular data, there should generally be at least several times more rows than columns (unless we deal with duplicates or other noises, which we can easily filter out by training the model).

3. Data diversity: If the data come from many different sources or contain various cases (e.g., different types of customers or products), it may require more data to ensure that each case occurs a sufficient number of times.

4. Level of imbalance: The easiest way to illustrate this problem is in binary classification. If the probability of class “1” is 0.05%, it is clear that to collect a sufficient number of positive cases, you need at least tens of thousands of records. Note that in the model training process, you will split the data into at least two sets, each of which will need to have a sufficient number of cases to achieve statistical significance.

5. Distribution of collected data: If the collected data are strongly clustered around certain values or areas, it may be necessary to collect more data to represent the diversity of the population.

6. Level of noise in the data: If the data contain a lot of noise or errors, a larger dataset may help the model distinguish true patterns from disturbances.

7. Model objectives: Determine what you want to achieve with the model. Do you need precise forecasts or rather general trends? This will affect the amount of data needed.

Note that you will only know the answers to many of these questions after you have performed the first analyses. This is typical in Data Science projects, which is why the right approach and philosophy are agile. Often, it is enough to first collect an initial sample of data, try to use them, and then decide whether we need more data or not.

Conclusion

If you are looking for more knowledge in Data Science or Machine Learning or want to discuss your ideas, do not hesitate to contact us!

Leave a Reply

Your email address will not be published. Required fields are marked *

Our mission is to create artificial intelligence technology that benefits people.

Our Services

AI algorithms

AI audit

Trainings

Consultations

Information

FAQ

Team

Company

Services

News

Industries

COGITA Sp. z o.o. is a company registered in the National Court Register kept by the District Court in Częstochowa (Poland), XVII Economic Division of the National Court Register. KRS (National Court Register) number: 0000995030, NIP (Tax Identification Number): 9492257381.