In the current age world, data is everywhere. Every sector is leveraging chunks of data to structure the development plans. As seen in our last blog, statistics is an indispensable way ahead for any scientific research, innovation, manufacturing, production, sales, revenue, forecasting, prediction, etc. In our healthcare sector example, we have seen how statistical measures were relevant in coming up with critical decision-making based on a sample of data. Data plays a crucial role in statistics to determine the validity of the result.

**Introduction to Hypotheses from Sample vs. Population**

Let’s say we must manufacture a Covid-19 vaccine for the whole nation. Now, for that, we need to first believe in the assumption that the vaccine will be able to fight Covid-19. That’s a hypothesis.

To state formally, a hypothesis can be “the vaccine will work in fighting Covid-19”. An alternative hypothesis of this example can be that “the vaccine will fail to combat Covid-19”.

**Intro to Population**

To prove if the hypothesis works, we need to analyze the data points statistically and decide whether to accept the hypothesis.

So, you first need to understand the characteristics of all the people, i.e., the enormous population. It is essential to consider all the attributes to manufacture an accurate Covid-19 vaccine for all sections of the population. For example, the drug manufacturer needs to collect data on the entire population for all the features, spanning all ages, gender, morbidity conditions, etc., based on which hypothesis can be concluded.

However, is the entire data accessible? So, is it possible to find data on every person living in a nation? It’s nearly impossible. Likewise, collecting entire population data impose more challenges.

**Challenges of Population**

Let’s try answering a few questions here:

**Necessity**: Is data collection for all the people necessary for vaccine manufacturing? Yes, leaving out any section of the population from the vaccine manufacturing project will make the results biased. The vaccine’s efficacy might fail for the left-out sections of society. However, is it possible to collect all data points of the population?**Practicality**: We understand every single data point needs to be incorporated for a robust Covid-19 vaccine. But is it practical to reach out to every possible person to gather the required data?**Cost-effectiveness:**Can you gauge the quantum of the cost involved to access every single data point of a country? It would involve considerable costs to collect data from the entire population to develop a vaccine. For example, the entire process might cost a fortune before starting the analysis. So, is there a better way?**Manageability**: Is it possible to manage large chunks of data? One cannot manage the data size of such a high magnitude within a small server or an excel/CSV format. Instead, storing big data involves many inconveniences and resources, involving trillions of rows and columns. After the data is collected, many processes need to be conducted, like data cleaning, structuring, and manipulation to start the analysis. Data is often tabulated, charted, and processed, which might be difficult for a population.**Accuracy**: All possible data attributes to ensure the manufactured vaccine is accurate. To ensure that the vaccine results are precise for every person, we need to ensure that the analysis includes all possible data points. For example, is it possible to access every single data point of the entire population? Assuming we have billions of records for the Indian population, it is impossible to analyze that data for statistical analysis.**Time efficiency**: The vaccine needs to be rolled out within a short period to restrict the further spread of the infection. How practical is it to reach out to every population data point in such a scenario? Time is a concern for data collection based on population. It will take enormous time to gather such a heavy magnitude of data.

So, considering all of the above, is population the convenient and practical dataset to consider for the statistical analysis of vaccine manufacturing? This gives birth to the concept of the sample.

**Why do we need Samples from Population?**

We have understood the challenges of dealing with a population in any real-life scenario. So, what is the best way out here?

First, we can consider a population subset (i.e., sample) for statistical analysis. It’s like collecting a small piece of a larger whole. However, the sample needs to be a prototype of the population, or the study will be biased and skewed. The sample needs to have the same mix of all attributes spanning ages, gender, and morbidity issues, just like the population. This should ensure that the analysis and conclusions drawn based on the sample hold valid for the population. The vaccine developed based on the sample data should work fine for the entire population. Researchers and statisticians can answer more questions based on sample data.

**Reasons for sample**

**Necessity**: Sample helps meet the necessity of gathering different attributes of people for the vaccine development. Samples can make it possible to collect every possible sample data point for the statistical analysis.

**Practicality**: Collecting data from a sample is more practical and efficient. Getting the best-represented sample is always the best practice. You can access a few data points from rural areas and tribal villages through the sample. These might otherwise be inaccessible by the data-collecting agents, leading to marginalized communities being left out.**Cost-effectiveness**: Collecting samples is always cost-efficient. There are fewer participants, laboratory, equipment, and researcher costs for the sample.**Manageability**: Storing and running statistical analyses on smaller datasets is easier and more reliable. For example, you can store sample data in multiple formats such as excel or server databases.**Speed of Data collection:**Collecting sample data is swifter than the population. Sample data collection is a lot faster and smoother.**Accuracy/Precision**: A statistician must conduct multiple trials to predict accurately. As for vaccine manufacturing, several dosage trials are necessary on different sections of the population to reach the precision. It is always convenient to run multiple trials of vaccines on a small sample dataset.

**Features of a Sample**

True, that entire global set of data is always best to arrive at impartial results, but it might take ages to arrive at a single conclusion. Hence, ensuring that the data collected(sample) is most suitable for the problem becomes crucial.

The best sample should have the following properties:

- It should be relevant to the problem of analysis
- Let’s say the average age of the Indian population is 45, while the same for a sample is 28. Will the vaccine manufactured based on sample data work for the population? It might fail for older sections of the nation. Hence, the sample’s mean should represent the population’s mean.
- Will the vaccine work if the variability of the population is not the same as the sample? For example, population data might have a maximum age of 75 and minimum age of 18 (range is 57), wherein sample data might have a maximum age of 45 and minimum age of 25 (range is 20). The vaccine might fail if the distribution of the data points is not the same across both datasets. If the sample does not reflect the same, the vaccine will fail in its efficacy for all people. Hence the variability of the sample data should be similar to the populations.
- Overall sample data should represent all the population classes, which can be achieved with the proper sampling technique.

**Statistic vs. Parameter**

**Statistic**: It’s a sample metric like sample means and standard deviation. This is a measurable quality.

**Parameter: **This is the counterpart metric for the population. We conclude parameters based on statistics. Examples are population mean and population standard deviation.

**Conclusion**

We have seen the sample and population, the need for each, and their challenges. While the population is the entire galaxy of the data, the sample is its subset. The best sample includes a similar distribution of the population data. We have understood what challenges the vaccine manufacturing company endures concerning data collection from the population. It’s impractical, inaccessible, inconvenient, and unmanageable costs and time to collect population data. The best way forward is to sample data from the same population of a smaller size.

### About Post Author

#### Yogesh Kothiya

Avid Content Creator, Love to train and consult in Data Analytics, Data Science, AI in Cloud, Chatbots, and MLOps.