 SAMPLING AND DATA ANALYSIS
2.1 Introduction
Analysis of the properties of a food material depends on the successful completion of a number of different steps: planning (identifying the most appropriate analytical procedure), sample selection, sample preparation, performance of analytical procedure, statistical analysis of measurements, and data reporting.� Most of the subsequent chapters deal with the description of various analytical procedures developed to provide information about food properties, whereas this chapter focuses on the other aspects of food analysis.
2.2 Sample Selection and Sampling Plans
A food analyst often has to determine the characteristics of a large quantity of food material, such as the contents of a truck arriving at a factory, a days worth of production, or the products stored in a warehouse. Ideally, the analyst would like to analyze every part of the material to obtain an accurate measure of the property of interest, but in most cases this is practically impossible. Many analytical techniques destroy the food and so there would be nothing left to sell if it were all analyzed. Another problem is that many analytical techniques are time consuming, expensive or labor intensive and so it is not economically feasible to analyze large amounts of material. It is therefore normal practice to select a fraction of the whole material for analysis, and to assume that its properties are representative of the whole material.� Selection of an appropriate fraction of the whole material is one of the most important stages of food analysis procedures, and can lead to large errors when not carried out correctly.�
Populations, Samples and Laboratory Samples.� It is convenient to define some terms used to describe the characteristics of a material whose properties are going to be analyzed.�
� Population.� The whole of the material whose properties we are trying to obtain an estimate of is usually referred to as the �population�.
� Sample. Only a fraction of the population is usually selected for analysis, which is referred to as the �sample�.� The sample may be comprised of one or more subsamples selected from different regions within the population.
� Laboratory Sample.� The sample may be too large to conveniently analyze using a laboratory procedure and so only a fraction of it is actually used in the final laboratory analysis.� This fraction is usually referred to as the �laboratory sample�.�
The primary objective of sample selection is to ensure that the properties of the laboratory sample are representative of the properties of the population, otherwise erroneous results will be obtained.� Selection of a limited number of samples for analysis is of great benefit because it allows a reduction in time, expense and personnel required to carry out the analytical procedure, while still providing useful information about the properties of the population. Nevertheless, one must always be aware that analysis of a limited number of samples can only give an estimate of the true value of the whole population.
Sampling Plans.� To ensure that the estimated value obtained from the laboratory sample is a good representation of the true value of the population it is necessary to develop a �sampling plan�. A sampling plan should be a clearly written document that contains precise details that an analyst uses to decide the sample size, the locations from which the sample should be selected, the method used to collect the sample, and the method used to preserve them prior to analysis.� It should also stipulate the required documentation of procedures carried out during the sampling process. The choice of a particular sampling plan depends on the purpose of the analysis, the property to be measured, the nature of the total population and of the individual samples, and the type of analytical technique used to characterize the samples. For certain products and types of populations sampling plans have already been developed and documented by various organizations which authorize official methods, e.g., the Association of Official Analytical Chemists (AOAC). Some of the most important considerations when developing or selecting an appropriate sampling plan are discussed below.
2.2.1 Purpose of Analysis
The first thing to decide when choosing a suitable sampling plan is the purpose of the analysis. Samples are analyzed for a number of different reasons in the food industry and this affects the type of sampling plan used:
� Official samples. Samples may be selected for official or legal requirements by government laboratories. These samples are analyzed to ensure that manufacturers are supplying safe foods that meet legal and labeling requirements. An officially sanctioned sampling plan and analytical protocol is often required for this type of analysis.
� Raw materials. Raw materials are often analyzed before acceptance by a factory, or before use in a particular manufacturing process, to ensure that they are of an appropriate quality.
� Process control samples. A food is often analyzed during processing to ensure that the process is operating in an efficient manner. Thus if a problem develops during processing it can be quickly detected and the process adjusted so that the properties of the sample are not adversely effected. Techniques used to monitor process control must be capable of producing precise results in a short time. Manufacturers can either use analytical techniques that measure the properties of foods online, or they can select and remove samples and test them in a quality assurance laboratory.
� Finished products. Samples of the final product are usually selected and tested to ensure that the food is safe, meets legal and labeling requirements, and is of a high and consistent quality. Officially sanctioned methods are often used for determining nutritional labeling.
� Research and Development. Samples are analyzed by food scientists involved in fundamental research or in product development.� In many situations it is not necessary to use a sampling plan in R&D because only small amounts of materials with welldefined properties are analyzed.
2.2.2 Nature of Measured Property
Once the reason for carrying out the analysis has been established it is necessary to clearly specify the particular property that is going to be measured, e.g., color, weight, presence of extraneous matter, fat content or microbial count. The properties of foods can usually be classified as either attributes or variables. An attribute is something that a product either does or does not have, e.g., it does or does not contain a piece of glass, or it is or is not spoilt. On the other hand, a variable is some property that can be measured on a continuous scale, such as the weight, fat content or moisture content of a material. Variable sampling usually requires less samples than attribute sampling.�
The type of property measured also determines the seriousness of the outcome if the properties of the laboratory sample do not represent those of the population.� For example, if the property measured is the presence of a harmful substance (such as bacteria, glass or toxic chemicals), then the seriousness of the outcome if a mistake is made in the sampling is much greater than if the property measured is a quality parameter (such as color or texture).� Consequently, the sampling plan has to be much more rigorous for detection of potentially harmful substances than for quantification of quality parameters.
2.2.3 Nature of Population
It is extremely important to clearly define the nature of the population from which samples are to be selected when deciding which type of sampling plan to use. Some of the important points to consider are listed below:
� A population may be either finite or infinite. A finite population is one that has a definite size, e.g., a truckload of apples, a tanker full of milk, or a vat full of oil. An infinite population is one that has no definite size, e.g., a conveyor belt that operates continuously, from which foods are selected periodically. Analysis of a finite population usually provides information about the properties of the population, whereas analysis of an infinite population usually provides information about the properties of the process.� To facilitate the development of a sampling plan it is usually convenient to divide an "infinite" population into a number of finite populations, e.g., all the products produced by one shift of workers, or all the samples produced in one day.
� A population may be either continuous or compartmentalized. A continuous population is one in which there is no physical separation between the different parts of the sample, e.g., liquid milk or oil stored in a tanker. A compartmentalized population is one that is split into a number of separate subunits, e.g., boxes of potato chips in a truck, or bottles of tomato ketchup moving along a conveyor belt. The number and size of the individual subunits determines the choice of a particular sampling plan.
� A population may be either homogenous or heterogeneous. A homogeneous population is one in which the properties of the individual samples are the same at every location within the material (e.g. a tanker of well stirred liquid oil), whereas a heterogeneous population is one in which the properties of the individual samples vary with location (e.g. a truck full of potatoes, some of which are bad). If the properties of a population were homogeneous then there would be no problem in selecting a sampling plan because every individual sample would be representative of the whole population. In practice, most populations are heterogeneous and so we must carefully select a number of individual samples from different locations within the population to obtain an indication of the properties of the total population.
2.2.4 Nature of Test Procedure
The nature of the procedure used to analyze the food may also determine the choice of a particular sampling plan, e.g., the speed, precision, accuracy and cost per analysis, or whether the technique is destructive or nondestructive. Obviously, it is more convenient to analyze the properties of many samples if the analytical technique used is capable of rapid, low cost, nondestructive and accurate measurements.
2.2.5. Developing a Sampling Plan
After considering the above factors one should be able to select or develop a sampling plan which is most suitable for a particular application. Different sampling plans have been designed to take into account differences in the types of samples and populations encountered, the information required and the analytical techniques used. Some of the features that are commonly specified in official sampling plans are listed below.
Sample size. The size of the sample selected for analysis largely depends on the expected variations in properties within a population, the seriousness of the outcome if a bad sample is not detected, the cost of analysis, and the type of analytical technique used. Given this information it is often possible to use statistical techniques to design a sampling plan that specifies the minimum number of subsamples that need to be analyzed to obtain an accurate representation of the population.� Often the size of the sample is impractically large, and so a process known as sequential sampling is used.� Here subsamples selected from the population are examined sequentially until the results are sufficiently definite from a statistical viewpoint.� For example, subsamples are analyzed until the ratio of good ones to bad ones falls within some statistically predefined value that enables one to confidently reject or accept the population.
Sample location. In homogeneous populations it does not matter where the sample is taken from because all the subsamples have the same properties. In heterogeneous populations the location from which the subsamples are selected is extremely important. In random sampling the subsamples are chosen randomly from any location within the material being tested. Random sampling is often preferred because it avoids human bias in selecting samples and because it facilitates the application of statistics.� In systematic sampling the samples are drawn systematically with location or time, e.g., every 10th box in a truck may be analyzed, or a sample may be chosen from a conveyor belt every 1 minute. This type of sampling is often easy to implement, but it is important to be sure that there is not a correlation between the sampling rate and the subsample properties.� In judgment sampling the subsamples are drawn from the whole population using the judgment and experience of the analyst. This could be the easiest subsample to get to, such as the boxes of product nearest the door of a truck. Alternatively, the person who selects the subsamples may have some experience about where the worst subsamples are usually found, e.g., near the doors of a warehouse where the temperature control is not so good. It is not usually possible to apply proper statistical analysis to this type of sampling, since the subsamples selected are not usually a good representation of the population.
Sample collection. Sample selection may either be carried out manually by a human being or by specialized mechanical sampling devices. Manual sampling may involve simply picking a sample from a conveyor belt or a truck, or using special cups or containers to collect samples from a tank or sack.� The manner in which samples are selected is usually specified in sampling plans.
2.3 Preparation of Laboratory Samples
Once we have selected a sample that represents the properties of the whole population, we must prepare it for analysis in the laboratory. The preparation of a sample for analysis must be done very carefully in order to make accurate and precise measurements.
2.3.1 Making Samples Homogeneous
The food material within the sample selected from the population is usually heterogeneous, i.e., its properties vary from one location to another.� Sample heterogeneity may either be caused by variations in the properties of different units within the sample (interunit variation) and/or it may be caused by variations within the individual units in the sample (intraunit variation). The units in the sample could be apples, potatoes, bottles of ketchup, containers of milk etc.� An example of interunit variation would be a box of oranges, some of good quality and some of bad quality.� An example of intraunit variation would be an individual orange, whose skin has different properties than its flesh. For this reason it is usually necessary to make samples homogeneous before they are analyzed, otherwise it would be difficult to select a representative laboratory sample from the sample. A number of mechanical devices have been developed for homogenizing foods, and the type used depends on the properties of the food being analyzed (e.g., solid, semisolid, liquid).� Homogenization can be achieved using mechanical devices (e.g., grinders, mixers, slicers, blenders), enzymatic methods (e.g., proteases, cellulases, lipases) or chemical methods (e.g., strong acids, strong bases, detergents).
2.3.2. Reducing Sample Size
Once the sample has been made homogeneous, a small more manageable portion is selected for analysis. This is usually referred to as a laboratory sample, and ideally it will have properties which are representative of the population from which it was originally selected. Sampling plans often define the method for reducing the size of a sample in order to obtain reliable and repeatable results.
2.3.3. Preventing Changes in Sample
Once we have selected our sample we have to ensure that it does not undergo any significant changes in its properties from the moment of sampling to the time when the actual analysis is carried out, e.g., enzymatic, chemical, microbial or physical changes. There are a number of ways these changes can be prevented.
� Enzymatic Inactivation. Many foods contain active enzymes they can cause changes in the properties of the food prior to analysis, e.g., proteases, cellulases, lipases, etc. If the action of one of these enzymes alters the characteristics of the compound being analyzed then it will lead to erroneous data and it should therefore be inactivated or eliminated. Freezing, drying, heat treatment and chemical preservatives (or a combination) are often used to control enzyme activity, with the method used depending on the type of food being analyzed and the purpose of the analysis.
� Lipid Protection. Unsaturated lipids may be altered by various oxidation reactions. Exposure to light, elevated temperatures, oxygen or prooxidants can increase the rate at which these reactions proceed. Consequently, it is usually necessary to store samples that have high unsaturated lipid contents under nitrogen or some other inert gas, in dark rooms or covered bottles and in refrigerated temperatures. Providing that they do not interfere with the analysis antioxidants may be added to retard oxidation.
� Microbial Growth and Contamination. Microorganisms are present naturally in many foods and if they are not controlled they can alter the composition of the sample to be analyzed. Freezing, drying, heat treatment and chemical preservatives (or a combination) are often used to control the growth of microbes in foods.
� Physical Changes. A number of physical changes may occur in a sample, e.g., water may be lost due to evaporation or gained due to condensation; fat or ice may melt or crystallize; structural properties may be disturbed. Physical changes can be minimized by controlling the temperature of the sample, and the forces that it experiences.
2.3.4. Sample Identification
Laboratory samples should always be labeled carefully so that if any problem develops its origin can easily be identified. The information used to identify a sample includes: a) Sample description, b) Time sample was taken, c) Location sample was taken from, d) Person who took the sample, and, e) Method used to select the sample.� The analyst should always keep a detailed notebook clearly documenting the sample selection and preparation procedures performed and recording the results of any analytical procedures carried out on each sample.� Each sample should be marked with a code on its label that can be correlated to the notebook.� Thus if any problem arises, it can easily be identified.
2.4. Data Analysis and Reporting
Food analysis usually involves making a number of repeated measurements on the same sample to provide confidence that the analysis was carried out correctly and to obtain a best estimate of the value being measured and a statistical indication of the reliability of the value.� A variety of statistical techniques are available that enable us to obtain this information about the laboratory sample from multiple measurements.
2.4.1. Measure of Central Tendency of Data
The most commonly used parameter for representing the overall properties of a number of measurements is the mean:
����������� �������������������� (1)
Here n is the total number of measurements, x_{i} is the individually measured values and is the mean value.
The mean is the best experimental estimate of the value that can be obtained from the measurements. It does not necessarily have to correspond to the true value of the parameter one is trying to measure. There may be some form of systematic error in our analytical method that means that the measured value is not the same as the true value (see below). Accuracy refers to how closely the measured value agrees with the true value. The problem with determining the accuracy is that the true value of the parameter being measured is often not known. Nevertheless, it is sometimes possible to purchase or prepare standards that have known properties and analyze these standards using the same analytical technique as used for the unknown food samples. The absolute error E_{abs}, which is the difference between the true value (x_{true}) and the measured value (x_{i}), can then be determined: E_{abs} = (x_{i}  x_{true}).� For these reasons, analytical instruments should be carefully maintained and frequently calibrated to ensure that they are operating correctly.
2.4.2. Measure of Spread of Data
The spread of the data is a measurement of how closely together repeated measurements are to each other. The standard deviation is the most commonly used measure of the spread of experimental measurements. This is determined by assuming that the experimental measurements vary randomly about the mean, so that they can be represented by a normal distribution.� The standard deviation SD of a set of experimental measurements is given by the following equation:
����������� ������������������������� (2)
Measured values within the specified range:
� SD means 68% values within range (x  SD) to (x + SD)
� 2SD means 95% values within range (x  2SD) to (x + 2SD)
� 3SD means >99% values within range (x  3SD) to (x + 3SD)
Another parameter that is commonly used to provide an indication of the relative spread of the data around the mean is the coefficient of variation, CV = [SD /] � 100%.
2.4.3. Sources of Error
There are three common sources of error in any analytical technique:
� Personal Errors (Blunders). These occur when the analytical test is not carried out correctly: the wrong chemical reagent or equipment might have been used; some of the sample may have been spilt; a volume or mass may have been recorded incorrectly; etc. It is partly for this reason that analytical measurements should be repeated a number of times using freshly prepared laboratory samples.� Blunders are usually easy to identify and can be eliminated by carrying out the analytical method again more carefully.
� Random Errors. These produce data that vary in a nonreproducible fashion from one measurement to the next e.g., instrumental noise. This type of error determines the standard deviation of a measurement. There may be a number of different sources of random error and these are accumulative (see �Propagation of Errors�).
� Systematic Errors. A systematic error produces results that consistently deviate from the true answer in some systematic way, e.g., measurements may always be 10% too high. This type of error would occur if the volume of a pipette was different from the stipulated value. For example, a nominally 100 cm^{3} pipette may always deliver 101 cm^{3} instead of the correct value.
To make accurate and precise measurements it is important when designing and setting up an analytical procedure to identify the various sources of error and to minimize their effects. Often, one particular step will be the largest source of error, and the best improvement in accuracy or precision can be achieved by minimizing the error in this step.
2.4.4. Propagation of Errors
Most analytical procedures involve a number of steps (e.g., weighing, volume measurement, reading dials), and there will be an error associated with each step. These individual errors accumulate to determine the overall error in the final result. For random errors there are a number of simple rules that can be followed to calculate the error in the final result:
Addition (Z = X+Y) and Subtraction (Z = XY): ��������� ��������������� (3)
Multiplication (Z = XY) and Division (Z = X/Y): ���������� ����� (4)
Here, DX is the standard deviation of the mean value X, DY is the standard deviation of the mean value Y, and DZ is the standard deviation of the mean value Z. These simple rules should be learnt and used when calculating the overall error in a final result.�
As an example, let us assume that we want to determine the fat content of a food and that we have previously measured the mass of extracted fat extracted from the food (M_{E}) and the initial mass of the food (M_{I}):�
M_{E} = 3.1 � 0.3 g
M_{I} = 10.5 � 0.7 g
% Fat Content = 100 � M_{E} / M_{I}
To calculate the mean and standard deviation of the fat content we need to use the multiplication rule (Z=X/Y) given by Equation 4.� Initially, we assign values to the various parameters in the appropriate propagation of error equation:
X = 3.1; DX = 0.3
Y = 10.5; DY = 0.7
% Fat Content = Z = 100�X/Y� = 100�3.1/10.5 = 29.5%
DZ = Z � [(DX/X)^{2}+(DY/Y)^{2}] = 29.5% � [(0.3/3.1)^{2}+(0.7/10.5)^{2}] = 3.5%
Hence, the fat content of the food is 29.5 � 3.5%.� In reality, it may be necessary to carry out a number of different steps in a calculation, some that involve addition/subtraction and some that involve multiplication/division.� When carrying out multiplication/division calculations it is necessary to ensure that all appropriate addition/subtraction calculations have been completed first.
2.4.5. Significant Figures and Rounding
The number of significant figures used in reporting a final result is determined by the standard deviation of the measurements. A final result is reported to the correct number of significant figures when it contains all the digits that are known to be correct, plus a final one that is known to be uncertain. For example, a reported value of 12.13, means that the 12.1 is known to be correct but the 3 at the end is uncertain, it could be either a 2 or a 4 instead.
For multiplication (Z = X� Y) and division (Z = X/Y), the significant figures in the final result (Z) should be equal to the significant figures in the number from which it was calculated (X or Y) that has the lowest significant figures. For example, 12.312 (5 significant figures) x 31.1 (3 significant figures) = 383 (3 significant figures). For addition (Z = X + Y) and subtraction (Z = X  Y), the significant figures in the final result (Z) are determined by the number from which it was calculated (X or Y) that has the last significant figure in the highest decimal column. For example, 123.4567 (last significant figure in the "0.0001" decimal column) + 0.31 (last significant figure in the "0.01" decimal column) = 123.77 (last significant figure in the "0.01" decimal column). Or, 1310 (last significant figure in the "10" decimal column) + 12.1 (last significant figure in the "0.1" decimal column) = 1320 (last significant figure in the "10" decimal column).
When rounding numbers: always round any number with a final digit less than 5 downwards, and 5 or more upwards, e.g. 23.453 becomes 23.45; 23.455 becomes 23.46; 23.458 becomes 23.46. It is usually desirable to carry extra digits throughout the calculations and then round off the final result.
2.4.6. Standard Curves: Regression Analysis
When carrying out certain analytical procedures it is necessary to prepare standard curves that are used to determine some property of an unknown material. A series of calibration experiments is carried out using samples with known properties and a standard curve is plotted from this data. For example, a series of protein solutions with known concentration of protein could be prepared and their absorbance of electromagnetic radiation at 280 nm could be measured using a UVvisible spectrophotometer. For dilute protein solutions there is a linear relationship between absorbance and protein concentration:
����������������������� ��������������
A bestfit line is drawn through the date using regression analysis, which has a gradient of a and a yintercept of b. The concentration of protein in an unknown sample can then be determined by measuring its absorbance: x = (yb)/a, where in this example x is the protein concentration and y is the absorbance. How well the straightline fits the experimental data is expressed by the correlation coefficient r^{2}, which has a value between 0 and 1. The closer the value is to 1 the better the fit between the straight line and the experimental values: r^{2}= 1 is a perfect fit. Most modern calculators and spreadsheet programs have routines that can be used to automatically determine the regression coefficient, the slope and the intercept of a set of data.
2.4.7. Rejecting Data
When carrying out an experimental analytical procedure it will sometimes be observed that one of the measured values is very different from all of the other values, e.g., as the result of a �blunder� in the analytical procedure. Occasionally, this value may be treated as being incorrect, and it can be rejected. There are certain rules based on statistics that allow us to decide whether a particular point can be rejected or not. A test called the Qtest is commonly used to decide whether an experimental value can be rejected or not.
�����������
Here X_{BAD} is the questionable value, X_{NEXT} is the next closet value to X_{BAD}, X_{HIGH} is the highest value of the data set and X_{LOW} is the lowest value of the data set. If the Qvalue is higher than the value given in a Qtest table for the number of samples being analyzed then it can be rejected:
Number of Observations 
Qvalue for Data Rejection (90% confidence level) 


3 
0.94 
4 
0.76 
5 
0.64 
6 
0.56 
7 
0.51 
8 
0.47 
9 
0.44 
10 
0.41 
For example, if five measurements were carried out and one measurement was very different from the rest (e.g., 20,22,25,50,21), having a Qvalue of 0.84, then it could be safely rejected (because it is higher than the value of 0.64 given in the Qtest table for five observations).
References
Nielsen, S.S. (1998). Food Analysis, 2nd Edition. Aspen Publication, Gaithersberg, Maryland.
Procter, A. and Meullenet, J.F. (1998).� Sampling and Sample Preparation.� In: Food Analysis, 2nd Edition. Aspen Publication, Gaithersberg, Maryland