
This paper proposes an optimization model for selecting a larger subsample that improves the representativeness of a simple random sample previously obtained from a population larger than the population of interest. The problem formulation involves convex mixedinteger nonlinear programming (convex MINLP) and is, therefore, NPhard. However, the solution is found by maximizing the size of the subsample taken from a stratified random sample with proportional allocation and restricting it to a pvalue large enough to achieve a good fit to the population of interest using Pearson’s chisquare goodnessoffit test. The paper also applies the model to the Continuous Sample of Working Lives (CSWL), which is a set of anonymized microdata containing information on individuals from Spanish Social Security records and the results prove that it is possible to obtain a larger subsample from the CSWL that (far) better represents the pensioner population for each of the waves analyzed.
