Evaluating a variable using the IN() Function

I’m trying to resolve a datastep variable in the in() function. I have a dataset that looks like the following:

|Run|Sample Level|Samples Tested|
| 1 |     1      |      1-5     |
| 1 |     2      |      1-5     |
...etc
| 1 |     5      |      1-5     |
---------------------------------
| 2 |     1      |      1-4     |
| 2 |     2      |      1-4     |

​x
 
|Run|Sample Level|Samples Tested|| 1 |     1      |      1-5     || 1 |     2      |      1-5     |...etc| 1 |     5      |      1-5     |---------------------------------| 2 |     1      |      1-4     || 2 |     2      |      1-4     |​

The samples tested vary by run. Normally the only sample levels in the dataset are the ones in the range provided by “Samples Tested”. However occasionally this is not the case, and it can get messy. For example the last one I worked on looked like this:

|Run|Sample Level|Samples Tested|
| 1 |      1     |2-9, 12-35, 37-40|

 
|Run|Sample Level|Samples Tested|| 1 |      1     |2-9, 12-35, 37-40|​

In this case I’d want to drop all rows with sample levels that were not included in Samples Tested, which I did by manually adding the code:

Data Want;
set Have;
if sample_level not in (2:9, 12:35, 37:40) then delete;
run;

 
Data Want;set Have;if sample_level not in (2:9, 12:35, 37:40) then delete;run;​

But what I want to do is have this done automatically by looking at the samples tested column. It’s easy enough to turn a “-” into a “:”, but where I’m stuck is getting the IN() function to recognize or resolve a variable. I would like code that looks like this: if sample_level not in(Samples_Tested) then delete; where samples_tested has been transformed to be something that the IN() function can handle. I’m also not opposed to using proc sql; if anyone has a solution that they think will work. I know you can do things like

Proc sql; Create table want as select * from HAVE where Sample_Level in (Select Samples_Tested from Have); Quit;

 
Proc sql; Create table want as select * from HAVE where Sample_Level in (Select Samples_Tested from Have); Quit;​

But the problem is that the samples tested varies by run and there could be 16 different runs. Hopefully I’ve explained the challenge clearly enough. Thanks for taking the time to read this and thanks in advance for your help!

Answer

Assuming the values of SAMPLES_TESTED is constant for each value of RUN you could use it to generate the selection criteria. For example you could use a data _null_ step to write a WHERE statement to a file and then %include that code into another data step.

filename code temp;
data _null_;
  file code;
  if eof then put ';';
  set have end=eof;
  by run;
  if first.run;
  if _n_=1 then put 'where ' @ ;
  else put '   or ' @ ;
  samples_tested=translate(samples_tested,':','-');
  put '(' run= 'and sample_level in (' samples_tested '))';
run;
data want;
  set have;
  %include code;
run;

 
filename code temp;data _null_;  file code;  if eof then put ';';  set have end=eof;  by run;  if first.run;  if _n_=1 then put 'where ' @ ;  else put '   or ' @ ;  samples_tested=translate(samples_tested,':','-');  put '(' run= 'and sample_level in (' samples_tested '))';run;data want;  set have;  %include code;run;​

Note: IN is an operator and not a function.

Advertisement

Answer