Suppose that you have to files that consist of records, one per line, and you want to ensure that none of the records in the second file appear in the first. How do you do it with only the text processing commandline tools commonly available on *nix systems?












Comment from Paul Brown @ 2009-09-25T22:22:15Z # permalink
You could use diff, but that's not in the spirit of the question.
My solution is:
Now, the lines prefixed by a count of 1 are those only in file2, the lines prefixed by a 2 are only in file1, and those prefixed by a 3 are in both.
Comment from Alexander @ 2009-09-26T19:32:00Z # permalink
You need to be sure that "intersection" of file1 and file2 is empty. In the case when records in files are uniq. Something like this can be useful:
if [ `cat file1 file2 | sort | uniq -d | wc -l` -eq 0 ] then echo No file2 in file1 else echo Files intersect fiComment from Hen @ 2009-09-29T05:35:14Z # permalink
grep -Ff file2 file1
Comment from Don Morrison @ 2010-05-14T15:34:09Z # permalink
Is it cheating to write an awk or perl script?
Comment from Paul Brown @ 2010-05-14T19:49:18Z # permalink
Fair question I'm at a loss to think of a current *nix that ships without perl and awk both present, meaning either is fair game.
Comment from Don Morrison @ 2010-05-15T18:28:29Z # permalink
#!/usr/bin/env perl -w --
# Usage: pod2usage subtractlines.pl
# Docs: pod2text subtractlines.pl
# Boss: pod2pdf subtractlines.pl > subtractlines.pdf
use 5.006;
use strict;
use warnings;
use Getopt::Long;
use Pod::Usage;
use English;
my %opt = ();
GetOptions(\%opt, ("overwrite")) or pod2usage(2);
sub aminusb (\@\%) {
my $aref = shift;
my $bref = shift;
my @result = ();
foreach (@$aref) {
if (not exists $$bref{"$_"}) { push @result, "$_"; }
} return @result;
}
if (scalar(@ARGV) != 2) {
print "\nERROR(subtractlines.pl): Two files must be specified.\n\n";
pod2usage(2);
}
open(my $fha, ($opt{"overwrite"}?'+<':'<'), "$ARGV[0]") or die $ERRNO;
open(my $fhb, '<', "$ARGV[1]") or die $ERRNO;
my @a = (); my %b = ();
while (<$fhb>) { $b{"$_"} = 1; }
while (<$fha>) { push @a, "$_"; }
my @result = aminusb(@a,%b);
if (not exists $opt{"overwrite"}) {
foreach(@result) { print "$_"; }
} else {
truncate $fha, 0;
foreach(@result) { print $fha "$_"; }
}
__END__
=pod
=head1 NAME
subtractlines.pl - changes file1 by subtracting lines from file2
=head1 USAGE
C<chmod a+x subtractlines.pl> #just once
C<./subtractlines.pl [-o|--overwrite] fremovelines fimmutablelines>
I<or if your *nix system doesn't respect env on the shebang line,>
C<perl subtractlines.pl [-o|--overwrite] fremovelines fimmutablelines>
B<Example 1: perl subtractlines.pl --overwrite file1 file2>
B<Example 2: perl subtractlines.pl file1 file2 E<gt> resultfile>
=head1 DESCRIPTION
A solution to the Mult.Ifario.Us CommandLine-Puzzler.
L<http://mult.ifario.us/p/commandline-puzzler>
=head1 AUTHOR
B<Donald Alan Morrison> E<lt>DonMorrison _ a _ t _ gmail.comE<gt>
=cut
Comment from Donald Alan Morrison @ 2010-05-24T18:46:39Z # permalink
It's surprising that Ubuntu (10.04) rejects arguments to the shebang line, unlike MacOSX and other unices.
For example, the "-w --" portion breaks on Ubuntu for the following shebang, but removing it fixes the problem. Me thinks Ubuntu is missing the point of /usr/bin/env(!) =)
#!/usr/bin/env perl -w --
It's not a bash issue, it fails under all shells, it's an exec() issue.
Here's a map of the shebang (#!) fragmentation (not POSIX specified):
http://www.in-ulm.de/~mascheck/various/shebang/