Commandline Puzzler

Paul R. Brown @ 2009-09-25T19:39:39Z

Suppose that you have to files that consist of records, one per line, and you want to ensure that none of the records in the second file appear in the first. How do you do it with only the text processing commandline tools commonly available on *nix systems?

Meta

Tags: (tag) (tag)

(comment bubbles) 7 comments
859 direct views

Comment from Paul Brown @ 2009-09-25T22:22:15Z # permalink

You could use diff, but that's not in the spirit of the question.

My solution is:

 cat file1 file1 file2 | sort | uniq -c > out

Now, the lines prefixed by a count of 1 are those only in file2, the lines prefixed by a 2 are only in file1, and those prefixed by a 3 are in both.

Comment from Alexander @ 2009-09-26T19:32:00Z # permalink

You need to be sure that "intersection" of file1 and file2 is empty. In the case when records in files are uniq. Something like this can be useful:

 if [ `cat file1 file2 | sort | uniq -d | wc -l` -eq 0 ]
 then
     echo No file2 in file1
 else
     echo Files intersect
 fi

Comment from Hen @ 2009-09-29T05:35:14Z # permalink

grep -Ff file2 file1

Comment from Don Morrison @ 2010-05-14T15:34:09Z # permalink

Is it cheating to write an awk or perl script?

Comment from Paul Brown @ 2010-05-14T19:49:18Z # permalink

Fair question — I'm at a loss to think of a current *nix that ships without perl and awk both present, meaning either is fair game.

Comment from Don Morrison @ 2010-05-15T18:28:29Z # permalink

#!/usr/bin/env perl -w --

# Usage: pod2usage subtractlines.pl

# Docs: pod2text subtractlines.pl

# Boss: pod2pdf subtractlines.pl > subtractlines.pdf

use 5.006;

use strict;

use warnings;

use Getopt::Long;

use Pod::Usage;

use English;

my %opt = ();

GetOptions(\%opt, ("overwrite")) or pod2usage(2);

sub aminusb (\@\%) {

my $aref = shift;

my $bref = shift;

my @result = ();

foreach (@$aref) {

if (not exists $$bref{"$_"}) { push @result, "$_"; }

} return @result;

}

if (scalar(@ARGV) != 2) {

print "\nERROR(subtractlines.pl): Two files must be specified.\n\n";

pod2usage(2);

}

open(my $fha, ($opt{"overwrite"}?'+<':'<'), "$ARGV[0]") or die $ERRNO;

open(my $fhb, '<', "$ARGV[1]") or die $ERRNO;

my @a = (); my %b = ();

while (<$fhb>) { $b{"$_"} = 1; }

while (<$fha>) { push @a, "$_"; }

my @result = aminusb(@a,%b);

if (not exists $opt{"overwrite"}) {

foreach(@result) { print "$_"; }

} else {

truncate $fha, 0;

foreach(@result) { print $fha "$_"; }

}

__END__

=pod

=head1 NAME

subtractlines.pl - changes file1 by subtracting lines from file2

=head1 USAGE

C<chmod a+x subtractlines.pl> #just once

C<./subtractlines.pl [-o|--overwrite] fremovelines fimmutablelines>

I<or if your *nix system doesn't respect env on the shebang line,>

C<perl subtractlines.pl [-o|--overwrite] fremovelines fimmutablelines>

B<Example 1: perl subtractlines.pl --overwrite file1 file2>

B<Example 2: perl subtractlines.pl file1 file2 E<gt> resultfile>

=head1 DESCRIPTION

A solution to the Mult.Ifario.Us CommandLine-Puzzler.

L<http://mult.ifario.us/p/commandline-puzzler>

=head1 AUTHOR

B<Donald Alan Morrison> E<lt>DonMorrison _ a _ t _ gmail.comE<gt>

=cut

Comment from Donald Alan Morrison @ 2010-05-24T18:46:39Z # permalink

It's surprising that Ubuntu (10.04) rejects arguments to the shebang line, unlike MacOSX and other unices.

For example, the "-w --" portion breaks on Ubuntu for the following shebang, but removing it fixes the problem. Me thinks Ubuntu is missing the point of /usr/bin/env(!) =)

#!/usr/bin/env perl -w --

It's not a bash issue, it fails under all shells, it's an exec() issue.

Here's a map of the shebang (#!) fragmentation (not POSIX specified):

http://www.in-ulm.de/~mascheck/various/shebang/