Cross reference tool

N

News groups

Hi!

For the development of web pages it is important to know the word
frequency of a page. Together with the GNU diction program you can check
and maintain easily the readability of a page.

So i need a cross referencer. I could not find a simple program for
win32. Somewhere around 1982 I developed a PASCAL cross reference
program, which is some time ago ;). Never done anything in C though.

I wanted to do this in C, mainly because of this newsgroup, which I find
to have a lot experts, providing usually good comments.

So I Googled and borrowed code from a lot of places, putting it together
in one file. The code is here (360 lines):

http://requirements-management.nl/files/download/xref-sources-0.99.zip

The header looks like this.

/****************************************************************/
/* a. File : xref.c */
/* b. Version : v0-99 */
/* c. Author : Hans Lodder (2007-09-28) */
/* d. Modified : Hans Lodder (2010-01-16) */
/* e. Description : Cross references an input file */
/* Can exclude keywords through L switch. */
/* f. Usage : xref [-h] {[-Len] | [-Ldu] | [-Lde] | [ -LC]} <file>*/
/* g. Default : Exclude English keywords -Len */
/* h. Algorithm : Uses a binary tree and linked lists */
/* i. Business requirements: */
/* 01. Produces a cross reference of any UTF-8 */
/* input file. */
/* 02. By default xref neglects frequent */
/* English keywords. */
/* 03. No input results in no output. */
/* 04. Fool data is isolated as early as */
/* feasible without */
/* sacrificing the design. */
/* 05. A modular and maintainable design */
/* j. Tools : gcc (4.4.0) -Wall -Wextra -pedantic -o xref.exe xref.c*/
/* splint (3.1.1) -weak xref.c */
/* indent (2.2.9) -gnu xref.c */
/* notepad++ (5.6.6) xref.c */
/* h. Test cases : 01. No input No output */
/* 02. 1 word 1 line file listing, */
/* 1 line output list */
/* 03. 2 equal words 1 line output list, */
/* 2 line numbers */
/* 04. 2 different words 2 line output list */
/* i. Remarks : */
/*****************************************************************/

If appropriate I am more than willing to add it to this message. Please
let me know.

All comments are welcomed. I would appreciate it if they not only
signaled issues, but also possible solutions.

Kind regards, and thanks in advance,

Hans Lodder
 
B

Ben Bacarisse

News groups said:
So i need a cross referencer. I could not find a simple program for
win32. Somewhere around 1982 I developed a PASCAL cross reference
program, which is some time ago ;). Never done anything in C though.

I wanted to do this in C, mainly because of this newsgroup, which I
find to have a lot experts, providing usually good comments.

So I Googled and borrowed code from a lot of places, putting it
together in one file. The code is here (360 lines):

http://requirements-management.nl/files/download/xref-sources-0.99.zip

It's hard to be motivated to do a review of code not posted here so
you may get only a few replies. I'll made two suggestions after
commenting that the code looks good overall.

(1) An unbalanced tree is a good first approximation, but your list of
common words (I'd not call them keywords -- too confusing) is sorted
so you get, in essence, a linked list search not a tree-based one!

(2) If you include \r in the excluded characters, your files will work
on non Windows systems. Alternatively, find a way to distribute a
plain text version of the common word list so that it has native
line endings wherever it is used.

<snip>
 
N

News groups

Ben Bacarisse schreef:
It's hard to be motivated to do a review of code not posted here so
you may get only a few replies. I'll made two suggestions after
commenting that the code looks good overall.

(1) An unbalanced tree is a good first approximation, but your list of
common words (I'd not call them keywords -- too confusing) is sorted
so you get, in essence, a linked list search not a tree-based one!

(2) If you include \r in the excluded characters, your files will work
on non Windows systems. Alternatively, find a way to distribute a
plain text version of the common word list so that it has native
line endings wherever it is used.

<snip>

Thanks Ben! The complete code of xref.c

/********************************************************************************/
/* a. File : xref.c */
/* b. Version : v0-99 */
/* c. Author : Hans Lodder (2007-09-28) */
/* d. Modified : Hans Lodder (2010-01-16) */
/* e. Description : Cross references an input file */
/* Can exclude keywords through L switch. */
/* f. Usage : xref [-h] {[-Len] | [-Ldu] | [-Lde] | [ -LC]} <file> */
/* g. Default : Exclude English keywords -Len */
/* h. Algorithm : Uses a binary tree and linked lists */
/* i. Business requirements: */
/* 01. Produces a cross reference of any UTF-8 input file. */
/* 02. By default xref neglects frequent English keywords. */
/* 03. No input results in no output. */
/* 04. Fool data is isolated as early as feasible without */
/* sacrificing the design. */
/* 05. A modular and maintainable design */
/* j. Tools : gcc (4.4.0) -Wall -Wextra -pedantic -o xref.exe
xref.c */
/* splint (3.1.1) -weak xref.c */
/* indent (2.2.9) -gnu xref.c */
/* notepad++ (5.6.6) xref.c */
/* h. Test cases : 01. No input No output */
/* 02. 1 word 1 line file listing, */
/* 1 line output list */
/* 03. 2 equal words 1 line output list, 2 line numbers */
/* 04. 2 different words 2 line output list */
/* i. Remarks : */
/********************************************************************************/

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <assert.h>
#include <stdbool.h>

#define WORD_WIDTH 15
#define NUM_WIDTH 5

#define LINE_BUF_SIZE 65536

#ifndef NDEBUG
#define NDEBUG
#undef NDEBUG
#endif

/* Node structure for each list of line numbers */
struct List
{
int lnum; /* line number of word */
struct List *next;
};
typedef struct List List;

/* Node structure for tree of distinct words */
struct Tree
{
char word[WORD_WIDTH];
int cnt_word; /* number of instances of word */
List *lstptr;
struct Tree *left, *right;
};
typedef struct Tree Tree;

/* Tree and list functions */
static Tree *addword (Tree *, const char *);
static List *addline (List *, int);
static Tree *find_node (Tree *, const char *);
static void print_header (void);
static void print_tree (Tree *);
static void print_list (List *);
static Tree *IsKeyword (Tree *, const char *);

static int ndistinct = 0;

int
main (int argc, char *argv[])
{
char linebuf[LINE_BUF_SIZE];
Tree *words = NULL;
Tree *keywords = NULL;
int lineno = 0;
FILE *fp;
char lang_excl_keywords[BUFSIZ] = "";
char *s;

/* command line processing */

while (--argc > 0 && (*++argv)[0] == '-')
{
for (s = argv[0] + 1; *s != '\0'; s++)
{
switch (*s)
{
case 'L':
--argc;
++argv;
strcpy (lang_excl_keywords, argv[0]);
break;
default:
fprintf (stderr, "xref: -I- version xref-0.99\n");
fprintf (stderr,
"xref: -I- create a document cross reference; optional:
exclude words\n");
fprintf (stderr,
"xref: -I- default: exclude common en-US words\n");
fprintf (stderr,
"xref: usage: xref -L <excluded words file> <stdin>
<stdout>\n");
fprintf (stderr, "xref: valid option is '-L'\n");
fprintf (stderr,
" L : letter code for the excluded keywords [C, de,
en, nl]\n");
fprintf (stderr, "\n");
fprintf (stderr,
"xref: mail requests and bugs to hans.lodder at
requirements-management.nl\n");
break;
}
}
}

/* Do optional input redirection */

if (argc == 1)
{
assert (strlen (argv[0]) != 0);
if (freopen (argv[0], "r", stdin) == NULL)
{
fprintf (stderr, "cannot open file '%s'", argv[0]);
exit (EXIT_FAILURE);
}
}

/* read in the keywords, excepted from the cross reference */
if (strlen (lang_excl_keywords) == 0)
strcat (lang_excl_keywords, "en.xrf");
else
strcat (lang_excl_keywords, ".xrf");

if ((fp = fopen (lang_excl_keywords, "r")) == NULL)
{
fprintf (stderr, "cannot open file '%s'", lang_excl_keywords);
exit (EXIT_FAILURE);
}

/* Process each line of text file */
while (fgets (linebuf, LINE_BUF_SIZE, fp) != NULL)
{
char *wordptr, *lineptr = linebuf;
char bad_chars[] =
" \t\n\f\v\\\"~!@#$%^&*()+'" "|`1234567890-={}[]:;<>?,./";

/* Process each word in line */
while ((wordptr = strtok (lineptr, bad_chars)) != NULL)
{
if (strlen (wordptr) > WORD_WIDTH)
wordptr[WORD_WIDTH - 1] = '\0';

/* assert (keywords != NULL); */
keywords = addword (keywords, wordptr);

lineptr = NULL;
}
}

/* terminate keyword processing */
if (fp != NULL)
(void) fclose (fp);

/* initialize the processing of words */

ndistinct = 0; /* re-initialize counter */
/* Process each line of text file */
while (fgets (linebuf, LINE_BUF_SIZE, stdin) != NULL)
{
char *wordptr, *lineptr = linebuf;
char bad_chars[] =
" \t\n\f\v\\\"~!@#$%^&*()+'" "|`1234567890-={}[]:;<>?,./";

assert (lineptr != NULL);
++lineno;

/* Process each word in line */
while ((wordptr = strtok (lineptr, bad_chars)) != NULL)
{
Tree *node;

assert (wordptr != NULL);
if (IsKeyword (keywords, wordptr) == NULL)
{
#ifdef NDEBUG
fprintf (stderr, "File: '%s', line '%d': wordptr= '%s'\n",
__FILE__, __LINE__, wordptr);
#endif
if (strlen (wordptr) > WORD_WIDTH)
wordptr[WORD_WIDTH - 1] = '\0';

assert (wordptr != NULL);
words = addword (words, wordptr);
node = find_node (words, wordptr);
(node->cnt_word)++; /* increment the number of instances of word */

node->lstptr = addline (node->lstptr, lineno);
}

lineptr = NULL;
}
}

/* Print results */

if (ndistinct > 0)
printf ("No. of distinct words: %d.\n\n", ndistinct);

if (words != NULL)
print_header ();
print_tree (words);

/* gcc guarantees that all space is returned to the OS, so it is
unnecessary to free allocated memory and file pointers. */

return 0;
}

Tree *
addword (Tree * tp, const char *word)
{
if (tp == NULL)
{
/* Add new entry */
assert (tp == NULL);

++ndistinct;
if ((tp = malloc (sizeof (Tree))) == NULL)
{
fprintf (stderr, "xref: -F- Out of memory");
exit (EXIT_FAILURE);
}
assert (tp != NULL);
strncpy (tp->word, word, WORD_WIDTH);
tp->cnt_word = 0;

tp->left = tp->right = NULL;
tp->lstptr = NULL;
}
else if (strcmp (tp->word, word) < 0)
tp->right = addword (tp->right, word);
else if (strcmp (tp->word, word) > 0)
tp->left = addword (tp->left, word);

return tp;
}

List *
addline (List * lp, int n)
{
if (lp == NULL)
{
/* Append new line number to list */
List *newp = malloc (sizeof (List));

if (newp == NULL)
{
fprintf (stderr, "xref: -F- Out of memory");
exit (EXIT_FAILURE);
}
assert (newp != NULL);
newp->lnum = n;
newp->next = NULL;
return newp;
}
else if (lp->lnum != n)
lp->next = addline (lp->next, n);

return lp;
}

void
print_header (void) /* print column headers */
{
const char *txt_keyword = "Keyword";
const char *txt_count = "Count";
const char *txt_perc = "( % )";
const char *txt_lineno = "Line number(s)";
const char *txt_hyphen = "-----------";

printf ("%-*.*s %-*s %-*s %-s\n", WORD_WIDTH, WORD_WIDTH, txt_keyword,
NUM_WIDTH, txt_count, NUM_WIDTH + NUM_WIDTH - 1, txt_perc,
txt_lineno);
printf ("%-*.*s %-*.5s %-*.8s %-s\n", WORD_WIDTH, WORD_WIDTH - 8,
txt_hyphen, NUM_WIDTH, txt_hyphen, NUM_WIDTH + NUM_WIDTH - 1,
txt_hyphen, txt_hyphen);
}

void
print_tree (Tree * tp)
{
if (tp != NULL)
{
/* In order traversal */
print_tree (tp->left);
printf ("%-*.*s: %*d (%5.1f%%)", WORD_WIDTH, WORD_WIDTH, tp->word,
NUM_WIDTH, tp->cnt_word,
(float) (tp->cnt_word * 100.0 / ndistinct));
print_list (tp->lstptr);
(void) putchar ('\n');
print_tree (tp->right);
}
}

void
print_list (List * lp)
{
const int INDENT = WORD_WIDTH + WORD_WIDTH + 1;
const int NUMS_PER_LINE = 8;
int count;

for (count = 0; lp != NULL; lp = lp->next, ++count)
{
printf ("%*d", NUM_WIDTH, lp->lnum);
if ((count + 1) % NUMS_PER_LINE == 0 && lp->next != NULL)
printf ("\n%*c", INDENT, ' ');
}
}

Tree *
find_node (Tree * tp, const char *s)
{
/* Binary search for word */
if (strcmp (tp->word, s) > 0)
return find_node (tp->left, s);
else if (strcmp (tp->word, s) < 0)
return find_node (tp->right, s);
else
return tp;
}

/* new function */

Tree *
IsKeyword (Tree * kwp, const char *s) /* keywords not to be counted */
{
int comp, search = 1;

while ((kwp != NULL) && (bool) search)
{
if ((comp = strcmp (kwp->word, s)) == 0)
search = 0;
else if (comp > 0)
kwp = kwp->left;
else if (comp < 0)
kwp = kwp->right;
else
assert (0);
}

if (!(bool) search) /* found a keyword */
return kwp;
else
return NULL;

}

/* End of file xref.c */
 
H

Hans Lodder

Ben Bacarisse schreef:
It's hard to be motivated to do a review of code not posted here so
you may get only a few replies. I'll made two suggestions after
commenting that the code looks good overall.

(1) An unbalanced tree is a good first approximation, but your list of
common words (I'd not call them keywords -- too confusing) is sorted
so you get, in essence, a linked list search not a tree-based one!

(2) If you include \r in the excluded characters, your files will work
on non Windows systems. Alternatively, find a way to distribute a
plain text version of the common word list so that it has native
line endings wherever it is used.

<snip>

Thanks Ben! I will include the complete code. Here its is:

/********************************************************************************/
/* a. File : xref.c */
/* b. Version : v0-99 */
/* c. Author : Hans Lodder (2007-09-28) */
/* d. Modified : Hans Lodder (2010-01-16) */
/* e. Description : Cross references an input file */
/* Can exclude keywords through L switch. */
/* f. Usage : xref [-h] {[-Len] | [-Ldu] | [-Lde] | [ -LC]} <file> */
/* g. Default : Exclude English keywords -Len */
/* h. Algorithm : Uses a binary tree and linked lists */
/* i. Business requirements: */
/* 01. Produces a cross reference of any UTF-8 input file. */
/* 02. By default xref neglects frequent English keywords. */
/* 03. No input results in no output. */
/* 04. Fool data is isolated as early as feasible without */
/* sacrificing the design. */
/* 05. A modular and maintainable design */
/* j. Tools : gcc (4.4.0) -Wall -Wextra -pedantic -o xref.exe
xref.c */
/* splint (3.1.1) -weak xref.c */
/* indent (2.2.9) -gnu xref.c */
/* notepad++ (5.6.6) xref.c */
/* h. Test cases : 01. No input No output */
/* 02. 1 word 1 line file listing, */
/* 1 line output list */
/* 03. 2 equal words 1 line output list, 2 line numbers */
/* 04. 2 different words 2 line output list */
/* i. Remarks : */
/********************************************************************************/

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <assert.h>
#include <stdbool.h>

#define WORD_WIDTH 15
#define NUM_WIDTH 5

#define LINE_BUF_SIZE 65536

#ifndef NDEBUG
#define NDEBUG
#undef NDEBUG
#endif

/* Node structure for each list of line numbers */
struct List
{
int lnum; /* line number of word */
struct List *next;
};
typedef struct List List;

/* Node structure for tree of distinct words */
struct Tree
{
char word[WORD_WIDTH];
int cnt_word; /* number of instances of word */
List *lstptr;
struct Tree *left, *right;
};
typedef struct Tree Tree;

/* Tree and list functions */
static Tree *addword (Tree *, const char *);
static List *addline (List *, int);
static Tree *find_node (Tree *, const char *);
static void print_header (void);
static void print_tree (Tree *);
static void print_list (List *);
static Tree *IsKeyword (Tree *, const char *);

static int ndistinct = 0;

int
main (int argc, char *argv[])
{
char linebuf[LINE_BUF_SIZE];
Tree *words = NULL;
Tree *keywords = NULL;
int lineno = 0;
FILE *fp;
char lang_excl_keywords[BUFSIZ] = "";
char *s;

/* command line processing */

while (--argc > 0 && (*++argv)[0] == '-')
{
for (s = argv[0] + 1; *s != '\0'; s++)
{
switch (*s)
{
case 'L':
--argc;
++argv;
strcpy (lang_excl_keywords, argv[0]);
break;
default:
fprintf (stderr, "xref: -I- version xref-0.99\n");
fprintf (stderr,
"xref: -I- create a document cross reference; optional:
exclude words\n");
fprintf (stderr,
"xref: -I- default: exclude common en-US words\n");
fprintf (stderr,
"xref: usage: xref -L <excluded words file> <stdin>
<stdout>\n");
fprintf (stderr, "xref: valid option is '-L'\n");
fprintf (stderr,
" L : letter code for the excluded keywords [C, de,
en, nl]\n");
fprintf (stderr, "\n");
fprintf (stderr,
"xref: mail requests and bugs to hans.lodder at
requirements-management.nl\n");
break;
}
}
}

/* Do optional input redirection */

if (argc == 1)
{
assert (strlen (argv[0]) != 0);
if (freopen (argv[0], "r", stdin) == NULL)
{
fprintf (stderr, "cannot open file '%s'", argv[0]);
exit (EXIT_FAILURE);
}
}

/* read in the keywords, excepted from the cross reference */
if (strlen (lang_excl_keywords) == 0)
strcat (lang_excl_keywords, "en.xrf");
else
strcat (lang_excl_keywords, ".xrf");

if ((fp = fopen (lang_excl_keywords, "r")) == NULL)
{
fprintf (stderr, "cannot open file '%s'", lang_excl_keywords);
exit (EXIT_FAILURE);
}

/* Process each line of text file */
while (fgets (linebuf, LINE_BUF_SIZE, fp) != NULL)
{
char *wordptr, *lineptr = linebuf;
char bad_chars[] =
" \t\n\f\v\\\"~!@#$%^&*()+'" "|`1234567890-={}[]:;<>?,./";

/* Process each word in line */
while ((wordptr = strtok (lineptr, bad_chars)) != NULL)
{
if (strlen (wordptr) > WORD_WIDTH)
wordptr[WORD_WIDTH - 1] = '\0';

/* assert (keywords != NULL); */
keywords = addword (keywords, wordptr);

lineptr = NULL;
}
}

/* terminate keyword processing */
if (fp != NULL)
(void) fclose (fp);

/* initialize the processing of words */

ndistinct = 0; /* re-initialize counter */
/* Process each line of text file */
while (fgets (linebuf, LINE_BUF_SIZE, stdin) != NULL)
{
char *wordptr, *lineptr = linebuf;
char bad_chars[] =
" \t\n\f\v\\\"~!@#$%^&*()+'" "|`1234567890-={}[]:;<>?,./";

assert (lineptr != NULL);
++lineno;

/* Process each word in line */
while ((wordptr = strtok (lineptr, bad_chars)) != NULL)
{
Tree *node;

assert (wordptr != NULL);
if (IsKeyword (keywords, wordptr) == NULL)
{
#ifdef NDEBUG
fprintf (stderr, "File: '%s', line '%d': wordptr= '%s'\n",
__FILE__, __LINE__, wordptr);
#endif
if (strlen (wordptr) > WORD_WIDTH)
wordptr[WORD_WIDTH - 1] = '\0';

assert (wordptr != NULL);
words = addword (words, wordptr);
node = find_node (words, wordptr);
(node->cnt_word)++; /* increment the number of instances of word */

node->lstptr = addline (node->lstptr, lineno);
}

lineptr = NULL;
}
}

/* Print results */

if (ndistinct > 0)
printf ("No. of distinct words: %d.\n\n", ndistinct);

if (words != NULL)
print_header ();
print_tree (words);

/* gcc guarantees that all space is returned to the OS, so it is
unnecessary to free allocated memory and file pointers. */

return 0;
}

Tree *
addword (Tree * tp, const char *word)
{
if (tp == NULL)
{
/* Add new entry */
assert (tp == NULL);

++ndistinct;
if ((tp = malloc (sizeof (Tree))) == NULL)
{
fprintf (stderr, "xref: -F- Out of memory");
exit (EXIT_FAILURE);
}
assert (tp != NULL);
strncpy (tp->word, word, WORD_WIDTH);
tp->cnt_word = 0;

tp->left = tp->right = NULL;
tp->lstptr = NULL;
}
else if (strcmp (tp->word, word) < 0)
tp->right = addword (tp->right, word);
else if (strcmp (tp->word, word) > 0)
tp->left = addword (tp->left, word);

return tp;
}

List *
addline (List * lp, int n)
{
if (lp == NULL)
{
/* Append new line number to list */
List *newp = malloc (sizeof (List));

if (newp == NULL)
{
fprintf (stderr, "xref: -F- Out of memory");
exit (EXIT_FAILURE);
}
assert (newp != NULL);
newp->lnum = n;
newp->next = NULL;
return newp;
}
else if (lp->lnum != n)
lp->next = addline (lp->next, n);

return lp;
}

void
print_header (void) /* print column headers */
{
const char *txt_keyword = "Keyword";
const char *txt_count = "Count";
const char *txt_perc = "( % )";
const char *txt_lineno = "Line number(s)";
const char *txt_hyphen = "-----------";

printf ("%-*.*s %-*s %-*s %-s\n", WORD_WIDTH, WORD_WIDTH, txt_keyword,
NUM_WIDTH, txt_count, NUM_WIDTH + NUM_WIDTH - 1, txt_perc,
txt_lineno);
printf ("%-*.*s %-*.5s %-*.8s %-s\n", WORD_WIDTH, WORD_WIDTH - 8,
txt_hyphen, NUM_WIDTH, txt_hyphen, NUM_WIDTH + NUM_WIDTH - 1,
txt_hyphen, txt_hyphen);
}

void
print_tree (Tree * tp)
{
if (tp != NULL)
{
/* In order traversal */
print_tree (tp->left);
printf ("%-*.*s: %*d (%5.1f%%)", WORD_WIDTH, WORD_WIDTH, tp->word,
NUM_WIDTH, tp->cnt_word,
(float) (tp->cnt_word * 100.0 / ndistinct));
print_list (tp->lstptr);
(void) putchar ('\n');
print_tree (tp->right);
}
}

void
print_list (List * lp)
{
const int INDENT = WORD_WIDTH + WORD_WIDTH + 1;
const int NUMS_PER_LINE = 8;
int count;

for (count = 0; lp != NULL; lp = lp->next, ++count)
{
printf ("%*d", NUM_WIDTH, lp->lnum);
if ((count + 1) % NUMS_PER_LINE == 0 && lp->next != NULL)
printf ("\n%*c", INDENT, ' ');
}
}

Tree *
find_node (Tree * tp, const char *s)
{
/* Binary search for word */
if (strcmp (tp->word, s) > 0)
return find_node (tp->left, s);
else if (strcmp (tp->word, s) < 0)
return find_node (tp->right, s);
else
return tp;
}

/* new function */

Tree *
IsKeyword (Tree * kwp, const char *s) /* keywords not to be counted */
{
int comp, search = 1;

while ((kwp != NULL) && (bool) search)
{
if ((comp = strcmp (kwp->word, s)) == 0)
search = 0;
else if (comp > 0)
kwp = kwp->left;
else if (comp < 0)
kwp = kwp->right;
else
assert (0);
}

if (!(bool) search) /* found a keyword */
return kwp;
else
return NULL;

}

/* End of file xref.c */
 
B

Barry Schwarz

On Tue, 16 Feb 2010 08:36:13 +0100, Hans Lodder

snip
Thanks Ben! I will include the complete code. Here its is:

snip 370+ lines

Changing your name and reposting almost 400 lines after five days
doesn't win many friends.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,995
Messages
2,570,235
Members
46,821
Latest member
AleidaSchi

Latest Threads

Top