MOD_HILITEM DESCRIPTION ======================= An Apache2 only DSO module to colorize/ hilite terms within a HTML version 2.0 and above documents using the libxml2 html parser module. MOD_HILITEM REQUIREMENTS ======================== Apache2 with mod_dso support. libxml2 2.6.8 (only version tested to date.) MOD_HILITEM DEVELOPMENT STATE ============================= This module is currently in alpha state. MOD_HILITEM OPERATION OVERVIEW ============================== Mod_hilitem, when configured to do so, intercepts a request for a html document. If the request has a correctly formatted QUERY_STRING, mod_hilitem will colorize/ hilite the indicated terms within the document, and return the modified HTML document instead of the original HTML document. MOD_HILITEM CONFIGURATION ========================= The Apache2 AddHandler configuration directive may be used to associate html files with the mod-hilitem handler. AddHandler mod-hilitem .html OR AddHandler mod-hilitem .html MOD_HILITEM QUERY_STRING ======================== Subject to change: The QUERY_STRING portion of the URL contains an '&' delimited list of search term regular expressions. Matching characters are colorized/ hilited. Upon reception, any '+' characters are changed to ' ' characters. Any escaped codes are unescaped. Empty and single character terms are ignored. The first RE is assigned color0, the second RE is assigned color 1, etc. Example: http://www.mydomain.com/colorize/this/file.html?RE1&RE2&RE3 Example Result: The colorized/ hilited version of file.html will be returned. All matches for RE1, RE2, and RE3 will be colorized/ hilited, using color0, color1, and color2 respectivly. The RE is in the form: (^|[[:space:]])([^[::]]*TERM[^[:space:]]*)([[:space:]]|$) where TERM is the actual single swish word/term being matched. The RE must contain exactly three sub expressions. The second (middle) subexpression is used as the characters to hilite. In almost all cases, most of the RE will be encoded in hex triples %XX. Search phrases are handled in a similar manner, within one RE. Future considerations are for only the term/phrases to be listed in the URL and mod_hilitem constructs the actual RE. This will reduce the length of the URL being passed to mod_hilitem. MOD_HILITEM HTML DOCUMENT ADJUSTMENTS ===================================== Documents using a statement indicating HTML version 3.2 and above use CSS1 and HTML tags to implement colorization. Mod_hilitem adds a: for each hilite term, where N is a unique positive integer. These elements are added within, and as the first elements of, the required element. By adding them at the start of the section, subsequent user supplied CSS1 sheets may be used to customize the hilite data on a per document basis. When a term is found, a element is used to hilite the matching portion of the text. Documents missing a statement or indicating HTML version below 3.2 use the tag to implement colorization. No per documentation overrides are currently possible. Terms are hilited within specific HTML tags. For now, the hilitem_colorable_element function is used to determine which tags will be checked for matching terms to hilite. The color and background schemes for colorizing/ hiliting matched text may be altered from within mod_hilitem.c. See the const global hilitem_clist and hilitem_blist string lists for more information. Technically, the tag was not added until HTML 4.0, and HTML 3.2 browsers are not required to support the tag. Hopefully, all current modern browsers support the tag, and this, hopefully, will not be an issue. MOD_HILITEM PCRE USAGE ====================== The Apache2 distribution includes a perl compatible regular expression library (srclib/pcre), accessible via the Apache2 interface functions. Mod_hilitem uses pcre to match the search terms given on the command/query line. Specifically, each given term is used to create a regular expression: "(term1|term2|termN)" As a general rule, longer hilite terms with multiple words, and, additional terms, increase processing time and requirements exponentially. MOD_HILITEM LIMITATIONS ======================= Currently mod_hilitem has numerous limitations and shortcomings. Hopefully, as time progresses, a search expert will volunteer to rewrite the search match routines. The search functions do not span tags. A Any will not match "Any". The search functions do not distinguish case. ANY will match any. Only the HTTP/1.1 GET method is supported. The module "works" best, at this time anyway, on simple HTML pages using the default white background with default black text. The module "works" best with CSS1 HTML 3.2+ style sheets. A temporary file must be created for each request. See the HILITEM_TEMPDIR and HILITEM_TFNAME compile time directives for more information. MOD_HILITEM FUTURE ================== Adjust the module to be an output filter, instead of a handler. Use hashing in hilitem_colorable_element function. CREDITS ======= Byron Young 2005 Part of the sourceforge.net searchm project.