Data Mining the MIRC Index

From MircWiki
Revision as of 15:12, 13 August 2006 by Johnperry (talk | contribs)
Jump to navigation Jump to search

This article describes several ways to mine the data in the MIRC index. To perform the procedures listed here, you must be an administrator of the storage service whose index is to be accessed.

In releases from T10 through T29, the index of a storage service is a memory-resident XML DOM object which is constructed when Tomcat starts and which is dynamically maintained while MIRC is running. This object can be saved to the root directory of the storage service by clicking the Save Index button in the Storage Service column on the storage service's admin page. The filename of the saved index is saved-index-file.xml.

The saved index file can be accessed from a browser through the URL:

  • [siteurl]/[storage service name]/saved-index-file.xml

To get the entire contents of the file, you must select the browser's View Source menu item. This will launch the configured text editor and display all the text of the object. At that point, you can select the text editor's Save as... menu item to save it to the local disk.

Mild caution: The index contains information that is preprocessed to make index searches reasonably efficient, but because it is intended for programmatic access, it may be difficult to read.

There are several options for processing the file:

  • You can use a text editor to reduce the file to just the elements of interest. In a large index, this is likely to be tedious.
  • Since the index file is a well-formed XML string, you can insert it into an XML database and access it using the database's methods.
  • You can write a program in Java or some other language that accesses the data programmaticaly using the XML DOM.
  • You can create an XSL program to pre-filter the index on the MIRC site as described below.

Using the XSL Capability of the XML Server

The XML Server in the MIRC software allows authorized users to select any pre-stored XSL program to apply to an XML file when it is accessed. The default XSL program which is used to process an XML file is named for the root element in the XML file. Thus, since the root element in all MIRCdocuments is MIRCdocument, the default program that is used to process MIRCdocuments for viewing is called MIRCdocument.xsl. The XML Server looks for the program file first in the same directory where the XML file is located and, if not found there, then in the root of the storage service. Authorized users can select another XSL program file by including the xsl query string. Here are three examples to make it clear:

  • filename.xml is processed with the default XSL program for the XML file.
  • filename.xml?xsl=xyz.xsl is processed with the xyz.xsl program.
  • filename.xml?xsl= returns the original, unprocessed XML text.

You can take advantage of this capability to preprocess the saved-index-file.xml file to return only the data of interest, formatted in a more readable way. For example, if you want to get a list of all the MIRCdocuments that contain code elements that are non-whitespace, along with their titles, file paths, and the codes themselves, consider the following XSL program:

<?xml version="1.0"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.1">

<xsl:template match="/MIRCdocs">
<xsl:text>
</xsl:text><MIRCdocs><xsl:text>

</xsl:text>
	<xsl:apply-templates select="MIRCdocument[.//code[string-length(normalize-space(.))!=0]]"/>
</MIRCdocs>
</xsl:template>

<xsl:template match="MIRCdocument">
<MIRCdocument><xsl:text>
	</xsl:text><filename><xsl:value-of select="@filename"/></filename><xsl:text>
	</xsl:text><title><xsl:value-of select="title"/></title>

	<xsl:for-each select=".//code">
		<xsl:if test="string-length(normalize-space(.))!=0">
			<xsl:text>
	</xsl:text>
			<code><xsl:value-of select="normalize-space(.)"/></code>
		</xsl:if>
	</xsl:for-each><xsl:text>
</xsl:text>
</MIRCdocument><xsl:text>

</xsl:text>
</xsl:template>
</xsl:stylesheet>

If this program is saved with the name MIRCdocs.xsl in the root directory of a storage service and saved-index-file.xml is accessed without any query string, then the result will look something like this:

<?xml version="1.0" encoding="UTF-8"?>
<MIRCdocs>

<MIRCdocument>
	<filename>documents/20060526133625296/SPR2006.ppt.xml</filename>
	<title>SPR 2006</title>
	<code>abc</code>
</MIRCdocument>

<MIRCdocument>
	<filename>documents/20060713134420906/MIRCdocument.xml</filename>
	<title>Title</title>
	<code>xyz</code>
</MIRCdocument>

</MIRCdocs>

Remember that when the browser displays the data, it will only show the text values of the elements; therefore, it is necessary to view the source of the page in order to capture all the text.

You can create XSL programs for various purposes and access them through the xsl parameter in the query string. Note that when processing saved-index-file.xml, which is stored in the root of the storage service, the XSL program must be located there as well.