Siilihai Help

Parser technology

Parsers are based on paths which are used to generate URLs from which to download data and patterns which extract the actual data from HTML source code. Paths are defined in Parser Maker's basics tab. Patterns are defined in Group List, Thread List and Message List tabs.

Forums in Siilihai contain:

All Groups, Threads, and Messages must have unique identifiers.

Most fields in parser definitions are optional. Naturally some are always needed to make the parser usable. Writing a single parser takes about 10-20 minutes, depending which features are supported. Required fields are marked with bold.

Data extraction patterns

These patterns extract the actual forum data from HTML pages served. Writing them takes a little practice, but is pretty straightforward. Writing patterns is much like writing regular expressions but much simpler.

The patterns contain HTML code to be searched from page source and tags that match different values from the HTML source. Tags can be for example %a, %b and %c. These are mapped to various information on the subject being searched, such as id numbers, text and dates.

Ignoring input

Use %i to ignore a section of text. %i can be used in all patterns.

Integer values

To force a value to be an integer (number), use UPPERCASE tag name. For example if %a matches group ID and you know that group ids are always numbers, you can use %A instead.

Last change

All patterns contain an optional last change tag. This is used to decide whether the content of group or thread has changed since last download. If last change is not used, Siilihai client has to re-download all data in forum every time the user wants to update the messages. The actual content of last change does not affect this decision - client is only interested whether it has changed.
It is highly recommended to include last change in all patterns!

Example Pattern
This is a fictional section in web page
<table>
<tr foo="123" bar="xyz">
<td>Message author: James Bond, date: 12.06.2009 at 12.43</td>
<td>Message:</td>
Hello, just testing!
</table>
To extract author (%a), date (%b) and message body (%c) from the HTML code, you could use the following pattern:
<table>%iMessage author: %a, date: %b</td>%i<td>Message:</td>%c</table>
Notice the usage of %i (ignore) as values foo and bar may change so they can't be written as part of the pattern. For example patterns please download a working parser with Parser Maker and see how it works.

Parser limitations

Sometimes you may face a forum that is impossible to parse. An example of this is ubuntuforums.org which uses JavaScript to dynamically generate group lists. If you have good ideas how to allow supporting forums that are not possible to parse, don't hesitate to contact Siilihai developers.


Contact Siilihai.com at siilihai@siilihai.com