Resources used in Files Example

One other note about the file example used earlier.  As the number of asserted individuals rises, the time and memory used by Pellet to do the classification and rules increases, as can be expected.  Given the current case, with 4 SWRL rules and reading the individuals from an RDF/XML file (on a 1 GB laptop), the numbers look like this:

# of Individuals Inference Time (sec) Memory (B)
8 0.375 517,7344
100 2.7655 42,745,856
230 12.406 187,015,168
500 61.015 780,402,688
1000 N/A Out of Memory

Since in this example, it will be difficult to know how many files will be imported in a batch, scalability becomes an important issue.  If the number of files in a batch (for whatever reason) just happens to get near 1000, the inference will fail.

Some common suggestions:

  • Increase Memory – “Memory is cheap” is a very common response when this issue is raised.  It can be a fast fix in a pinch. However, most industry people (system architects, for instance) will shoot this down immediately for a number of reasons.
    • No matter how much memory you throw at a solution, if the input is unbounded, eventually there is a risk that the new limit will be reached (unexpectedly).  Normally this will happen during a demonstration to upper management …
    • “Real” applications (enterprise, commercial) have to share resources in an infrastructure and are expected to behave nicely. Resources like memory are frequently shared with other virtual servers (VM’s) in the same way as disk space is on a SAN, and processor speed is trottled by server.  Even if an application has “full access” to a server of it’s own, when it is deployed to production, there may be new limits on what it can use.
    • While this application is mostly dealing with single-thread batch processing, most rule applications in an infrastructure are dealing with any number of concurrent threads.  If all of those threads have unbounded memory, no amount of memory would be safe.
  • Tune the Engine – In any rule (or knowledge) base, there are features of the engine that can be turned off to conserve resources.  (Try the information in the Pellet FAQ for instance.) Optimization is good in any application, especially if the gains are good. However, no matter how much you tune the engine, if the number of instances coming into the application is unbounded, eventually a spike in the number of input instances will hit the magic limit.
  • Process a Fixed Number of Files – Typically, a rules application will look at a single case at a time and process the results.

It really depends on the application, of course. Research applications (and heavy-AI applications in general) are frequently given more resources than typical enterprise applications.  Tuning and a set limit on input instances is usually possible.

In this case, the chosen approach is to use the third option.  To do this, one approach is to merge the file scanner part of the application into the classification step, load a copy of the file ontology (OWL and SWRL rules) as a base ontology, and for each file or directory found:

  1. Assert the file information (name and so on).
  2. Run classification.
  3. Extract the results and act on them.

This kind of issue pops up frequently, so we will be dealing with it again.

Advertisement

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Connecting to %s

Follow

Get every new post delivered to your Inbox.