Resources used in Files Example
2009/08/22 Leave a Comment
One other note about the file example used earlier. As the number of asserted individuals rises, the time and memory used by Pellet to do the classification and rules increases, as can be expected. Given the current case, with 4 SWRL rules and reading the individuals from an RDF/XML file (on a 1 GB laptop), the numbers look like this:
| # of Individuals | Inference Time (sec) | Memory (B) |
| 8 | 0.375 | 517,7344 |
| 100 | 2.7655 | 42,745,856 |
| 230 | 12.406 | 187,015,168 |
| 500 | 61.015 | 780,402,688 |
| 1000 | N/A | Out of Memory |
Since in this example, it will be difficult to know how many files will be imported in a batch, scalability becomes an important issue. If the number of files in a batch (for whatever reason) just happens to get near 1000, the inference will fail.
Some common suggestions:
- Increase Memory – “Memory is cheap” is a very common response when this issue is raised. It can be a fast fix in a pinch. However, most industry people (system architects, for instance) will shoot this down immediately for a number of reasons.
- No matter how much memory you throw at a solution, if the input is unbounded, eventually there is a risk that the new limit will be reached (unexpectedly). Normally this will happen during a demonstration to upper management …
- “Real” applications (enterprise, commercial) have to share resources in an infrastructure and are expected to behave nicely. Resources like memory are frequently shared with other virtual servers (VM’s) in the same way as disk space is on a SAN, and processor speed is trottled by server. Even if an application has “full access” to a server of it’s own, when it is deployed to production, there may be new limits on what it can use.
- While this application is mostly dealing with single-thread batch processing, most rule applications in an infrastructure are dealing with any number of concurrent threads. If all of those threads have unbounded memory, no amount of memory would be safe.
- Tune the Engine – In any rule (or knowledge) base, there are features of the engine that can be turned off to conserve resources. (Try the information in the Pellet FAQ for instance.) Optimization is good in any application, especially if the gains are good. However, no matter how much you tune the engine, if the number of instances coming into the application is unbounded, eventually a spike in the number of input instances will hit the magic limit.
- Process a Fixed Number of Files – Typically, a rules application will look at a single case at a time and process the results.
It really depends on the application, of course. Research applications (and heavy-AI applications in general) are frequently given more resources than typical enterprise applications. Tuning and a set limit on input instances is usually possible.
In this case, the chosen approach is to use the third option. To do this, one approach is to merge the file scanner part of the application into the classification step, load a copy of the file ontology (OWL and SWRL rules) as a base ontology, and for each file or directory found:
- Assert the file information (name and so on).
- Run classification.
- Extract the results and act on them.
This kind of issue pops up frequently, so we will be dealing with it again.