Too Many Open Files

I had an interesting problem happen with a customer late last week. A monitor we maintain on the alert log suddenly paged us with ORA-1116, ORA-1110 and ORA-27041 (Can’t open a database file, too many open files). Nothing had changed on this system in months.

After running: lsof | awk ‘{print $9}’ | sort | uniq -c | sort -n | tail -5 (as with all code blocks, your milage may vary, this was in the bash shell on Redhat Linux), the problem turned out to be with $ORACLE_HOME/nls/data/9idata/lx40030.nlb. A little poking aroud on Metalink led us to bug 5257698. This is a Generic bug for 10.2. The patch replaces lx40030.nlb and lx40003.nlb in $ORACLE_HOME/nls/data/old. Rerunning appears to have fixed the open files issue.

The worrisome part of this issue is that they had been live on 10.2 with E-Business Suite for months and no patches or other changes had been made which would have triggered this that we could find. In spite of this, we had to reboot Thursday and were right on the edge of having to shutdown production again on Friday. Luckily, we were able to apply the patch this weekend.

If you are using 10.2 especially with the E-Business Suite, please make sure that you have some form of simple open file monitor running (lsof | wc -l if nothing else).

We discovered that the number of files was flucuating somewhat, but it was basically increasing (this was after a reboot–the problem all started when we got a page from the alert.log that some datafiles apppeared to be missing).

Here is a simple script that can be used to do the monitoring (just put it in cron):

COUNT=$($LSOF | $WC -l)
echo "There are $COUNT open files"
if [ "$COUNT" -gt "$THRESHOLD" ]; then
   $MAIL -s "$SUBJ $COUNT open files" $ALERT <<EOF

On $(date), the number of
open files on $(hostname)
was $COUNT.  This is
greater than the alerting threshold