What’s in an index?

Splunk Enterprise stores all of the data it processes in indexes. An index is a collection of databases,

which are subdirectories located in $SPLUNK_HOME/var/lib/splunk.

Indexes consist of two types of files: rawdata and index files

Default set of indexes

Splunk Enterprise comes with a number of preconfigured indexes, including:

  • main: This is the default Splunk Enterprise index..
  • _internal: Stores Splunk Enterprise internal logs and processing metrics.
  • _audit: Contains events related to the file system change monitor, auditing, and all user search history.

When you index a data source, Splunk assigns metadata values.

  • The metadata is applied to the entire source
  • Splunk applies defaults is not specified
  • You can override them on per-event basis (during the parsing phase)
Metadata Default
Source Path of input file, network hostname:port, or script name
Host

 

Splunk hostname of the inputting instance (forwarder)
Sourcetype Uses the source filename if Splunk cannot automatically determine
Index Defaults to main
  • Splunk store events in Indexes
  • Splunk users can specify which index to search

( index=main sourcetype=access_combined_wcookie action=purchase)

  • All new inputs to splunk are stored in the main index.
  • The default is /opt/splunk/var/lib/splunk

 

main All processed data is stored here unless otherwise specified
summary

 

For summary indexing system
_internal

 

Splunk indexes its own logs.  Metrics from its processing here
_audit

 

It is for audit trails.
_interosepction

 

Tracks system performance.   Resource usage data of Splunk
_thefishbucket

 

Contains checkpoint information for file monitoring inputs.
  • It is always good to create separate indexes for access control and segregation of duties.
  • By using multiple indexes, you can set granular retention times
Daily logs Retention Times for logs Index Name Access Control Access Control Access Control
Plao Alto Firwall 10 GB 30 fwlog Firewall Team Security Team
Linux syslog 20 GB 60 oslog Admin Team Security Team
Windows logs 10 GB 60 oslog Admin Team Security Team
Proxy logs 15 GB 90 weblog Web Team Security Team Audit Team
Application logs 10 GB 90 applog App Team
Web logs 35 GB 90 weblog Web Team Security Team Audit Team
Daily Total 100 GB

 

  • An index stores events in unit called buckets
  • A bucket is a directory containing a set of raw data and indexing data.
  • Buckets have a time span and a max data size.
How data Flow through an Index
Inputs -> Hot -> Warm -> Cold -> Archive OR Delete
 

These are the newest buckets – open for WRITE

 

 

Recently indexed data, bucket are closed (READ ONLY)

 

Oldest data still in the index  (read only)

Frozen  : where data when it’s ready for archive or deletion

 

(No longer searchable)

 

AWS Storage AWS Storage Aws Storage Aws Storage
Storage Type where you put your buckets EBS General Purpose SSD (gp2) EBS General Purpose SSD (gp2) • EBS Throughput Optimised HDD (st1) Glacier
 

 

EBS Provisioned IOPS SSD (io1) volume types provide the highest performance and are ideal for special use cases

 

Hot Buckets

Data is read and parsed. It goes through the license meter.  The event is written into a hot bucket.  The buckets are closed when it reaches time span or max size. Then converted to warm status.

“hot_”  in the index’s db directory with a name

Hot buckets are renamed when rolled from hot to warm

When rolls bucket, it moves the entire bucket subdirectory.

Hot and warm buckets are searched first and should be on the fastest disks.

 

Select Settings > Indexes

Example

Index Settings
Index Name fwlog
Home Path /opt/dataidx/fw/db
Cold Path /opt/dataidx/fw/colddb
Thawed path /opt/dataidx/fw/thaweddb
Max size 500000 MB
Max size hot/warm/cold 10000 MB
Frozen archive path /opt/dataidx/frozen

OR

Edit the stanza in indexe.conf to more advanced options.

Indexing Activity

Search and Reporting > Reports > License Usage Data Cube

How to Inspecting Buckets

Search:  | dbinspect index=name  span or timeformat

 Display a chart with the span size of 1 day, using the command line interface (CLI)

  • | dbinspect index=_internal span=1d

Default dbinspect output for a local _internal index.

  • | dbinspect index=_internal

Check for corrupt buckets

Use the corruptonly argument to display information about corrupted buckets, instead of information about all buckets.

The output fields that display are the same with or without the corruptonly argument.

  • | dbinspect index=_internal corruptonly=true

 Count the number of buckets for each Splunk server

Use this command to verify that the Splunk servers in your distributed environment are included in the dbinspect command.

Counts the number of buckets for each server.

  • | dbinspect index=_internal | stats count by splunk_server

Find the index size of buckets in GB

Use dbinspect to find the index size of buckets in GB.

For current numbers, run this search over a recent time range.

  • | dbinspect index=_internal | eval GB=sizeOnDiskMB/1024| stats sum(GB)

Deleting Events you need to have can_delete role

  • delete command to make the unwanted data not to show up in searches
  • Index=web host=myhost source=access_combined_wcookie | delete

 splunk clean eventdata indexname wipes out all data from the index

How  can you tell your indexer is working?

index=[your_index_name]

index=_internal LicenseUsage idx=[your_index_name]

To check the license usage, search

Index=_internal Metrics series=[your_index_name]| stats sum(kbps)

index=_internal Metrics group=”per_sourcetype_thruput” series=access* | timechart span=1h sum(kb) by series

index=_internal Metrics group=”per_sourcetype_thruput” series=access* | timechart span=1h sum(kb) by series  | sort – sum(MB)

Determine how many active sources are being indexed.

Search | dbinspect index=main OR index=[your_index_name]

How to calculate the data compression rate of bucket

Search  :      | dbinstpect index=main  OR index=[your_inedex]  | were eventCount > 10000 | fields index,id,state,eventCount,rawSize,sizeOnDiskMB,sourceTypeCount   | eval TotalRawMB=(rawSize / 1024 / 1024)    | eval compression=tostring(round( sizeOnDiskMB / TotalRawMB * 100, 2 ))  + “%”      | table index, id, state, sourceTypeCount, TotalRawMB, sizeOnDiskMB, compression