Class InformationTheory

java.lang.Object
org.tribuo.util.infotheory.InformationTheory

public final class InformationTheory extends Object
A class of (discrete) information theoretic functions. Gives warnings if there are insufficient samples to estimate the quantities accurately.

Defaults to log_2, so returns values in bits.

All functions expect that the element types have well defined equals and hashcode, and that equals is consistent with hashcode. The behaviour is undefined if this is not true.

  • Nested Class Summary

    Nested Classes
    Modifier and Type
    Class
    Description
    static final class 
    An immutable named tuple containing the statistics from a G test.
  • Field Summary

    Fields
    Modifier and Type
    Field
    Description
    static final int
    The initial size of the various maps.
    static final double
    Log base 2.
    static double
    Sets the base of the logarithm used in the information theoretic calculations.
    static final double
    Log base e.
    static final double
    The ratio of samples to symbols before emitting a warning.
  • Method Summary

    Modifier and Type
    Method
    Description
    static <T> Map<T,Long>
    Generate the counts for a single vector.
    static double
    Calculates the discrete Shannon entropy of a stream, assuming each element of the stream is an element of the same probability distribution.
    static double
    Calculates the discrete Shannon entropy of a stream, assuming each element of the stream is an element of the same probability distribution.
    static <T1, T2, T3> double
    cmi(List<T1> first, List<T2> second, Set<List<T3>> condition)
    Calculates the conditional mutual information between first and second conditioned on the set.
    static <T1, T2> double
    conditionalEntropy(List<T1> vector, List<T2> condition)
    Calculates the discrete Shannon conditional entropy of two arrays, using histogram probability estimators.
    static <T1, T2, T3> double
    conditionalMI(List<T1> first, List<T2> second, List<T3> condition)
    Calculates the discrete Shannon conditional mutual information, using histogram probability estimators.
    static <T1, T2, T3> double
    Calculates the discrete Shannon conditional mutual information, using histogram probability estimators.
    static <T1, T2, T3> double
    Calculates the discrete Shannon conditional mutual information, using histogram probability estimators.
    static <T> double
    entropy(List<T> vector)
    Calculates the discrete Shannon entropy, using histogram probability estimators.
    static <T> double
    expectedMI(List<T> first, List<T> second)
    Compute the expected mutual information assuming randomized inputs.
    gTest(List<T1> first, List<T2> second, Set<List<T3>> condition)
    Calculates the GTest statistics for the input variables conditioned on the set.
    static <T1, T2> double
    jointEntropy(List<T1> first, List<T2> second)
    Calculates the Shannon joint entropy of two arrays, using histogram probability estimators.
    static <T1, T2, T3> double
    jointMI(List<T1> first, List<T2> second, List<T3> target)
    Calculates the discrete Shannon joint mutual information, using histogram probability estimators.
    static <T1, T2, T3> double
    Calculates the discrete Shannon joint mutual information, using histogram probability estimators.
    static <T1, T2> double
    mi(List<T1> first, List<T2> second)
    Calculates the discrete Shannon mutual information, using histogram probability estimators.
    static <T1, T2> double
    mi(Set<List<T1>> first, Set<List<T2>> second)
    Calculates the mutual information between the two sets of random variables.
    static <T1, T2> double
    mi(PairDistribution<T1,T2> pairDist)
    Calculates the discrete Shannon mutual information, using histogram probability estimators.

    Methods inherited from class java.lang.Object

    clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
  • Field Details

    • SAMPLES_RATIO

      public static final double SAMPLES_RATIO
      The ratio of samples to symbols before emitting a warning.
      See Also:
    • DEFAULT_MAP_SIZE

      public static final int DEFAULT_MAP_SIZE
      The initial size of the various maps.
      See Also:
    • LOG_2

      public static final double LOG_2
      Log base 2.
    • LOG_E

      public static final double LOG_E
      Log base e.
    • LOG_BASE

      public static double LOG_BASE
      Sets the base of the logarithm used in the information theoretic calculations. For LOG_2 the unit is "bit", for LOG_E the unit is "nat".
  • Method Details

    • mi

      public static <T1, T2> double mi(Set<List<T1>> first, Set<List<T2>> second)
      Calculates the mutual information between the two sets of random variables.
      Type Parameters:
      T1 - The first type.
      T2 - The second type.
      Parameters:
      first - The first set of random variables.
      second - The second set of random variables.
      Returns:
      The mutual information I(first;second).
    • cmi

      public static <T1, T2, T3> double cmi(List<T1> first, List<T2> second, Set<List<T3>> condition)
      Calculates the conditional mutual information between first and second conditioned on the set.
      Type Parameters:
      T1 - The first type.
      T2 - The second type.
      T3 - The third type.
      Parameters:
      first - A sample from the first random variable.
      second - A sample from the second random variable.
      condition - A sample from the conditioning set of random variables.
      Returns:
      The conditional mutual information I(first;second|condition).
    • gTest

      public static <T1, T2, T3> InformationTheory.GTestStatistics gTest(List<T1> first, List<T2> second, Set<List<T3>> condition)
      Calculates the GTest statistics for the input variables conditioned on the set.
      Type Parameters:
      T1 - The first type.
      T2 - The second type.
      T3 - The third type.
      Parameters:
      first - A sample from the first random variable.
      second - A sample from the second random variable.
      condition - A sample from the conditioning set of random variables.
      Returns:
      The GTest statistics.
    • jointMI

      public static <T1, T2, T3> double jointMI(List<T1> first, List<T2> second, List<T3> target)
      Calculates the discrete Shannon joint mutual information, using histogram probability estimators. Arrays must be the same length.
      Type Parameters:
      T1 - Type contained in the first array.
      T2 - Type contained in the second array.
      T3 - Type contained in the target array.
      Parameters:
      first - An array of values.
      second - Another array of values.
      target - Target array of values.
      Returns:
      The mutual information I(first,second;joint)
    • jointMI

      public static <T1, T2, T3> double jointMI(TripleDistribution<T1,T2,T3> rv)
      Calculates the discrete Shannon joint mutual information, using histogram probability estimators. Arrays must be the same length.
      Type Parameters:
      T1 - Type contained in the first array.
      T2 - Type contained in the second array.
      T3 - Type contained in the target array.
      Parameters:
      rv - The random variable to calculate the joint mi of
      Returns:
      The mutual information I(first,second;joint)
    • conditionalMI

      public static <T1, T2, T3> double conditionalMI(List<T1> first, List<T2> second, List<T3> condition)
      Calculates the discrete Shannon conditional mutual information, using histogram probability estimators. Arrays must be the same length.
      Type Parameters:
      T1 - Type contained in the first array.
      T2 - Type contained in the second array.
      T3 - Type contained in the condition array.
      Parameters:
      first - An array of values.
      second - Another array of values.
      condition - Array to condition upon.
      Returns:
      The conditional mutual information I(first;second|condition)
    • conditionalMI

      public static <T1, T2, T3> double conditionalMI(TripleDistribution<T1,T2,T3> rv)
      Calculates the discrete Shannon conditional mutual information, using histogram probability estimators. Note this calculates I(T1;T2|T3).
      Type Parameters:
      T1 - Type of the first variable.
      T2 - Type of the second variable.
      T3 - Type of the condition variable.
      Parameters:
      rv - The triple random variable of the three inputs.
      Returns:
      The conditional mutual information I(first;second|condition)
    • conditionalMIFlipped

      public static <T1, T2, T3> double conditionalMIFlipped(TripleDistribution<T1,T2,T3> rv)
      Calculates the discrete Shannon conditional mutual information, using histogram probability estimators. Note this calculates I(T1;T3|T2).
      Type Parameters:
      T1 - Type of the first variable.
      T2 - Type of the condition variable.
      T3 - Type of the second variable.
      Parameters:
      rv - The triple random variable of the three inputs.
      Returns:
      The conditional mutual information I(first;second|condition)
    • mi

      public static <T1, T2> double mi(List<T1> first, List<T2> second)
      Calculates the discrete Shannon mutual information, using histogram probability estimators. Arrays must be the same length.
      Type Parameters:
      T1 - Type of the first array
      T2 - Type of the second array
      Parameters:
      first - An array of values
      second - Another array of values
      Returns:
      The mutual information I(first;second)
    • mi

      public static <T1, T2> double mi(PairDistribution<T1,T2> pairDist)
      Calculates the discrete Shannon mutual information, using histogram probability estimators.
      Type Parameters:
      T1 - Type of the first variable
      T2 - Type of the second variable
      Parameters:
      pairDist - PairDistribution for the two variables.
      Returns:
      The mutual information I(first;second)
    • jointEntropy

      public static <T1, T2> double jointEntropy(List<T1> first, List<T2> second)
      Calculates the Shannon joint entropy of two arrays, using histogram probability estimators. Arrays must be same length.
      Type Parameters:
      T1 - Type of the first array.
      T2 - Type of the second array.
      Parameters:
      first - An array of values.
      second - Another array of values.
      Returns:
      The entropy H(first,second)
    • conditionalEntropy

      public static <T1, T2> double conditionalEntropy(List<T1> vector, List<T2> condition)
      Calculates the discrete Shannon conditional entropy of two arrays, using histogram probability estimators. Arrays must be the same length.
      Type Parameters:
      T1 - Type of the first array.
      T2 - Type of the second array.
      Parameters:
      vector - The main array of values.
      condition - The array to condition on.
      Returns:
      The conditional entropy H(vector|condition).
    • entropy

      public static <T> double entropy(List<T> vector)
      Calculates the discrete Shannon entropy, using histogram probability estimators.
      Type Parameters:
      T - Type of the array.
      Parameters:
      vector - The array of values.
      Returns:
      The entropy H(vector).
    • calculateCountDist

      public static <T> Map<T,Long> calculateCountDist(List<T> vector)
      Generate the counts for a single vector.
      Type Parameters:
      T - The type inside the vector.
      Parameters:
      vector - An array of values.
      Returns:
      A HashMap from states of T to counts.
    • calculateEntropy

      public static double calculateEntropy(Stream<Double> vector)
      Calculates the discrete Shannon entropy of a stream, assuming each element of the stream is an element of the same probability distribution.
      Parameters:
      vector - The probability distribution.
      Returns:
      The entropy.
    • calculateEntropy

      public static double calculateEntropy(DoubleStream vector)
      Calculates the discrete Shannon entropy of a stream, assuming each element of the stream is an element of the same probability distribution.
      Parameters:
      vector - The probability distribution.
      Returns:
      The entropy.
    • expectedMI

      public static <T> double expectedMI(List<T> first, List<T> second)
      Compute the expected mutual information assuming randomized inputs.
      Type Parameters:
      T - The type inside the list. Must define equals and hashcode.
      Parameters:
      first - The first vector.
      second - The second vector.
      Returns:
      The expected mutual information under a hypergeometric distribution.