bpo.miners module

Functions:

mine_problem(log[, task_type_filter, ...])

Mines a problem and returns it as a problems.Problem that can be simulated.

bpo.miners.mine_problem(log, task_type_filter=None, datetime_format='%Y/%m/%d %H:%M:%S', earliest_start=None, latest_completion=None, min_resource_count=2, resource_schedule_timeunit=datetime.timedelta(seconds=3600), resource_schedule_repeat=168, datafields={}, max_error_std=None)[source]

Mines a problem and returns it as a problems.Problem that can be simulated. The log from which the model is mined must at least have the columns Case ID, Activity, Resource, Start Timestamp, Complete Timestamp, which identify the corresponding event log information. Activity labels are the same as Task Types for the purposes of the problem definition. The timing distributions associated with the problem are all in hours. Only cases that start on or after earliest_start and complete on or after latest_completion will be taken into account. Datafields specifies the columns in the log for which data fields will be learned. Datafields is a dictionary that maps column names to probability distributions. For the corresponding log column, a data type will be learned according to the specified distribution. The simulator can then draw samples for the distribution.

Parameters
  • log – a pandas dataframe from which the problem must be mined.

  • task_type_filter – a function that takes the name of a task type/ activity and returns if it should be included, or None to include all task types.

  • datetime_format – the datetime format the Start Timestamp and Complete Timestamp columns use.

  • earliest_start – a datetime object that, if not None, indicates that only cases that start on or after this datetime should be included. :param:`earliest_start` and :param:`latest_completion` should either both be None or both have a value.

  • latest_completion – a datetime object that, if not None, indicates that only cases that complete on or before this datetime should be included. :param:`earliest_start` and :param:`latest_completion` should either both be None or both have a value.

  • min_resource_count – the minimum number of times a resource must have executed a task of a particular type, for it to be considered in the pool of resources for the task type. This must be greater than 1, otherwise the standard deviation of the processing time cannot be computed.

  • resource_schedule_timeunit – the timeunit in which resource schedules should be represented. Default is 1 hour.

  • resource_schedule_repeat – the number of times after which the resource schedule is expected to repeat itself. Default is 168 repeats (of 1 hour is a week).

  • datafields – a mapping of string to DistributionType, where string must be the name is one of the columns of the log.

  • max_error_std – a distribution that describes the maximum standard deviation of the error that the processing times can have. It must be specified as a fraction of the mean processing time. The actual fraction will be sampled from this probability distribution. By default, no maximum is set.

Returns

a problems.Problem.