With the details you already have, you should be able to conduct complicated simulation scenarios with various models and population sets. You are also able to produce reports to help analyze the simulations. Yet sometimes there is a need to get beyond the results of a single simulation, or there is a need to take the data outside the GUI. To support such manipulation, the system offers some python utilities that arrive with the system. The following text will explore these utilities and their usefulness by subject.
The utilities are python scripts that allow the user to perform special advanced tasks. Here is a brief list of these scripts:
All these scripts are invoked using a similar manner. Therefore for explanation purposes we will refer to the script name, including the .py extension as: PythonScript.py. Whenever the name PythonScript.py is encountered, it should be replaced with the script name of interest.
The above scripts all start from the command prompt / terminal window. In Linux you can open a terminal window. In windows you can select the command prompt under the program group called accessories when you click on the windows start button on the lower left corner. On the windows start menu, you can also select run and then type
cmd and then press Enter to launch the command prompt.
Once you opened the terminal, you will have to change directory to your working directory by typing:
Recall that your working directory is the directory you installed IEST and
WorkingDirectoryFullPath means the full path name. To write the full directory name you can use the tab completion feature, or use drag and drop of a file into the command prompt window in windows and make corrections to the name that appears. Note that the directory separator on PC is the backslash character \ while on Linux it is a slash character / .
Once you are in the correct directory, you can invoke the script PythonScript.py by typing:
Note that if you defined the python installation directory to be in the windows path you can just write the script name PythonScript.py in the windows console. For further information on how to invoke python on Windows, see the following link. link.
For purposes of this tutorial we will always use the Linux form of invoking the program:
The invoked script will show you usage information and list its input variables. The program will then ask you to enter input through the console, each time prompting a single input. You can now follow the questions to run the script.
It is also possible to invoke the scripts with all their inputs from the command line and avoid asking the user for additional input. To do this, just add the input values after the script name:
python PythonScript.py InputVariable1 InputVariable2 ...
Note that each script will have different requests for input variables. And that in many cases, there may be defaults for some variables making them optional. Optional variables are displayed in brackets  in the usage information if the script is invoked with no variables.
We will now continue to discuss each utility script separately.
The script in focus for this topic is ConvertDataToCode.py.
If you think about the way you work with the system, entities are created in a certain order and reference each other. The order of entity creation is important to enable certain dependencies. For example you need to define a state before you include it in a process, you need to create a model before you use it in a project, and you need to create a parameter before you use it in an expression. It is somewhat similar to building a house: you first need to build the foundations, then the main body, and only then the roof, in that order. And just like in a house, after it is built it is sometimes difficult to make a correction in foundations. This analogy of building a house may be helpful later on, for now we will get back to our system and the GUI.
Each time you create a new entity the system will add it to the database. This database can be saved and loaded by the system as a zip file. This file is referred to many times as the data definitions file, since this file holds the entire database of entities that enables us to save and load our work. It can also contain simulation results on top of the project that created them. Think about adding entities to the system analogous to adding bricks to the house and think of the database as a snapshot of the entire house.
Think about a situation where instead of clicking your way through the system forms and entering data in a certain order, you can write down sentences that describe what you are doing is the form of instructions. Such a set of instructions can be used to create the database from scratch. This set of instructions constitutes a program that can reconstruct such a database. With analogy to the house, think about this as a plan with detailed instructions to a quick builder on how to build the house.
Now think that you already have a database zip file and want the system to figure out what is the set of instructions that created the database zip file. The system can do just that if you used the utility called ConvertDataToCode.py.
This utility takes a database zip file as input and creates Python code that reconstructs this database. With analogy to the house, think about it as looking at a snapshot of a house and automatically deriving the plans for the house as instructions to the builder.
The main input parameter to the reconstruction program is the database zip file we will denote as
DataDefintionsFileName.zip and typically the script will be invoked in the following way:
python ConvertDataToCode.py DataDefintionsFileName.zip
This will avoid asking questions from the user and just perform the conversion with default values which are recommended for most cases. By default the set of instructions to create the database will be saved as a reconstruction Python program using the file name: TheGeneratedDataCode.py
With regards to our analogy, think about this file as the plan containing instructions for the builder to build the house.
If you open this file, you will find instructions that create your data base in the following order:
Unless you request for it specifically, simulation results will not be converted by default, otherwise these will appear at the end.
At the very end of the code, you will find a line that creates a new zip file from the code under the default file name TheGeneratedDataCode_out.zip. So if you run the python reconstruction program TheGeneratedDataCode.py it will create a new database zip file under the name TheGeneratedDataCode_out.zip that is equivalent to the database file you converted to code.
To run the conversion from code back to data use the command:
If there are no changes to the python reconstruction program then this allows circular path between code and data that can be followed in either direction. In other words, this allows transfer from data definitions to code and vice versa so that code and data definitions are now interchangeable. With analogy to the house, think about is as having the ability to build a house from a plan containing building instructions and the ability to take a snapshot of an existing house and convert it back into building plans. This is powerful mechanism that allows the user to make complicated changes easily.
The most useful task that can be performed through code is making changes while avoiding dependencies. For example, if the user wants to change a name of a parameter from
Diabetes is used, the system will not allow the user to perform this change through the Graphical User Interface (GUI) since this will violate dependencies. Yet it is possible to do this using find and replace operation in the code file and then reconstructing a new data file. Note that the user should be careful to make the changes in all places and avoid name clashes and changes of other variable names with the word
Diabetes in them. If the changes the user made in the code are reasonable, once the code is executed a new database file will be created. If changes create conflicts or are otherwise invalid, the system will not be able to reconstruct the data file. With analogy to the house, think about it as being able to take a snapshot of an existing house converting it into plans, changing the plans of the foundation and then rebuilding the entire house from the existing plan.
Note that this type of operation is intended for the advanced user and the user is responsible for making intelligent changes in the code. However, the system will make validity checks when converting the code back to data. With analogy to the house, it is up to the designer to make a proper change in the foundation in the plan, otherwise the builders will either not be able to build this house, or if the house is built, it may be faulty due to a bad change in the foundations.
There are other uses of this powerful capability code that include:
There may be other uses to this powerful capability. Yet again it is important to understand that it is not recommended for non advanced user. If not used properly, it can cause much confusion. Never the less it is a very useful tool.
As an example, it is recommended to run the following command:
python ConvertDataToCode.py Testing.zip
This will convert the testing data definitions to code in the file TheGeneratedDataCode.py that can be inspected by the user or executed to regenerate the data definitions.
The script in focus for this topic is MultiRunSimulation.py.
Using the GUI it is possible to define and run a simulation by pressing the Run Simulation Button in the simulation screen. Each time a simulation is launched there is a need to wait for it to finish. Once done, simulation results are accessible.
However, since we typically run a Monte-Carlo simulation, we will expect different results each time we run the simulation. If we want to get a good understanding of the distribution of results, there is a need to run many repetitions of the same simulation. This is possible to do by defining a large number of repetitions for a project. However, for practical reasons it is may not be the most efficient thing to do. These reasons include: 1) Running the simulation for a very large number of population repetitions, such as 100,000 or more, may be required for some models to get stable results, yet it may take much time to wait for results. 2) Keeping simulation results in memory may not be practical as it may require larger machines and is prone to interruptions of simulation. 3) We sometimes want the population size to match the study size to allow better comparison of results. 4) Sometimes the user may wants to run the simulations outside the GUI - perhaps as a batch job.
To resolve these issues and offer further flexibility, the system provides a mechanism to run simulations outside the GUI using the MultiRunSimulation.py script. When the script is invoked, it will ask for the following parameters in this order:
FileName: The data definitions zip file name that holds the project to be simulated.
ProjectIndex: The number of the project to be simulated within the data definitions zip file. Note that project number zero means the first project on the list displayed in the GUI main screen. However, for advanced users, it is possible to use the internal ID that can be seen if data is converted to code if it is enclosed in brackets. If this information is omitted, then the system will choose the first project by default.
Repetitions: This is an optional integer that defines the number of times to repeat the entire simulation. For each repetition, the system will create a new output database file with simulation results for the project requested. Each new file will have the same file name as the original data definitions zip file with an extension of a
#will be the number of the simulation. Each such file will be a copy of the original database with a single result set. If
Repetitionsare omitted the default is 100 repetitions.
StartIndex: This is an optional integer that indicates the first suffix counter to be added to the file generated with the results. By default, this number will be 0, meaning that the results will be saved in a filename with the same name as the database name followed by an underscore and 0 for the first file and subsequent files generated will continue counting from this number. This number is useful if we want to add additional simulations after N simulations have already been generated by MultiRunSimulation.py and we want to run additional simulations where filenames start their index after N simulations. This way we can save time, by running the simulations on different machines in parallel.
PopulationRepetitionsOverride: This is an optional integer that defines an override to the population repetitions defined in the Project form. If the word
Noneis used, then the system will not perform any override - this is the default. Otherwise the number of repetitions of each population individual is overridden. Note that
PopulationRepetitionsOverrideare related, yet define different things. In a sense these two numbers multiply the number of individuals defined in the population set if all simulations are examined. For example, if 100 Repetitions are requested as input to MultiRunSimulation.py and the number of
PopulationRepetitionsOverrideis 2000 for a distribution based population, then there will be 100 new files generated each with 2000 individuals simulated. If all files are counted, overall there will be 200,000 individuals simulated from which statistics can be derived.
ModelOverrideID: This is an optional integer. The number indicates the model index to override the model defined in the project to be simulated. The first model is indexed as 0 and the model numbers are sorted according to the order they appear in the GUI in the model form when it opens. It is also possible to specify the number in brackets and then the internal index of the model will be used, this internal ID can be found if the database file is converted to code. This option is useful if the user wants to compare the results of a project with multiple model versions without redefining the data definitions zip file. It is the responsibility of the user to make sure the model is compatible with the other definitions of the project.
PopulationOverrideID: This is an optional integer. The number indicates the population set index to override the population set defined in the project to be simulated. The first population set is indexed as 0 and the population set numbers are sorted according to the order they appear in the GUI in the population form when it opens. It is also possible to specify the number in brackets and then the internal index of the population set will be used, this internal ID can be found if the database file is converted to code. This option is useful if the user wants to compare the results of a project with multiple population sets. It is the responsibility of the user to make sure the population set is compatible with the other definitions of the project.
RuleValueOverrides: One or more numbers that are optional and if specified will override project initialization rule values. This is intended to allow the user to override initialization values defined in stage 0 of the simulation. The first number overrides the value provided for the affected parameter in the first rule, the second number for the second rule and so on. To use this ability the user has to define the project rules in stage 0 of the simulation to be in a known order beforehand since this order will be used to place the override values. This allows interfacing with project initialization before simulation from a batch program outside the GUI system and manipulating simulation parameters. In a sense this ability transforms the project into a function with input parameters defined by the override.
As an example, it is recommended to run the following command:
python MultiRunSimulation.py Testing.zip 0 3
This will run the first example in the file 3 times and will generate the files Testsing_0.zip, Testsing_1.zip, Testsing_2.zip, each holding simulation results for the first project. You can then load these files through the GUI and inspect the results in each file.
Note that the simulation will be conducted sequentially one after the other on the same machine on the same CPU core. So using MultiRunSimulation.py script does not save simulation time in this form. However, this script allows avoiding memory limit violations. It allows practical flexibility of conducting simulations by manipulating the simulation defaults and scaling the simulation result sizes after definitions. These capabilities can be utilized manually by the user. However, these capabilities are best utilized by the system to provide parallel computing capabilities as will be discussed later.
The script in focus for this topic is MultiRunCombinedReport.py.
Using the GUI it was possible to generate a report for a single simulation result set. However, even within the GUI it is possible to run several simulations for the same project, each time creating a new results set while the report is per simulation results set - not per project. Moreover, if simulations for the same project were generated using MultiRunSimulation.py, then results exist in multiple files and it is hard to compose a report for all of these together.
The MultiRunCombinedReport.py script allows pulling together several result sets from multiple files and creating a single report combining them together. It is up to the user to make sure that the result sets are compatible.
When this script is invoked, it will ask the user a few questions as input. It is possible to answer the questions by hand, or prepare a file with the answers and run the script with this file as input as depicted in the usage.
The inputs requested are:
OptionValue. A blank line indicates the end of the format options list. Note that an easy way to obtain this list is saving the format options from the GUI results form into an .opt file and copying the contents of this file.
As an example that demonstrates the capabilities of this utility, we will build upon the results from the previous example created by MultiRunSimulation.py . In this example invoke the program in the following manner:
Then provide the following answers, where (Press Enter for Blank Line) stands for an empty line:
Testing_0.zip Testing_1.zip Testing_2.zip (Press Enter for Blank Line) 1 (Press Enter for Blank Line) DetailLevel 1 (Press Enter for Blank Line) (Press Enter for Blank Line)
These inputs can be also saved into a file that will be provided as a parameter to the script in the command line when it is invoked.
Once the script finished running, you can open the file Report.txt and find a detailed report that will combine results from all 3 simulations in the 3 files created previously. Note that the record count is 3000 rather than 1000. Also note that the filenames are presented at the top of the report.
The MultiRunCombinedReport.py script in combination with the previous MultiRunSimulation.py script allows overcoming memory limitations by chopping down a large simulation to smaller chunks. This is one way to get better statistics while running a report. However, processing the report may be very time consuming, especially if there are many files since this is done sequentially. Moreover, the report will combine all individuals together into a single report so the number of individuals in the report may not match the study size. Finally, the report is textual. The system provides other tools that provide further flexibility in reporting results that are discussed next.
The script in focus for this topic is MultiRunSimulationStatisticsAsCSV.py.
Previous reports were textual with fixed width tables, yet since most reports in the system are tabular it makes sense to create the report as a spreadsheet. A common method to represent such reports textually is the CSV format that stands for Comma Separated Values. In this format, each cell in the spreadsheet is separated from its neighbor rows using commas and a new line indicates a new row in the spreadsheet. Spreadsheet applications can open this file and the user can then manipulate it further if needed.
The script in focus is able to generate such a CSV report from a data definitions zip file with results. Moreover, this script can do this for multiple files generated by MultiRunSimulation.py and generate additional statistics in a summary report. Furthermore, this script allows processing this information in parallel and cutting down computation time significantly if computing power is available.
It is possible to invoke the script without input parameters in the command line and enter them manually. Yet it is usually invoked from the command line as follows:
python MultiRunSimulationStatisticsAsCSV.py FilePattern ResultsID OptFile OutPrefix
Note that the last three command line parameters are optional and can be omitted. Here is a description of these inputs:
FilePattern: The file pattern that describes the file or files to be processed. Note that this input defines the processing tasks the script will undertake.
FilePatternis a single zip file such as Model.zip, then the system will generate a single CSV file with the same name replacing the suffix to indicate a CSV report. This is useful for running many such reports in parallel on different CPU cores.
FilePatternincludes wildcards that expand to multiple zip files such as Model_*.zip then the system will generate a CSV report for each file that matches the pattern and then an additional CSV report that summarizes these CSV files providing statistics about all files. Note that the double quotes for the file pattern are important to avoid the Linux operating system expanding this pattern before passing it to the program. Note that computations will be performed serially for each zip file and then the report is created, this is much more time consuming than the parallel form.
FilePatternincludes wildcards that expand to multiple CSV files such as
"Model_*.CSV"then the system will generate only the statistics report that summarizes these CSV files providing statistics about all files. Again, note that the double quotes are important on Linux. This form is useful in parallel computing environment if each CSV report was already computed from each zip file in parallel as described previously.
ResultsID: This parameter defines the simulation result set ID to process in each file in
FilePattern. Note that the system assumes that the results were generated by MultiRunSimulation.py and that all results are for the same project and therefore have the same
ResultsIDin each file. Typically, the
ResultsIDwill be 1 for data definitions file without previous results. The default value is
None, meaning the first result set is selected - typically result set 1.
OptFile: This is the report options file that can be generated and saved through the report form in the GUI. It contains report parameters of interest and calculation methods, it also contains information about stratification. Note that
DetailLeveland other report options such as number format are ignored since these are not relevant for CSV reports. It is recommended to create such a file after running a small simulation in the GUI and compiling the report. If this file is not specified, all parameters will be used with the system trying to automatically determine the calculation method without stratification.
OutPrefix: This defines the prefix for the output summary statistics filenames. If no prefix is defined, the system will use the common prefix of the input file names for the summary output. There will be 5 files generated as summary output, all having the same prefix and ending with the following endings: Mean.csv, STD.csv, Median.csv, Min.csv, Max.csv. Each such ending will report the statistics for all the files that fit the
FilePatternspecified. Note that
OutPrefixdoes not influence the individual CSV file generated for each zip file, which will have the same name as the zip file.
This script enables processing of reports for multiple result files and can be invoked on a machine with a single CPU, or in parallel processing environment. Here are examples that will build upon the results from the previous example created by MultiRunSimulation.py:
Example for running simulation statistics in serial:
python MultiRunSimulationStatisticsAsCSV.py "Testing_*.zip"
This will generate 8 CSV files: Testing_0.csv, Testing_1.csv, Testing_2.csv, Testing_Max.csv, Testing_Mean.csv, Testing_Median.csv, Testing_Min.csv, Testing_STD.csv. The first 3 files will contain a report of the results from the corresponding zip file. The last 5 files will gather information from these 3 files and calculate a specific statistic function over these files using the functions: Max, Mean, Median, Min, STD. Note that a CSV report will look rotated compared to a textual report since columns become rows and vice versa. In the generated CSV reports, each row represents a different parameter and calculation and each column represents different time steps within a stratification cell. If there are several stratification cells, these will appear as column blocks starting with a mostly blank column defining the stratification. Note that the first few columns/rows contain headers. The statistic files also contain a new row at the end defining the number of repetitions from which the information was extracted. This is helpful to figure out how much information was available to construct the statistics. Note that in contrary to MultiRunCombinedReport.py that combines all results and then generates a textual report on the combined population size, MultiRunSimulationStatisticsAsCSV.py will generate multiple CSV reports for each result set using the original specified population size and provides statistics on what happens when the simulation is repeated several times.
The same example above can be repeated by running the script several times in parallel with different input parameters. To do this, the following commands should be run in parallel. This can be simulated by running the commands from multiple console/terminal windows:
python MultiRunSimulationStatisticsAsCSV.py Testing_0.zip
python MultiRunSimulationStatisticsAsCSV.py Testing_1.zip
python MultiRunSimulationStatisticsAsCSV.py Testing_2.zip
Once all the above scripts have finished, run the collection script:
python MultiRunSimulationStatisticsAsCSV.py Testing_*.csv
The first 3 commands will create a since CSV report for each zip file, while the last command will create the 5 summary statistics CSV files from the single report CSV files. The results are similar to running the computation in the serial case, while gaining the advantage of utilizing computing power to cut down coverall computation time. This advantage is significant in High Performance Computing Environment (HPC) where this script is executed on a cluster, as will be shown later on.
The script in focus for this topic is AssembleReportCSV.py.
Typically, simulations reproduce a few different scenarios that should be compared. For example the results of a control group need to be compared to the results of an intervention group in a simulated clinical study. Once results are available, the user will want to see the results near each other on the same report using similar terminology. Alternatively, a user may want to compare simulation results to the actual results obtained from a clinical trial. Also, the user may just want to narrow down the amount of information from a single CSV report file to compare specific time frames and stratifications in a certain order from a much larger list.
The system provides some support to accommodate such comparison and visualization through the AssembleReportCSV.py utility.
The AssembleReportCSV.py utility assumes that MultiRunSimulationStatisticsAsCSV.py created summary simulation reports as CSV files. And these files are to be combined to a single file that compares specific columns from those CSV files, and possibly includes reference columns from other files with a similar format.
The script is always invoked from the command line in the following format:
python AssembleReportCSV.py AssemblySequence OutputFileName
AssemblySequenceis an elaborate structure that allows the user to select specific columns from specific input files in a specific order. The assembly sequence will be of the form
[ ColumnTuple1, ColumnTuple2, ...]. The user can specify this sequence within double quotes in the command line, or place it in a text file and place the filename as a command line parameter instead. Each member in the assembly sequence is a tuple enclosed in parenthesis of the form
(Filename, Key1, Key2, Stratification, Title)where:
FileNameis the CSV filename from which to extract the column within quotes.
Key1: The start step of the interval of interest. This information is required and should be enclosed in quotes.
Key2: The end step of the interval of interest. This information is required and should be enclosed in quotes.
Stratification: This is an optional parameter that can be skipped or omitted by specifying an empty string. Otherwise, is allows specifying a stratification cell of interest by string. The string should match exactly the stratification string in the CSV report that starts with
'Stratification -'and should be enclosed in quotes. This information allows the system to select a specific column by the stratification cell. If skipped, then the time intervals from the first stratification cell encountered will be used.
Title: An optional parameter that can be omitted. If specified as a string in quotes, this string will be used as the column title. This allows the user to specify a title that can distinguish columns textually and give a meaningful explanation of the column and therefore recommended.
OutputFileNameis the name of the output CSV file where the collected columns will be placed.
The report generated is very similar to previous CSV reports with the difference that it can extract columns from multiple files and provides a title for each such column. So the output file contains the following information for each column: user specific title, the file name from which the column was extracted for reference, the stratification requested by the user, the project name that generated the results, the model name used in the project, the population set name used in the project, start step of interval, end step of interval, many rows with parameter statistics, repetitions count.
To make the report readable it is recommended to extract the first two header columns by including the following tuples in the beginning of the sequence:
('FileName','',''), ('FileName','Start Step','End Step'). Note that this assumes that
<Header> was selected as the first parameter in the report options file, which is the default.
Here is an example that builds again on the simulations we conducted using MultiRunSimulation.py and on reports we created using MultiRunSimulationStatisticsAsCSV.py beforehand.
Type in the following command:
python AssembleReportCSV.py "[('Testing_Mean.csv','',''), ('Testing_Mean.csv','Start Step','End Step'), ('Testing_0.csv','0','0','','Simulation 1 result'), ('Testing_1.csv','0','0','','Simulation 2 result'), ('Testing_2.csv','0','0','','Simulation 3 result'), ('Testing_Mean.csv','0','0','','Mean of 3 simulations') , ('Testing_STD.csv','0','0','','STD of 3 simulations'), ('Testing_0.csv','1','1','','Simulation 1 result'), ('Testing_1.csv','1','1','','Simulation 2 result'), ('Testing_2.csv','1','1','','Simulation 3 result'), ('Testing_Mean.csv','1','1','','Mean of 3 simulations') , ('Testing_STD.csv','1','1','','STD of 3 simulations'), ('Testing_0.csv','2','2','','Simulation 1 result'), ('Testing_1.csv','2','2','','Simulation 2 result'), ('Testing_2.csv','2','2','','Simulation 3 result'), ('Testing_Mean.csv','2','2','','Mean of 3 simulations') , ('Testing_STD.csv','2','2','','STD of 3 simulations'), ('Testing_0.csv','3','3','','Simulation 1 result'), ('Testing_1.csv','3','3','','Simulation 2 result'), ('Testing_2.csv','3','3','','Simulation 3 result'), ('Testing_Mean.csv','3','3','','Mean of 3 simulations') , ('Testing_STD.csv','3','3','','STD of 3 simulations')]" Testing_Out.csv
This example demonstrates the use of this script to compare the results from each of the 3 simulations at all 3 years near each other. It also compares those to the Mean and STD statistics extracted for those 3 simulations.
Note that the user can specify a reference CSV file that can be used to include specific columns. Also note that the system will not check if the rows match, it just selects columns from multiple files and assembles those together. It is up to the user to make sure the columns and their definitions match between files. With good organization of the data, CSV reports can now be read by human or reused to create graphical plots as described hereafter.
The script in focus for this topic is CreatePlotsFromCSV.py.
Once a CSV report is assembled, it is possible to use a spreadsheet to plot graphs using external tools. However, in many cases, there is a need to create the same plot repetitively in an automated way without manipulating the CSV file after its creation. To support such a method, the system provides the utility CreatePlotsFromCSV.py.
This utility relies on the format that is produced by AssembleReportCSV.py since it expects the first row to contain a title. It also expects the first two columns in the file to contain header columns with parameter and calculation method. Basically what the script does is produce a plot where the X and Y axis values are selected by the user by specifying a parameter and a calculation method. The script is sensitive to the titles provided at the first row and defines these as different series with different legends in the plots. It can also generate several plots together.
This script is invoked with the following command line:
python CreatePlotsFromCSV.py InputFileName OutputFileName PlotSequence
InputFileNameis the name of the file generated by AssembleReportCSV.py and contains the data to display.
OutputFileNameis the name of the PDF document file that will contain the plots, each plot in a different page.
PlotSequenceis a file name or a string representing the graphs to be made of the form
[ParamList, LegendList, StyleList]where
ParamListdefines what parameters are of interest in the plot,
LegendListdefines the titles of interest,
StyleListdefines the color, line type and marker to use in the plot for different legends.
ParamListis of the form
[ParamDataX, ParamDataY1, ParamDataY2...]where Each element defines a specific row in the input CSV file from which data will be extracted. The first element is considered as the series for the X axis values in the plot and therefore referred to as
ParamDataX. Each successive
ParamData#defines the Y axis values for a new plot and therefore named as
ParamDataY2, and so on. The definition of each element
ParamData#is the same and is normally defined as a tuple:
(ParamName, ParamCalcMethod, AxisTitle):
ParamNameis the name of the parameter to display, it should be enclosed in quotes and corresponds to the names that appear in the first column in the input file.
ParamCalcMethodis the short name for the calculation method and should correspond to the value of the second column in the CSV file. It is a required identifier to define the plot series since each parameter can be calculated several times using several calculation methods, so there is a need to define both the
ParamCalcMethodto define the correct row of values in the input CSV file.
AxisTitleis a string enclosed in quotes that can be specified by the user to give a new name for the set of numbers in the row to appear at the axis or legend. Yet the user can specify an empty string so that the system will use the combined
ParamCalcMethodas the default axis title.
ParamData#parameter stands for the X axis. This X axis will be used for all the plots that will follow it and each subsequent
ParamData#will define a new plot for a new Y axis. However, it is possible to bundle several parameters together in a single plot, or specifying a separate X axis for each plot by creating a nested
ParamData#. If this is done, the system will treat the nested list differently and the first
ParamData#element will be the new X title and all subsequent
ParamData#elements will be plotted with the new X axis all on the same plot in the same page. So nesting allows comparing different parameters, or calculation methods, or changing the X axis for this plot. Note that nesting is possible for 1 level only.
LegendListis composed of strings, enclosed in quotes, and separated by commas. The system will extract information from plots only from elements in the
LegendList. Each such element will be displayed as a different series in the same plot with the matching legend. Note that this defines which columns from the input CSV file will be chosen. Yet the order in which those columns appear is the series will not changed from the CSV file. Also note that in case of a nested
ParamList, the name of the title will be added to the legend to separate series by legend as well as parameter and calculation method. In all cases, different series will look differently according to the sequence specified in
StyleListis a list of strings enclosed in quotes and separated by commas. These strings will determine the appearance of the line type, the color and the marker for each series in a plot. Each string is a format string where line and marker type are defined by one of the characters from the list:
'-','--','-.',':','.',',','o','v','^','<','>','1','2','3','4','s','p','*','h','H','+','x','D','d','|','_', and color is defined by a character from the list:
'b','g','r','c','m','y','k','w'. Combining those together will create a specific format for the line. If this list is not defined or is too short, the system will use an internal sequence of format strings. Additional information is available in this web site.
The next example will demonstrate plot generation from the CSV file previously created by the AssembleReportCSV.py example.
python CreatePlotsFromCSV.py Testing_Out.csv Testing_Out.pdf "[ [('','Start Step',''), ('Alive','Sum All',''),[ ('Age','Avg All',''), ('Alive','Sum All',''), ('Dead','Sum All','')]] , ['Simulation 1 result', 'Simulation 2 result', 'Simulation 3 result', 'Mean of 3 simulations'] , ['r-','g-','b-','k-', 'r--','g--','b--','k--'] ]"
This command will create a pdf file with two plots. The first will show the number of alive people per year for each simulation and for the average of 3 simulations. The second plot, will also show the number of deaths per year on the same plot where the X axis is age.
This plot script can be included in other scripts to build elaborate graphical reports as will be demonstrated later.
The script in focus for this topic is
The utility scripts above can be used to conduct simulations, generate reports, and even create graphical plots. Those utilities run on both Linux and Windows. Those utilities can also work in High Performance Computing (HPC) environment where these can be executed on a cluster of computers. Although the system can potentially run on several HPC environments, the HPC environment of choice for the system is SLURM. More information in SLURM be found here.
If you have SLURM install on a computing cluster that also has all required packages installed on it, the system provides the SlurmRun.py script that executes a complete simulation and reporting mechanism.
Note, however, that contrary to other scripts that receive input parameters when run and should not be changes, this script is a Python program that should be changed by the user to adapt to their needs. So it is assumed that the user has at least basic understanding of Python and programming. This tutorial may be helpful for getting acquainted with Python.
SlurmRun.py starts with a set of definitions that are intended for change by the user. After these are defined, the system will run the simulation in parallel in 3 main phases. These phases include several sub phases that will be described hereafter:
Using these phases, the system can run many simulations in parallel and receive many results from many scenario variations. To control the simulation, the user will change parameters in the scenario definition section at the top of the script. These parameters are:
Scenario: The name for the simulation job you are running.
FileNamePrefix: The name of the zip file that holds the data definitions of the projects to run.
MailFinalResultsTo: The email address you want the results to be sent to.
Phase3Environemnt: the SLURM environment parameters for the SLURM
sbatchcommand you want the simulations to run with. This includes time, memory, machine allocation and many other parameters that should be determined together with the cluster administrator.
RunPhase3C: are Boolean parameters that allow the user to control what phases to run. These should normally be all set to
True. However, in some cases, it is useful to have this control, especially in cases where recovery is needed.
Repetitions: The number of times to repeat each scenario variation simulation. Note that there may be several scenario variations, so the number of simulations in Phase 1 is controlled by this number and by the number of scenario variations.
SimulationTimeOverride: This parameter can be used to override the number of simulation steps defined in the project to be run. Use
'None'to avoid changes.
PopulationRepetitionsOverride: This parameter can be used to override the size of population generated in the simulation by overriding the project definition of population repetitions. Use
'None'to avoid changes.
OptionsTreatment: These are lists of option categories that are used to define a scenario variation. Generally elements in these lists are tuples of the form
ParameterOverrideStringprovides a sub set of parameter values to use with MultiRunSimulation.py to override initialization values of coefficients in the project in simulation stage 0. This requires for the project definition to accept these overrides. Note that these option groups are later merged to create all possible combinations of the options entered when scenario variations are determined. These options are later used during report creation to define the title from components defined in
TitleComponent. For example, if the project accepts a single coefficient parameter that defines if biomarkers change during simulation that exists in stage 0 of simulation, we can define
[('0','NoBioChange'), ('1','WithBioChange')]. With this example SlurmRun.py will run both scenario options and combine them with other possible scenario options to create scenario variations. Each scenario variation created will have a title that contains either the title component
WithBioChange. Such titles will appear at the top of reports. Note that if there is no variation in a specific option, then the system will not include a title component for it to avoid unnecessary long title strings. As an extended example, if the project also accepts a coefficient that defines if the simulation should be run with treatment or without treatment, then the user can run both options in parallel using SlurmRun.py if
OptionsTreatmentis defined as
[('0','NoTreatment'), ('1','WithTreatment')]. Note that OptionsTreatment will be combined with
OptionsBioMarkerso that 2x2=4 scenario variations will be created. These scenario variations will have the following titles:
'WithBioChange WithTreatment'. These scenarios variations may be combined further with other options to create even more scenario variations.
StratifyBy: If stratification of the results is required in the report, this string will hold the stratification table for report generation.
Stratifications: A list of stratifications of interest in the form
StatificationStringis the title string generated in the report that corresponds to a specific cell table specified in
StratifyByand starts with the words:
TitleComponentis the stratification title to combine with other title components if this stratification is used in the final report. Note that this does not increase the number of simulations, yet increases the size of the report.
ModelsToUse: The population override and model override for the project. This option allows running the same simulation with multiple population sets and multiple model overrides without changing the project. The populations are defined as tuples of the form
OverrideNumberAsStringholds the population/model number to override as a string enclosed in quotes. If
OverrideNumberAsStringis provided in brackets, the internal code of the population set/model are used, otherwise the sort order in the GUI is used.
TitleComponentis a string to use when the report title is assembled by the system from all options. It is up to the user to make sure the override projects/models are reasonable.
ProjectsToUse: Allows the user to define the project number to run in different scenario variations. Again, a tuple of the form
(OverrideNumberAsString, TitleComponent)is used.
OverrideNumberAsStringholds the project number to run from the model zip file where the first project is indexed as 0.
TitleComponentwill determine the part of the title for the scenario variation that uses this option. If several projects are used it is up to the user to set them up so they will return results in the same format and be compatible for combining in a report.
Exclusions: These are lists of tuples of strings that indicate what options should be included together and what options should be excluded when building the scenario variations. For example to include only scenario variations where both biomarkers and treatment or neither are simulated while disallowing other scenarios, we can define this by using
Inclusions = [('NoBioChange', 'NoTreatment'), ('WithBioChange', 'WithTreatment')]or by defining
Exclusions = [('NoBioChange', 'WithTreatment'), ('WithBioChange', 'NoTreatment')]. For each tuple in
Inclusions, the system will make sure each scenario variation title that will be executed will include all the tuple components. For each tuple in
Exclusions, the system will make sure each scenario variation title that will be executed will not include all the tuple components. By using
Exclusionsit is possible to reduce the number of scenario variations and keep only combinations of options that may interest the user, otherwise the number of scenario variations may be very large and impractical to simulate and visualize.
MaxDimensionsToAllowVariation: This parameter allows limiting the number of scenario variations by allowing only a certain number of changes in options from the first scenario variation defined. For example if we set
MaxDimensionsToAllowVariation = 1with the biomarker and treatment example without
Exclusionsdefined, then we will get only 3 variations:
'WithBioChange NoTreatment'since these change only one dimension at most from the original scenario variation. Note that
'WithBioChange WithTreatment'will not be included since it changes both parameters from the original scenario variation. Note that the original scenario variation is determined by the first tuple defined for each option.
ReportReferenceFileName: This parameter allows the user to define a reference CSV file name from which the first two columns and reference values can be extracted and combined into the final report. It is useful to show known study results together with simulation results. However, the reference file should have the same format as a CSV report created by the system to allow its assembly. If an empty string is used then the system will select the first two title columns from the Mean CSV file of the first scenario.
ReportReferenceColumnTuple: This parameter allows the user to define which columns to extract from the reference file. The tuple will be of the form
Key2are the column title found in rows 4,5 of the column to be extracted. If
ReportReferenceFileNameis left blank,
SummaryReportTimes: This is a list of tuples of time step intervals to be shown in the summary report. Each tuple consists of
EndTimeStepare strings enclosed in quotes that are expected to be generated in
SummaryIntervalsthat is later defined. Note that
SummaryIntervalsmay include other time intervals relevant to create the yearly report and other intervals while
SummaryReportTimesselects only the intervals relevant for the final summary report, not the yearly summary report.
SummaryIntervals: A full list of intervals for the report. It is a list of numbers and sequences to be used to process reports with. See the Reports for further details. Note that to generate yearly summary intervals, it is recommended to include the number 1 in the sequence to generate yearly results for the yearly report and the plot.
ColumnFilter: A list of parameters and calculation methods for reports. It is defined as a sequence of tuples
(ParameterName, ParameterCalculationMethod, ReplacementTitleName). See the help on Reports for further details. It is recommended to create an .opt file with report options and extract this filter from the file rather than construct it using the editor.
PlotFilter: Instructions for CreatePlotsFromCSV.py to create plots from the final yearly report. This is only the
ParamListcomponent from the
PlotSequenceinput parameter to the CreatePlotsFromCSV.py script. It should contain components from
ColumnFilterwhere the calculation method is replaced by the short version name of the calculation method. Note that the
LegendListcomponent is not required since it will be calculated automatically from the titles of the scenario variations.
PlotStyles: A sequence of strings to represent color, line style, and markers of different scenario variations. This is only the
StyleListcomponent from the
PlotSequenceinput parameter to the CreatePlotsFromCSV.py script.
ReportFilterFileName: SlurmRun.py will create a new .opt file and save it under this name. This is created for clerical purposes to allow future manipulation of this file.
The script in focus for this topic is MultiRunExportResultsAsCSV.py.
The user may wish to process simulation results using different calculation techniques than those provided so far. Or the user may wish to store the calculations within a database that can be read by other systems. To provide such capabilities, the system provides a script to convert the simulation results from the internal zip form to CSV files that can be read by many systems, including spreadsheets and database applications.
It is possible to invoke the script without input parameters in the command line and enter them manually. Yet it is usually invoked from the command line as follows:
python MultiRunSimulationStatisticsAsCSV.py FileNamePattern ResultsID ColumnName
FileNamePattern: The file pattern that describes all the zip files to be processed. It is recommended to enclose it in double quotation marks to fit both Linux and Windows formats.
ResultsID: A mandatory parameter that defines the simulation result ID to process in each file in
FileNamePattern. Typically, the results ID will be 1 - this is true for running a model file without previous results, never the less, the user can choose a different results set that exists in all zip files.
ColumnName: Optional column names that exist in the result set. The user can provide as many column names separated by spaces, these column names correspond to parameter names to be exported to the CSV file. If no column is defined, then the system will export all parameters calculated during simulation to the output file.
FileNamePatternwill be the same as the file name with the .zip ending replaced with Results.csv. The first line in this output file will contain the parameter names to allow easier visualization and import into spreadsheet and database applications.
To demonstrate this script, here is an example that is based on the results from the previously described example of MultiRunSimulation.py. Try running the script with the following line:
python MultiRunExportResultsAsCSV.py "Testing_*.zip" 1 IndividualID Repetition Time Age Alive Dead
The system will create 3 files: Testing_0Results.csv , Testing_1Results.csv, Testing_2Results.csv. Each file will contain 6 columns corresponding to the list provided by the user. These files are easily opened with a spreadsheet application and the results there can be manipulated further by the user to create their own reports.