Settings used for cohort analysis. Many of these settings can be overwritten in the cohort class __init__ calls.
Path to base directory for the wikipride project
Colormap to use. Has to be a valid name in matplotlib.pyplot.cm.datad
Path to directory for cohort data
Filter out known bots?
The language of the Wikipedia (e.g. ‘en’,’pt’)
The Mongo query variables used to aggregate the data.
The name of the collection
The name of the db
Reads the configuration from ConfigParser instance into the runtime settings.
Parameters: | configfile – A file that can be read by a ConfigParser instance |
---|
Path to store report
The path to the MySQL configuration for logging into the server, e.g. ‘~/.my.cnf’
If True, all tables that we attempt to create will be dropped if they exist already. If False, only tables don’t exist already will be created
The host name of the MySql server
The name of the database on db host where the aggregated tables will be stored. On the toolserver, this is the username prepended by a u_, e.e. u_delcerambaul
The name of the database on db host where the Mediawiki database is stored. On the toolserver, this is for example ptwiki_p
List containing all YM (e.g. ‘200401’ for January 2004) that we want to analyze
A dictionary mapping each YM in time_stamps to its array index (used to access numpy arrays)
Path to directory for user lists
Path to store wikipride visualizations of user defined cohorts
This module defines individual cohorts. All cohorts are sub-classes of the Cohort, and they should overwrite the non-generic methods defined in the parent class. Different kind of cohorts have been defined.
This module defines the abstract class Cohort. All cohort definitions must inherit this class.
digraph inheritance07007f5952 { rankdir=LR; size="8.0, 12.0"; "cohorts.base.Cohort" [style="setlinewidth(0.5)",URL="#cohorts.base.Cohort",fontname=Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans,height=0.25,shape=box,fontsize=10]; }
Abstract class that defines common properties of cohorts, which are defined in the cohorts modules
Adds a line to the matplotlib figure passed as argument. The dimension the data has to match the length of the time_stamps. It is assumed that the figure contains only one axes.
Parameters: |
|
---|---|
Returns: | matplotlib figure |
Iterates over the SQL data and calls self.processSQLrow() which needs to be implemented by the parent cohort class
Parameters: |
|
---|
Dictionary that contains the data. {name : numpy.array }. Different aggregates can be saved; for example ‘bytesadded’,’edits’,’bytesremovedPerEditor’
Dictionary that holds descriptive information about self.data. For example, an ‘addedBytes’ data description might be:
self.data_description[‘addedBytes’] = { title}
This method should is called at the of an aggregateDataFromXXX() method. It allows to manipulate the time series data in self.data. E.g. and ‘addedBytes’ could be divided by ‘edits’ to create a new variable ‘addedPerEdit’.
Generates the path and file name based on properties of the cohort. Additional identifying features might be used in file names by overwriting this method in subclasses of the base Cohort class.
If no destination argument is passed, the method uses the ftype argument to determine which base directory should be used. Only the name of the data feature (e.g. ‘added’) and the cohort name (e.g. AbsoluteAgePerMonth) is used in the basic method.
Parameters: |
|
---|---|
Returns: | A path without file format |
Initialize the self.data dictionary with the appropriate variable names and numpy.arrays
Initialize the self.data_description dictionary with additional information
This method allows to produce line plots using the cohort data stored in self.data. Usually line plots illustrate interesting trends/ratios that depend on the cohort definition. Thus this method in the base cohort definition does nothing and should be overwritten in the cohort class itself.
Loads the data from disk. It will populate self.data with {names[i] : numpy.array}. An error is raised if there is no corresponding datafile stored
Args varName: | variable name |
---|---|
Parameters: | destination – str, destination directory. If None, settings will be used |
The Mongo query variables used to aggregate the data. If None, all fields will be returned by mongo. If ‘settings’, the mongoQueryVars from the settings will be used
The number of colors used for the wikipride graphs. If required, it should be defined in the child class definition.
True if the bots are filtered from the cohort
Saves the aggregated numpy.arrays to file. There is one file for each collected variable, the names is uniquely constructed from the properties of the variable and cohort. The format of the CSV doesn’t follow the numpy representation as it transposes the matrix. Thus the temporal axis is vertical instead of horizontal, each row is a measurement for a different time unit. This format is used by the visualization library dygraphs .
Parameters: | destination – str, destination directory. If None, the data directory from the settings will be used |
---|
Saves the aggregated numpy.arrays to file. There is one file for each collected variable, the names is uniquely constructed from the properties of the variable and cohort.
Saves a matplotlib figure to disk.
Parameters: |
|
---|
Plots the cohort trends using the famous WikiPride stacked bar chart! If normal is True, the absolute values are visualized. If percentage is True, the relative values are visualized (i.e. the percentages). If flip is True, the numpy.array is flipped upside down. This results in the bars added in reverse order. The order of the cohort labels is also reversed as a result.
Parameters: |
|
---|
This module implements age cohorts, AbsoluteAge and RelativeAge.
A cohort is the group of people that have started editing in the same month.
Cohort labels
Cohort definition
Returns ticks and labels for the colorbar of a WikiPride visualization
Returns the index of the cohort, which is identical to the time index of the first edit
Initialize the self.data_description dictionary with additional information
Maximum number of edits by editor in a given month to be included
Minimum number of edits by editor in a given month to be included
Number of visible colors in the wikipride plots. E.g. one color for every six month for wikipride plots
The SQL query returns edit information for each editor for each ym she has edited.
A cohort is the group of people that have started editing in the same month.
Cohort labels
Cohort definition
Returns ticks and labels for the colorbar of a WikiPride visualization
Returns the index of the cohort, which is identical to the time index of the first edit
The user_id of the previously encountered editor as we iterate through the table
A cohort is the group of people that have the same age at the time of an edit. During the first month of editing, a contributor will be in the 1-month old cohort, then he switches to the 2-month cohort and so forth.
Cohort labels
Cohort definition
Returns ticks and labels for the colorbar of a WikiPride visualization
Returns the index of the cohort (i.e. the relative age of the editor) from the time index of the edit and time index of the first edit
Initialize the self.data_description dictionary with additional information
Graphs for relative age cohorts include
Maximum number of edits by editor in a given month to be included
Minimum number of edits by editor in a given month to be included
Number of visible colors in the wikipride plots. E.g. one color for every six month for wikipride plots
The SQL query returns edit information for each editor for each ym she has edited.
A cohort is the group of people that have the same age at the time of an edit.
Cohort labels
Cohort definition
A cohort is the group of people that have the same age at the time of an edit. During the first month of editing, a contributor will be in the 1-month old cohort, then he switches to the 2-month cohort and so forth.
Cohort labels
Cohort definition
Returns ticks and labels for the colorbar of a WikiPride visualization
Returns the index of the cohort (i.e. the relative age of the editor) from the time index of the edit and time index of the first edit
The user_id of the previously encountered editor as we iterate through the table
This module implements histograms cohorts, e.g. EditsHistogram.
The cohorts are based on the number of edits they have done in a given month. It uses a table where the values are aggregated for all namespaces.
Cohort definition
Returns ticks and labels for the colorbar of a WikiPride visualization
Initialize the self.data_description dictionary with information used for plotting.
Graphs for editor activity histogram cohort include
The SQL query returns edit information for each editor for each ym she has edited.
The cohorts are based on the number of edits they have done in a given month. Implemented only for MongoDB.
Cohort definition
The cohorts are based on the number of edits they have done in a given month.
Cohort definition
Returns ticks and labels for the colorbar of a WikiPride visualization
Initialize the self.data_description dictionary with information used for plotting.
The ym at the end of the period months after the first edit of an editor
The user_id of the previously encountered editor as we iterate through the table
The number of month an editor is considered new
For simple cohorts :)
The namespaces themselves are cohorts
Cohort labels
Cohort definition
Returns ticks and labels for the colorbar of a WikiPride visualization
Initialize the self.data_description dictionary with additional information
The SQL query returns edit information for each editor for each ym she has edited.
There is just one cohort, which contains the number of of editors who started contributing in any given month.
NewEditors.linePlot() creates a line plot.
Cohort labels
Cohort definition
Returns ticks and labels for the colorbar of a WikiPride visualization
Initialize the self.data_description dictionary with additional information
Creates a line plot for the number of new editors and saves it to disk.
Parameters: | dest – str, destination directory |
---|
The SQL query returns the new editor count for each ym.
A cohort that is comprised of active editors that started editing in a given year.
Minimum number of edits per month to be included in the cohort
Cohort labels
Cohort definition
Returns the index of the cohort, which is identical to the time index of the first edit
Initialize the self.data_description dictionary with additional information
Only take time_stamps starting with self.year
The year the cohort started.
A cohort that is comprised of active editors that started editing in a given year. Only the contributions to the Wikipedia namespaces 4&5 are considered.
Cohort labels
Cohort definition
Returns ticks and labels for the colorbar of a WikiPride visualization
Initialize the self.data_description dictionary with additional information
Only take time_stamps starting with self.year
This module interacts with the MediaWiki SQL database.
Starting with the Wikimedia SQL database schema, this module creates a set of tables that will be used to aggregate the cohort trends.
Create an index on a SQL table in the user database
Parameters: |
|
---|
Create a SQL table in the user database
Parameters: |
|
---|
Drops a SQL table in the user database
Parameters: | tablename – str, name of the table |
---|
This module holds the collection of SQL queries used for the preprocessing of the data
Query to editor centric table. For each user and each year/month, it contains the number of add/remove edits as well as number bytes added/removed.
Query to editor centric table. Same as EDITOR_YEAR_MONTH but including namespace. For each user and each year/month/namespace, it contains the number of add/remove edits as well as number bytes added/removed.
Query to editor centric table. Same as EDITOR_YEAR_MONTH but including only for namespace 0 (main) and only for pages that are no redirects. For each user and each year/month, it contains the number of add/remove edits as well as number bytes added/removed.
Query to time centric table. Same as TIME_YEAR_MONTH_NAMESPACE but including namespace. For each year/month, it contains the number of editors, the number of add/remove edits as well as number bytes added/removed.
Query to time centric table. For each year/month, it contains the number of editors, the number of add/remove edits as well as number bytes added/removed.
This module defines the content of a report, which consists of the following at the moment.
New editors
Histogram cohorts
Namespaces
A report consists of a collection of report items. A report item consists of a cohort instance and methods to generate the data and the plots.
Cohort instance
Creates the directory if it doesn’t exist already. The base directory is joined with the relative destination directory and returned.
Parameters: | base – base directory (e.g. settings.datadirectory or settings.wikipridedirectory) |
---|---|
Returns: | absolute path |
Stores a simple csv file in a format used by the javascript dygraphs library.
Generates and saves the cohort data. Calls the aggregateDataFromSQL() method from the Cohort instance passed as argument. The collected data matrices are stored in the Cohort.data attribute. The data matrices are saved as txt files in the data destination directory.
For the variables names in varNames, produces the WikiPride graphs using wikiPride() (e.g. added, editors, ...). If the cohort defines linePlots, they are also generated.
Parameters: |
|
---|
Relative path to the destination directory
The aggregation of the cohort data requires that data.preprocessing.process() has been executed and the data thus preprocessed. The data.cohortdata.processData() method will use the report definition in report to create a directory structure that contains the data of the cohort defitintions described below. The data is stored in the form of numpy matrices.
The aggregation of the cohort data requires that data.preprocessing.process() has been executed and the data thus preprocessed. The data.cohortdata.processData() method will use the report definition in report to create a directory structure that contains the data of the cohort defitintions described below. The data is stored in the form of numpy matrices.
Creates a set of graphs which requires that data.report.processData() has been executed and the data thus aggregated. The data is loaded from disk.
This module defines individual cohorts. All cohorts are sub-classes of the Cohort, and they should overwrite the non-generic methods defined in the parent class. Different kind of cohorts have been defined.
Creates a database connection to the slave replica on alpha and implements methods for querying the database
SQL connection instance
Provides a database connection con to wikilytics.
Connection instance to the mongo db
Name of the mongo collection
Name of the mongo database
A set of utility methods that are used in different parts of the framework.
From http://www.scipy.org/Cookbook/Matplotlib/ColormapTransformations
Parameters: |
|
---|---|
Returns: | a discrete colormap from the continuous colormap cmap. |
Returns the starting and end datetime object for the yyyymm passed. I.e. the first and last day of the month
Parameters: | ym – str, ‘yyyymm’ format |
---|---|
Returns: | tuple of datetime objects |
Helper data structures for time stamps List of all time unites, i.e. every month. yyyymm
Helper data structures for time stamps List of all time unites, i.e. every month. yyyymm