Reference¶

Settings used for cohort analysis. Many of these settings can be overwritten in the cohort class __init__ calls.

settings.basedirectory¶: Path to base directory for the wikipride project

settings.cmapName¶: Colormap to use. Has to be a valid name in matplotlib.pyplot.cm.datad

settings.datadirectory¶: Path to directory for cohort data

settings.filterbots¶: Filter out known bots?

settings.language¶: The language of the Wikipedia (e.g. ‘en’,’pt’)

settings.mongoQueryVars¶: The Mongo query variables used to aggregate the data.

settings.mongocol¶: The name of the collection

settings.mongodb¶: The name of the db

settings.readConfig(configfile)[source]¶

Reads the configuration from ConfigParser instance into the runtime settings.

Parameters:	configfile – A file that can be read by a ConfigParser instance

settings.reportdirectory¶: Path to store report

settings.sqlconfigfile¶: The path to the MySQL configuration for logging into the server, e.g. ‘~/.my.cnf’

settings.sqldroptables¶: If True, all tables that we attempt to create will be dropped if they exist already. If False, only tables don’t exist already will be created

settings.sqlhost¶: The host name of the MySql server

settings.sqluserdb¶: The name of the database on db host where the aggregated tables will be stored. On the toolserver, this is the username prepended by a u_, e.e. u_delcerambaul

settings.sqlwikidb¶: The name of the database on db host where the Mediawiki database is stored. On the toolserver, this is for example ptwiki_p

settings.time_stamps¶: List containing all YM (e.g. ‘200401’ for January 2004) that we want to analyze

settings.time_stamps_index¶: A dictionary mapping each YM in time_stamps to its array index (used to access numpy arrays)

settings.userlistdirectory¶: Path to directory for user lists

settings.wikipridedirectory¶: Path to store wikipride visualizations of user defined cohorts

Cohort definitions¶

This module defines individual cohorts. All cohorts are sub-classes of the Cohort, and they should overwrite the non-generic methods defined in the parent class. Different kind of cohorts have been defined.

This module defines the abstract class Cohort. All cohort definitions must inherit this class.

digraph inheritance07007f5952 { rankdir=LR; size="8.0, 12.0"; "cohorts.base.Cohort" [style="setlinewidth(0.5)",URL="#cohorts.base.Cohort",fontname=Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans,height=0.25,shape=box,fontsize=10]; }

class cohorts.base.Cohort[source]¶

Abstract class that defines common properties of cohorts, which are defined in the cohorts modules

addLine(data, fig=None, label='')[source]¶

Adds a line to the matplotlib figure passed as argument. The dimension the data has to match the length of the time_stamps. It is assumed that the figure contains only one axes.

Parameters:	data – numpy.array of same length as time_stamps fig – matplotlib figure. If none, a new figure is created. label – str, label for the legend. Defaults to an empty string.
Returns:	matplotlib figure

aggregateDataFromSQL(verbose=False, callback=None)[source]¶

Iterates over the SQL data and calls self.processSQLrow() which needs to be implemented by the parent cohort class

Parameters:	verbose – bool, display progress on stdout callback – function, a callback function that can be used for data transformations after the query has executed.

data¶: Dictionary that contains the data. {name : numpy.array }. Different aggregates can be saved; for example ‘bytesadded’,’edits’,’bytesremovedPerEditor’

data_description¶

Dictionary that holds descriptive information about self.data. For example, an ‘addedBytes’ data description might be:

self.data_description[‘addedBytes’] = { title}

finalizeData()[source]¶: This method should is called at the of an aggregateDataFromXXX() method. It allows to manipulate the time series data in self.data. E.g. and ‘addedBytes’ could be divided by ‘edits’ to create a new variable ‘addedPerEdit’.

getFileName(varName, destination=None, ftype='data')[source]¶

Generates the path and file name based on properties of the cohort. Additional identifying features might be used in file names by overwriting this method in subclasses of the base Cohort class.

If no destination argument is passed, the method uses the ftype argument to determine which base directory should be used. Only the name of the data feature (e.g. ‘added’) and the cohort name (e.g. AbsoluteAgePerMonth) is used in the basic method.

Parameters:	varName – name of the self.data variable destination – str, destination directory. If None, settings will be used ftype – str, ‘data’ or ‘wikipride’
Returns:	A path without file format

getIndex(edits)[source]¶: Returns the index of the cohort

initData()[source]¶: Initialize the self.data dictionary with the appropriate variable names and numpy.arrays

initDataDescription()[source]¶: Initialize the self.data_description dictionary with additional information

linePlots(dest)[source]¶: This method allows to produce line plots using the cohort data stored in self.data. Usually line plots illustrate interesting trends/ratios that depend on the cohort definition. Thus this method in the base cohort definition does nothing and should be overwritten in the cohort class itself.

loadDataFromDisk(varName, destination=None)[source]¶

Loads the data from disk. It will populate self.data with {names[i] : numpy.array}. An error is raised if there is no corresponding datafile stored

Args varName:	variable name
Parameters:	destination – str, destination directory. If None, settings will be used

mongoQueryVars¶: The Mongo query variables used to aggregate the data. If None, all fields will be returned by mongo. If ‘settings’, the mongoQueryVars from the settings will be used

ncolors¶: The number of colors used for the wikipride graphs. If required, it should be defined in the child class definition.

nobots¶: True if the bots are filtered from the cohort

processMongoDocument()[source]¶: Processes a document of the Mogo DB result set

processSQLrow(row)[source]¶: Processes a row of the SQL result set

saveDataToCSV(destination=None)[source]¶

Saves the aggregated numpy.arrays to file. There is one file for each collected variable, the names is uniquely constructed from the properties of the variable and cohort. The format of the CSV doesn’t follow the numpy representation as it transposes the matrix. Thus the temporal axis is vertical instead of horizontal, each row is a measurement for a different time unit. This format is used by the visualization library dygraphs .

Parameters:	destination – str, destination directory. If None, the data directory from the settings will be used

saveDataToDisk(destination=None)[source]¶: Saves the aggregated numpy.arrays to file. There is one file for each collected variable, the names is uniquely constructed from the properties of the variable and cohort.

saveFigure(name, fig, dest, title='', ylabel='', xlabel=None, ylog=False, legendpos=None, pdf=False)[source]¶

Saves a matplotlib figure to disk.

Parameters:

name – str, name of resulting file
fig – matplotlib figure to be saved.
dest – str, destination folder.
title – str, plot title. Defaults to an empty string.
ylabel – str, plot y axis label. Defaults to an empty string.
xlabel – str, plot x axis label. If none, time stamps xticks will be used.
ylog – If True the log scale is used for the y-axis, default is False.
legendpos – int, the position of the legend. If None, no legend will be displayed.
pdf – If True the file will be saved as pdf, otherwise as png.

wikiPride(varName, varDesc=None, normal=True, percentage=True, colorbar=True, ncolors=None, flip=False, pdf=False, dest=None, verbose=False)[source]¶

Plots the cohort trends using the famous WikiPride stacked bar chart! If normal is True, the absolute values are visualized. If percentage is True, the relative values are visualized (i.e. the percentages). If flip is True, the numpy.array is flipped upside down. This results in the bars added in reverse order. The order of the cohort labels is also reversed as a result.

Parameters:

varName – str, the name of the numpy.array in self.data to visualize
varDesc – str. Alternative name for the data description. If None, varName will be used.
normal – Boolean. Visualize absolute values.
percentage – Boolean. Visualize percentages.
colorbar – Boolean. Add color bar legend.
pdf – Boolean. If True, save plot as pdf
flip – Boolean. N.flipud() the numpy.array which inverses the order the boxes are added
dest – str. Path to directory on where to save the plot. If None, the path in settings.py will be used
verbose – Boolean. Displays information about the graphing progress.

This module implements age cohorts, AbsoluteAge and RelativeAge.

class cohorts.age.AbsoluteAgeAllNamespaces(minedits=1, maxedits=None)[source]¶

A cohort is the group of people that have started editing in the same month.

cohort_labels¶: Cohort labels

cohorts¶: Cohort definition

colorbarTicksAndLabels(ncolors)[source]¶: Returns ticks and labels for the colorbar of a WikiPride visualization

getIndex(fe)[source]¶: Returns the index of the cohort, which is identical to the time index of the first edit

initDataDescription()[source]¶: Initialize the self.data_description dictionary with additional information

maxedits¶: Maximum number of edits by editor in a given month to be included

minedits¶: Minimum number of edits by editor in a given month to be included

ncolors¶: Number of visible colors in the wikipride plots. E.g. one color for every six month for wikipride plots

sqlQuery¶: The SQL query returns edit information for each editor for each ym she has edited.

class cohorts.age.AbsoluteAgePerMonth[source]¶

A cohort is the group of people that have started editing in the same month.

cohort_labels¶: Cohort labels

cohorts¶: Cohort definition

colorbarTicksAndLabels(ncolors)[source]¶: Returns ticks and labels for the colorbar of a WikiPride visualization

getIndex(fe)[source]¶: Returns the index of the cohort, which is identical to the time index of the first edit

old_user_id¶: The user_id of the previously encountered editor as we iterate through the table

class cohorts.age.Age[source]¶

A abstract class for for an age cohort.

initDataDescription()[source]¶: Initialize the self.data_description dictionary with additional information

class cohorts.age.RelativeAgeAllNamespaces(minedits=1, maxedits=None)[source]¶

A cohort is the group of people that have the same age at the time of an edit. During the first month of editing, a contributor will be in the 1-month old cohort, then he switches to the 2-month cohort and so forth.

cohort_labels¶: Cohort labels

cohorts¶: Cohort definition

colorbarTicksAndLabels(ncolors)[source]¶: Returns ticks and labels for the colorbar of a WikiPride visualization

getIndex(ti, fe)[source]¶: Returns the index of the cohort (i.e. the relative age of the editor) from the time index of the edit and time index of the first edit

initDataDescription()[source]¶: Initialize the self.data_description dictionary with additional information

linePlots(dest)[source]¶

Graphs for relative age cohorts include

Bytes added per edit (new vs. old editors)
Contribution percentage of bytes added for each one year cohort
Editor percentage for each one year cohort

maxedits¶: Maximum number of edits by editor in a given month to be included

minedits¶: Minimum number of edits by editor in a given month to be included

ncolors¶: Number of visible colors in the wikipride plots. E.g. one color for every six month for wikipride plots

sqlQuery¶: The SQL query returns edit information for each editor for each ym she has edited.

class cohorts.age.RelativeAgePerDay[source]¶

A cohort is the group of people that have the same age at the time of an edit.

cohort_labels¶: Cohort labels

cohorts¶: Cohort definition

colorbarTicksAndLabels(ncolors)[source]¶: Returns ticks and labels for the colorbar of a WikiPride visualization

getIndex(ti, fe)[source]¶: Returns the index of the cohort (i.e. the relative age of the editor) from the time index of the edit and time index of the first edit

class cohorts.age.RelativeAgePerMonth[source]¶

A cohort is the group of people that have the same age at the time of an edit. During the first month of editing, a contributor will be in the 1-month old cohort, then he switches to the 2-month cohort and so forth.

cohort_labels¶: Cohort labels

cohorts¶: Cohort definition

colorbarTicksAndLabels(ncolors)[source]¶: Returns ticks and labels for the colorbar of a WikiPride visualization

getIndex(ti, fe)[source]¶: Returns the index of the cohort (i.e. the relative age of the editor) from the time index of the edit and time index of the first edit

old_user_id¶: The user_id of the previously encountered editor as we iterate through the table

This module implements histograms cohorts, e.g. EditsHistogram.

class cohorts.histogram.EditorActivity[source]¶

The cohorts are based on the number of edits they have done in a given month. It uses a table where the values are aggregated for all namespaces.

cohorts¶: Cohort definition

colorbarTicksAndLabels(ncolors)[source]¶: Returns ticks and labels for the colorbar of a WikiPride visualization

getColor(i)[source]¶: Returns a color based on the index of the cohort i

getIndex(edits)[source]¶: Returns the index of the cohort

initDataDescription()[source]¶: Initialize the self.data_description dictionary with information used for plotting.

linePlots(dest)[source]¶

Graphs for editor activity histogram cohort include

Number of editors by activity
Number of edits by activity
Bytes added by activity
Bytes added per editor
Bytes added per edit
Edits per editor
The first year of one-year cohorts in one plot (x-axis is age, not time)

sqlQuery¶: The SQL query returns edit information for each editor for each ym she has edited.

class cohorts.histogram.EditsHistogram[source]¶

The cohorts are based on the number of edits they have done in a given month. Implemented only for MongoDB.

cohorts¶: Cohort definition

colorbarTicksAndLabels(ncolors)[source]¶: Returns ticks and labels for the colorbar of a WikiPride visualization

getColor(i)[source]¶: Returns a color based on the index of the cohort i

getIndex(edits)[source]¶: Returns the index of the cohort

class cohorts.histogram.NewEditorActivity(period=3)[source]¶

The cohorts are based on the number of edits they have done in a given month.

cohorts¶: Cohort definition

colorbarTicksAndLabels(ncolors)[source]¶: Returns ticks and labels for the colorbar of a WikiPride visualization

getColor(i)[source]¶: Returns a color based on the index of the cohort i

getIndex(edits)[source]¶: Returns the index of the cohort

initDataDescription()[source]¶: Initialize the self.data_description dictionary with information used for plotting.

lastym¶: The ym at the end of the period months after the first edit of an editor

old_user_id¶: The user_id of the previously encountered editor as we iterate through the table

period¶: The number of month an editor is considered new

For simple cohorts :)

class cohorts.simple.NameSpaces[source]¶

The namespaces themselves are cohorts

cohort_labels¶: Cohort labels

cohorts¶: Cohort definition

colorbarTicksAndLabels(ncolors)[source]¶: Returns ticks and labels for the colorbar of a WikiPride visualization

getIndex(ns)[source]¶: Returns the index of the cohort, given the year of the first edit

initDataDescription()[source]¶: Initialize the self.data_description dictionary with additional information

sqlQuery¶: The SQL query returns edit information for each editor for each ym she has edited.

class cohorts.simple.NewEditors[source]¶

There is just one cohort, which contains the number of of editors who started contributing in any given month.

NewEditors.linePlot() creates a line plot.

cohort_labels¶: Cohort labels

cohorts¶: Cohort definition

colorbarTicksAndLabels(ncolors)[source]¶: Returns ticks and labels for the colorbar of a WikiPride visualization

getIndex(ns)[source]¶: Not needed in this cohort!

initDataDescription()[source]¶: Initialize the self.data_description dictionary with additional information

linePlots(dest)[source]¶

Creates a line plot for the number of new editors and saves it to disk.

Parameters:	dest – str, destination directory

sqlQuery¶: The SQL query returns the new editor count for each ym.

class cohorts.simple.OneYearCohort(year, activation=5, overall=False)[source]¶

A cohort that is comprised of active editors that started editing in a given year.

activation¶: Minimum number of edits per month to be included in the cohort

cohort_labels¶: Cohort labels

cohorts¶: Cohort definition

getIndex(fe)[source]¶: Returns the index of the cohort, which is identical to the time index of the first edit

initDataDescription()[source]¶: Initialize the self.data_description dictionary with additional information

time_stamps_index¶: Only take time_stamps starting with self.year

year¶: The year the cohort started.

class cohorts.simple.ProjectSpaceCohorts(activation=5)[source]¶

A cohort that is comprised of active editors that started editing in a given year. Only the contributions to the Wikipedia namespaces 4&5 are considered.

cohort_labels¶: Cohort labels

cohorts¶: Cohort definition

colorbarTicksAndLabels(ncolors)[source]¶: Returns ticks and labels for the colorbar of a WikiPride visualization

getIndex(y)[source]¶: Returns the index of the cohort, given the year of the first edit

initDataDescription()[source]¶: Initialize the self.data_description dictionary with additional information

time_stamps_index¶: Only take time_stamps starting with self.year

Data Processing¶

This module interacts with the MediaWiki SQL database.

Preprocessing¶

Starting with the Wikimedia SQL database schema, this module creates a set of tables that will be used to aggregate the cohort trends.

data.preprocessing.createIndex(query, tablename)[source]¶

Create an index on a SQL table in the user database

Parameters:	tablename – str, name of the table query – str, query to execute

data.preprocessing.createTable(query, tablename)[source]¶

Create a SQL table in the user database

Parameters:	tablename – str, name of the table query – str, query to execute

data.preprocessing.dropTable(tablename)[source]¶

Drops a SQL table in the user database

Parameters:	tablename – str, name of the table

data.preprocessing.executeCommand(command, comment)[source]¶

Exports a SQL table into a file

Parameters:	command – str, the command used to export the comment – str, comment for logging stream

data.preprocessing.process()[source]¶: Creates the auxiliary SQL tables on the user database.

data.preprocessing.tableExists(tablename)[source]¶

Returns True if the table exists in the user database

Parameters:	tablename – str, name of the table

Tables¶

This module holds the collection of SQL queries used for the preprocessing of the data

data.tables.CREATE_EDITOR_YEAR_MONTH¶: Query to editor centric table. For each user and each year/month, it contains the number of add/remove edits as well as number bytes added/removed.

data.tables.CREATE_EDITOR_YEAR_MONTH_NAMESPACE¶: Query to editor centric table. Same as EDITOR_YEAR_MONTH but including namespace. For each user and each year/month/namespace, it contains the number of add/remove edits as well as number bytes added/removed.

data.tables.CREATE_EDITOR_YEAR_MONTH_NS0_NOREDIRECT¶: Query to editor centric table. Same as EDITOR_YEAR_MONTH but including only for namespace 0 (main) and only for pages that are no redirects. For each user and each year/month, it contains the number of add/remove edits as well as number bytes added/removed.

data.tables.CREATE_TIME_YEAR_MONTH_DAY_NAMESPACE¶: Query to time centric table. Same as TIME_YEAR_MONTH_NAMESPACE but including namespace. For each year/month, it contains the number of editors, the number of add/remove edits as well as number bytes added/removed.

data.tables.CREATE_TIME_YEAR_MONTH_NAMESPACE¶: Query to time centric table. For each year/month, it contains the number of editors, the number of add/remove edits as well as number bytes added/removed.

data.tables.CREATE_USER_COHORTS¶: Query to create an augmented user table. Includes time stamp for first edit of user, also considering archived revisions. A detailed description is available here.

data.tables.INDEX_REV_LEN_CHANGED¶: Query to create an augmented revision table. Includes namespace and change of the size of the articel len_change. Costly query, a detailed description is available here.

Report¶

This module defines the content of a report, which consists of the following at the moment.

Community roles
- User
- Administrators
Cohort trends
- Age Cohorts
  
  More than 1 edit
  
  More than 5 edit
  
  More than 100 edit
  
  Less than 100 edits
- New editors
- Histogram cohorts
- Namespaces
User lists
- Most active editors

class data.report.ReportItem(cohort, dest)[source]¶

A report consists of a collection of report items. A report item consists of a cohort instance and methods to generate the data and the plots.

cohort¶: Cohort instance

createDirectory(base)[source]¶

Creates the directory if it doesn’t exist already. The base directory is joined with the relative destination directory and returned.

Parameters:	base – base directory (e.g. settings.datadirectory or settings.wikipridedirectory)
Returns:	absolute path

freeData()[source]¶: Frees the data in hope of reducing the memory usage of the process.

generateCSV()[source]¶: Stores a simple csv file in a format used by the javascript dygraphs library.

generateData()[source]¶: Generates and saves the cohort data. Calls the aggregateDataFromSQL() method from the Cohort instance passed as argument. The collected data matrices are stored in the Cohort.data attribute. The data matrices are saved as txt files in the data destination directory.

generateVisualizations(varNames, **kargs)[source]¶

For the variables names in varNames, produces the WikiPride graphs using wikiPride() (e.g. added, editors, ...). If the cohort defines linePlots, they are also generated.

Parameters:	kargs – arguments passed directly to `wikiPride()`. E.g. flip=True, percentage=False. varNames – list of str, containing the names of the variables for which wikipride should be produced.

loadData()[source]¶: Loads the data from disk if available

relDest¶: Relative path to the destination directory

data.report.processCSV()[source]¶: The aggregation of the cohort data requires that data.preprocessing.process() has been executed and the data thus preprocessed. The data.cohortdata.processData() method will use the report definition in report to create a directory structure that contains the data of the cohort defitintions described below. The data is stored in the form of numpy matrices.

data.report.processData()[source]¶: The aggregation of the cohort data requires that data.preprocessing.process() has been executed and the data thus preprocessed. The data.cohortdata.processData() method will use the report definition in report to create a directory structure that contains the data of the cohort defitintions described below. The data is stored in the form of numpy matrices.

data.report.processReport()[source]¶: Creates a set of graphs which requires that data.report.processData() has been executed and the data thus aggregated. The data is loaded from disk.

Database configuration¶

This module defines individual cohorts. All cohorts are sub-classes of the Cohort, and they should overwrite the non-generic methods defined in the parent class. Different kind of cohorts have been defined.

Creates a database connection to the slave replica on alpha and implements methods for querying the database

db.sql.connect()[source]¶: Connect to the MySQL database

db.sql.db¶: SQL connection instance

db.sql.getCursor()[source]¶: Returns a normal cursor

db.sql.getSSCursor()[source]¶: Returns a server-side cursor

db.sql.getSSDictCursor()[source]¶: Returns a server-side dictionary cursor

Provides a database connection con to wikilytics.

db.mongo.con¶: Connection instance to the mongo db

db.mongo.mongocol¶: Name of the mongo collection

db.mongo.mongodb¶: Name of the mongo database

Utils module¶

A set of utility methods that are used in different parts of the framework.

utils.cmap_discretize(cmapName, N)[source]¶

From http://www.scipy.org/Cookbook/Matplotlib/ColormapTransformations

Parameters:	cmap – colormap instance, eg. cm.jet. N – Number of colors.
Returns:	a discrete colormap from the continuous colormap cmap.

utils.computeMonthStartEndtime(ym)[source]¶

Returns the starting and end datetime object for the yyyymm passed. I.e. the first and last day of the month

Parameters:	ym – str, ‘yyyymm’ format
Returns:	tuple of datetime objects

utils.create_time_stamps_day(fromymd='20010101', toymd='20101231')[source]¶: Helper data structures for time stamps List of all time unites, i.e. every month. yyyymm

utils.create_time_stamps_month(fromym='200101', toym='201012')[source]¶: Helper data structures for time stamps List of all time unites, i.e. every month. yyyymm

utils.isBot(u_id)[source]¶

Returns true if we filter for bots and u_id is a known bot.

Parameters:	ints – Boolean, if True compares u_id as int (default is False)

utils.numberOfMonths(ymStart, ymEnd)[source]¶

Returns the number of months between the parameters.

Parameters:	ymStart – str, ‘yyyymm’ format ymEnd – str, ‘yyyymm’ format
Returns:	int, number of month

Table Of Contents

This Page

Reference¶

Cohort definitions¶

Data Processing¶

Preprocessing¶

Tables¶

Report¶

Database configuration¶

Utils module¶

Navigation

Table Of Contents

This Page

Quick search

Reference¶

Cohort definitions¶

Data Processing¶

Preprocessing¶

Tables¶

Report¶

Database configuration¶

Utils module¶

Navigation