Table Of Contents

This Page

Reference

Settings used for cohort analysis. Many of these settings can be overwritten in the cohort class __init__ calls.

settings.basedirectory

Path to base directory for the wikipride project

settings.cmapName

Colormap to use. Has to be a valid name in matplotlib.pyplot.cm.datad

settings.datadirectory

Path to directory for cohort data

settings.filterbots

Filter out known bots?

settings.language

The language of the Wikipedia (e.g. ‘en’,’pt’)

settings.mongoQueryVars

The Mongo query variables used to aggregate the data.

settings.mongocol

The name of the collection

settings.mongodb

The name of the db

settings.readConfig(configfile)[source]

Reads the configuration from ConfigParser instance into the runtime settings.

Parameters:configfile – A file that can be read by a ConfigParser instance
settings.reportdirectory

Path to store report

settings.sqlconfigfile

The path to the MySQL configuration for logging into the server, e.g. ‘~/.my.cnf’

settings.sqldroptables

If True, all tables that we attempt to create will be dropped if they exist already. If False, only tables don’t exist already will be created

settings.sqlhost

The host name of the MySql server

settings.sqluserdb

The name of the database on db host where the aggregated tables will be stored. On the toolserver, this is the username prepended by a u_, e.e. u_delcerambaul

settings.sqlwikidb

The name of the database on db host where the Mediawiki database is stored. On the toolserver, this is for example ptwiki_p

settings.time_stamps

List containing all YM (e.g. ‘200401’ for January 2004) that we want to analyze

settings.time_stamps_index

A dictionary mapping each YM in time_stamps to its array index (used to access numpy arrays)

settings.userlistdirectory

Path to directory for user lists

settings.wikipridedirectory

Path to store wikipride visualizations of user defined cohorts

Cohort definitions

This module defines individual cohorts. All cohorts are sub-classes of the Cohort, and they should overwrite the non-generic methods defined in the parent class. Different kind of cohorts have been defined.

This module defines the abstract class Cohort. All cohort definitions must inherit this class.

digraph inheritance07007f5952 { rankdir=LR; size="8.0, 12.0"; "cohorts.base.Cohort" [style="setlinewidth(0.5)",URL="#cohorts.base.Cohort",fontname=Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans,height=0.25,shape=box,fontsize=10]; }

class cohorts.base.Cohort[source]

Abstract class that defines common properties of cohorts, which are defined in the cohorts modules

addLine(data, fig=None, label='')[source]

Adds a line to the matplotlib figure passed as argument. The dimension the data has to match the length of the time_stamps. It is assumed that the figure contains only one axes.

Parameters:
  • data – numpy.array of same length as time_stamps
  • fig – matplotlib figure. If none, a new figure is created.
  • label – str, label for the legend. Defaults to an empty string.
Returns:

matplotlib figure

aggregateDataFromSQL(verbose=False, callback=None)[source]

Iterates over the SQL data and calls self.processSQLrow() which needs to be implemented by the parent cohort class

Parameters:
  • verbose – bool, display progress on stdout
  • callback – function, a callback function that can be used for data transformations after the query has executed.
data

Dictionary that contains the data. {name : numpy.array }. Different aggregates can be saved; for example ‘bytesadded’,’edits’,’bytesremovedPerEditor’

data_description

Dictionary that holds descriptive information about self.data. For example, an ‘addedBytes’ data description might be:

self.data_description[‘addedBytes’] = { title}

finalizeData()[source]

This method should is called at the of an aggregateDataFromXXX() method. It allows to manipulate the time series data in self.data. E.g. and ‘addedBytes’ could be divided by ‘edits’ to create a new variable ‘addedPerEdit’.

getFileName(varName, destination=None, ftype='data')[source]

Generates the path and file name based on properties of the cohort. Additional identifying features might be used in file names by overwriting this method in subclasses of the base Cohort class.

If no destination argument is passed, the method uses the ftype argument to determine which base directory should be used. Only the name of the data feature (e.g. ‘added’) and the cohort name (e.g. AbsoluteAgePerMonth) is used in the basic method.

Parameters:
  • varName – name of the self.data variable
  • destination – str, destination directory. If None, settings will be used
  • ftype – str, ‘data’ or ‘wikipride’
Returns:

A path without file format

getIndex(edits)[source]

Returns the index of the cohort

initData()[source]

Initialize the self.data dictionary with the appropriate variable names and numpy.arrays

initDataDescription()[source]

Initialize the self.data_description dictionary with additional information

linePlots(dest)[source]

This method allows to produce line plots using the cohort data stored in self.data. Usually line plots illustrate interesting trends/ratios that depend on the cohort definition. Thus this method in the base cohort definition does nothing and should be overwritten in the cohort class itself.

loadDataFromDisk(varName, destination=None)[source]

Loads the data from disk. It will populate self.data with {names[i] : numpy.array}. An error is raised if there is no corresponding datafile stored

Args varName:variable name
Parameters:destination – str, destination directory. If None, settings will be used
mongoQueryVars

The Mongo query variables used to aggregate the data. If None, all fields will be returned by mongo. If ‘settings’, the mongoQueryVars from the settings will be used

ncolors

The number of colors used for the wikipride graphs. If required, it should be defined in the child class definition.

nobots

True if the bots are filtered from the cohort

processMongoDocument()[source]

Processes a document of the Mogo DB result set

processSQLrow(row)[source]

Processes a row of the SQL result set

saveDataToCSV(destination=None)[source]

Saves the aggregated numpy.arrays to file. There is one file for each collected variable, the names is uniquely constructed from the properties of the variable and cohort. The format of the CSV doesn’t follow the numpy representation as it transposes the matrix. Thus the temporal axis is vertical instead of horizontal, each row is a measurement for a different time unit. This format is used by the visualization library dygraphs .

Parameters:destination – str, destination directory. If None, the data directory from the settings will be used
saveDataToDisk(destination=None)[source]

Saves the aggregated numpy.arrays to file. There is one file for each collected variable, the names is uniquely constructed from the properties of the variable and cohort.

saveFigure(name, fig, dest, title='', ylabel='', xlabel=None, ylog=False, legendpos=None, pdf=False)[source]

Saves a matplotlib figure to disk.

Parameters:
  • name – str, name of resulting file
  • fig – matplotlib figure to be saved.
  • dest – str, destination folder.
  • title – str, plot title. Defaults to an empty string.
  • ylabel – str, plot y axis label. Defaults to an empty string.
  • xlabel – str, plot x axis label. If none, time stamps xticks will be used.
  • ylog – If True the log scale is used for the y-axis, default is False.
  • legendpos – int, the position of the legend. If None, no legend will be displayed.
  • pdf – If True the file will be saved as pdf, otherwise as png.
wikiPride(varName, varDesc=None, normal=True, percentage=True, colorbar=True, ncolors=None, flip=False, pdf=False, dest=None, verbose=False)[source]

Plots the cohort trends using the famous WikiPride stacked bar chart! If normal is True, the absolute values are visualized. If percentage is True, the relative values are visualized (i.e. the percentages). If flip is True, the numpy.array is flipped upside down. This results in the bars added in reverse order. The order of the cohort labels is also reversed as a result.

Parameters:
  • varName – str, the name of the numpy.array in self.data to visualize
  • varDesc – str. Alternative name for the data description. If None, varName will be used.
  • normal – Boolean. Visualize absolute values.
  • percentage – Boolean. Visualize percentages.
  • colorbar – Boolean. Add color bar legend.
  • pdf – Boolean. If True, save plot as pdf
  • flip – Boolean. N.flipud() the numpy.array which inverses the order the boxes are added
  • dest – str. Path to directory on where to save the plot. If None, the path in settings.py will be used
  • verbose – Boolean. Displays information about the graphing progress.

This module implements age cohorts, AbsoluteAge and RelativeAge.

class cohorts.age.AbsoluteAgeAllNamespaces(minedits=1, maxedits=None)[source]

A cohort is the group of people that have started editing in the same month.

cohort_labels

Cohort labels

cohorts

Cohort definition

colorbarTicksAndLabels(ncolors)[source]

Returns ticks and labels for the colorbar of a WikiPride visualization

getIndex(fe)[source]

Returns the index of the cohort, which is identical to the time index of the first edit

initDataDescription()[source]

Initialize the self.data_description dictionary with additional information

maxedits

Maximum number of edits by editor in a given month to be included

minedits

Minimum number of edits by editor in a given month to be included

ncolors

Number of visible colors in the wikipride plots. E.g. one color for every six month for wikipride plots

sqlQuery

The SQL query returns edit information for each editor for each ym she has edited.

class cohorts.age.AbsoluteAgePerMonth[source]

A cohort is the group of people that have started editing in the same month.

cohort_labels

Cohort labels

cohorts

Cohort definition

colorbarTicksAndLabels(ncolors)[source]

Returns ticks and labels for the colorbar of a WikiPride visualization

getIndex(fe)[source]

Returns the index of the cohort, which is identical to the time index of the first edit

old_user_id

The user_id of the previously encountered editor as we iterate through the table

class cohorts.age.Age[source]

A abstract class for for an age cohort.

initDataDescription()[source]

Initialize the self.data_description dictionary with additional information

class cohorts.age.RelativeAgeAllNamespaces(minedits=1, maxedits=None)[source]

A cohort is the group of people that have the same age at the time of an edit. During the first month of editing, a contributor will be in the 1-month old cohort, then he switches to the 2-month cohort and so forth.

cohort_labels

Cohort labels

cohorts

Cohort definition

colorbarTicksAndLabels(ncolors)[source]

Returns ticks and labels for the colorbar of a WikiPride visualization

getIndex(ti, fe)[source]

Returns the index of the cohort (i.e. the relative age of the editor) from the time index of the edit and time index of the first edit

initDataDescription()[source]

Initialize the self.data_description dictionary with additional information

linePlots(dest)[source]

Graphs for relative age cohorts include

  • Bytes added per edit (new vs. old editors)
  • Contribution percentage of bytes added for each one year cohort
  • Editor percentage for each one year cohort
maxedits

Maximum number of edits by editor in a given month to be included

minedits

Minimum number of edits by editor in a given month to be included

ncolors

Number of visible colors in the wikipride plots. E.g. one color for every six month for wikipride plots

sqlQuery

The SQL query returns edit information for each editor for each ym she has edited.

class cohorts.age.RelativeAgePerDay[source]

A cohort is the group of people that have the same age at the time of an edit.

cohort_labels

Cohort labels

cohorts

Cohort definition

colorbarTicksAndLabels(ncolors)[source]

Returns ticks and labels for the colorbar of a WikiPride visualization

getIndex(ti, fe)[source]

Returns the index of the cohort (i.e. the relative age of the editor) from the time index of the edit and time index of the first edit

class cohorts.age.RelativeAgePerMonth[source]

A cohort is the group of people that have the same age at the time of an edit. During the first month of editing, a contributor will be in the 1-month old cohort, then he switches to the 2-month cohort and so forth.

cohort_labels

Cohort labels

cohorts

Cohort definition

colorbarTicksAndLabels(ncolors)[source]

Returns ticks and labels for the colorbar of a WikiPride visualization

getIndex(ti, fe)[source]

Returns the index of the cohort (i.e. the relative age of the editor) from the time index of the edit and time index of the first edit

old_user_id

The user_id of the previously encountered editor as we iterate through the table

This module implements histograms cohorts, e.g. EditsHistogram.

class cohorts.histogram.EditorActivity[source]

The cohorts are based on the number of edits they have done in a given month. It uses a table where the values are aggregated for all namespaces.

cohorts

Cohort definition

colorbarTicksAndLabels(ncolors)[source]

Returns ticks and labels for the colorbar of a WikiPride visualization

getColor(i)[source]

Returns a color based on the index of the cohort i

getIndex(edits)[source]

Returns the index of the cohort

initDataDescription()[source]

Initialize the self.data_description dictionary with information used for plotting.

linePlots(dest)[source]

Graphs for editor activity histogram cohort include

  • Number of editors by activity
  • Number of edits by activity
  • Bytes added by activity
  • Bytes added per editor
  • Bytes added per edit
  • Edits per editor
  • The first year of one-year cohorts in one plot (x-axis is age, not time)
sqlQuery

The SQL query returns edit information for each editor for each ym she has edited.

class cohorts.histogram.EditsHistogram[source]

The cohorts are based on the number of edits they have done in a given month. Implemented only for MongoDB.

cohorts

Cohort definition

colorbarTicksAndLabels(ncolors)[source]

Returns ticks and labels for the colorbar of a WikiPride visualization

getColor(i)[source]

Returns a color based on the index of the cohort i

getIndex(edits)[source]

Returns the index of the cohort

class cohorts.histogram.NewEditorActivity(period=3)[source]

The cohorts are based on the number of edits they have done in a given month.

cohorts

Cohort definition

colorbarTicksAndLabels(ncolors)[source]

Returns ticks and labels for the colorbar of a WikiPride visualization

getColor(i)[source]

Returns a color based on the index of the cohort i

getIndex(edits)[source]

Returns the index of the cohort

initDataDescription()[source]

Initialize the self.data_description dictionary with information used for plotting.

lastym

The ym at the end of the period months after the first edit of an editor

old_user_id

The user_id of the previously encountered editor as we iterate through the table

period

The number of month an editor is considered new

For simple cohorts :)

class cohorts.simple.NameSpaces[source]

The namespaces themselves are cohorts

cohort_labels

Cohort labels

cohorts

Cohort definition

colorbarTicksAndLabels(ncolors)[source]

Returns ticks and labels for the colorbar of a WikiPride visualization

getIndex(ns)[source]

Returns the index of the cohort, given the year of the first edit

initDataDescription()[source]

Initialize the self.data_description dictionary with additional information

sqlQuery

The SQL query returns edit information for each editor for each ym she has edited.

class cohorts.simple.NewEditors[source]

There is just one cohort, which contains the number of of editors who started contributing in any given month.

NewEditors.linePlot() creates a line plot.

cohort_labels

Cohort labels

cohorts

Cohort definition

colorbarTicksAndLabels(ncolors)[source]

Returns ticks and labels for the colorbar of a WikiPride visualization

getIndex(ns)[source]

Not needed in this cohort!

initDataDescription()[source]

Initialize the self.data_description dictionary with additional information

linePlots(dest)[source]

Creates a line plot for the number of new editors and saves it to disk.

Parameters:dest – str, destination directory
sqlQuery

The SQL query returns the new editor count for each ym.

class cohorts.simple.OneYearCohort(year, activation=5, overall=False)[source]

A cohort that is comprised of active editors that started editing in a given year.

activation

Minimum number of edits per month to be included in the cohort

cohort_labels

Cohort labels

cohorts

Cohort definition

getIndex(fe)[source]

Returns the index of the cohort, which is identical to the time index of the first edit

initDataDescription()[source]

Initialize the self.data_description dictionary with additional information

time_stamps_index

Only take time_stamps starting with self.year

year

The year the cohort started.

class cohorts.simple.ProjectSpaceCohorts(activation=5)[source]

A cohort that is comprised of active editors that started editing in a given year. Only the contributions to the Wikipedia namespaces 4&5 are considered.

cohort_labels

Cohort labels

cohorts

Cohort definition

colorbarTicksAndLabels(ncolors)[source]

Returns ticks and labels for the colorbar of a WikiPride visualization

getIndex(y)[source]

Returns the index of the cohort, given the year of the first edit

initDataDescription()[source]

Initialize the self.data_description dictionary with additional information

time_stamps_index

Only take time_stamps starting with self.year

Data Processing

This module interacts with the MediaWiki SQL database.

Preprocessing

Starting with the Wikimedia SQL database schema, this module creates a set of tables that will be used to aggregate the cohort trends.

data.preprocessing.createIndex(query, tablename)[source]

Create an index on a SQL table in the user database

Parameters:
  • tablename – str, name of the table
  • query – str, query to execute
data.preprocessing.createTable(query, tablename)[source]

Create a SQL table in the user database

Parameters:
  • tablename – str, name of the table
  • query – str, query to execute
data.preprocessing.dropTable(tablename)[source]

Drops a SQL table in the user database

Parameters:tablename – str, name of the table
data.preprocessing.executeCommand(command, comment)[source]

Exports a SQL table into a file

Parameters:
  • command – str, the command used to export the
  • comment – str, comment for logging stream
data.preprocessing.process()[source]

Creates the auxiliary SQL tables on the user database.

data.preprocessing.tableExists(tablename)[source]

Returns True if the table exists in the user database

Parameters:tablename – str, name of the table

Tables

This module holds the collection of SQL queries used for the preprocessing of the data

data.tables.CREATE_EDITOR_YEAR_MONTH

Query to editor centric table. For each user and each year/month, it contains the number of add/remove edits as well as number bytes added/removed.

data.tables.CREATE_EDITOR_YEAR_MONTH_NAMESPACE

Query to editor centric table. Same as EDITOR_YEAR_MONTH but including namespace. For each user and each year/month/namespace, it contains the number of add/remove edits as well as number bytes added/removed.

data.tables.CREATE_EDITOR_YEAR_MONTH_NS0_NOREDIRECT

Query to editor centric table. Same as EDITOR_YEAR_MONTH but including only for namespace 0 (main) and only for pages that are no redirects. For each user and each year/month, it contains the number of add/remove edits as well as number bytes added/removed.

data.tables.CREATE_TIME_YEAR_MONTH_DAY_NAMESPACE

Query to time centric table. Same as TIME_YEAR_MONTH_NAMESPACE but including namespace. For each year/month, it contains the number of editors, the number of add/remove edits as well as number bytes added/removed.

data.tables.CREATE_TIME_YEAR_MONTH_NAMESPACE

Query to time centric table. For each year/month, it contains the number of editors, the number of add/remove edits as well as number bytes added/removed.

data.tables.CREATE_USER_COHORTS

Query to create an augmented user table. Includes time stamp for first edit of user, also considering archived revisions. A detailed description is available here.

data.tables.INDEX_REV_LEN_CHANGED

Query to create an augmented revision table. Includes namespace and change of the size of the articel len_change. Costly query, a detailed description is available here.

Report

This module defines the content of a report, which consists of the following at the moment.

  • Community roles
    • User
    • Administrators
  • Cohort trends
    • Age Cohorts
      • More than 1 edit
      • More than 5 edit
      • More than 100 edit
      • Less than 100 edits
    • New editors

    • Histogram cohorts

    • Namespaces

  • User lists
    • Most active editors
class data.report.ReportItem(cohort, dest)[source]

A report consists of a collection of report items. A report item consists of a cohort instance and methods to generate the data and the plots.

cohort

Cohort instance

createDirectory(base)[source]

Creates the directory if it doesn’t exist already. The base directory is joined with the relative destination directory and returned.

Parameters:base – base directory (e.g. settings.datadirectory or settings.wikipridedirectory)
Returns:absolute path
freeData()[source]

Frees the data in hope of reducing the memory usage of the process.

generateCSV()[source]

Stores a simple csv file in a format used by the javascript dygraphs library.

generateData()[source]

Generates and saves the cohort data. Calls the aggregateDataFromSQL() method from the Cohort instance passed as argument. The collected data matrices are stored in the Cohort.data attribute. The data matrices are saved as txt files in the data destination directory.

generateVisualizations(varNames, **kargs)[source]

For the variables names in varNames, produces the WikiPride graphs using wikiPride() (e.g. added, editors, ...). If the cohort defines linePlots, they are also generated.

Parameters:
  • kargs – arguments passed directly to wikiPride(). E.g. flip=True, percentage=False.
  • varNames – list of str, containing the names of the variables for which wikipride should be produced.
loadData()[source]

Loads the data from disk if available

relDest

Relative path to the destination directory

data.report.processCSV()[source]

The aggregation of the cohort data requires that data.preprocessing.process() has been executed and the data thus preprocessed. The data.cohortdata.processData() method will use the report definition in report to create a directory structure that contains the data of the cohort defitintions described below. The data is stored in the form of numpy matrices.

data.report.processData()[source]

The aggregation of the cohort data requires that data.preprocessing.process() has been executed and the data thus preprocessed. The data.cohortdata.processData() method will use the report definition in report to create a directory structure that contains the data of the cohort defitintions described below. The data is stored in the form of numpy matrices.

data.report.processReport()[source]

Creates a set of graphs which requires that data.report.processData() has been executed and the data thus aggregated. The data is loaded from disk.

Database configuration

This module defines individual cohorts. All cohorts are sub-classes of the Cohort, and they should overwrite the non-generic methods defined in the parent class. Different kind of cohorts have been defined.

Creates a database connection to the slave replica on alpha and implements methods for querying the database

db.sql.connect()[source]

Connect to the MySQL database

db.sql.db

SQL connection instance

db.sql.getCursor()[source]

Returns a normal cursor

db.sql.getSSCursor()[source]

Returns a server-side cursor

db.sql.getSSDictCursor()[source]

Returns a server-side dictionary cursor

Provides a database connection con to wikilytics.

db.mongo.con

Connection instance to the mongo db

db.mongo.mongocol

Name of the mongo collection

db.mongo.mongodb

Name of the mongo database

Utils module

A set of utility methods that are used in different parts of the framework.

utils.cmap_discretize(cmapName, N)[source]

From http://www.scipy.org/Cookbook/Matplotlib/ColormapTransformations

Parameters:
  • cmap – colormap instance, eg. cm.jet.
  • N – Number of colors.
Returns:

a discrete colormap from the continuous colormap cmap.

utils.computeMonthStartEndtime(ym)[source]

Returns the starting and end datetime object for the yyyymm passed. I.e. the first and last day of the month

Parameters:ym – str, ‘yyyymm’ format
Returns:tuple of datetime objects
utils.create_time_stamps_day(fromymd='20010101', toymd='20101231')[source]

Helper data structures for time stamps List of all time unites, i.e. every month. yyyymm

utils.create_time_stamps_month(fromym='200101', toym='201012')[source]

Helper data structures for time stamps List of all time unites, i.e. every month. yyyymm

utils.isBot(u_id)[source]

Returns true if we filter for bots and u_id is a known bot.

Parameters:ints – Boolean, if True compares u_id as int (default is False)
utils.numberOfMonths(ymStart, ymEnd)[source]

Returns the number of months between the parameters.

Parameters:
  • ymStart – str, ‘yyyymm’ format
  • ymEnd – str, ‘yyyymm’ format
Returns:

int, number of month