breast_cancer#

The data set contains features for binary prediction of breast cancer.

The data has 569 patient records with 30 features and one binary target column, referring to the presence of breast cancer in the patient.

This is a copy of UCI ML Breast Cancer Wisconsin (Diagnostic) datasets. https://goo.gl/U2Uwz2

Features are computed from a digitized image of a fine needle aspirate (FNA) of a breast mass. They describe characteristics of the cell nuclei present in the image.

Separating plane described above was obtained using Multisurface Method-Tree (MSM-T) [K. P. Bennett, “Decision Tree Construction Via Linear Programming.” Proceedings of the 4th Midwest Artificial Intelligence and Cognitive Science Society, pp. 97-101, 1992], a classification method which uses linear programming to construct a decision tree. Relevant features were selected using an exhaustive search in the space of 1-4 features and 1-3 separating planes.

The actual linear program used to obtain the separating plane in the 3-dimensional space is that described in: [ K. P. Bennett and O. L. Mangasarian: “Robust Linear Programming Discrimination of Two Linearly Inseparable Sets”, Optimization Methods and Software 1, 1992, 23-34].

This database is also available through the UW CS ftp server:

ftp ftp.cs.wisc.edu cd math-prog/cpo-dataset/machine-learn/WDBC/

References:
  • W.N. Street, W.H. Wolberg and O.L. Mangasarian. Nuclear feature extraction for breast tumor diagnosis. IS&T/SPIE 1993 International Symposium on Electronic Imaging: Science and Technology, volume 1905, pages 861-870, San Jose, CA, 1993.

  • O.L. Mangasarian, W.N. Street and W.H. Wolberg. Breast cancer diagnosis and prognosis via linear programming. Operations Research, 43(4), pages 570-577, July-August 1995.

  • W.H. Wolberg, W.N. Street, and O.L. Mangasarian. Machine learning techniques to diagnose breast cancer from fine-needle aspirates. Cancer Letters 77 (1994) 163-171.

The typical ML task in this dataset is to build a model that classifies between benign and malignant samples.

Ten real-valued features are computed for each cell nucleus:
  1. radius (mean of distances from center to points on the perimeter)

  2. texture (standard deviation of gray-scale values)

  3. perimeter

  4. area

  5. smoothness (local variation in radius lengths)

  6. compactness (perimeter^2 / area - 1.0)

  7. concavity (severity of concave portions of the contour)

  8. concave points (number of concave portions of the contour)

  9. symmetry

  10. fractal dimension (“coastline approximation” - 1)

Dataset Shape:
Dataset Shape#

Property

Value

Samples Total

569

Dimensionality

30

Features

real

Targets

boolean

Description:
Dataset Description#

mean radius

Feature

mean radius

mean texture

Feature

mean texture

mean perimeter

Feature

mean perimeter

mean area

Feature

mean area

mean smoothness

Feature

mean smoothness

mean compactness

Feature

mean compactness

mean concavity

Feature

mean concavity

mean concave points

Feature

mean concave points

mean symmetry

Feature

mean symmetry

mean fractal dimension

Feature

mean fractal dimension

radius error

Feature

radius error

texture error

Feature

texture error

perimeter error

Feature

perimeter error

area error

Feature

area error

smoothness error

Feature

smoothness error

compactness error

Feature

compactness error

concavity error

Feature

concavity error

concave points error

Feature

concave points error

symmetry error

Feature

symmetry error

fractal dimension error

Feature

fractal dimension error

worst radius

Feature

worst radius

worst texture

Feature

worst texture

worst perimeter

Feature

worst perimeter

worst area

Feature

worst area

worst smoothness

Feature

worst smoothness

worst compactness

Feature

worst compactness

worst concavity

Feature

worst concavity

worst concave points

Feature

worst concave points

worst symmetry

Feature

worst symmetry

worst fractal dimension

Feature

worst fractal dimension

target

Label

The class (Benign, Malignant)

Functions

load_data([data_format, as_train_test])

Load and returns the Breast Cancer dataset (classification).

load_fitted_model([pretrained])

Load and return a fitted classification model to predict the flower type in the iris dataset.