breast_cancer#
The data set contains features for binary prediction of breast cancer.
The data has 569 patient records with 30 features and one binary target column, referring to the presence of breast cancer in the patient.
This is a copy of UCI ML Breast Cancer Wisconsin (Diagnostic) datasets. https://goo.gl/U2Uwz2
Features are computed from a digitized image of a fine needle aspirate (FNA) of a breast mass. They describe characteristics of the cell nuclei present in the image.
Separating plane described above was obtained using Multisurface Method-Tree (MSM-T) [K. P. Bennett, “Decision Tree Construction Via Linear Programming.” Proceedings of the 4th Midwest Artificial Intelligence and Cognitive Science Society, pp. 97-101, 1992], a classification method which uses linear programming to construct a decision tree. Relevant features were selected using an exhaustive search in the space of 1-4 features and 1-3 separating planes.
The actual linear program used to obtain the separating plane in the 3-dimensional space is that described in: [ K. P. Bennett and O. L. Mangasarian: “Robust Linear Programming Discrimination of Two Linearly Inseparable Sets”, Optimization Methods and Software 1, 1992, 23-34].
This database is also available through the UW CS ftp server:
ftp ftp.cs.wisc.edu cd math-prog/cpo-dataset/machine-learn/WDBC/
- References:
W.N. Street, W.H. Wolberg and O.L. Mangasarian. Nuclear feature extraction for breast tumor diagnosis. IS&T/SPIE 1993 International Symposium on Electronic Imaging: Science and Technology, volume 1905, pages 861-870, San Jose, CA, 1993.
O.L. Mangasarian, W.N. Street and W.H. Wolberg. Breast cancer diagnosis and prognosis via linear programming. Operations Research, 43(4), pages 570-577, July-August 1995.
W.H. Wolberg, W.N. Street, and O.L. Mangasarian. Machine learning techniques to diagnose breast cancer from fine-needle aspirates. Cancer Letters 77 (1994) 163-171.
The typical ML task in this dataset is to build a model that classifies between benign and malignant samples.
- Ten real-valued features are computed for each cell nucleus:
radius (mean of distances from center to points on the perimeter)
texture (standard deviation of gray-scale values)
perimeter
area
smoothness (local variation in radius lengths)
compactness (perimeter^2 / area - 1.0)
concavity (severity of concave portions of the contour)
concave points (number of concave portions of the contour)
symmetry
fractal dimension (“coastline approximation” - 1)
- Dataset Shape:
# Property
Value
Samples Total
569
Dimensionality
30
Features
real
Targets
boolean
- Description:
# mean radius
Feature
mean radius
mean texture
Feature
mean texture
mean perimeter
Feature
mean perimeter
mean area
Feature
mean area
mean smoothness
Feature
mean smoothness
mean compactness
Feature
mean compactness
mean concavity
Feature
mean concavity
mean concave points
Feature
mean concave points
mean symmetry
Feature
mean symmetry
mean fractal dimension
Feature
mean fractal dimension
radius error
Feature
radius error
texture error
Feature
texture error
perimeter error
Feature
perimeter error
area error
Feature
area error
smoothness error
Feature
smoothness error
compactness error
Feature
compactness error
concavity error
Feature
concavity error
concave points error
Feature
concave points error
symmetry error
Feature
symmetry error
fractal dimension error
Feature
fractal dimension error
worst radius
Feature
worst radius
worst texture
Feature
worst texture
worst perimeter
Feature
worst perimeter
worst area
Feature
worst area
worst smoothness
Feature
worst smoothness
worst compactness
Feature
worst compactness
worst concavity
Feature
worst concavity
worst concave points
Feature
worst concave points
worst symmetry
Feature
worst symmetry
worst fractal dimension
Feature
worst fractal dimension
target
Label
The class (Benign, Malignant)
Functions
|
Load and returns the Breast Cancer dataset (classification). |
|
Load and return a fitted classification model to predict the flower type in the iris dataset. |