0% found this document useful (0 votes)
16 views6 pages

Application of AI As Singing Trainer

This paper discusses the development of an AI-based singing trainer aimed at providing personalized music learning experiences for users who lack access to proper coaching. It explores various AI techniques, including singing voice conversion and deep learning models, to enhance the quality of singing and provide real-time feedback. The proposed system addresses the challenges faced by aspiring singers, such as the absence of tutors and facilities, by offering a user-friendly interface and tailored training exercises.

Uploaded by

Hajrudin
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views6 pages

Application of AI As Singing Trainer

This paper discusses the development of an AI-based singing trainer aimed at providing personalized music learning experiences for users who lack access to proper coaching. It explores various AI techniques, including singing voice conversion and deep learning models, to enhance the quality of singing and provide real-time feedback. The proposed system addresses the challenges faced by aspiring singers, such as the absence of tutors and facilities, by offering a user-friendly interface and tailored training exercises.

Uploaded by

Hajrudin
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

Application Of AI As Singing Trainer

Vivek Vinze Jainam Dhami Darshit Desai


Dept. of Information Technology Dept. of Information Technology Dept. of Information Technology
Dwarkadas J. Sanghvi College of Dwarkadas J. Sanghvi College of Dwarkadas J. Sanghvi College of
2021 International Conference on Advances in Computing, Communication, and Control (ICAC3) | 978-1-6654-2634-3/21/$31.00 ©2021 IEEE | DOI: 10.1109/ICAC353642.2021.9697188

Engineering Engineering Engineering


Mumbai, India Mumbai, India Mumbai, India
vivek.vinze@gmail.com jainamdhami2000@gmail.com darshitr@gmail.com

Harshal Dalvi Purva Raut


Dept. of Information Technology Dept. of Information Technology
Dwarkadas J. Sanghvi College of Dwarkadas J. Sanghvi College of
Engineering Engineering
Mumbai, India Mumbai, India
harshal.dalvi@djsce.ac.in purvapraut@gmail.com

Abstract—Many people have the talent and passion for music and physical health, have been published. These findings show
but cannot pursue it further due to many restrictions. One of the that singing has the potential to improve the mental and oral
main restrictions is the unavailability of proper guidance and health of the elderly. A survey is highlighted in [5] in which
coaching. This paper aims at addressing this drawback by many participants ranked music creating and sociability
providing a complete music learning experience to the user. To alongside family relationships and overall health as highly
accomplish this challenge, we are making a comprehensive significant or crucial to their standard of living. It appeared
comparison of an existing song selected by the user and the voice that the band program is satisfying needs perceived as
clips of the user. Singing voice conversion will artificially otherwise not being adequately met, and providing
convert the song into the trainee’s voice. By looking at a variety
socialization benefits. In a survey conducted by [6], Most of
of statistical graphs and by listening to these clips of Artificial
Intelligence-generated songs, in their real-time voice, the user
the people reported health benefits after participating in choral
can quickly grasp the key concepts, boost his confidence and singing. They reported increased control over breathing,
improve his/her mistakes. increased feeling of activeness, feelings of being more
energized and alert to their surroundings. A large proportion
Keywords— Artificial Intelligence, Singing Voice Conversion, of them also felt an improvement in their posture or stance and
Speech Enhancement, Speaker Embedding felt that singing resulted in an enhanced lung capacity. It gave
them a sense of achievement and enlightened their mood. The
I. INTRODUCTION Alzheimer’s Society for people with dementia and carers has
Singing is among the most assertive and crucial aspects of introduced an intervention named Singing for the Brain [7].
music, as well as a form of enjoyment and self-expression. Singing for the Brain brings people with dementia together to
Singing uses words, pitches, and tones to transmit both verbal sing music they love and are familiar with in a joyful and
and emotional information. welcoming environment. It also does vocal exercises that help
improve the brain’s activity and wellbeing. According to the
There have been a lot of studies suggesting the benefits of results of the survey, apart from coping up with dementia, the
music and singing. In [1], the combined data was analysed to intervention also serves as an enjoyable medium for the
find benefits from music and social interaction that help the patients. It had positive impacts on the patients’ mood and
individuals' perceptions of personal well-being. Musicians increased their memory. Most importantly, it helped them
benefited from learning, sharing, and singing together. accept the diagnosis.
Opportunities to create friendships, overcome loneliness, and
obtain a feeling of approval were among the social benefits. A lot of music enthusiasts and homegrown singers have
Singing has been shown to improve people's health and always felt the need for a personalized training medium that
happiness. Choirs and music ensembles in the community will help them nurture their talents and develop a better grasp
continue to be a helpful tool to help citizens, establish of singing. But a huge hurdle faced by them is the absence of
community, and actively create music to transmit culture and home tutors and unavailability of superior facilities such as
tradition. Members recognized additional benefits to the enrolling themselves in a singing academy. Such practical
individual, such as a sense of accomplishment and issues are faced by the majority of people, and therefore this
responsibility from being a part of their various ensembles, created a need for developing an independent music training
which coincides with Macdonald's diverse set of conclusions interface. This interface should be readily available, be easy to
[2]. As revealed by [3], singers are more health-conscious than use, and have a user-friendly graphic user interface. It should
their peers, supported by the fact that they exercise more and also provide accurate and engaging results. The scope of this
smoke less than their peers. music trainer application includes providing exercises that can
be undertaken depending on the user's choice. Various types
The fact that stress-related elements are harmful to one's of voice modulations can be performed by the user. It will also
overall health has been reported by many recent studies. Some provide highly realistic personalized suggestions by recreating
subjects, 60 years old or older, who took part in [4] Chiyoda the whole song in the user’s voice. This will help the user gain
Paramedical Care Clinic researchers have demonstrated that, deeper insights. This novel suggestion-based approach will
in a society that is rapidly aging, keeping good oral health is outperform the current state-of-the-art music trainer
mandatory for assisting the senior citizens to remain healthy, applications.
particularly in terms of mental health. Several objective
analyses of the impacts of singing, such as its impact on mental

978-1-6654-2634-3/21/$31.00 ©2021 IEEE

Authorized licensed use limited to: MIT. Downloaded on November 19,2024 at 15:36:50 UTC from IEEE Xplore. Restrictions apply.
The rest of this paper follows the following structure: issues more prominent. Larger-scale models benefit various
Section 2 comprises a description of the literature survey applications based on Neural networks, but for recent and
conducted. In Section 3, we state our approach to solve the current existing speech modeling and enhancement
stated problems. Section 4 contains the conclusions. framework, larger-scale networks exhibit reduced robustness
to a wide range of real-world use scenarios beyond what is
II. LITERATURE SURVEY observed in training data. Thus, a semi-supervised technique
A. Cleaning Module contributes to increasing the amount of conversational training
data available by pre-enhancing noisy data and improving
A novel two-way approach of modelling the input speech performance on real-world recordings. The optimization is
is proposed in [8]. As opposed to the traditional approaches, better matched to human perceptual judgments on voice
which either model just the speech or the speech along with a quality thanks to a revised loss function that is oriented toward
noise-estimate, the approach proposed in the paper models retaining speech quality.
speech and noise simultaneously. This is done by using a dual-
branch convolutional neural network, known as SN-Net. The The MOS score provided by the model for noise is 2.95
specialty of SN-Net lies in the fact that instead of performing and for the proposed system is 3.52. MOS scores indicate the
the information fusion only at the final layer, interaction voice quality or the perceived quality of the voice. The scores
modules have been created at various intermediary feature range from 1 (bad) to 5 (best). Thus, we conclude that [10]
domains. Such interconnections can help compensate for the does a decent job of giving us enhanced voice quality.
other branch by covering up for its missing components.
B. AI Module
Furthermore, this paper also proposes a module designed for
feature extraction, named residual convolution-and-attention A unique framework for converting singing voices is
(RA). This extracts the correlation between the frequency and proposed in [11] in which Generative Adversarial Networks
temporal proportions for both noises and speech. The dual (GANs) are used. It is made up of a pair of neural networks: a
signal modelling nature of this prototype also makes it suitable discriminator that distinguishes between converted and natural
for speaker separation. singing voices, as well as a generator, to fool the discriminator.
In this research, the differences in distributions of the original
Using this approach, there were significant improvements target parameters and the singing parameters obtained by
in SDR [18] (in dB) and PESQ [19] values as compared to GAN are minimized. The trials conducted in [11] showed the
traditional systems. This showcased improved performance. proposed method effectively and it converted singing voices
In [9], a common inconvenience incurred during the outperforming the standard approach. The focus of this paper
process of speech enhancements is described. Classical is on the utilization of GANs as suggested for SVC, which
methods frequently produce a noise that is randomly does not necessitate the use of a speech recognition system.
fluctuating in nature, known as musical noise. This paper The following are the primary contributions of this paper: 1) a
presents a postfilter (PF) that is designed for the gains in unique futuristic framework based on Generative Adversarial
spectral weighting that can effectively reduce musical noise. It Networks (GANs) for converting singing voices, and 2)
has a strong detector for low SNR regions and pauses in eliminated the requirement for any kind of external process,
speech, as well as an adaptive smoothing algorithm for like recognition of speech, as well as a reduction in the reliance
weighing gains over soft decisions supported by frequency. on massive quantities of data for training, and 3) achieving a
When the postfilter is used in conjunction with traditional singing voice of excellent quality that beats methods based on
noise reduction approaches, absolute and relative Deep Neural Network. Through a discriminative procedure,
measurements indicate consistent improvements. Instrumental GANs were developed to propose the important distinctions
measurements of segmental speech signal-to-noise ratio between the original target singing and the source singing.
(SNR), noise attenuation, and cepstral distance show This Generative Adversarial Network is made up of a pair of
enhancements when it's applied additionally to traditionally DNNs which are updated repeatedly using mini-batch
used musical noise removers. stochastic gradient descent. An anti-spoofing system, based on
Deep Neural Network that can discriminate between synthetic
A subjective listening test supported the actual results. and natural singing voices is comparable to the discriminator
Increased segmental Noise Attenuation (NA) after using the utilized in this study. The proposed technique performs
postfilter was observed. This solved the problem of musical astonishingly effectively with only a little amount of
noise. concurrent training data from both singers. In the experiments,
In an attempt to introduce a novel approach, [10] is made the baseline is outperformed and a high-quality singing voice
on a convolutional neural network while also making use of is achieved.
the auto-regressive functionality in the generation framework Increased performance with respect to DNN is observed.
that can give rise to acoustic functionalities in the earlier The SVC Performance is denoted by the MOS score of
stages. Speech modeling and enhancement have multiple 3.89±0.11. Thus, we conclude that [11] provides us with an
hindrances. First, the robustness of the model is due to the above-average quality of singing voice synthesis.
differences in the variety of recordings, speeches, and noise
parameters that exist in the real world. Second, limited access A deep neural network model for singing voice conversion
to clean, perfect speech or recordings. Third, the problems is presented in [12]. The suggested network is unaffected by
become increasingly tedious in lesser or minimum SNR text or notes, and it also transforms one performer's sounds
(signal to noise ratio) cases, whose values can be improved by into the voice of the other. There is no monitoring during the
utilizing and teaching huge models, which results in makes training: Features like lyrics, musical notes, sampling
more evident into fitting the accessible database/datasets matching among different performers, and phonetic features
biases, thus making the model less robust to other more are completely absent. It employs three architectures. First, a
susceptible real-world circumstances, making the first two single CNN-based encoder for all the singers, followed by a
WaveNet based decoder, and finally a classifier which makes

Authorized licensed use limited to: MIT. Downloaded on November 19,2024 at 15:36:50 UTC from IEEE Xplore. Restrictions apply.
sure that the latent representation is singer-agnostic. A single foundation for future timbre analysis. In the remainder of this
embedding vector is assigned to each singer. The decoder is paper, the approach for finding the pitch with enough speed
trained on this embedding vector. This paper introduces a and precision is described. It can calculate the fundamental
novel data augmentation strategy, along with original frequency and respond in less than a twentieth of a second. It
protocols and training losses based on the concept of back can be used as a teaching tool, letting teachers and students
translation, to deal with comparatively smaller datasets. The see details of music while it is being performed. It also works
results of this study show that the output produces natural well as a tuning device for anyone, as it is more responsive
singing vocals that are easily recognized as belonging to the and user-friendly than conventional electronic tuners.
target singer.
D. Recommendation Module
Using this approach, a MOS score of 4 was achieved. This The method discussed in [15] focuses on enhancing music
portrayed a high singing voice conversion rate. recommendation systems which could further be used on a
For multi-modal synthesis, [13] was proposed. It is variety of platforms. The Tunes Recommendation System (T-
essentially an auto-regressive functionality in the generation RECSYS) algorithm presented in this study uses a deep
framework that can give rise to acoustic functionalities (for learning classification model with a combination of
example, Mel-spectrogram) for the various frame-by-frame collaborative and content-based filtering inputs to create a
audio sources. As compared to speech/vocal data without any precise recommendation algorithm that predicts in real-time.
accompaniment collection, the collection of singing vocal data When this method is applied to the Spotify Recsys Challenge
is very tough and expensive. DurIAN extracts the d-vectors data, scores of up to 88 percent precision are achieved at a
and these d-vectors are then inputted as the embeddings of the balanced discrimination threshold. A one-of-a-kind k-
speaker into the voice conversion network which then recommendation system providing music recommendations
represents the speaker's identity. For the extraction of the to users is introduced in this paper where each music in the
user/tester’s d-vector, just 20 seconds of the speaker's vocal database is given a score based on the user's preferences and
data input is necessary throughout the conversion. The returns the top-k scoring songs, using a trained hybridization
similarity and naturalness scores generated by the Mean of collaborative and content-based filtering. The system
Opinion Scores (MOS) indicate that the proposed system can extracts numerous critical metadata factors such as genre,
provide desirable continuous singing voice conversion in one tempo, and mood from user input as well as previous data and
go with just 20 seconds of the tester’s data. Firstly, for one of preferences. These variables are used as input which enables
its inputs, it uses text/song lyrics for both the vocal data. The us to give fast and responsive recommendations. This paper's
interior singing corpus is then included within the speaker technique is comparable to Spotify's Discover Weekly
embedding training to further increase the singing data Playlist, which is a group of songs recommended by Spotify
capabilities. Since the singing corpus isn't given a speaker every week, derived from the user’s past listening behavior.
label, it assigns a pseudo speaker label to each singing segment T-RECSYS can be also used for the continuation of a playlist,
using bottom-up hierarchical agglomerative clustering (HAC). in which the user is recommended music to “continue” a
currently playing playlist. This was, incidentally, the
The results obtained in [13] help portray that by challenge posed by Spotify in their RecSys Challenge 2018,
combining some speech data with the singing voice which affirmed that even when big systems that make
conversion training model, the target singing data will have recommendations are already put on the internet, there is an
better and clearer pronunciation. ongoing demand for such research. The algorithms that made
C. Comparision Module it possible to incorporate more than one variable type did not
include real-time updates, which was shown to be a severe
The software discussed in [14] describes the tone of the flaw. Instead, consumers were given recommendations based
notes the user is playing or singing to a musician, in real-time on the prior day's historical data, which is common even on
and with great precision. It's very effective as teaching well-known services like Youtube and Netflix.
assistance for beginners, as well as for investigating sound
production refinements. Music, as the most abstract form of Keeping all these issues in mind a system that is capable
art, necessitates apparent ease of communication, which is of achieving high precision of recommendation was made
achieved via intense study and attention to detail. Artists who which is easily adaptable to a variety of market services, such
are serious about singing spend many hours perfecting their as Amazon or Netflix. By utilizing the 90% threshold, perfect
skill and removing little flaws that the majority of their (100%) precision was observed. Similarly, by utilizing the
audience would never notice. A graphical interface can default 50% threshold, over 88% precision was achieved and
provide a musician with immediate, nonverbal, and correct Standard Deviation of 4 trials was 8.58% The high precision
feedback, as well as assisting us in analyzing performance received thus gives us state-of-the-art recommendations.
and learning more about how artistic and technical decisions
The preferences of users depending on their previous,
are made. Beginners can learn to discover the proper notes recent, and similar preferences are predicted in [16]. The use
accurately with the experimental tool created, and more of a collaborative filtering algorithm for a music
skilled musicians can study these subtle differences with the recommendation system is investigated in this paper. The
support of an independent and objective tool that can help us primary focus of collaborative filtering is on the relationships
progress and know where to focus our efforts. In this paper, between different items/products and target users to make a
each note's pitch should be determined quick enough to preferred prediction. This paper focuses on two approaches:
provide the musician immediate feedback, and this user-based recommendations and item-based
information should be displayed in an immediately useful recommendations. The dependent variable values are fed into
form. The dominant frequency must be deduced from several the training rows while on the test rows they are absent. Any
frequency components to determine the pitch of the wave. of the unknown matrix entries should be predicted in a data-
The techniques utilized in the research produce all of these based way using the observable entries in the rest of the
harmonic components and their amplitudes, providing a

Authorized licensed use limited to: MIT. Downloaded on November 19,2024 at 15:36:50 UTC from IEEE Xplore. Restrictions apply.
matrix. The user-based method is dependent on the similar is because one needs to consider a test set, the analyzed music
user’s preferred product or rating to the target user thus fragment and the size of the classifier to learn and improvise.
making predictions and recommendations. The item-based Furthermore, in most cases, the music databases contained
method relies on the frequency of items selected based on only 20-30 seconds worth of recording fragments. The
other items for how often they are placed together. The target proposed system optimized the input data. The effectiveness
user is fond of one product/item from the given set of of the kNN classification algorithm was greatly improved by
numbers of products/items and with the help of those items, giving weights to parameters, limiting the number of classes,
predictions, and recommendations for the target user can be and applying the PCA approach.
made.
The final accuracy was approximately 90%. Hence, after
The parameters similarity measure (α) and scoring classifying the genre, songs of similar genres are
function (q) help us understand the effectiveness of recommended.
recommendation systems. The user-based approach uses the
ratings given by similar users to recommend songs, and best III. PROPOSED SYSTEM FOR AI-BASED APPROACH
results were obtained for values of α=0.3 and q=5. Whereas, In this section we will be discussing the proposed system
an item-based approach uses groups of similar items to architecture, the entire workflow of our system and the
recommend songs. Best results were obtained with values of technologies which we are planning to use for its
α=0.15 and q=3. implementation.
The major concentration of [17] is on creating a content A. System Architecture for AI Trainer
feature for music recommendation systems that are derived
Fig. 1. demonstrates our system architecture. We have
from an optimization vector. First of all, classification of
proposed a system wherein the trainee could learn and
music into 22 different classes according to their genre is
improvise their singing in a detailed and informative way.
carried out. Multiple low-level signal descriptor-based
feature vectors are designated and tested to accomplish this. 1) Graphic User Interface and Databases
They are further optimized using two techniques, Correlation The user interface will ask the trainee for a 10-20 seconds
Analysis (CA) and Principal Component Analysis (PCA). clip of him/her singing the song they wish to train on, and
Traditional music genre classification approaches achieve an select the same from the songs database. This database will
accuracy of over 80% on commonly available datasets. But have a handful of songs from various genres so that the trainee
there were a lot of limitations associated with it. First of all, can explore different options according to his/her taste. A
the datasets are only around 1000 songs large. Hence it was separate database will be present that will store the user’s
challenging to compare the effectiveness of the proposed vocals or vocal singing temporarily.
solutions within the research conducted by the authors. This

Fig. 1: System Architecture for AI Trainer

2) Cleaning Module simultaneously pass through noise remover which would


The selected song and the input vocals will then pass remove interference or disturbance from the input thus
through the cleaning module. The music accompaniment and improving the vocal clarity that would help other modules to
vocals of the selected song will be separated by the recognize singing patterns and different parameters more
background music remover. The input vocals of the user will efficiently.

Authorized licensed use limited to: MIT. Downloaded on November 19,2024 at 15:36:50 UTC from IEEE Xplore. Restrictions apply.
3) Comparision Module querying analysis which will help our system excel in
After the audio of the student is cleaned and the noise in data adaptability.
the background is suppressed, the audio output is sent to the
comparison module, along with the song selected by the IV. CONCLUSION
student with its background music removed. This is where the The singing human voice is perhaps the most significant
statistical analysis of both the audios is conducted. All the musical instrument that exists. Various advantages of singing
parameters measured in audio, for example, the pitch, timbre, include mental as well as physical health benefits like
rhythm, dynamics, etc. are compared and a detailed and enlightening of mood and improved breathing, posture
informative analysis is performed. The graph-based analysis respectively. However, a personalized music trainer
will further help the user identify the instances at which an application was needed which will provide suggestive as well
error has occurred. The cleaned audio of the user and the song as corrective measures. This application will enable the users
without the background music both are together used for to train on the song they prefer, deepen their understanding
generating key music or singing visuals. with the help of interactive charts, and listen to the song in
their own voice to get a better grasp of the areas where they
4) AI Module need to improve. However, singing analysis is a daunting task,
The song selected by the user will then be fed into our wherein considerable challenges await us. Gathering a suitable
machine learning model. This song, along with the clip dataset containing 10 to 20 second clips of individuals singing
consisting of the trainee’s voice, will be used by the model to various songs without background music, and filtering the
recreate the whole song in the trainee's voice, i.e. create a song input sound to get rid of the background noise, are some of
that will sound as if the trainee himself/herself has sung it. them. The latter is a challenge because not every user might
Along with the generated charts, the song generated with the have access to a good quality microphone to provide the input
help of Artificial-Intelligence will be used as a corrective voice. As the proposed model involves integrating various
measure. The periods in which the trainee made a mistake will models, system requirements could increase potentially.
be identified, and only those specific clips of the artificial song Although this will initially be a desktop application, it can be
will be sent to the trainee. By listening to these targeted clips, further deployed as a web-based application and a mobile
the user will instantly get a benchmark to mimic, or a level of based application for increased ease of use.
singing which he/she can achieve. This is the main highlight
of our proposed system. REFERENCES
5) Feedback Module [1] D. Joseph and J. Southcott, “Personal, musical and social benefits of
singing in a community ensemble: Three case studies in Melbourne
All the feedback provided to the user via the charts and (Australia),” The Journal for Transdisciplinary Research in Southern
artificially generated clips through the graphic user interface Africa, vol. 10, no. 2, Nov. 2014, doi: 10.4102/td.v10i2.103.
(GUI) will make up the feedback module. These suggestive [2] A. MacDonald, “Researching with Young Children: Considering
and corrective measures will enhance the trainee’s Issues of Ethics and Engagement,” Contemporary Issues in Early
understanding of his/her mistakes, and provide them with a Childhood, vol. 14, no. 3, pp. 255–269, Jan. 2013, doi:
wide scope of improvement. 10.2304/ciec.2013.14.3.255.
[3] B. Schorr-Lesnick, A. S. Teirstein, L. K. Brown, and A. Miller,
6) Recommendation Module “Pulmonary Function in Singers and Wind-Instrument Players,” Chest,
The selected song will also be passed through the vol. 88, no. 2, pp. 201–205, Aug. 1985, doi: 10.1378/chest.88.2.201.
recommendation module. This is where audio processing [4] K. Sakano et al., “Possible benefits of singing to the mental and
takes place i.e., different aspects of the selected song will get physical condition of the elderly,” BioPsychoSocial Medicine, vol. 8,
no. 1, May 2014, doi: 10.1186/1751-0759-8-11.
processed by the model, which will then identify different
[5] D. D. Coffman and M. S. Adamek, “The Contributions of Wind Band
genres and music moods. Therefore, this module will generate Participation to Quality of Life of Senior Adults,” Music Therapy
a playlist based on the genre which the user selected, and Perspectives, vol. 17, no. 1, pp. 27–31, Jan. 1999, doi:
provide him with song recommendations. This is crucial 10.1093/mtp/17.1.27.
because learning is not a one-time process, and hence the user [6] S. M. Clift and G. Hancox, “The perceived benefits of singing,”
will need to train on multiple songs until sufficient mastery is Journal of the Royal Society for the Promotion of Health, vol. 121, no.
achieved. 4, pp. 248–256, Dec. 2001, doi: 10.1177/146642400112100409
[7] “Singing for the Brain,” Alzheimer’s Society.
B. Technologies used https://www.alzheimers.org.uk/get-support/your-support-
services/singing-for-the-brain (accessed Sep. 02, 2021).
The technologies to be used in our proposed system will
[8] C. Zheng, X. Peng, Y. Zhang, S. Srinivasan, and Y. Lu, “Interactive
include: Speech and Noise Modeling for Speech Enhancement,” arXiv.org, Dec.
• Python programming language for programming the 17, 2020. https://arxiv.org/abs/2012.09408 (accessed Sep. 02, 2021).
models as it has a vast library support which would [9] T. Esch and P. Vary, “Efficient musical noise suppression for speech
enhancement system,” Apr. 2009, Accessed: Sep. 02, 2021. [Online].
increase productivity, enhance our model and Available: http://dx.doi.org/10.1109/icassp.2009.4960607.
portability. [10] U. Isik, R. Giri, N. Phansalkar, J.-M. Valin, K. Helwani, and A.
Krishnaswamy, “PoCoNet: Better Speech Enhancement with
• PyQt for designing the graphic user interface. This is Frequency-Positional Embeddings, Semi-Supervised Conversational
because, when compared to tkinter, PyQt has more Data, and Biased Loss,” Oct. 2020, Accessed: Sep. 02, 2021. [Online].
versatility and a cleaner code base. PyQt develops Available: http://dx.doi.org/10.21437/interspeech.2020-3027.
database and networking applications using a range of [11] B. Sisman, K. Vijayan, M. Dong, and H. Li, “SINGAN: Singing Voice
platform APIs. Conversion with Generative Adversarial Networks,” Nov. 2019,
Accessed: Sep. 02, 2021. [Online]. Available:
• MongoDB as a database, as it has very flexible http://dx.doi.org/10.1109/apsipaasc47483.2019.9023162.
document schemas and has flexible and intuitive data [12] E. Nachmani and L. Wolf, “Unsupervised Singing Voice Conversion,”
model. It has a change-friendly design and a powerful Sep. 2019, Accessed: Sep. 02, 2021. [Online]. Available:
http://dx.doi.org/10.21437/interspeech.2019-1761.

Authorized licensed use limited to: MIT. Downloaded on November 19,2024 at 15:36:50 UTC from IEEE Xplore. Restrictions apply.
[13] L. Zhang et al., “DurIAN-SC: Duration Informed Attention Network
Based Singing Voice Conversion System,” Oct. 2020, Accessed: Sep.
02, 2021. [Online]. Available:
http://dx.doi.org/10.21437/interspeech.2020-1789.
[14] P. McLeod and G. Wyvill, “Visualization of musical pitch,” Accessed:
Sep. 02, 2021. [Online]. Available:
http://dx.doi.org/10.1109/cgi.2003.1214486.
[15] F. Fessahaye et al., “T-RECSYS: A Novel Music Recommendation
System Using Deep Learning,” Jan. 2019, Accessed: Sep. 02, 2021.
[Online]. Available: http://dx.doi.org/10.1109/icce.2019.8662028.
[16] E. Shakirova, “Collaborative filtering for music recommender system,”
2017, Accessed: Sep. 02, 2021. [Online]. Available:
http://dx.doi.org/10.1109/eiconrus.2017.7910613.
[17] P. Hoffmann, A. Kaczmarek, P. Spaleniak, and B. Kostek, “Music
Recommendation System - Journal of Telecommunications and
Information Technology - Tom nr 2 (2014) - Biblioteka Nauki -
Yadda,” Journal of Telecommunications and Information Technology,
vol. nr 2, Jan. 2014.
[18] E. Vincent, R. Gribonval, and C. Fevotte, “Performance measurement
in blind audio source separation,” IEEE Transactions on Audio, Speech
and Language Processing, vol. 14, no. 4, pp. 1462–1469, Jul. 2006,
doi: 10.1109/tsa.2005.858005.
[19] tsbmail, “P.862.2:Wideband extension to Recommendation P.862 for
the assessment of wideband telephone networks and speech codecs.”
https://www.itu.int/rec/T-REC-P.862.2 (accessed Sep. 02, 2021).

Authorized licensed use limited to: MIT. Downloaded on November 19,2024 at 15:36:50 UTC from IEEE Xplore. Restrictions apply.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy